arXiv 论文速递

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Authors: Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao

Venue: NeurIPS 2025

First: 2025-10-27T17:59:59+00:00 · Latest: 2025-10-27T17:59:59+00:00

Comments: NeurIPS 2025, produced by Pointcept, project page: https://pointcept.github.io/Concerto

Abstract

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

中文标题/摘要

标题：Concerto：联合2D-3D自监督学习产生空间表示

人类通过多感官协同学习抽象概念，一旦形成，这些表示可以从单一模态中回忆起来。受这一原则的启发，我们引入了Concerto，这是一种简化的模拟人类概念学习的空间认知模型，结合了3D同模态自蒸馏与2D-3D跨模态联合嵌入。尽管结构简单，Concerto 学习到的空间特征更为连贯和信息丰富，零样本可视化结果证明了这一点。Concerto 在3D场景感知的线性探针中分别比最先进的2D和3D自监督模型高出14.2%和4.8%，并且优于它们的特征拼接。通过全微调，Concerto 在多个场景理解基准测试中取得了新的SOTA结果（例如，ScanNet上的80.7% mIoU）。我们还提出了一种针对视频提升点云空间理解的Concerto变体，并提供了一个将Concerto表示线性投影到CLIP语言空间的翻译器，使其能够实现开放世界的感知。这些结果表明，Concerto 产生了具有优越细粒度几何和语义一致性的空间表示。

Summary / 总结

Concerto is a minimalist model that combines 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding to learn spatial representations. It outperforms both standalone 2D and 3D self-supervised models by 14.2% and 4.8% respectively, and achieves new state-of-the-art results on 3D scene perception benchmarks. Full fine-tuning of Concerto sets new SOTA results with 80.7% mIoU on ScanNet, and a video-lifted variant and translator are also introduced to enhance spatial understanding and open-world perception capabilities.

Concerto 是一个结合了 3D 内模态自蒸馏和 2D-3D 跨模态联合嵌入的简约模型，用于学习空间表示。它在 3D 场景感知任务中分别比独立的 2D 和 3D 自监督模型高出 14.2% 和 4.8%，并在 ScanNet 达到了 80.7% 的 mIoU 的新最佳结果。它还展示了在空间理解中的精细几何和语义一致性方面的优越性。

Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Authors: Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski

Venue: NeurIPS 2025

First: 2025-10-27T17:59:51+00:00 · Latest: 2025-10-27T17:59:51+00:00

Comments: NeurIPS 2025, 38 pages, 22 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at https://zsh2000.github.io/track-inpaint-resplat.github.io/.

中文标题/摘要

标题：Track, Inpaint, Resplat：基于主题的3D和4D生成与渐进纹理填充

当前的3D/4D生成方法通常优化为现实感、效率和美学。然而，它们往往无法在不同视角下保持主题的语义身份。通过使用特定主题的一张或多张图像（也称为个性化或基于主题的生成）来适应生成方法，可以生成符合主题身份的视觉内容。然而，个性化3D/4D生成仍然很大程度上未被探索。在本文中，我们介绍了TIRE（Track, Inpaint, REsplat），一种新颖的主题驱动3D/4D生成方法。该方法将现有3D生成模型生成的初始3D资产作为输入，并使用视频跟踪来识别需要修改的区域。然后，我们采用基于主题的2D修复模型进行渐进填充。最后，我们重新将修改后的2D多视角观察结果映射回3D，同时保持一致性。广泛的实验表明，与最先进的方法相比，我们的方法在3D/4D生成中显著提高了身份保留。我们的项目网站可在https://zsh2000.github.io/track-inpaint-resplat.github.io/访问。

Summary / 总结

This work addresses the challenge of preserving the semantic identity of a subject in 3D and 4D generation. The method, TIRE, takes an initial 3D asset and uses video tracking to identify regions needing modification, followed by a subject-driven 2D inpainting model to progressively fill these regions. The modified 2D views are then resplatted back to 3D, maintaining consistency. Experiments show that TIRE significantly enhances identity preservation compared to existing methods.

该研究旨在解决在3D和4D生成中保持主体语义身份的挑战。它提出了TIRE（Track, Inpaint, REsplat）方法，通过视频跟踪识别需要修改的区域，并使用主体驱动的2D修复模型逐步填充这些区域。最后，将修改后的2D多视角观察结果重新映射回3D。实验表明，TIRE在身份保持方面显著优于现有方法。

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Authors: Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, Beng Chin Ooi

First: 2025-10-27T17:59:32+00:00 · Latest: 2025-10-27T17:59:32+00:00

Comments: 22 pages, 13 figures

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) have demonstrated strong general-purpose capabilities in open-world visual comprehension. However, most existing MLLMs primarily focus on holistic, scene-level understanding, often overlooking the need for fine-grained, object-centric reasoning. In this paper, we present PixelRefer, a unified region-level MLLM framework that enables advanced fine-grained understanding over user-specified regions across both images and videos. Motivated by the observation that LLM attention predominantly focuses on object-level tokens, we propose a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations from free-form regions. Our analysis reveals that global visual tokens contribute mainly in early LLM layers, inspiring the design of PixelRefer-Lite, an efficient variant that employs an Object-Centric Infusion module to pre-fuse global context into object tokens. This yields a lightweight Object-Only Framework that substantially reduces computational cost while maintaining high semantic fidelity. To facilitate fine-grained instruction tuning, we curate PixelRefer-2.2M, a high-quality object-centric instruction dataset. Extensive experiments across a range of benchmarks validate that PixelRefer achieves leading performance with fewer training samples, while PixelRefer-Lite offers competitive accuracy with notable gains in efficiency.

中文标题/摘要

标题：PixelRefer：一种统一的时空对象引用框架，具有任意粒度

多模态大型语言模型（MLLMs）在开放世界的视觉理解中展示了强大的通用能力。然而，现有的大多数MLLMs主要关注整体的场景理解，往往忽视了细粒度的对象中心推理的需求。在本文中，我们提出了PixelRefer，这是一种统一的区域级MLLM框架，能够在图像和视频中对用户指定的区域进行高级细粒度理解。受观察到的LLM注意力主要集中在对象级标记的启发，我们提出了一种可适应尺度的对象标记器（SAOT），从自由形式的区域生成紧凑且语义丰富的对象表示。我们的分析表明，全局视觉标记主要在早期LLM层中起作用，启发了PixelRefer-Lite的设计，这是一种高效的变体，使用对象中心融合模块将全局上下文预融合到对象标记中。这产生了一个轻量级的对象仅框架，显著降低了计算成本，同时保持了高语义保真度。为了促进细粒度指令调优，我们整理了PixelRefer-2.2M，这是一个高质量的对象中心指令数据集。广泛的实验在一系列基准上验证了PixelRefer在较少训练样本的情况下实现了领先性能，而PixelRefer-Lite则在效率显著提升的同时提供了竞争力的准确性。

Summary / 总结

PixelRefer is a unified framework for fine-grained understanding of user-specified regions in images and videos using multimodal large language models. It introduces a Scale-Adaptive Object Tokenizer (SAOT) to generate compact and semantically rich object representations and an Object-Centric Infusion module in PixelRefer-Lite to pre-fuse global context into object tokens, reducing computational cost. Experiments show that PixelRefer outperforms existing models with fewer training samples, and PixelRefer-Lite offers competitive accuracy with efficiency gains.

PixelRefer 是一个统一框架，用于在图像和视频中对用户指定区域进行细粒度理解，通过 Scale-Adaptive Object Tokenizer (SAOT) 生成紧凑且语义丰富的对象表示。PixelRefer-Lite 是一个高效变体，使用 Object-Centric Infusion 模块将全局上下文预融合到对象令牌中，减少计算成本同时保持语义保真度。实验表明，PixelRefer 在较少的训练样本下表现出色，而 PixelRefer-Lite 提供了具有显著效率提升的竞争力精度。

Alita-G: Self-Evolving Generative Agent for Agent Generation

Authors: Jiahao Qiu, Xuan Qi, Hongru Wang, Xinzhe Juan, Yimin Wang, Zelin Zhao, Jiayi Geng, Jiacheng Guo, Peihang Li, Jingzhe Shi, Shilong Liu, Mengdi Wang

First: 2025-10-27T17:59:14+00:00 · Latest: 2025-10-27T17:59:14+00:00

Comments: 15 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have been shown to perform better when scaffolded into agents with memory, tools, and feedback. Beyond this, self-evolving agents have emerged, but current work largely limits adaptation to prompt rewriting or failure retries. Therefore, we present ALITA-G, a self-evolution framework that transforms a general-purpose agent into a domain expert by systematically generating, abstracting, and curating Model Context Protocol (MCP) tools. In this framework, a generalist agent executes a curated suite of target-domain tasks and synthesizes candidate MCPs from successful trajectories. These are then abstracted to parameterized primitives and consolidated into an MCP Box. At inference time, ALITA-G performs retrieval-augmented MCP selection with the help of each tool's descriptions and use cases, before executing an agent equipped with the MCP Executor. Across several benchmarks GAIA, PathVQA, and Humanity's Last Exam, ALITA-G attains strong gains while reducing computation costs. On GAIA validation, it achieves 83.03% pass@1 and 89.09% pass@3, establishing a new state-of-the-art result while reducing mean tokens per example by approximately 15% relative to a strong baseline agent. ALITA-G thus provides a principled pathway from generalist capability to reusable, domain-specific competence, improving both accuracy and efficiency on complex reasoning tasks.

中文标题/摘要

标题：Alita-G：自演化生成代理剂框架

大型语言模型（LLMs）在与记忆、工具和反馈结合成代理后表现出色。在此基础上，自演化代理剂已经出现，但当前工作主要限于提示重写或失败重试。因此，我们提出了ALITA-G，这是一种自演化框架，通过系统生成、抽象和整理模型上下文协议（MCP）工具，将通用代理转变为领域专家。在该框架中，通用代理执行目标领域的任务套件，并从成功的轨迹中合成候选MCP。这些MCP随后被抽象为参数化原语，并合并到MCP箱中。在推理时，ALITA-G借助每个工具的描述和使用案例进行检索增强的MCP选择，然后执行配备MCP执行器的代理。ALITA-G在GAIA、PathVQA和人类的最后一试等多个基准测试中取得了显著的提升，同时降低了计算成本。在GAIA验证中，它实现了83.03%的pass@1和89.09%的pass@3，建立了新的最先进的结果，同时将每个示例的平均词数减少了约15%。ALITA-G因此提供了一条从通用能力到可重用的领域特定能力的原理性路径，提高了复杂推理任务的准确性和效率。

Summary / 总结

ALITA-G is a self-evolution framework that transforms a general-purpose agent into a domain expert by generating, abstracting, and curating Model Context Protocol (MCP) tools. It achieves strong performance gains on GAIA, PathVQA, and Humanity's Last Exam benchmarks, with 83.03% pass@1 and 89.09% pass@3 on GAIA validation, reducing computation costs by approximately 15% compared to a strong baseline agent.

ALITA-G 是一个自进化框架，通过生成、抽象和整理 Model Context Protocol (MCP) 工具，将通用型代理转变为领域专家。它在 GAIA、PathVQA 和 Humanity's Last Exam 等基准测试中取得了显著的性能提升，在 GAIA 验证集上达到了新的最佳结果，相比强基线代理减少了约 15% 的每例平均词数。

Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models

Authors: Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab

First: 2025-06-05T17:55:23+00:00 · Latest: 2025-10-27T17:59:13+00:00

Comments: The Thirty-Ninth Annual Conference on Neural Information Processing Systems

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable, all without any extra computational overhead. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.

中文标题/摘要

标题：约束熵遗忘：大规模语言模型的原始对偶框架

在实际应用场景中部署的大规模语言模型（LLMs）越来越多地面临遗忘敏感、过时或专有信息的需求。现有的遗忘方法通常将遗忘和保留建模为正则化权衡问题，将两个目标合并为单一的标量化损失。这通常会导致优化不稳定，并且在保留数据上的性能下降，尤其是在激进遗忘的情况下。我们提出了一种新的LLM遗忘形式化方法，将其作为约束优化问题：遗忘通过一种新颖的logit-边缘压平损失强制执行，该损失明确地将输出分布推向指定遗忘集上的均匀分布，而保留则通过一个独立保留集上的硬约束来保留。与基于熵的目标相比，我们的损失是softmax自由的、数值稳定的，并且保持非消失梯度，从而实现更高效和稳健的优化。我们使用可扩展的原始对偶算法解决约束问题，通过伴随变量的动力学揭示遗忘和保留之间的权衡，而无需额外的计算开销。在TOFU和MUSE基准上对多种LLM架构的评估表明，我们的方法在一致地匹配或超越最先进的基线方面表现出色，同时有效移除了目标信息并保留了下游用途。

Summary / 总结

The paper addresses the challenge of unlearning sensitive information in large language models (LLMs) by proposing a constrained optimization approach. This method uses a logit-margin flattening loss to enforce forgetting and a hard constraint to preserve retention, offering a more stable optimization process compared to existing methods. Experiments on TOFU and MUSE benchmarks show that the proposed approach outperforms or matches state-of-the-art methods in removing targeted information while maintaining downstream utility.

论文提出了一种约束优化框架来解决在大型语言模型（LLM）中删除敏感信息的问题。该框架通过logit-margin扁平化损失强制执行遗忘，并通过独立的保留集施加硬约束来保持保留，提供了比现有方法更稳定和高效的优化过程。在TOFU和MUSE基准上的实验表明，所提出的方法在删除目标信息的同时保持下游效用方面优于或匹配了最先进的技术。

UNDREAM: Bridging Differentiable Rendering and Photorealistic Simulation for End-to-end Adversarial Attacks

Authors: Mansi Phute, Matthew Hull, Haoran Wang, Alec Helbling, ShengYun Peng, Willian Lunardi, Martin Andreoni, Wenke Lee, Duen Horng Chau

First: 2025-10-19T16:38:03+00:00 · Latest: 2025-10-27T17:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep learning models deployed in safety critical applications like autonomous driving use simulations to test their robustness against adversarial attacks in realistic conditions. However, these simulations are non-differentiable, forcing researchers to create attacks that do not integrate simulation environmental factors, reducing attack success. To address this limitation, we introduce UNDREAM, the first software framework that bridges the gap between photorealistic simulators and differentiable renderers to enable end-to-end optimization of adversarial perturbations on any 3D objects. UNDREAM enables manipulation of the environment by offering complete control over weather, lighting, backgrounds, camera angles, trajectories, and realistic human and object movements, thereby allowing the creation of diverse scenes. We showcase a wide array of distinct physically plausible adversarial objects that UNDREAM enables researchers to swiftly explore in different configurable environments. This combination of photorealistic simulation and differentiable optimization opens new avenues for advancing research of physical adversarial attacks.

中文标题/摘要

标题：UNDREAM：将可微渲染与逼真模拟相结合以实现端到端对抗攻击

部署在自动驾驶等关键安全应用中的深度学习模型使用模拟来测试其在现实条件下对抗对抗攻击的鲁棒性。然而，这些模拟是非可微的，迫使研究人员创建不整合模拟环境因素的攻击，降低攻击成功率。为解决这一限制，我们引入了UNDREAM，这是第一个将逼真模拟器与可微渲染器相结合的软件框架，以在任何3D对象上实现对抗扰动的端到端优化。UNDREAM通过提供对天气、照明、背景、相机角度、轨迹以及现实人类和物体运动的完全控制，使环境操控成为可能，从而允许创建多样化的场景。我们展示了UNDREAM使研究人员能够在不同可配置环境中迅速探索的多种独特的物理上合理的对抗对象。这种结合了逼真模拟和可微优化的方法为物理对抗攻击的研究开辟了新的途径。

Summary / 总结

The research aims to enhance the robustness testing of deep learning models in safety-critical applications by addressing the limitations of non-differentiable simulations. The authors introduce UNDREAM, a software framework that integrates photorealistic simulators with differentiable renderers, enabling end-to-end optimization of adversarial perturbations. Key findings include the ability to manipulate various environmental factors and create diverse physically plausible adversarial objects, which can be explored in configurable settings. This integration opens new possibilities for studying physical adversarial attacks.

研究旨在通过解决非可微模拟的限制，增强深度学习模型在安全关键应用中的鲁棒性测试。作者引入了UNDREAM，这是一个将光现实模拟器与可微渲染器集成的软件框架，能够实现对抗性扰动的端到端优化。主要发现包括能够操控各种环境因素，并在可配置的环境中创建多样化的物理上合理的对抗性对象。这种光现实模拟与可微优化的结合为研究物理对抗攻击开辟了新途径。

Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation

Authors: Mingliang Zhai, Hansheng Liang, Xiaomeng Fan, Zhi Gao, Chuanhao Li, Che Sun, Xu Bin, Yuwei Wu, Yunde Jia

First: 2025-10-23T08:02:08+00:00 · Latest: 2025-10-27T17:58:06+00:00

Comments: 16 pages, 7 figures, 8 tables

Abs · PDF · Code1 · Code2 · Project1

Abstract

Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model's ability for tool-usage and multi-step reasoning, we further design a novel EQA data generation pipeline that automatically constructs large-scale EQA tasks with reasoning trajectories and corresponding answers. Based on the pipeline, we collect the EQA-RT dataset that contains about 18K tasks, divided into a training set EQA-RT-Train, and two test sets EQA-RT-Seen (scenes overlapping with the training set) and EQA-RT-Unseen (novel scenes). Experiments on EQA-RT-Seen and EQA-RT-Unseen show that ToolEQA improves the success rate by 9.2~20.2% over state-of-the-art baselines, while outperforming the zero-shot ToolEQA by 10% in success rate. In addition, ToolEQA also achieves state-of-the-art performance on the HM-EQA, OpenEQA, and EXPRESS-Bench datasets, demonstrating its generality. Our homepage see https://tooleqa.github.io.

中文标题/摘要

标题：基于工具增强的多步推理在体感问答中的应用

体感问答（EQA）要求代理探索3D环境以获取观察信息并回答与场景相关的问题。现有方法利用VLM直接探索环境并回答问题，而无需明确的思考或规划，这限制了它们的推理能力，并导致过度或低效的探索以及无效的响应。本文引入了ToolEQA代理，该代理将外部工具与多步推理相结合，外部工具可以提供更多完成任务所需的信息，帮助模型推导出更好的探索方向，从而在推理的下一步获得额外的有效信息。这使ToolEQA能够生成更准确的回答，同时缩短探索距离。为了增强模型的工具使用能力和多步推理能力，我们进一步设计了一种新的EQA数据生成流水线，该流水线可以自动构建大规模的EQA任务及其推理轨迹和相应的答案。基于该流水线，我们收集了包含约18000个任务的EQA-RT数据集，分为训练集EQA-RT-Train和两个测试集EQA-RT-Seen（场景与训练集重叠）和EQA-RT-Unseen（新颖场景）。在EQA-RT-Seen和EQA-RT-Unseen上的实验表明，ToolEQA在成功率上比最先进的基线提高了9.2%~20.2%，而在成功率上比零样本ToolEQA提高了10%。此外，ToolEQA还在HM-EQA、OpenEQA和EXPRESS-Bench数据集上实现了最先进的性能，证明了其通用性。我们的主页见https://tooleqa.github.io。

Multi-Agent Evolve: LLM Self-Improve through Co-evolution

Authors: Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhan, Mostofa Patwary, Jiaxuan You

Venue: ICLR 2026

First: 2025-10-27T17:58:02+00:00 · Latest: 2025-10-27T17:58:02+00:00

Comments: 29 pages, 4 figures, submitted to ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.

中文标题/摘要

标题：多智能体进化：通过共进化提升大语言模型自我改进能力

强化学习（RL）在提升大语言模型（LLM）的推理能力方面展现了显著潜力。然而，RL在LLM上的成功很大程度上依赖于人工标注的数据集和可验证的奖励，这限制了其可扩展性和通用性。最近的自我博弈RL方法，受到游戏和围棋中该范式的成功启发，旨在在无需人工标注数据的情况下增强LLM的推理能力。然而，这些方法主要依赖于一个具体的反馈环境（例如，Python解释器或游戏引擎）；将其扩展到一般领域仍然具有挑战性。为了解决这些挑战，我们提出了多智能体进化（MAE）框架，该框架使LLM能够在解决数学、推理和一般知识问答等多种任务中自我进化。MAE的核心设计基于一个由单一LLM实例化的三元组交互智能体（提案者、解决者、评判者），并应用强化学习来优化它们的行为。提案者生成问题，解决者尝试解答，评判者评估并共同进化。实验结果表明，MAE在Qwen2.5-3B-Instruct上的平均改进率为4.54%。这些结果突显了MAE作为一种可扩展、数据高效的方法，能够在最少的人工标注监督下提升LLM的通用推理能力。

Summary / 总结

The paper proposes Multi-Agent Evolve (MAE), a framework for enhancing the reasoning capabilities of large language models (LLMs) through self-improvement. MAE uses a triplet of interacting agents (Proposer, Solver, Judge) instantiated from a single LLM to co-evolve in solving various tasks. Experiments show that MAE improves performance on Qwen2.5-3B-Instruct benchmarks by an average of 4.54%. This approach reduces the reliance on human-curated datasets and verifiable rewards, making it more scalable and generalizable.

论文提出了Multi-Agent Evolve (MAE)框架，通过自我进化来增强大型语言模型（LLMs）的推理能力。MAE使用一个由单一LLM实例化的三元组交互代理（Proposer, Solver, Judge）来共同解决各种任务。实验结果显示，MAE在Qwen2.5-3B-Instruct基准测试中的平均改进率为4.54%，展示了在无需大量依赖人工标注数据的情况下，一种可扩展且数据高效的LLM通用推理能力提升方法。

PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Authors: Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan

First: 2025-10-27T17:57:52+00:00 · Latest: 2025-10-27T17:57:52+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce \textbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.

中文标题/摘要

标题：PRISM-Bench：基于谜题的视觉任务基准及CoT错误检测

我们介绍了**PRISM-Bench**，一个基于谜题的视觉挑战基准，旨在评估模型不仅能否解决问题，还能如何进行推理。与仅衡量最终答案准确性的先前评估不同，PRISM-Bench 引入了一个诊断任务：给定一个视觉谜题和一个包含恰好一个错误的逐步推理链（CoT），模型必须识别出第一个错误的步骤。这种设置使得逻辑一致性、错误检测和视觉推理的细粒度评估成为可能。PRISM-Bench 中的谜题需要多步符号、几何和类比推理，抵制基于表面模式匹配的捷径。对最先进的MLLMs的评估揭示了一个持续存在的差距：能够生成可信推理的模型往往无法定位简单的逻辑错误。通过将答案生成与推理验证分离，PRISM-Bench 提供了对多模态推理能力更敏锐的视角，并强调了在开发可信赖的MLLMs过程中需要诊断性评估协议。

Summary / 总结

PRISM-Bench is a benchmark for puzzle-based visual tasks that evaluates both problem-solving ability and reasoning process, introducing a diagnostic task to identify errors in step-by-step chains-of-thought. The benchmark includes puzzles requiring multi-step reasoning, which state-of-the-art MLLMs often fail to solve correctly, despite generating plausible solutions. This reveals a gap between fluent generation and faithful reasoning, highlighting the need for diagnostic evaluation in multimodal reasoning.

PRISM-Bench 是一个基于视觉谜题的基准，不仅评估问题解决能力，还评估推理过程。它引入了一个诊断任务，以识别推理过程中的错误。不同于以往仅关注最终答案的基准，PRISM-Bench 显示出流畅的推理过程生成与准确的推理之间存在差距，模型往往无法检测到简单的逻辑错误。这突显了在开发可信赖的多模态推理模型时需要更多的诊断性评估。

Lightweight Robust Direct Preference Optimization

Authors: Cheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe

First: 2025-10-27T17:55:06+00:00 · Latest: 2025-10-27T17:55:06+00:00

Comments: arXiv admin note: substantial text overlap with arXiv:2509.02709

Abs · PDF · Code1 · Code2

Abstract

Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting. Recent works have proposed using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. However, these methods often suffer from excessive conservatism and high computational cost. We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO which accounts for uncertainty in the preference distribution through a lightweight DRO formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We further show that DPO-PRO is equivalent to a regularized DPO objective that penalizes model overconfidence under weak preference signals. We evaluate DPO-PRO on standard alignment benchmarks and a real-world public health task. Experimental results show that our method consistently improves robustness to noisy preference signals compared to existing DPO variants.

中文标题/摘要

标题：轻量级稳健直接偏好优化

直接偏好优化（DPO）已成为一种流行的大型语言模型（LLM）微调方法，因其稳定性和简单性。然而，它也对数据中的噪声敏感，容易过拟合。最近的研究提出了使用分布稳健优化（DRO）来解决潜在的噪声和数据分布变化问题。然而，这些方法往往过于保守且计算成本高。我们提出了DPO-PRO（基于DPO的偏好稳健性），这是一种基于DPO的稳健微调算法，通过轻量级的DRO形式考虑偏好分布的不确定性。与之前的DRO变体不同，DPO-PRO仅关注偏好中的不确定性，避免不必要的保守性，并且几乎不增加计算开销。我们进一步证明，DPO-PRO等同于一个正则化的DPO目标，该目标在弱偏好信号下惩罚模型的过度自信。我们在标准对齐基准和一个实际的公共卫生任务上评估了DPO-PRO。实验结果表明，与现有的DPO变体相比，我们的方法在对噪声偏好信号的鲁棒性方面始终有所改进。

Summary / 总结

The research aims to improve the robustness of Direct Preference Optimization (DPO) for fine-tuning large language models (LLMs) by addressing data noise and overfitting. DPO-PRO, a lightweight robust fine-tuning algorithm, is proposed, which uses a distributionally robust optimization (DRO) formulation to account for uncertainty in preferences without excessive conservatism or high computational cost. Experimental results demonstrate that DPO-PRO enhances robustness to noisy preference signals compared to existing DPO methods on standard benchmarks and a public health task.

研究旨在通过解决数据噪声和过拟合问题，提高直接偏好优化(DPO)对大型语言模型的鲁棒性。DPO-PRO是一种轻量级的鲁棒微调算法，通过分布鲁棒优化(DRO)形式来处理偏好分布的不确定性，而不增加过多的保守性或高计算成本。实验结果表明，DPO-PRO在标准基准测试和公共健康任务中增强了对嘈杂偏好信号的鲁棒性，优于现有DPO变体。

InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras

Authors: Erich Liang, Roma Bhattacharjee, Sreemanti Dey, Rafael Moschopoulos, Caitlin Wang, Michel Liao, Grace Tan, Andrew Wang, Karhan Kayan, Stamatis Alexandropoulos, Jia Deng

Venue: NeurIPS 2025

First: 2025-10-27T17:54:57+00:00 · Latest: 2025-10-27T17:54:57+00:00

Comments: Accepted at NeurIPS 2025 DB Track, Camera Ready Version. Supplementary material included

Abs · PDF · Code1 · Code2 · Project1

Abstract

Accurately tracking camera intrinsics is crucial for achieving 3D understanding from 2D video. However, most 3D algorithms assume that camera intrinsics stay constant throughout a video, which is often not true for many real-world in-the-wild videos. A major obstacle in this field is a lack of dynamic camera intrinsics benchmarks--existing benchmarks typically offer limited diversity in scene content and intrinsics variation, and none provide per-frame intrinsic changes for consecutive video frames. In this paper, we present Intrinsics in Flux (InFlux), a real-world benchmark that provides per-frame ground truth intrinsics annotations for videos with dynamic intrinsics. Compared to prior benchmarks, InFlux captures a wider range of intrinsic variations and scene diversity, featuring 143K+ annotated frames from 386 high-resolution indoor and outdoor videos with dynamic camera intrinsics. To ensure accurate per-frame intrinsics, we build a comprehensive lookup table of calibration experiments and extend the Kalibr toolbox to improve its accuracy and robustness. Using our benchmark, we evaluate existing baseline methods for predicting camera intrinsics and find that most struggle to achieve accurate predictions on videos with dynamic intrinsics. For the dataset, code, videos, and submission, please visit https://influx.cs.princeton.edu/.

中文标题/摘要

标题：InFlux：视频相机动态内参自我校准基准

准确跟踪相机内参对于实现从2D视频到3D理解至关重要。然而，大多数3D算法假设视频中的相机内参保持不变，而在许多真实世界的野外视频中，这通常不是真的。该领域的主要障碍是没有动态相机内参基准——现有的基准通常在场景内容和内参变化方面缺乏多样性，并且没有提供连续视频帧的内参变化。在本文中，我们提出了InFlux（动态内参流动），一个提供具有动态内参视频的每帧真实世界基准内参标注的基准。与之前的基准相比，InFlux捕捉了更广泛的内参变化和场景多样性，包括来自386个高分辨率室内和室外视频的143K+标注帧。为了确保每帧内参的准确性，我们构建了一个全面的校准实验查找表，并扩展了Kalibr工具箱以提高其准确性和鲁棒性。使用我们的基准，我们评估了现有的预测相机内参的方法，并发现大多数方法在具有动态内参的视频上难以实现准确的预测。有关数据集、代码、视频和提交，请访问https://influx.cs.princeton.edu/。

Summary / 总结

InFlux is a benchmark for self-calibrating dynamic camera intrinsics in videos, addressing the lack of diverse benchmarks with per-frame intrinsic changes. The method involves building a comprehensive calibration lookup table and extending the Kalibr toolbox for accurate and robust intrinsics. Key findings show that existing methods struggle with dynamic intrinsics, highlighting the need for improved algorithms. For the dataset, code, and videos, visit https://influx.cs.princeton.edu/.

研究旨在解决在视频中准确跟踪动态相机内参的问题，这对于实现3D理解至关重要。主要方法是创建一个新的基准InFlux，该基准提供了具有动态内参的视频的每帧真实内参标注。关键发现表明，现有方法在动态变化的视频中难以预测准确的内参，突显了需要更好的算法和基准的需求。数据集、代码和视频请访问https://influx.cs.princeton.edu/.

LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas

Authors: Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, Ju Hu, Sergey Tulyakov, Kuan-Chieh Jackson Wang

First: 2025-10-23T17:59:55+00:00 · Latest: 2025-10-27T17:53:30+00:00

Comments: 9 pages, preprint. Project page: https://snap-research.github.io/layercomposer/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.

中文标题/摘要

标题：LayerComposer：基于空间感知分层画布的交互式个性化T2I

尽管现有的个性化生成模型在视觉保真度方面表现出色，但它们缺乏对空间组成进行交互式控制的能力，并且在处理多个主题时扩展性较差。为了解决这些限制，我们提出了LayerComposer，这是一种交互式的多主题文本到图像生成框架。我们的方法引入了两个主要贡献：（1）分层画布，这是一种新颖的表示方法，其中每个主题放置在不同的层上，从而实现无遮挡的组合；（2）锁定机制，该机制保持选定层的高保真度，同时允许其余层灵活适应周围环境。类似于专业的图像编辑软件，所提出的分层画布允许用户通过直观的层操作放置、调整大小或锁定输入主题。我们灵活的锁定机制不需要对架构进行更改，而是依赖于固有的位置嵌入以及一种新的互补数据采样策略。广泛的实验表明，与多主题个性化图像生成的最新方法相比，LayerComposer在空间控制和身份保留方面具有优越性。

Summary / 总结

LayerComposer is an interactive framework for generating personalized multi-subject images with spatial control. It introduces a layered canvas where each subject is placed on a distinct layer, allowing for occlusion-free composition and a locking mechanism that preserves selected layers while adapting the rest to the context. Experiments show that LayerComposer outperforms existing methods in terms of spatial control and identity preservation.

LayerComposer 是一个交互式框架，用于生成多主体的个性化文本到图像输出。它引入了一种分层画布，其中每个主体位于独立的层上，允许无遮挡的组合和剩余层的灵活适应。锁定机制保留选定的层，同时根据上下文适应其他层。实验表明，LayerComposer 在多主体图像生成中的空间控制和身份保留方面优于现有方法。

ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

Authors: Mohammed Baharoon, Luyang Luo, Michael Moritz, Abhinav Kumar, Sung Eun Kim, Xiaoman Zhang, Miao Zhu, Mahmoud Hussain Alabbad, Maha Sbayel Alhazmi, Neel P. Mistry, Lucas Bijnens, Kent Ryan Kleinschmidt, Brady Chrisler, Sathvik Suryadevara, Sri Sai Dinesh Jaliparthi, Noah Michael Prudlo, Mark David Marino, Jeremy Palacio, Rithvik Akula, Di Zhou, Hong-Yu Zhou, Ibrahim Ethem Hamamci, Scott J. Adams, Hassan Rayhan AlOmaish, Pranav Rajpurkar

First: 2025-07-29T17:27:15+00:00 · Latest: 2025-10-27T17:51:47+00:00

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

We introduce ReXGroundingCT, the first publicly available dataset linking free-text findings to pixel-level 3D segmentations in chest CT scans. The dataset includes 3,142 non-contrast chest CT scans paired with standardized radiology reports from CT-RATE. Construction followed a structured three-stage pipeline. First, GPT-4 was used to extract and standardize findings, descriptors, and metadata from reports originally written in Turkish and machine-translated into English. Second, GPT-4o-mini categorized each finding into a hierarchical ontology of lung and pleural abnormalities. Third, 3D annotations were produced for all CT volumes: the training set was quality-assured by board-certified radiologists, and the validation and test sets were fully annotated by board-certified radiologists. Additionally, a complementary chain-of-thought dataset was created to provide step-by-step hierarchical anatomical reasoning for localizing findings within the CT volume, using GPT-4o and localization coordinates derived from organ segmentation models. ReXGroundingCT contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, covering diverse radiological patterns from 3,142 non-contrast CT scans. About 79% of findings are focal abnormalities and 21% are non-focal. The dataset includes a public validation set of 50 cases and a private test set of 100 cases, both annotated by board-certified radiologists. The dataset establishes a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging. Model performance on the private test set is hosted on a public leaderboard at https://rexrank.ai/ReXGroundingCT. The dataset is available at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.

中文标题/摘要

标题：ReXGroundingCT：用于胸部CT扫描中发现分割的首个公开数据集

我们介绍了ReXGroundingCT，这是首个将自由文本发现与胸部CT扫描的像素级3D分割链接起来的公开数据集。该数据集包括3,142份非对比胸部CT扫描，配以CT-RATE标准化的放射学报告。数据集构建遵循一个结构化的三阶段管道。首先，使用GPT-4从最初用土耳其语书写的报告中提取并标准化发现、描述和元数据，并将其机器翻译成英语。其次，GPT-4o-mini将每个发现分类到肺和胸膜异常的层次结构分类法中。第三，为所有CT体积生成3D注释：训练集由认证放射科医生进行质量保证，验证集和测试集由认证放射科医生完全注释。此外，还创建了一个补充的推理数据集，使用GPT-4o和器官分割模型推导出的定位坐标，为在CT体积中定位发现提供逐步的层次解剖推理。ReXGroundingCT包含16,301个注释实体，覆盖8,028个文本到3D分割对，来自3,142份非对比CT扫描的多种放射学模式。约79%的发现是局灶性异常，21%是非局灶性异常。数据集包括50例公开验证集和100例私有测试集，均由认证放射科医生注释。该数据集为CT成像中自由文本发现分割和基于地面真值的放射学报告生成奠定了基础。模型在私有测试集上的性能在https://rexrank.ai/ReXGroundingCT上的公共排行榜上发布。数据集可在https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT获取。

Summary / 总结

ReXGroundingCT is a dataset that links free-text findings from radiology reports to pixel-level 3D segmentations in chest CT scans. It was constructed using a three-stage pipeline involving GPT-4 for report standardization and categorization, and board-certified radiologists for quality assurance and annotation. The dataset includes 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, with 79% of findings being focal abnormalities and 21% non-focal. It provides a valuable resource for developing models to segment findings from free-text reports in chest CT scans and has a public leaderboard for model performance evaluation.

ReXGroundingCT 是一个将胸部 CT 报告中的自由文本发现与像素级 3D 分割链接起来的数据集。该数据集通过包含 GPT-4 报告标准化和分类的三阶段管道构建，以及由放射科认证医生进行注释。数据集包含 16,301 个注释实体，覆盖 8,028 个文本到 3D 分割对，其中 79% 的发现是局灶性异常。该数据集为 CT 成像中的自由文本发现分割和基于地面的放射学报告生成奠定了基础，模型性能可在公共排行榜上获取。

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Authors: Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik

Venue: NeurIPS 2025 Spotlight

First: 2025-10-11T20:13:59+00:00 · Latest: 2025-10-27T17:51:21+00:00

Comments: Accepted as a Spotlight Paper at NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs. At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human-labeled annotations. We demonstrate that SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open-source and commercial MLLMs, achieving state of-the-art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.

中文标题/摘要

标题：ESCA：通过场景图生成重新定义具身代理

多模态大型语言模型（MLLMs）正迅速向通用具身代理发展。然而，现有的MLLMs无法可靠地捕捉低级视觉特征与高级文本语义之间的细微联系，导致语义接地薄弱和感知不准确。为克服这一挑战，我们提出ESCA框架，通过将具身代理的感知与空间-时间场景图联系起来，来重新定义具身代理。其核心是SGCLIP，这是一种基于CLIP的新型开放领域、可提示的基础模型，用于生成场景图。SGCLIP通过神经符号管道在87K+开放领域视频上进行训练，自动生成的字幕与模型自身生成的场景图对齐，从而无需人工标注。我们证明SGCLIP在基于提示的推理和任务特定微调方面表现出色，分别在场景图生成和动作定位基准上达到最先进的结果。ESCA结合SGCLIP提高了基于开源和商用MLLMs的具身代理的感知能力，实现了两个具身环境中的最先进的性能。值得注意的是，ESCA显著减少了代理感知错误，并使开源模型超越了专有基准。我们将在https://github.com/video-fm/LASER发布SGCLIP模型训练源代码，并在https://github.com/video-fm/ESCA发布具身代理。

Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

Authors: Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, Jie Shen

First: 2025-10-27T17:50:19+00:00 · Latest: 2025-10-27T17:50:19+00:00

Comments: Project page: https://lookahead-anchoring.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.

中文标题/摘要

标题：前瞻锚定：在音频驱动的人体动画中保留角色身份

音频驱动的人体动画模型在时间自回归生成过程中经常出现身份漂移的问题，导致角色逐渐失去身份。一种解决方案是生成关键帧作为中间时间锚点，以防止退化，但这需要额外的关键帧生成阶段，并可能限制自然运动的动力学。为了解决这一问题，我们提出了前瞻锚定，它利用了当前生成窗口之前的时间步长的关键帧，而不是在当前生成窗口内。这将关键帧从固定边界转变为方向性灯塔：模型在响应即时音频提示的同时不断追求这些未来的锚点，通过持续的指导保持一致的身份。这还使自关键帧成为可能，其中参考图像作为前瞻目标，完全消除了关键帧生成的需要。我们发现，时间前瞻的距离自然控制了表达性和一致性的平衡：较大的距离允许更大的运动自由度，而较小的距离则加强了身份的依附。当应用于三个最近的人体动画模型时，前瞻锚定实现了更优的唇部同步、身份保留和视觉质量，展示了在多种不同架构中改进的时间条件。视频结果可在以下链接查看：https://lookahead-anchoring.github.io

Summary / 总结

The paper addresses the issue of identity drift in audio-driven human animation models by proposing Lookahead Anchoring, which uses future keyframes as temporal anchors to maintain character identity. This method allows the model to continuously pursue these future anchors while responding to immediate audio cues, enhancing both expressivity and consistency. Experimental results show that Lookahead Anchoring improves lip synchronization, identity preservation, and visual quality across three recent human animation models, demonstrating its effectiveness in various architectures.

论文提出了一种前瞻锚定方法，通过使用未来时间步的关键帧作为临时锚点，来解决音频驱动的人体动画模型中的身份漂移问题。该方法使模型能够在保持一致身份的同时，仍能实现自然的运动动态。实验结果显示，前瞻锚定在三个最新的人体动画模型中提高了唇部同步、身份保持和视觉质量，并通过可调节的时间前瞻距离有效平衡了表现性和一致性。

UrbanVLA: A Vision-Language-Action Model for Urban Micromobility

Authors: Anqi Li, Zhiyong Wang, Jiazhao Zhang, Minghan Li, Yunpeng Qi, Zhibo Chen, Zhizheng Zhang, He Wang

First: 2025-10-27T17:46:43+00:00 · Latest: 2025-10-27T17:46:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Urban micromobility applications, such as delivery robots, demand reliable navigation across large-scale urban environments while following long-horizon route instructions. This task is particularly challenging due to the dynamic and unstructured nature of real-world city areas, yet most existing navigation methods remain tailored to short-scale and controllable scenarios. Effective urban micromobility requires two complementary levels of navigation skills: low-level capabilities such as point-goal reaching and obstacle avoidance, and high-level capabilities, such as route-visual alignment. To this end, we propose UrbanVLA, a route-conditioned Vision-Language-Action (VLA) framework designed for scalable urban navigation. Our method explicitly aligns noisy route waypoints with visual observations during execution, and subsequently plans trajectories to drive the robot. To enable UrbanVLA to master both levels of navigation, we employ a two-stage training pipeline. The process begins with Supervised Fine-Tuning (SFT) using simulated environments and trajectories parsed from web videos. This is followed by Reinforcement Fine-Tuning (RFT) on a mixture of simulation and real-world data, which enhances the model's safety and adaptability in real-world settings. Experiments demonstrate that UrbanVLA surpasses strong baselines by more than 55% in the SocialNav task on MetaUrban. Furthermore, UrbanVLA achieves reliable real-world navigation, showcasing both scalability to large-scale urban environments and robustness against real-world uncertainties.

中文标题/摘要

标题：UrbanVLA：一种城市微移动性的视觉-语言-行动模型

城市微移动应用，如配送机器人，需要在大规模城市环境中可靠导航并遵循长期路线指令。由于现实世界城市区域的动态和非结构化特性，这一任务尤其具有挑战性，而现有的大多数导航方法仍然针对短距离和可控场景。有效的城市微移动需要两种互补的导航技能层次：低级能力，如点目标到达和障碍物避免，以及高级能力，如路线-视觉对齐。为此，我们提出UrbanVLA，一种基于路线条件的视觉-语言-行动（VLA）框架，用于可扩展的城市导航。我们的方法在执行过程中明确地将嘈杂的路线航点与视觉观察对齐，并随后规划轨迹以驱动机器人。为了使UrbanVLA掌握这两种导航技能层次，我们采用两阶段训练管道。该过程始于使用模拟环境和从网络视频解析的轨迹进行监督微调（SFT）。随后是基于模拟和现实世界数据混合的强化微调（RFT），这增强了模型在现实世界环境中的安全性和适应性。实验表明，UrbanVLA在MetaUrban的SocialNav任务中比强基线高出55%以上。此外，UrbanVLA实现了可靠的现实世界导航，展示了其在大规模城市环境中的可扩展性和对现实世界不确定性具有鲁棒性。

Summary / 总结

UrbanVLA is a Vision-Language-Action model designed for urban micromobility navigation, addressing the challenges of large-scale, dynamic city environments. It uses a two-stage training process, starting with supervised fine-tuning on simulated data and web videos, followed by reinforcement fine-tuning on mixed simulation and real-world data. Experiments show that UrbanVLA outperforms strong baselines by more than 55% in the SocialNav task on MetaUrban and demonstrates reliable real-world navigation capabilities.

UrbanVLA 是一种用于城市微移动性的视觉-语言-行动模型，旨在应对动态和非结构化的城市环境挑战。它采用两阶段训练流程，首先在模拟环境和网络视频中进行监督微调，然后在模拟和真实世界数据上进行强化微调。这种方法使 UrbanVLA 能够在路线-视觉对齐和障碍物避免方面表现出色，在 MetaUrban 的 SocialNav 任务中超越了强大的基线超过 55%，并展示了在真实世界中的可靠导航能力。

LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

Authors: Sajib Acharjee Dip, Adrika Zafor, Bikash Kumar Paul, Uddip Acharjee Shuvo, Muhit Islam Emon, Xuan Wang, Liqing Zhang

First: 2025-10-09T05:12:09+00:00 · Latest: 2025-10-27T17:46:32+00:00

Comments: 34 pages, 5 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.

中文标题/摘要

标题：LLM4Cell：单细胞生物学中大型语言和代理模型综述

大型语言模型（LLMs）和新兴的代理框架正在通过使自然语言推理、生成注释和多模态数据整合来逐步改变单细胞生物学。然而，进展仍然分散在不同的数据模态、架构和评估标准之间。LLM4Cell 提供了首个统一的综述，涵盖了为单细胞研究开发的 58 个基础和代理模型，这些模型涵盖了 RNA、ATAC、多组学和空间模态。我们将这些方法分为五类：基础模型、文本桥梁、空间模型、多模态、表观遗传学模型和代理模型，并将它们映射到八个关键分析任务，包括注释、轨迹和扰动建模以及药物反应预测。基于超过 40 个公共数据集，我们分析了基准适用性、数据多样性以及伦理或可扩展性限制，并在涵盖生物学基础、多组学对齐、公平性、隐私和可解释性的 10 个领域维度上评估了模型。通过将数据集、模型和评估领域联系起来，LLM4Cell 提供了语言驱动的单细胞智能的首个集成视图，并概述了可解释性、标准化和可信模型开发中的开放挑战。

Summary / 总结

LLM4Cell surveys 58 foundation and agentic models for single-cell biology, covering RNA, ATAC, multi-omics, and spatial modalities. It categorizes these models into five families and maps them to eight analytical tasks. Evaluations across 40 datasets and 10 domain dimensions highlight challenges in interpretability, standardization, and model development.

LLM4Cell 概述了58个用于单细胞生物学的基础和代理模型，涵盖了RNA、ATAC、多组学和空间模态。这些模型被分为五个类别，并映射到八个分析任务。通过对40个数据集和10个领域维度的评估，指出了可解释性、标准化和模型开发中的挑战。

Now you see me! Attribution Distributions Reveal What is Truly Important for a Prediction

Authors: Nils Philipp Walter, Jilles Vreeken, Jonas Fischer

First: 2025-03-10T13:59:57+00:00 · Latest: 2025-10-27T17:45:30+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Neural networks are regularly employed in high-stakes decision-making, where understanding and transparency is key. Attribution methods have been developed to gain understanding into which input features neural networks use for a specific prediction. Although widely used in computer vision, these methods often result in unspecific saliency maps that fail to identify the relevant information that led to a decision, supported by different benchmarks results. Here, we revisit the common attribution pipeline and identify one cause for the lack of specificity in attributions as the computation of attribution of isolated logits. Instead, we suggest to combine attributions of multiple class logits in analogy to how the softmax combines the information across logits. By computing probability distributions of attributions over classes for each spatial location in the image, we unleash the true capabilities of existing attribution methods, revealing better object- and instance-specificity and uncovering discriminative as well as shared features between classes. On common benchmarks, including the grid-pointing game and randomization-based sanity checks, we show that this reconsideration of how and where we compute attributions across the network improves established attribution methods while staying agnostic to model architectures. We make the code publicly available: https://github.com/nilspwalter/var.

中文标题/摘要

标题：现在你看到我！归因分布揭示预测中真正重要的因素

神经网络在高风险决策中经常被使用，而理解和透明度是关键。已经开发了归因方法来了解神经网络在特定预测中使用了哪些输入特征。尽管这些方法在计算机视觉中广泛应用，但它们通常会产生不具体的显著性图，无法识别导致决策的相关信息，这得到了不同的基准结果的支持。在这里，我们重新审视了常见的归因管道，并确定缺乏具体性的原因之一是单独计算logits的归因。相反，我们建议将多个类logits的归因结合起来，类似于softmax如何在logits之间结合信息。通过在图像中的每个空间位置上计算归因的概率分布，我们释放了现有归因方法的真正能力，揭示了更好的对象和实例特异性，并发现了类间具有区分性和共享性的特征。在包括网格点游戏和随机化基线检查在内的常见基准上，我们展示了这种重新考虑如何以及在哪里在网络中计算归因的方法改进了现有的归因方法，同时保持对模型架构的中立性。我们已将代码公开发布：https://github.com/nilspwalter/var.

Summary / 总结

This paper addresses the issue of unspecific saliency maps produced by attribution methods in neural networks, which are crucial for high-stakes decision-making. The authors propose a new method that computes probability distributions of attributions over classes for each spatial location in the image, instead of attributing isolated logits. This approach enhances object- and instance-specificity and uncovers discriminative and shared features between classes, improving established attribution methods without being model architecture-specific. The method is validated on common benchmarks and randomization-based sanity checks, demonstrating its effectiveness.

本文探讨了神经网络中由归因方法产生的不具体显著图的问题，这对高风险决策至关重要。作者提出了一种新方法，该方法在每个图像空间位置上为每个类计算归因的概率分布，而不是单独归因于logits。这种方法增强了对象和实例的特异性，并揭示了类间具有区分性和共享性的特征，同时不依赖于模型架构。该方法在常见基准和随机化基线检查上进行了验证，证明了其有效性。

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Authors: Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai

Venue: NeurIPS 2025

First: 2025-10-27T17:44:56+00:00 · Latest: 2025-10-27T17:44:56+00:00

Comments: Accepted by NeurIPS 2025. The code will be made available at https://github.com/H-EmbodVis/MERGE

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE

中文标题/摘要

标题：超越代际：通过文本到图像扩散模型统一图像生成与深度估计

生成深度估计方法利用预训练文本到图像扩散模型中存储的丰富视觉先验，展示了惊人的零样本能力。然而，训练过程中的参数更新会导致预训练模型图像生成能力的灾难性退化。我们提出了MERGE，一种从固定预训练文本到图像模型出发的统一图像生成与深度估计模型。MERGE表明，预训练文本到图像模型不仅可以进行图像生成，还可以轻松扩展到深度估计。具体而言，MERGE引入了一种插拔框架，通过简单的插拔转换器可以无缝切换图像生成和深度估计模式。同时，我们提出了组重用机制，以促进参数重用并提高额外可学习参数的利用率。MERGE释放了预训练文本到图像模型的强大深度估计能力，同时保留其原始的图像生成能力。与用于图像生成和深度估计的其他统一模型相比，MERGE在多个深度估计基准测试中取得了最先进的性能。代码将在https://github.com/H-EmbodVis/MERGE公开。

Summary / 总结

MERGE is a unified model for image generation and depth estimation that starts from a fixed pre-trained text-to-image model. It introduces a play-and-plug framework allowing seamless switching between image generation and depth estimation modes, and a Group Reuse Mechanism to improve parameter utilization. MERGE outperforms other unified models in multiple depth estimation benchmarks while maintaining the original image generation capability of the pre-trained model.

MERGE 是一个从固定预训练文本到图像模型出发的统一模型，用于图像生成和深度估计。它引入了一个可玩即插即用框架，允许在图像生成和深度估计模式之间无缝切换，并提出了一种组重用机制以提高参数利用率。MERGE 在多个深度估计基准测试中表现出色，同时保持了预训练模型的原始图像生成能力。

RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

Authors: Yash Jangir, Yidi Zhang, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki

First: 2025-10-27T17:41:38+00:00 · Latest: 2025-10-27T17:41:38+00:00

Comments: Website: https://robotarenainf.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained from real-world demonstrations or alternative simulation environments. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, such as textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.

中文标题/摘要

标题：RobotArena $\infty$: 通过实到模拟转换实现可扩展的机器人基准测试

机器人通才——能够跨不同环境执行多种任务的可指导代理——的追求需要严格的可扩展评估。然而，机器人策略的实际测试仍然受到根本限制：它劳动密集、速度慢、在大规模下不安全且难以复制。现有的模拟基准测试同样有限，因为它们在相同的合成领域内训练和测试策略，无法评估从真实世界演示或不同模拟环境训练的模型。随着策略的范围和复杂性扩大，这些障碍只会加剧，因为机器人中的“成功”往往依赖于执行质量的微妙人类判断。在本文中，我们介绍了一种新的基准测试框架，通过将跨视域评估转移到增强有人工反馈的大规模模拟环境中来克服这些挑战。利用视觉语言模型、二维到三维生成建模和可微渲染的最新进展，我们的方法自动将广泛使用的机器人数据集中的视频演示转换为模拟对应物。在这些数字双胞胎中，我们使用自动化的视觉语言模型引导评分和大规模的人类偏好判断来评估跨视域策略，将人类参与从繁琐的场景设置、重置和安全监督转变为轻量级的偏好比较。为了衡量鲁棒性，我们系统地沿多个轴线（如纹理和物体放置）对模拟环境进行扰动，对策略在受控变化下的泛化能力进行压力测试。结果是一个不断演进、可复制且可扩展的基准测试，用于真实世界训练的机器人操作策略，填补了当今机器人领域的一项关键缺失能力。

Summary / 总结

This paper introduces RobotArena $\infty$, a new benchmarking framework for evaluating robot generalists by leveraging real-to-sim translation techniques. It uses vision-language models, 2D-to-3D generative modeling, and differentiable rendering to convert real-world robot demonstrations into simulated environments. The framework assesses policies using both automated scoring and human preference judgments collected from crowdworkers, shifting human involvement from tedious tasks to lightweight preference comparisons. Key findings include the robustness of policies under controlled environmental perturbations, demonstrating a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies.

本文介绍了RobotArena $\infty$，这是一种通过实到仿转换技术来评估机器人通用性的新基准框架。它利用视觉语言模型、二维到三维生成建模和可微渲染将真实世界的机器人演示转换为模拟环境。该框架使用自动评分和从众包工人收集的人类偏好判断来评估策略，将人类参与从繁琐的任务转变为轻量级的偏好比较。关键发现包括在受控环境扰动下的策略稳健性，展示了用于真实训练的机器人操作策略的持续演化、可重复和可扩展基准。

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang

Venue: NeurIPS 2025

First: 2025-10-27T17:38:17+00:00 · Latest: 2025-10-27T17:38:17+00:00

Comments: Accepted at NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

中文标题/摘要

标题：EgoThinker：时空链推理揭示自中心理

自中心理视频推理关注摄像机背后的不可见代理，该代理动态塑造环境，需要推断隐藏意图并识别细微交互。这一核心挑战限制了当前的多模态大型语言模型MLLMs，它们擅长可见事件推理但缺乏具身的第一人称理解。为弥合这一差距，我们引入了EgoThinker，一种通过时空链推理监督和两阶段学习课程赋予MLLMs稳健的自中心理推理能力的新框架。首先，我们引入了EgoRe-5M，这是一个从1300万段多样化的自中心理视频片段中构建的大规模自中心理问答数据集，该数据集包含多分钟的片段，并标注了详细的链推理理由和密集的手物定位。其次，我们使用SFT在EgoRe-5M上进行推理技能的训练，随后进行强化微调RFT以进一步增强时空定位。实验结果表明，EgoThinker在多个自中心理基准测试中优于现有方法，同时在细微的时空定位任务中取得了显著改进。完整代码和数据可在https://github.com/InternRobotics/EgoThinker/获取。

Summary / 总结

EgoThinker addresses the challenge of egocentric video reasoning by introducing a novel framework that combines spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. It leverages the EgoRe-5M dataset, which includes 13 million diverse egocentric video clips annotated with detailed reasoning rationales, to train multimodal large language models. The model outperforms existing methods on multiple egocentric benchmarks and shows significant improvements in fine-grained spatio-temporal localization tasks.

EgoThinker 通过引入结合时空链式推理监督和两阶段学习课程的新框架，解决了自中心视频推理的挑战。该方法利用 EgoRe-5M 数据集，这是一个大规模的自中心问答数据集，来训练多模态大语言模型（MLLMs），以实现稳健的自中心推理能力。实验结果表明，该方法在多个基准测试中显著提高了细粒度的时空定位，并优于现有方法。完整的代码和数据可在 https://github.com/InternRobotics/EgoThinker 获取。

ReCode: Unify Plan and Action for Universal Granularity Control

Authors: Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu

First: 2025-10-27T17:35:15+00:00 · Latest: 2025-10-27T17:35:15+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.

中文标题/摘要

标题：ReCode：统一计划与行动以实现通用粒度控制

现实世界任务需要在不同粒度上做出决策，人类通过利用统一的认知表示，将计划视为高层次的行动形式而表现出色。然而，当前基于大型语言模型（LLM）的代理缺乏在不同决策粒度之间灵活操作的关键能力。这一限制源于现有范式强制将高层次规划与低层次行动严格分离，这阻碍了动态适应性并限制了泛化能力。我们提出了ReCode（递归代码生成），这是一种新的范式，通过在单一代码表示中统一规划与行动来解决这一限制。在该表示中，ReCode 将高层次计划视为抽象占位函数，代理随后递归地将其分解为更细粒度的子函数，直到达到基本动作。这种递归方法消除了计划与行动之间的刚性边界，使代理能够动态控制其决策粒度。此外，递归结构自然生成了丰富的、多粒度的训练数据，使模型能够学习层次决策过程。大量实验表明，ReCode 在推理性能上显著超越了先进的基线，并在训练中展示了出色的数据效率，验证了我们核心见解：通过递归代码生成统一规划与行动是实现通用粒度控制的强大而有效的途径。代码可在 https://github.com/FoundationAgents/ReCode 获取。

Summary / 总结

ReCode addresses the limitation of current LLM-based agents in handling varying decision granularities by unifying planning and action within a recursive code representation. This approach treats high-level plans as abstract functions that are recursively decomposed into finer-grained sub-functions until primitive actions are reached, enabling dynamic control of decision granularity. Experiments show that ReCode outperforms advanced baselines in inference and demonstrates strong data efficiency, validating the effectiveness of unifying planning and action through recursive code generation for universal granularity control.

ReCode通过将计划和行动统一到递归代码表示中来解决当前基于LLM的代理在处理不同决策粒度时的局限性。该方法将高层计划视为抽象函数，并递归分解为更细粒度的动作，从而实现动态决策粒度控制。实验表明，ReCode在推理性能上超越了先进基线，并展示了出色的数据效率，验证了这种统一方法在实现通用粒度控制方面的有效性和强大性。

SafeCOMM: A Study on Safety Degradation in Fine-Tuned Telecom Large Language Models

Authors: Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Fernando Koch, Walid Saad, Holger Boche

First: 2025-05-29T13:31:51+00:00 · Latest: 2025-10-27T17:35:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Fine-tuning large language models (LLMs) on telecom datasets is a common practice to adapt general-purpose models to the telecom domain. However, little attention has been paid to how this process may compromise model safety. Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries. In this paper, we investigate this issue by fine-tuning LLMs on three representative telecom datasets and show that safety degrades even for light telecom domain adaptation. To this end, we introduce TeleHarm, the first telecom-specific red-teaming benchmark, which we use alongside established Direct-Harm and HexPhi datasets to systematically assess harmful behavior. We further extend our analysis to publicly available TeleLLMs that were continually pre-trained on large telecom corpora, revealing that safety alignment is severely lacking, primarily due to the omission of safety-focused instruction tuning. To address these issues, we evaluate three realignment defenses: SafeInstruct, SafeLoRA, SafeMERGE. We show that, across all settings, the proposed defenses can effectively restore safety without compromising telecom task performance, leading to Safe teleCOMMunication (SafeCOMM) models. Our work serves as both a diagnostic study and practical guide for safety realignment in telecom-tuned LLMs, underscoring the need for safety-aware instruction and fine-tuning in the telecom domain.

中文标题/摘要

标题：SafeCOMM：电信领域微调大型语言模型安全退化研究

在电信数据集上微调大型语言模型（LLMs）是将通用模型适应电信领域的一种常见做法。然而，人们很少关注这一过程如何可能损害模型的安全性。最近的研究表明，即使是温和的微调也可能降低LLMs的安全对齐，使其对有害或不道德的用户查询作出响应。在本文中，我们通过在三个代表性的电信数据集上微调LLMs来研究这一问题，并表明即使是轻微的电信领域适应也会导致安全退化。为此，我们引入了TeleHarm，这是首个针对电信领域的红队基准测试，我们使用它与已建立的Direct-Harm和HexPhi数据集一起系统地评估有害行为。我们进一步将分析扩展到公开可用的TeleLLMs，这些模型在大规模电信语料库上持续预训练，揭示出安全对齐严重不足，主要原因是缺乏安全重点指令微调。为了解决这些问题，我们评估了三种重新对齐防御：SafeInstruct、SafeLoRA、SafeMERGE。我们证明，在所有设置中，提出的防御措施可以有效地恢复安全性，而不损害电信任务性能，从而生成Safe电信通信（SafeCOMM）模型。我们的工作既是诊断研究，也是电信领域微调LLMs安全重新对齐的实用指南，强调了在电信领域需要安全意识的指令和微调。

Summary / 总结

This study investigates how fine-tuning large language models (LLMs) on telecom datasets can degrade model safety, even for light adaptation. It introduces TeleHarm, a telecom-specific red-teaming benchmark, and evaluates three safety realignment defenses: SafeInstruct, SafeLoRA, and SafeMERGE. The research demonstrates that these defenses can restore safety without affecting telecom task performance, leading to improved Safe teleCOMMunication models.

研究探讨了将大型语言模型（LLM）微调在电信数据集上如何导致模型安全性的下降，即使是对轻度适应也是如此。研究引入了TeleHarm，一个电信特定的红队基准，并评估了三种安全对齐防御：SafeInstruct、SafeLoRA和SafeMERGE。研究显示，这些防御可以在不损害电信任务性能的情况下恢复安全性，从而生成更安全的电信模型（SafeCOMM）。

Revising Second Order Terms in Deep Animation Video Coding

Authors: Konstantin Schmidt, Thomas Richter

Venue: https://eusipco2025.org/wp-content/uploads/pdfs/0000691.pdf

First: 2025-10-27T17:32:08+00:00 · Latest: 2025-10-27T17:32:08+00:00

Abs · PDF · Code1 · Code2

Abstract

First Order Motion Model is a generative model that animates human heads based on very little motion information derived from keypoints. It is a promising solution for video communication because first it operates at very low bitrate and second its computational complexity is moderate compared to other learning based video codecs. However, it has strong limitations by design. Since it generates facial animations by warping source-images, it fails to recreate videos with strong head movements. This works concentrates on one specific kind of head movements, namely head rotations. We show that replacing the Jacobian transformations in FOMM by a global rotation helps the system to perform better on items with head-rotations while saving 40% to 80% of bitrate on P-frames. Moreover, we apply state-of-the-art normalization techniques to the discriminator to stabilize the adversarial training which is essential for generating visually appealing videos. We evaluate the performance by the learned metics LPIPS and DISTS to show the success our optimizations.

中文标题/摘要

标题：修订深度动画视频编码中的二次项

一阶运动模型是一种生成模型，基于少量从关键点提取的运动信息来动画化人体头部。它是一种很有前景的视频通信解决方案，因为它可以在非常低的比特率下运行，而且与其他基于学习的视频编解码器相比，其计算复杂度适中。然而，它在设计上存在很强的局限性。由于它是通过扭曲源图像来生成面部动画的，因此无法重现具有强烈头部运动的视频。本工作集中于一种特定类型的头部运动，即头部旋转。我们通过将FOMM中的雅可比变换替换为全局旋转，使系统在处理头部旋转的项目时性能更好，同时在P帧上节省了40%到80%的比特率。此外，我们应用了最先进的归一化技术到判别器，以稳定对抗训练，这对于生成视觉上吸引人的视频至关重要。我们通过学习的度量LPIPS和DISTS来评估性能，以展示我们优化的成功。

Summary / 总结

This work revises the First Order Motion Model to improve its performance on head rotations, reducing bitrate by 40% to 80% on P-frames. The authors replace Jacobian transformations with global rotations and apply advanced normalization techniques to the discriminator, enhancing the stability and visual quality of generated videos. Evaluation metrics LPIPS and DISTS demonstrate the effectiveness of these optimizations.

该研究改进了第一阶运动模型以更好地处理头部旋转，P帧的比特率降低了40%到80%。作者用全局旋转替换雅可比变换，并应用先进的规范化技术到判别器，提高生成视频的稳定性和视觉质量。通过LPIPS和DISTS等评估指标展示了这些优化的有效性。

Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance

Authors: Aladin Djuhera, Swanand Ravindra Kadhe, Syed Zawad, Farhan Ahmed, Heiko Ludwig, Holger Boche

First: 2025-06-06T20:34:06+00:00 · Latest: 2025-10-27T17:31:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.

中文标题/摘要

标题：修复后处理中的问题：LLM后训练数据质量和模型性能的比较研究

近年来，关于大型语言模型（LLMs）的研究越来越多地关注后训练和与专门用于增强指令跟随、世界知识和专业技能的数据集的对齐。然而，大多数用于领先开源和闭源LLM的后训练数据集仍然对公众不可访问，且关于其构建过程的信息有限。这种透明度的缺乏促使最近开发了开源后训练语料库。虽然在这些开放替代品上进行训练可以产生与领先模型相当的性能，但由于大规模进行系统性比较的显著计算成本，这种比较仍然具有挑战性，因此这些比较大多不存在。因此，在评估数据质量时，尚不清楚特定样本、任务类型或编纂策略如何影响下游性能。在这项工作中，我们首次对两个主要的开源后训练数据集——Tulu-3-SFT-Mix和SmolTalk——进行了全面的并排分析。使用Magpie框架，我们为每个样本标注了详细的质量指标，包括对话结构（单轮对话 vs. 多轮对话）、任务类别、输入质量和响应质量，并推导出统计指标，揭示了两个数据集在结构和质量上的相似性和差异。基于这些见解，我们设计了一种原则性的编纂食谱，生成了一个新的数据混合体TuluTalk，其样本数量比任一原始数据集少14%，但在关键基准测试上匹配或超过了它们的性能。我们的发现为在实际资源限制内构建更有效的后训练数据集提供了可操作的见解。为了支持未来的研究，我们公开发布了标注后的原始数据集和我们编纂的TuluTalk混合体。

Enhancing Graph Neural Networks: A Mutual Learning Approach

Authors: Paul Agbaje, Arkajyoti Mitra, Afia Anjum, Pranali Khose, Ebelechukwu Nwafor, Habeeb Olufowobi

First: 2025-10-22T04:07:48+00:00 · Latest: 2025-10-27T17:26:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Knowledge distillation (KD) techniques have emerged as a powerful tool for transferring expertise from complex teacher models to lightweight student models, particularly beneficial for deploying high-performance models in resource-constrained devices. This approach has been successfully applied to graph neural networks (GNNs), harnessing their expressive capabilities to generate node embeddings that capture structural and feature-related information. In this study, we depart from the conventional KD approach by exploring the potential of collaborative learning among GNNs. In the absence of a pre-trained teacher model, we show that relatively simple and shallow GNN architectures can synergetically learn efficient models capable of performing better during inference, particularly in tackling multiple tasks. We propose a collaborative learning framework where ensembles of student GNNs mutually teach each other throughout the training process. We introduce an adaptive logit weighting unit to facilitate efficient knowledge exchange among models and an entropy enhancement technique to improve mutual learning. These components dynamically empower the models to adapt their learning strategies during training, optimizing their performance for downstream tasks. Extensive experiments conducted on three datasets each for node and graph classification demonstrate the effectiveness of our approach.

中文标题/摘要

标题：增强图神经网络：一种互学方法

知识蒸馏（KD）技术已成为将复杂教师模型的知识转移到轻量级学生模型的强大工具，特别适用于在资源受限设备中部署高性能模型。该方法已成功应用于图神经网络（GNNs），利用其表达能力生成节点嵌入，捕捉结构和特征相关信息。在本研究中，我们从传统的KD方法出发，探索GNNs之间协作学习的潜力。在没有预训练教师模型的情况下，我们展示了相对简单的浅层GNN架构可以协同学习高效的模型，在推理过程中表现更好，特别是在处理多个任务时。我们提出了一种协作学习框架，其中学生GNN的集合在整个训练过程中相互教学。我们引入了自适应logit权重单元以促进模型之间的高效知识交流，并引入了熵增强技术以提高互学效果。这些组件在训练过程中动态地赋予模型调整其学习策略的能力，优化其下游任务的性能。在节点和图分类的三个数据集上进行的广泛实验表明了我们方法的有效性。

Summary / 总结

This study proposes a collaborative learning framework for enhancing graph neural networks (GNNs) by mutual teaching among student GNNs, without relying on a pre-trained teacher model. The approach uses an adaptive logit weighting unit and an entropy enhancement technique to facilitate efficient knowledge exchange and improve mutual learning. Experiments on three datasets for node and graph classification show that this method can generate better-performing models for inference, especially for multiple tasks.

该研究提出了一种协作学习方法，通过让学生GNN相互学习来增强图神经网络（GNN）。这种方法不需要预训练的教师模型，而是使用简单的浅层GNN协同提高性能。该方法包括一个自适应logit权重单元和熵增强技术，以促进有效的知识交流和改进相互学习。在三个数据集上的节点和图分类实验表明，该方法有效提高了GNN在下游任务中的性能。

Sequential Multi-Agent Dynamic Algorithm Configuration

Authors: Chen Lu, Ke Xue, Lei Yuan, Yao Wang, Yaoyuan Wang, Sheng Fu, Chao Qian

Venue: NeurIPS 2025

First: 2025-10-27T17:11:03+00:00 · Latest: 2025-10-27T17:11:03+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Dynamic algorithm configuration (DAC) is a recent trend in automated machine learning, which can dynamically adjust the algorithm's configuration during the execution process and relieve users from tedious trial-and-error tuning tasks. Recently, multi-agent reinforcement learning (MARL) approaches have improved the configuration of multiple heterogeneous hyperparameters, making various parameter configurations for complex algorithms possible. However, many complex algorithms have inherent inter-dependencies among multiple parameters (e.g., determining the operator type first and then the operator's parameter), which are, however, not considered in previous approaches, thus leading to sub-optimal results. In this paper, we propose the sequential multi-agent DAC (Seq-MADAC) framework to address this issue by considering the inherent inter-dependencies of multiple parameters. Specifically, we propose a sequential advantage decomposition network, which can leverage action-order information through sequential advantage decomposition. Experiments from synthetic functions to the configuration of multi-objective optimization algorithms demonstrate Seq-MADAC's superior performance over state-of-the-art MARL methods and show strong generalization across problem classes. Seq-MADAC establishes a new paradigm for the widespread dependency-aware automated algorithm configuration. Our code is available at https://github.com/lamda-bbo/seq-madac.

中文标题/摘要

标题：顺序多智能体动态算法配置

动态算法配置（DAC）是自动机器学习的最新趋势，可以在执行过程中动态调整算法配置，减轻用户繁琐的试错调优任务。最近，多智能体强化学习（MARL）方法提高了多种异构超参数的配置能力，使得复杂算法的各种参数配置成为可能。然而，许多复杂算法中存在多个参数之间的固有依赖关系（例如，先确定操作类型，再确定操作的参数），而这些依赖关系在先前的方法中并未被考虑，导致结果次优。本文提出了一种顺序多智能体DAC（Seq-MADAC）框架，通过考虑多个参数之间的固有依赖关系来解决这一问题。具体而言，我们提出了一种顺序优势分解网络，可以通过顺序优势分解利用动作顺序信息。从合成函数到多目标优化算法的配置实验表明，Seq-MADAC在最先进的MARL方法上表现出更优的性能，并且在不同问题类别上具有很强的泛化能力。Seq-MADAC为广泛依赖的自动化算法配置建立了一个新的范式。我们的代码可在https://github.com/lamda-bbo/seq-madac获取。

Summary / 总结

The research aims to improve dynamic algorithm configuration by addressing the inter-dependencies among multiple parameters in complex algorithms. The method introduces Seq-MADAC, which uses a sequential advantage decomposition network to consider these dependencies. Experiments show that Seq-MADAC outperforms existing MARL methods and demonstrates strong generalization across different problem classes.

研究旨在通过解决复杂算法中多个参数之间的依赖关系来改进动态算法配置。提出的Seq-MADAC框架使用顺序优势分解网络来考虑这些依赖关系。实验表明，Seq-MADAC在不同问题类别上表现出色，优于现有MARL方法，并展示了强大的泛化能力。

When No Paths Lead to Rome: Benchmarking Systematic Neural Relational Reasoning

Authors: Anirban Das, Irtaza Khalid, Rafael Peñaloza, Steven Schockaert

Venue: NeurIPS 2025

First: 2025-10-27T17:09:16+00:00 · Latest: 2025-10-27T17:09:16+00:00

Comments: accepted at NeurIPS 2025 D&B track

Abs · PDF · Code1 · Code2

Abstract

Designing models that can learn to reason in a systematic way is an important and long-standing challenge. In recent years, a wide range of solutions have been proposed for the specific case of systematic relational reasoning, including Neuro-Symbolic approaches, variants of the Transformer architecture, and specialised Graph Neural Networks. However, existing benchmarks for systematic relational reasoning focus on an overly simplified setting, based on the assumption that reasoning can be reduced to composing relational paths. In fact, this assumption is hard-baked into the architecture of several recent models, leading to approaches that can perform well on existing benchmarks but are difficult to generalise to other settings. To support further progress in the field of systematic relational reasoning with neural networks, we introduce NoRA, a new benchmark which adds several levels of difficulty and requires models to go beyond path-based reasoning.

中文标题/摘要

标题：无路可通罗马：系统神经关系推理基准测试

设计能够以系统方式学习推理的模型是一个重要且长期存在的挑战。近年来，针对系统关系推理的具体情况提出了多种解决方案，包括神经符号方法、Transformer架构的变体和专门的图神经网络。然而，现有的系统关系推理基准测试集中在过于简化的设置上，基于推理可以简化为关系路径组合的假设。实际上，这一假设已经嵌入到许多最近模型的架构中，导致这些方法在现有基准测试上表现良好，但在其他设置中难以泛化。为了支持系统关系推理领域中神经网络的进一步进展，我们引入了NoRA，这是一个新的基准测试，增加了多个难度级别，要求模型超越基于路径的推理。

Summary / 总结

The paper addresses the challenge of designing models that can reason systematically, which has been a long-standing issue. It introduces NoRA, a new benchmark that goes beyond the simplistic path-based reasoning assumed in existing benchmarks. The key finding is that current models, which are optimized for existing benchmarks, struggle with the more complex scenarios introduced by NoRA, highlighting the need for models that can generalize better to different settings.

该论文旨在设计能够进行系统性推理的模型，引入了NoRA这一新基准，该基准超越了基于路径的推理。作者提出NoRA以支持神经网络在系统性关系推理领域的进一步进展，并增加了多个难度级别。实验结果表明，现有的通常基于路径的方法在NoRA上表现不佳，突显了需要更通用的方法的必要性。

DPGLA: Bridging the Gap between Synthetic and Real Data for Unsupervised Domain Adaptation in 3D LiDAR Semantic Segmentation

Authors: Wanmeng Li, Simone Mosco, Daniel Fusaro, Alberto Pretto

Venue: IROS

First: 2025-10-27T17:05:59+00:00 · Latest: 2025-10-27T17:05:59+00:00

Comments: This paper has been accepted for publication at the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Abs · PDF · Code1 · Code2

Abstract

Annotating real-world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self-training-based Unsupervised Domain Adaptation (UDA) has been widely used to improve point cloud semantic segmentation by leveraging synthetic point cloud data. However, we argue that existing methods do not effectively utilize unlabeled data, as they either rely on predefined or fixed confidence thresholds, resulting in suboptimal performance. In this paper, we propose a Dynamic Pseudo-Label Filtering (DPLF) scheme to enhance real data utilization in point cloud UDA semantic segmentation. Additionally, we design a simple and efficient Prior-Guided Data Augmentation Pipeline (PG-DAP) to mitigate domain shift between synthetic and real-world point clouds. Finally, we utilize data mixing consistency loss to push the model to learn context-free representations. We implement and thoroughly evaluate our approach through extensive comparisons with state-of-the-art methods. Experiments on two challenging synthetic-to-real point cloud semantic segmentation tasks demonstrate that our approach achieves superior performance. Ablation studies confirm the effectiveness of the DPLF and PG-DAP modules. We release the code of our method in this paper.

中文标题/摘要

标题：DPGLA：合成数据与真实数据之间在3D LiDAR语义分割无监督领域适应中的桥梁

在智能自主系统中使用真实世界的LiDAR点云进行标注成本高昂。为克服这一限制，基于自我训练的无监督领域适应（UDA）已被广泛用于通过利用合成点云数据来提高点云语义分割性能。然而，我们认为现有方法未能有效利用未标注数据，因为它们要么依赖预定义或固定的置信度阈值，导致性能不佳。在本文中，我们提出了一种动态伪标签过滤（DPLF）方案，以增强点云UDA语义分割中真实数据的利用。此外，我们设计了一个简单且高效的先验引导数据增强管道（PG-DAP），以减轻合成与真实世界点云之间的领域偏移。最后，我们利用数据混合一致性损失来促使模型学习上下文无关的表示。我们通过与最新方法的广泛比较实施并彻底评估了我们的方法。在两个具有挑战性的合成到真实点云语义分割任务上的实验表明，我们的方法实现了优越的性能。消融研究证实了DPLF和PG-DAP模块的有效性。我们在本文中发布了我们方法的代码。

Summary / 总结

This paper addresses the challenge of annotating real-world LiDAR point clouds by proposing a method that enhances the utilization of unlabeled data through a Dynamic Pseudo-Label Filtering scheme and a Prior-Guided Data Augmentation Pipeline. The approach also employs data mixing consistency loss to improve context-free representation learning. Experiments show that the proposed method outperforms state-of-the-art techniques on two challenging synthetic-to-real point cloud semantic segmentation tasks, and ablation studies confirm the effectiveness of the proposed modules.

本文通过提出一种动态伪标签过滤（DPLF）方案和先验引导数据增强管道（PG-DAP），旨在解决为自主系统标注真实LiDAR点云的挑战。该方法还利用数据混合一致性损失来提高上下文无关的表示。实验表明，所提出的方法在两个具有挑战性的合成到真实点云语义分割任务中优于最先进的技术，并且消融研究证实了DPLF和PG-DAP模块的有效性。

Toward Carbon-Neutral Human AI: Rethinking Data, Computation, and Learning Paradigms for Sustainable Intelligence

Authors: KC Santosh, Rodrigue Rizk, Longwei Wang

First: 2025-10-27T17:02:30+00:00 · Latest: 2025-10-27T17:02:30+00:00

Comments: 9 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of Artificial Intelligence (AI) has led to unprecedented computational demands, raising significant environmental and ethical concerns. This paper critiques the prevailing reliance on large-scale, static datasets and monolithic training paradigms, advocating for a shift toward human-inspired, sustainable AI solutions. We introduce a novel framework, Human AI (HAI), which emphasizes incremental learning, carbon-aware optimization, and human-in-the-loop collaboration to enhance adaptability, efficiency, and accountability. By drawing parallels with biological cognition and leveraging dynamic architectures, HAI seeks to balance performance with ecological responsibility. We detail the theoretical foundations, system design, and operational principles that enable AI to learn continuously and contextually while minimizing carbon footprints and human annotation costs. Our approach addresses pressing challenges in active learning, continual adaptation, and energy-efficient model deployment, offering a pathway toward responsible, human-centered artificial intelligence.

中文标题/摘要

标题：向碳中和的人工智能迈进：重新思考数据、计算和学习范式以实现可持续智能

人工智能（AI）的迅猛发展带来了前所未有的计算需求，引发了重大的环境和伦理问题。本文批评了当前对大规模、静态数据集和单一训练范式的依赖，提倡转向受人类启发的可持续AI解决方案。我们提出了一种新的框架——人类AI（HAI），强调增量学习、碳意识优化和人机协作，以提高适应性、效率和问责性。通过借鉴生物认知原理并利用动态架构，HAI旨在在性能与生态责任之间取得平衡。我们详细阐述了理论基础、系统设计和操作原则，使AI能够持续、情境化地学习，同时减少碳足迹和人工注释成本。我们的方法解决了主动学习、持续适应和能源高效模型部署等紧迫挑战，提供了一条负责任的人本化人工智能之路。

Summary / 总结

This paper addresses the environmental and ethical concerns arising from the high computational demands of AI. It proposes a Human AI (HAI) framework that emphasizes incremental learning, carbon-aware optimization, and human-in-the-loop collaboration to enhance adaptability and efficiency. The key findings include theoretical foundations, system design, and operational principles that enable continuous and context-aware learning while reducing carbon footprints and human annotation costs.

本文针对AI高能耗带来的环境和伦理问题，提出了一个强调增量学习、碳意识优化和人类在环合作的Human AI (HAI)框架，以提高适应性和效率。主要发现包括理论基础、系统设计和操作原则，这些原则使AI能够实现持续且上下文相关的学习，同时减少碳足迹和人工标注成本。

DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning

Authors: Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, Sunil Gupta

First: 2025-07-28T03:34:15+00:00 · Latest: 2025-10-27T17:00:52+00:00

Comments: accepted at ECAI 2025; offline cross-domain reinforcement learning with a guided diffusion model;

Abs · PDF · Code1 · Code2

Abstract

Cross-domain offline reinforcement learning (RL) seeks to enhance sample efficiency in offline RL by utilizing additional offline source datasets. A key challenge is to identify and utilize source samples that are most relevant to the target domain. Existing approaches address this challenge by measuring domain gaps through domain classifiers, target transition dynamics modeling, or mutual information estimation using contrastive loss. However, these methods often require large target datasets, which is impractical in many real-world scenarios. In this work, we address cross-domain offline RL under a limited target data setting, identifying two primary challenges: (1) Dataset imbalance, which is caused by large source and small target datasets and leads to overfitting in neural network-based domain gap estimators, resulting in uninformative measurements; and (2) Partial domain overlap, where only a subset of the source data is closely aligned with the target domain. To overcome these issues, we propose DmC, a novel framework for cross-domain offline RL with limited target samples. Specifically, DmC utilizes $k$-nearest neighbor ($k$-NN) based estimation to measure domain proximity without neural network training, effectively mitigating overfitting. Then, by utilizing this domain proximity, we introduce a nearest-neighbor-guided diffusion model to generate additional source samples that are better aligned with the target domain, thus enhancing policy learning with more effective source samples. Through theoretical analysis and extensive experiments in diverse MuJoCo environments, we demonstrate that DmC significantly outperforms state-of-the-art cross-domain offline RL methods, achieving substantial performance gains.

中文标题/摘要

标题：DmC：邻近邻居引导扩散模型在离线跨域强化学习中的应用

跨域离线强化学习（RL）旨在通过利用额外的离线源数据集来提高离线RL的样本效率。关键挑战在于识别和利用与目标域最相关的源样本。现有方法通过领域分类器、目标过渡动力学建模或对比损失下的互信息估计来测量领域差距。然而，这些方法通常需要大量的目标数据集，这在许多实际场景中是不切实际的。在本文中，我们针对有限目标数据集下的跨域离线RL，识别出两个主要挑战：（1）数据集不平衡，由于源数据集大而目标数据集小，导致基于神经网络的领域差距估计器过拟合，产生无信息的测量；（2）部分领域重叠，只有源数据的一部分与目标域紧密对齐。为了解决这些问题，我们提出了DmC，一种新的有限目标样本下的跨域离线RL框架。具体而言，DmC利用基于$K$-最近邻（$K$-NN）的估计来测量领域接近度，无需进行神经网络训练，有效避免了过拟合。然后，通过利用这种领域接近度，我们引入了一个最近邻引导的扩散模型来生成与目标域更对齐的额外源样本，从而使用更有效的源样本增强策略学习。通过理论分析和在多种MuJoCo环境中的广泛实验，我们证明DmC显著优于最先进的跨域离线RL方法，实现了显著的性能提升。

Summary / 总结

The paper addresses the challenge of cross-domain offline reinforcement learning with limited target data, focusing on dataset imbalance and partial domain overlap. It proposes DmC, a framework that uses $k$-nearest neighbor estimation to measure domain proximity without neural network training, and introduces a nearest-neighbor-guided diffusion model to generate more aligned source samples. Experiments show that DmC outperforms existing methods in various MuJoCo environments, achieving significant performance gains.

论文针对有限目标数据下的跨域离线强化学习问题，重点关注数据集不平衡和部分领域重叠。提出了一种名为DmC的新框架，利用$k$-最近邻估计来测量领域接近度，无需训练神经网络，并引入了最近邻引导的扩散模型生成更符合目标领域的源样本。实验结果表明，DmC在各种MuJoCo环境中显著优于现有方法，实现了显著的性能提升。

AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation

Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu

First: 2025-03-13T08:22:28+00:00 · Latest: 2025-10-27T16:55:55+00:00

Abs · PDF · Code1 · Code2

Abstract

While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query's semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10\% in key metrics.

中文标题/摘要

标题：AttentionRAG：检索增强生成中的注意力引导上下文剪枝

尽管RAG在LLM应用中表现出色，但其有效性受到检索上下文不断增加的长度的阻碍，这引入了信息冗余和巨大的计算开销。现有的上下文剪枝方法，如LLMLingua，缺乏上下文意识，且在控制压缩率方面灵活性有限，往往导致剪枝不足或信息丢失过多。在本文中，我们提出了一种名为AttentionRAG的注意力引导上下文剪枝方法，用于RAG系统。AttentionRAG的核心思想在于其注意力焦点机制，该机制将RAG查询重新表述为下一个词预测范式。该机制将查询的语义焦点隔离到单个词，从而实现查询与检索上下文之间精确高效的注意力计算。在LongBench和Babilong基准上的广泛实验表明，AttentionRAG在上下文压缩方面最高可达到6.3倍的压缩率，且在关键指标上比LLMLingua方法高出约10%。

Summary / 总结

AttentionRAG is an attention-guided context pruning method for RAG systems aimed at reducing the length of retrieved contexts to minimize redundancy and computational overhead. It reformulates RAG queries into a next-token prediction paradigm, focusing on the semantic core of the query to enable precise attention calculations. Experiments show that AttentionRAG can compress contexts up to 6.3 times while improving key metrics by around 10% compared to LLMLingua methods.

AttentionRAG 是一种针对 RAG 系统的注意力导向上下文剪枝方法，旨在减少检索到的上下文长度，以减少信息冗余和计算开销。它将 RAG 查询重新表述为下一个词预测范式，聚焦于查询的语义核心，以实现精确和高效的注意力计算。实验表明，AttentionRAG 可以将上下文压缩多达 6.3 倍，并在关键指标上比 LLMLingua 方法高出约 10%。

RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

Authors: Juntao Jiang, Jiangning Zhang, Weixuan Liu, Muxuan Gao, Xiaobin Hu, Zhucun Xue, Yong Liu, Shuicheng Yan

First: 2025-01-14T22:03:00+00:00 · Latest: 2025-10-27T16:55:19+00:00

Abs · PDF · Code1 · Code2

Abstract

In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.

中文标题/摘要

标题：RWKV-UNet：通过长距离合作改进UNet以实现有效的医学图像分割

近年来，深度学习在医学图像分割领域取得了显著进展，特别是在卷积神经网络(CNN)和变压器模型方面。然而，CNN在捕捉长距离依赖性方面存在局限性，而变压器则面临高计算复杂度的问题。为了解决这一问题，我们提出了一种名为RWKV-UNet的新模型，该模型将RWKV（接收权重键值）结构集成到U-Net架构中。这种集成增强了模型捕捉长距离依赖性和提高上下文理解的能力，这对于准确的医学图像分割至关重要。我们构建了一个强大的编码器，其中包含结合了CNN和RWKVs的全局-局部空间感知(GLSP)模块。我们还提出了一种跨通道混合(CCM)模块，以通过多尺度特征融合改进跳跃连接，实现全局通道信息整合。在11个基准数据集上的实验表明，RWKV-UNet在各种类型的医学图像分割任务中达到了最先进的性能。此外，较小的变体RWKV-UNet-S和RWKV-UNet-T在准确性和计算效率之间取得了平衡，使其适用于更广泛的临床应用。

Summary / 总结

The research aims to improve medical image segmentation by addressing the limitations of CNNs and transformers. RWKV-UNet integrates the RWKV structure into U-Net to enhance long-range dependency capture and contextual understanding. Experiments on 11 benchmark datasets demonstrate that RWKV-UNet achieves state-of-the-art performance, with smaller variants balancing accuracy and efficiency for clinical applications.

研究旨在通过解决CNN和transformer的局限性，提高医学图像分割的效果。RWKV-UNet将RWKV集成到U-Net中，以捕捉长距离依赖关系并增强上下文理解。实验在11个基准数据集上显示，RWKV-UNet在各种医学图像分割任务中表现出色，较小的变体在准确性和计算效率之间取得平衡，适用于更广泛的临床应用。

FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time

Authors: Yaoli Liu, Yao-Xiang Ding, Kun Zhou

First: 2025-10-27T16:54:08+00:00 · Latest: 2025-10-27T16:54:08+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at https://future-item.github.io/FreeFuse/

中文标题/摘要

标题：FreeFuse：通过测试时自动掩码的多主题LoRA融合

本文提出FreeFuse，这是一种无需训练的新型方法，用于通过自动融合多个主题LoRA进行多主题文本到图像生成。与现有方法要么专注于推理前的LoRA权重合并，要么依赖分割模型和噪声混合等复杂技术来隔离LoRA输出不同，我们的关键洞察是，可以从跨注意力层权重中自动推导出上下文感知的动态主题掩码。数学分析表明，在推理时直接应用这些掩码到LoRA输出可以很好地近似于将主题LoRA集成到扩散模型并在掩码区域单独使用的情况。FreeFuse展示了更高的实用性和效率，因为它不需要额外的训练，不需要修改LoRA，不需要辅助模型，也不需要用户定义的提示模板或区域指定。相反，它只需要用户提供LoRA激活词即可无缝集成到标准工作流程中。广泛的实验验证了在多主题生成任务中，FreeFuse在生成质量和易用性方面均优于现有方法。项目页面位于https://future-item.github.io/FreeFuse/

Summary / 总结

FreeFuse is a training-free approach for multi-subject text-to-image generation that automatically fuses multiple subject LoRAs using context-aware dynamic subject masks derived from cross-attention layer weights. This method avoids the need for pre-inference LoRA weight merging, segmentation models, or noise blending, and only requires users to provide LoRA activation words. Experiments show that FreeFuse outperforms existing methods in both generation quality and usability for multi-subject tasks without additional training or modifications.

FreeFuse 是一种无需训练的方法，用于多主题文本到图像生成，通过从交叉注意力层权重中自动提取上下文感知的动态主题掩码来融合多个主题LoRA。该方法避免了预推理LoRA权重合并、分割模型或噪声混合的需求。实验结果表明，FreeFuse 在多主题任务中在生成质量和易用性方面优于现有方法，无需额外训练或修改LoRA。

Localising under the drape: proprioception in the era of distributed surgical robotic system

Authors: Martin Huber, Nicola A. Cavalcanti, Ayoob Davoodi, Ruixuan Li, Christopher E. Mower, Fabio Carrillo, Christoph J. Laux, Francois Teyssere, Thibault Chandanson, Antoine Harlé, Elie Saghbiny, Mazda Farshad, Guillaume Morel, Emmanuel Vander Poorten, Philipp Fürnstahl, Sébastien Ourselin, Christos Bergeles, Tom Vercauteren

First: 2025-10-27T16:50:12+00:00 · Latest: 2025-10-27T16:50:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite their mechanical sophistication, surgical robots remain blind to their surroundings. This lack of spatial awareness causes collisions, system recoveries, and workflow disruptions, issues that will intensify with the introduction of distributed robots with independent interacting arms. Existing tracking systems rely on bulky infrared cameras and reflective markers, providing only limited views of the surgical scene and adding hardware burden in crowded operating rooms. We present a marker-free proprioception method that enables precise localisation of surgical robots under their sterile draping despite associated obstruction of visual cues. Our method solely relies on lightweight stereo-RGB cameras and novel transformer-based deep learning models. It builds on the largest multi-centre spatial robotic surgery dataset to date (1.4M self-annotated images from human cadaveric and preclinical in vivo studies). By tracking the entire robot and surgical scene, rather than individual markers, our approach provides a holistic view robust to occlusions, supporting surgical scene understanding and context-aware control. We demonstrate an example of potential clinical benefits during in vivo breathing compensation with access to tissue dynamics, unobservable under state of the art tracking, and accurately locate in multi-robot systems for future intelligent interaction. In addition, and compared with existing systems, our method eliminates markers and improves tracking visibility by 25%. To our knowledge, this is the first demonstration of marker-free proprioception for fully draped surgical robots, reducing setup complexity, enhancing safety, and paving the way toward modular and autonomous robotic surgery.

中文标题/摘要

标题：在覆盖物下的本地化：分布式手术机器人时代的本体感受

尽管手术机器人技术复杂，但它们仍然对周围环境一无所知。这种空间意识的缺乏导致了碰撞、系统恢复和工作流程中断等问题，这些问题随着具有独立交互臂的分布式机器人的引入而加剧。现有的跟踪系统依赖于笨重的红外摄像头和反射标记，只能提供有限的手术场景视图，并在拥挤的手术室中增加了硬件负担。我们提出了一种无标记的本体感受方法，使手术机器人在其无菌覆盖物下实现精确定位，尽管视觉线索受到阻碍。我们的方法仅依赖于轻量级的立体RGB摄像头和新型基于变压器的深度学习模型。它基于迄今为止最大的多中心空间机器人手术数据集（140万张自注释图像，来自人体尸体和前期临床体内研究）。通过跟踪整个机器人和手术场景，而不是单独的标记，我们的方法提供了对遮挡具有鲁棒性的整体视图，支持手术场景理解和上下文感知控制。我们展示了在体内呼吸补偿中的潜在临床益处示例，可以访问最新的组织动力学，这是现有跟踪技术无法观察到的，并且在多机器人系统中准确定位，以支持未来智能交互。此外，与现有系统相比，我们的方法消除了标记并提高了跟踪可见性25%。据我们所知，这是首次展示无标记的本体感受用于完全覆盖的手术机器人，简化了设置复杂性，提高了安全性，并为模块化和自主机器人手术铺平了道路。

Summary / 总结

This paper addresses the spatial awareness issue in surgical robots, which can lead to collisions and workflow disruptions. It introduces a marker-free proprioception method using lightweight stereo-RGB cameras and transformer-based deep learning models. The method provides precise localization of surgical robots under sterile draping, offering a holistic view robust to occlusions. It demonstrates improved tracking visibility by 25% and potential clinical benefits such as breathing compensation and accurate localization in multi-robot systems.

该论文解决了手术机器人缺乏空间感知的问题，可能导致碰撞和工作流程中断。它提出了一种基于轻量级立体RGB摄像头和基于变压器的深度学习模型的无标记本体感知方法。该方法能够在手术机器人覆盖无菌布的情况下提供精确的定位，并提供一个不受遮挡影响的整体视图。它展示了25%的跟踪可见性改进以及呼吸补偿等潜在临床益处，并且能够在多机器人系统中实现准确定位，从而实现模块化和自主手术。

Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2

Authors: Naveenkumar G Venkataswamy, Yu Liu, Soumyabrata Dey, Stephanie Schuckers, Masudul H Imtiaz

First: 2025-10-07T17:33:41+00:00 · Latest: 2025-10-27T16:44:15+00:00

Comments: This submission has been withdrawn because it duplicates significant content from another version of the paper already available on arXiv as arXiv:2412.13063

Abs · PDF · Code1 · Code2

Abstract

Smartphone-based iris recognition in the visible spectrum (VIS) remains difficult due to illumination variability, pigmentation differences, and the absence of standardized capture controls. This work presents a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance at acquisition and demonstrates that accurate VIS iris recognition is feasible on commodity devices. Using a custom Android application performing real-time framing, sharpness evaluation, and feedback, we introduce the CUVIRIS dataset of 752 compliant images from 47 subjects. A lightweight MobileNetV3-based multi-task segmentation network (LightIrisNet) is developed for efficient on-device processing, and a transformer matcher (IrisFormer) is adapted to the VIS domain. Under a standardized protocol and comparative benchmarking against prior CNN baselines, OSIRIS attains a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS. The acquisition app, trained models, and a public subset of the dataset are released to support reproducibility. These results confirm that standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.

中文标题/摘要

标题：基于智能手机的高质量可见光虹膜识别V2

基于智能手机的可见光谱（VIS）虹膜识别由于光照变化、色素差异以及缺乏标准化捕获控制而困难。本研究提出了一种紧凑的端到端管道，确保在捕获过程中符合ISO/IEC 29794-6质量标准，并证明在普通设备上实现准确的VIS虹膜识别是可行的。通过一个自定义的Android应用程序进行实时构图、清晰度评估和反馈，我们引入了包含47名受试者752张合规图像的CUVIRIS数据集。开发了一种轻量级的MobileNetV3多任务分割网络（LightIrisNet）用于设备端高效处理，并将变压器匹配器（IrisFormer）适应到VIS域。在标准化协议下，OSIRIS在FAR=0.01时达到TAR为97.9%（EER为0.76%），而仅在UBIRIS.v2上训练的IrisFormer在CUVIRIS上的EER为0.057%。该捕获应用程序、训练模型和数据集的公共子集已发布以支持可重复性。这些结果表明，标准化捕获和适应VIS的轻量级模型使智能手机上的准确和实用虹膜识别成为可能。

Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

Authors: Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro

First: 2025-10-27T16:40:17+00:00 · Latest: 2025-10-27T16:40:17+00:00

Comments: 16 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

中文标题/摘要

标题：通过情感推理验证器实现多模态LLM的情感一致推理

近期多模态大型语言模型（MLLMs）的发展正在将人机交互（HCI）从表面交流转变为更加细腻和情感智能的交流。为了实现这一转变，理解情感变得至关重要，使系统能够捕捉到用户意图背后的微妙线索。此外，为预测的情感提供忠实的解释对于确保可解释性和建立用户信任至关重要。然而，当前基于MLLM的方法往往生成与目标标签相悖的情感解释，有时甚至与其自身预测的情感相矛盾。这种不一致性对误解构成重大风险，并在交互环境中削弱了可靠性。为了解决这一问题，我们提出了一种新颖的方法：情感推理验证器（ERV）和解释奖励。我们的方法引导模型在多模态情感识别过程中产生明确与目标情感一致的推理，而无需修改模型架构或需要额外的配对视频描述注释。我们的方法在MAFW和DFEW数据集上显著提高了忠实解释预测一致性和解释情感准确性。通过广泛的实验和人工评估，我们表明，我们的方法不仅增强了解释与预测之间的对齐，还使MLLM能够提供情感一致、值得信赖的交互，标志着向真正的人类级HCI系统迈出关键一步。

Summary / 总结

The paper proposes the Emotional Rationale Verifier (ERV) and an Explanation Reward to enhance the consistency and accuracy of emotion explanations in Multimodal Large Language Models (MLLMs). This method ensures that the generated explanations align with the target emotions without altering the model architecture or requiring additional annotations. Experiments on MAFW and DFEW datasets show significant improvements in faithful explanation-prediction consistency and explanation emotion accuracy, leading to more emotionally coherent and trustworthy interactions.

论文提出了情感推理验证器（ERV）和解释奖励，以增强多模态大型语言模型（MLLM）中情绪解释的一致性和准确性。该方法确保生成的解释与目标情绪一致，无需修改模型架构或额外的配对视频描述注释。在MAFW和DFEW数据集上的实验显示，解释预测一致性以及解释情绪准确性有了显著提高，从而实现了更加情感一致和值得信赖的交互，标志着向真正的人类级人机交互系统迈出了关键一步。

Topology Sculptor, Shape Refiner: Discrete Diffusion Model for High-Fidelity 3D Meshes Generation

Authors: Kaiyu Song, Hanjiang Lai, Yaqing Zhang, Chuangjian Cai, Yan Pan Kun Yue, Jian Yin

First: 2025-10-24T08:51:48+00:00 · Latest: 2025-10-27T16:38:35+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this paper, we introduce Topology Sculptor, Shape Refiner (TSSR), a novel method for generating high-quality, artist-style 3D meshes based on Discrete Diffusion Models (DDMs). Our primary motivation for TSSR is to achieve highly accurate token prediction while enabling parallel generation, a significant advantage over sequential autoregressive methods. By allowing TSSR to "see" all mesh tokens concurrently, we unlock a new level of efficiency and control. We leverage this parallel generation capability through three key innovations: 1) Decoupled Training and Hybrid Inference, which distinctly separates the DDM-based generation into a topology sculpting stage and a subsequent shape refinement stage. This strategic decoupling enables TSSR to effectively capture both intricate local topology and overarching global shape. 2) An Improved Hourglass Architecture, featuring bidirectional attention enriched by face-vertex-sequence level Rotational Positional Embeddings (RoPE), thereby capturing richer contextual information across the mesh structure. 3) A novel Connection Loss, which acts as a topological constraint to further enhance the realism and fidelity of the generated meshes. Extensive experiments on complex datasets demonstrate that TSSR generates high-quality 3D artist-style meshes, capable of achieving up to 10,000 faces at a remarkable spatial resolution of $1024^3$. The code will be released at: https://github.com/psky1111/Tencent-TSSR.

中文标题/摘要

标题：拓扑雕塑家，形状精炼师：离散扩散模型在高保真3D网格生成中的应用

在本文中，我们介绍了拓扑雕塑家，形状精炼师（TSSR），一种基于离散扩散模型（DDMs）生成高质量、艺术家风格3D网格的新方法。TSSR的主要动机是实现高度准确的标记预测，同时允许并行生成，这是相对于顺序自回归方法的一个重要优势。通过使TSSR“同时”看到所有网格标记，我们解锁了新的效率和控制水平。我们通过三个关键创新利用这种并行生成能力：1）解耦训练和混合推理，将基于DDM的生成分为拓扑雕塑阶段和后续的形状精炼阶段。这种战略解耦使TSSR能够有效捕捉复杂的局部拓扑和总体的全局形状。2）改进的 hourglass 架构，通过面-顶点-序列级别的旋转位置嵌入（RoPE）增强双向注意力，从而捕捉网格结构中的更丰富上下文信息。3）一种新颖的连接损失，作为拓扑约束，进一步增强生成网格的真实性和保真度。在复杂数据集上的大量实验表明，TSSR能够生成高质量的3D艺术家风格网格，能够在惊人的空间分辨率$1024^3$下达到高达10,000个面。代码将在以下地址发布：https://github.com/psky1111/Tencent-TSSR。

Summary / 总结

The paper introduces Topology Sculptor, Shape Refiner (TSSR), a method for generating high-quality 3D meshes using Discrete Diffusion Models. Motivated by the need for accurate and efficient mesh generation, TSSR employs decoupled training and hybrid inference, an improved hourglass architecture with bidirectional attention and rotational positional embeddings, and a novel connection loss. These innovations enable TSSR to generate meshes with up to 10,000 faces at a spatial resolution of $1024^3$, achieving high fidelity and realism. The code is available at https://github.com/psky1111/Tencent-TSSR.

论文介绍了使用离散扩散模型生成高质量3D网格的Topology Sculptor, Shape Refiner (TSSR)方法。该方法旨在实现准确且高效的网格生成，通过解耦训练和混合推理、改进的小时玻璃架构以及新颖的连接损失，TSSR能够并行生成网格，同时捕捉局部拓扑和全局形状。实验结果显示，TSSR能够生成高达10,000个面、分辨率为$1024^3$的高质量网格。代码可在https://github.com/psky1111/Tencent-TSSR获取。

Towards Deep Physics-Informed Kolmogorov-Arnold Networks

Authors: Spyros Rigas, Fotios Anagnostopoulos, Michalis Papachristou, Georgios Alexandridis

First: 2025-10-27T16:35:01+00:00 · Latest: 2025-10-27T16:35:01+00:00

Comments: 73 pages, 22 figures

Abs · PDF · Code1 · Code2

Abstract

Since their introduction, Kolmogorov-Arnold Networks (KANs) have been successfully applied across several domains, with physics-informed machine learning (PIML) emerging as one of the areas where they have thrived. In the PIML setting, Chebyshev-based physics-informed KANs (cPIKANs) have become the standard due to their computational efficiency. However, like their multilayer perceptron-based counterparts, cPIKANs face significant challenges when scaled to depth, leading to training instabilities that limit their applicability to several PDE problems. To address this, we propose a basis-agnostic, Glorot-like initialization scheme that preserves activation variance and yields substantial improvements in stability and accuracy over the default initialization of cPIKANs. Inspired by the PirateNet architecture, we further introduce Residual-Gated Adaptive KANs (RGA KANs), designed to mitigate divergence in deep cPIKANs where initialization alone is not sufficient. Through empirical tests and information bottleneck analysis, we show that RGA KANs successfully traverse all training phases, unlike baseline cPIKANs, which stagnate in the diffusion phase in specific PDE settings. Evaluations on seven standard forward PDE benchmarks under a fixed training pipeline with adaptive components demonstrate that RGA KANs consistently outperform parameter-matched cPIKANs and PirateNets - often by several orders of magnitude - while remaining stable in settings where the others diverge.

中文标题/摘要

标题：迈向深度物理启发的柯尔莫哥洛夫-阿诺尔德网络

自引入以来，柯尔莫哥洛夫-阿诺尔德网络（KANs）已在多个领域成功应用，其中物理启发的机器学习（PIML）已成为它们蓬勃发展的领域之一。在PIML设置中，基于切比雪夫的物理启发KANs（cPIKANs）因其计算效率而成为标准。然而，与基于多层感知机的同类产品一样，cPIKANs在深度扩展时面临重大挑战，导致训练不稳定，限制了它们在多个偏微分方程（PDE）问题中的应用。为了解决这一问题，我们提出了一种基础无关的、类似于Glorot的初始化方案，该方案保持了激活方差并显著提高了cPIKANs的稳定性和准确性。受PirateNet架构的启发，我们进一步引入了残差门控自适应KANs（RGA KANs），旨在在初始化不足以防止深度cPIKANs发散的情况下减轻发散。通过实证测试和信息瓶颈分析，我们表明RGA KANs能够成功穿越所有训练阶段，而基线cPIKANs在特定PDE设置中的扩散阶段会停滞不前。在固定训练管道下对七个标准前向PDE基准进行评估，显示RGA KANs在参数匹配的cPIKANs和PirateNets中表现出更优的性能，通常高出几个数量级，同时在其他模型发散的设置中保持稳定。

Summary / 总结

This paper addresses the challenges of training deep physics-informed Kolmogorov-Arnold Networks (cPIKANs) by proposing a new initialization scheme and introducing Residual-Gated Adaptive KANs (RGA KANs). The authors demonstrate that RGA KANs improve stability and accuracy over standard cPIKANs and outperform parameter-matched cPIKANs and PirateNets on seven standard PDE benchmarks, often by several orders of magnitude.

本文旨在通过提出一种基底无关的初始化方案并引入残差门控自适应Kolmogorov-Arnold网络（RGA KANs）来解决深度物理信息Kolmogorov-Arnold网络（cPIKANs）的训练难题。作者证明了RGA KANs在稳定性与准确性上优于cPIKANs，特别是在深层网络中，并在七个标准PDE基准测试中，RGA KANs的性能显著优于参数匹配的cPIKANs和PirateNets，通常高出几个数量级。

Mixed Precision Training of Neural ODEs

Authors: Elena Celledoni, Brynjulf Owren, Lars Ruthotto, Tianjiao Nicole Yang

First: 2025-10-27T16:32:56+00:00 · Latest: 2025-10-27T16:32:56+00:00

Comments: Code available at https://github.com/EmoryMLIP/rampde; 26 pages, 4 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Exploiting low-precision computations has become a standard strategy in deep learning to address the growing computational costs imposed by ever larger models and datasets. However, naively performing all computations in low precision can lead to roundoff errors and instabilities. Therefore, mixed precision training schemes usually store the weights in high precision and use low-precision computations only for whitelisted operations. Despite their success, these principles are currently not reliable for training continuous-time architectures such as neural ordinary differential equations (Neural ODEs). This paper presents a mixed precision training framework for neural ODEs, combining explicit ODE solvers with a custom backpropagation scheme, and demonstrates its effectiveness across a range of learning tasks. Our scheme uses low-precision computations for evaluating the velocity, parameterized by the neural network, and for storing intermediate states, while stability is provided by a custom dynamic adjoint scaling and by accumulating the solution and gradients in higher precision. These contributions address two key challenges in training neural ODE: the computational cost of repeated network evaluations and the growth of memory requirements with the number of time steps or layers. Along with the paper, we publish our extendable, open-source PyTorch package rampde, whose syntax resembles that of leading packages to provide a drop-in replacement in existing codes. We demonstrate the reliability and effectiveness of our scheme using challenging test cases and on neural ODE applications in image classification and generative models, achieving approximately 50% memory reduction and up to 2x speedup while maintaining accuracy comparable to single-precision training.

中文标题/摘要

标题：神经ODE的混合精度训练

利用低精度计算已成为应对不断增长的计算成本（由越来越大的模型和数据集引起）的深度学习标准策略。然而，简单地将所有计算都置于低精度可能会导致舍入误差和不稳定性。因此，混合精度训练方案通常将权重存储在高精度中，并仅在白名单操作中使用低精度计算。尽管这些原则在训练连续时间架构（如神经常微分方程（Neural ODEs））方面目前并不可靠。本文提出了一种针对神经ODEs的混合精度训练框架，结合显式ODE求解器和自定义反向传播方案，并展示了其在各种学习任务中的有效性。我们的方案使用低精度计算来评估由神经网络参数化的速度，并用于存储中间状态，而稳定性则通过自定义动态伴随缩放和在更高精度下累积解和梯度来提供。这些贡献解决了训练神经ODEs的两个关键挑战：网络评估的计算成本和随时间步数或层数增加的内存需求增长。除了论文，我们还发布了扩展的开源PyTorch包rampde，其语法类似于领先包，可提供即插即用的替代方案。我们使用具有挑战性的测试案例和神经ODE在图像分类和生成模型中的应用，展示了该方案的可靠性和有效性，实现了约50%的内存减少和高达2倍的速度提升，同时保持与单精度训练相当的准确性。

Summary / 总结

This paper addresses the challenge of training neural ordinary differential equations (Neural ODEs) using mixed precision techniques. The authors propose a framework that uses low-precision computations for evaluating the velocity and storing intermediate states, while maintaining stability through custom dynamic adjoint scaling and higher precision accumulation. This approach reduces memory usage by about 50% and achieves up to 2x speedup without compromising accuracy. The method is demonstrated on various learning tasks, including image classification and generative models.

该论文解决了使用混合精度技术训练神经常微分方程（Neural ODEs）的挑战。它提出了一种框架，使用低精度计算来评估速度并存储中间状态，同时通过自定义动态伴随缩放和更高精度的累积来保持稳定性。该方法通过约50%的内存使用量减少和最多2倍的加速，而不牺牲准确性。其有效性通过各种学习任务得到验证，包括图像分类和生成模型。