arXiv 论文速递

Generative View Stitching

Authors: Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann

First: 2025-10-28T17:59:58+00:00 · Latest: 2025-10-28T17:59:58+00:00

Comments: Project website: https://andrewsonga.github.io/gvs

Abstract

Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

中文标题/摘要

标题：生成式视图缝合

自回归视频扩散模型能够生成长期稳定且与历史一致的序列，但无法用未来条件引导当前生成。在具有预定义摄像机轨迹的摄像机引导视频生成中，这一限制会导致生成场景中的碰撞，之后自回归迅速崩溃。为解决这一问题，我们提出了生成式视图缝合（GVS），该方法并行采样整个序列，使生成场景忠实于预定义的摄像机轨迹的每一部分。我们的主要贡献是一种采样算法，将先前用于机器人规划的扩散缝合技术扩展到视频生成。虽然此类缝合方法通常需要专门训练的模型，但GVS与任何使用扩散强迫（一种常见的序列扩散框架）训练的现成视频模型兼容，我们已经证明扩散强迫提供了进行缝合所需的条件。我们还引入了全方位引导（Omni Guidance）技术，通过同时条件化于过去和未来来增强缝合的时序一致性，并使我们提出的闭环机制能够实现长距离的一致性。总体而言，GVS实现了稳定、无碰撞、帧到帧一致且能够为各种预定义摄像机路径闭合循环的摄像机引导视频生成，包括奥斯卡·鲁道夫斯·埃德华兹的不可能楼梯。结果最好以视频形式查看：https://andrewsonga.github.io/gvs

Uniform Discrete Diffusion with Metric Path for Video Generation

Authors: Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang

First: 2025-10-28T17:59:57+00:00 · Latest: 2025-10-28T17:59:57+00:00

Comments: 19 pages, 10 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

中文标题/摘要

标题：均匀离散扩散与度量路径的视频生成

连续空间视频生成已经取得了快速进展，而离散方法则因误差累积和长上下文不一致而落后。在本文中，我们重新审视了离散生成建模，并提出了均匀离散扩散与度量路径（URSA），这是一种简单而强大的框架，能够通过迭代的离散时空标记全局细化来弥补与连续方法之间的差距，实现可扩展的视频生成。URSA的核心在于将视频生成任务表述为离散时空标记的迭代全局细化。它结合了两个关键设计：线性化度量路径和分辨率相关的时隙调整机制。这些设计使URSA能够高效地扩展到高分辨率图像合成和长时间视频生成，同时需要显著减少推理步骤。此外，我们还引入了一种异步时间微调策略，该策略在单一模型中统一了多种任务，包括插值和图像到视频生成。在具有挑战性的视频和图像生成基准上的广泛实验表明，URSA在离散方法中始终表现出色，并且在性能上与最先进的连续扩散方法相当。代码和模型可在https://github.com/baaivision/URSA获取

Summary / 总结

URSA is a framework that addresses the limitations of discrete video generation methods by integrating a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. This approach enables efficient high-resolution image synthesis and long-duration video generation with fewer inference steps. Experiments show that URSA outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods on challenging benchmarks.

URSA通过结合离散时空令牌的迭代全局精炼、线性化度量路径和分辨率依赖的时间步长调整机制，解决了离散视频生成方法的局限性。这种方法使得URSA能够高效地进行高分辨率图像合成和长时视频生成，并且所需的推理步骤更少。实验表明，URSA在挑战性基准测试中优于现有离散方法，并且与最先进的连续扩散方法的性能相当。

Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Authors: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan

First: 2025-10-28T17:59:02+00:00 · Latest: 2025-10-28T17:59:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

中文标题/摘要

标题：MoE中的路由问题：通过显式路由指导扩展扩散变换器

混合专家（MoE）已成为一种强大的范式，用于在保持计算效率的同时扩展模型容量。尽管MoE在大型语言模型（LLMs）中取得了显著成功，但将其应用于扩散变换器（DiTs）的努力仅取得有限成效。我们将其差距归因于语言和视觉标记之间的根本差异。语言标记在语义上密集且具有显著的标记间变异性，而视觉标记表现出空间冗余和功能异质性，阻碍了视觉MoE中的专家专业化。为此，我们提出了ProMoE，这是一种具有两步路由器的MoE框架，该路由器带有显式路由指导，以促进专家专业化。具体而言，这种指导鼓励路由器通过根据功能角色进行条件路由将图像标记划分为条件集和非条件集，并通过基于语义内容的可学习原型进行原型路由来细化条件图像标记的分配。此外，基于原型路由在潜在空间中的相似性专家分配提供了一种自然机制，以纳入显式的语义指导，我们验证了这种指导对于视觉MoE至关重要。在此基础上，我们提出了一种路由对比损失，以显式增强原型路由过程，促进专家内部的一致性和专家之间的多样性。在ImageNet基准上的广泛实验表明，ProMoE在修正流和DDPM训练目标下均超越了最先进的方法。代码和模型将公开发布。

Summary / 总结

The paper addresses the challenge of applying Mixture-of-Experts (MoE) to Diffusion Transformers (DiTs) by introducing ProMoE, which uses a two-step router with explicit routing guidance to promote expert specialization. The method partitions image tokens into conditional and unconditional sets based on their functional roles and refines these assignments through prototypical routing. This approach enhances intra-expert coherence and inter-expert diversity, leading to superior performance on the ImageNet benchmark compared to existing methods under both Rectified Flow and DDPM training objectives.

论文通过引入ProMoE框架，该框架使用带有显式路由指导的两步路由器来促进专家专业化。这种指导根据功能角色将图像令牌划分为条件性和非条件性集合，并通过基于语义内容的可学习原型进行细化路由。提出的路由对比损失进一步增强了这一过程，提高了专家内部的一致性和专家之间的多样性。在ImageNet基准测试上的实验表明，ProMoE在Rectified Flow和DDPM训练目标下均优于现有方法。

Physics-Informed Latent Neural Operator for Real-time Predictions of time-dependent parametric PDEs

Authors: Sharmila Karumuri, Lori Graham-Brady, Somdatta Goswami

First: 2025-01-14T20:38:30+00:00 · Latest: 2025-10-28T17:58:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep operator network (DeepONet) has shown significant promise as surrogate models for systems governed by partial differential equations (PDEs), enabling accurate mappings between infinite-dimensional function spaces. However, when applied to systems with high-dimensional input-output mappings arising from large numbers of spatial and temporal collocation points, these models often require heavily overparameterized networks, leading to long training times. Latent DeepONet addresses some of these challenges by introducing a two-step approach: first learning a reduced latent space using a separate model, followed by operator learning within this latent space. While efficient, this method is inherently data-driven and lacks mechanisms for incorporating physical laws, limiting its robustness and generalizability in data-scarce settings. In this work, we propose PI-Latent-NO, a physics-informed latent neural operator framework that integrates governing physics directly into the learning process. Our architecture features two coupled DeepONets trained end-to-end: a Latent-DeepONet that learns a low-dimensional representation of the solution, and a Reconstruction-DeepONet that maps this latent representation back to the physical space. By embedding PDE constraints into the training via automatic differentiation, our method eliminates the need for labeled training data and ensures physics-consistent predictions. The proposed framework is both memory and compute-efficient, exhibiting near-constant scaling with problem size and demonstrating significant speedups over traditional physics-informed operator models. We validate our approach on a range of parametric PDEs, showcasing its accuracy, scalability, and suitability for real-time prediction in complex physical systems.

中文标题/摘要

标题：基于物理信息的潜在神经算子模型用于实时预测时变参数化偏微分方程

深度算子网络（DeepONet）作为偏微分方程（PDEs）支配系统的代理模型，显示出显著的潜力，能够实现无限维函数空间之间的精确映射。然而，当应用于具有高维输入输出映射的系统时，这些模型通常需要高度过参数化的网络，导致较长的训练时间。潜在DeepONet通过引入两步方法部分解决了这些问题：首先使用单独的模型学习一个减少的潜在空间，然后在该潜在空间内进行算子学习。尽管这种方法高效，但它本质上是数据驱动的，缺乏整合物理定律的机制，限制了其在数据稀缺环境中的鲁棒性和泛化能力。在本文中，我们提出了一种基于物理信息的潜在神经算子框架（PI-Latent-NO），直接将支配物理定律整合到学习过程中。我们的架构包括两个端到端训练的耦合DeepONet：一个潜在DeepONet学习解的低维表示，一个重建DeepONet将这种潜在表示映射回物理空间。通过自动微分将PDE约束嵌入训练中，我们的方法消除了对标记训练数据的需求，并确保物理一致的预测。所提出的方法在内存和计算效率方面都表现出色，随着问题规模的增加几乎保持恒定的扩展性，并且在传统基于物理信息的算子模型上显示出显著的速度提升。我们在一系列参数化PDEs上验证了我们的方法，展示了其准确性、可扩展性和在复杂物理系统中进行实时预测的适用性。

Summary / 总结

The research aims to improve the efficiency and robustness of deep operator networks (DeepONets) for solving partial differential equations (PDEs) by integrating physical laws directly into the learning process. The proposed PI-Latent-NO framework uses two coupled DeepONets: a Latent-DeepONet for learning a low-dimensional representation and a Reconstruction-DeepONet for mapping this representation back to the physical space. This approach eliminates the need for labeled training data and ensures physics-consistent predictions. Experimental results demonstrate significant speedups and accuracy improvements over traditional physics-informed operator models, making it suitable for real-time predictions in complex physical systems.

本文提出了一种名为PI-Latent-NO的物理感知潜神经算子框架，以解决高维输入输出映射系统的深度算子网络（DeepONet）训练难题。该方法通过两个耦合的DeepONet实现：一个潜DeepONet用于学习低维表示，另一个重建DeepONet将该表示映射回物理空间。通过自动微分嵌入PDE约束，该方法消除了对标注训练数据的需求，并确保物理一致性预测。实验结果展示了在各种参量PDE中的显著加速和准确性，适用于复杂物理系统的实时预测。

Retrieval-Augmented Generation-based Relation Extraction

Authors: Sefika Efeoglu, Adrian Paschke

First: 2024-04-20T14:42:43+00:00 · Latest: 2025-10-28T17:56:27+00:00

Comments: published at the Semantic Web journal. The last version is available: https://doi.org/10.1177/22104968251385519

Abs · PDF · Code1 · Code2

Abstract

Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks. This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.

中文标题/摘要

标题：基于检索增强生成的关系提取

信息提取（IE）是一种将无结构文本数据转换为结构化格式的变革性过程，通过实体和关系提取（RE）方法实现。识别实体对之间的关系在这一框架中起着关键作用。尽管存在各种关系提取技术，但它们的有效性很大程度上依赖于标记数据和大量计算资源的访问。为应对这些挑战，大型语言模型（LLMs）成为有希望的解决方案；然而，它们可能会由于自身的训练数据而产生幻觉响应。为克服这些限制，本文提出了检索增强生成的关系提取（RAG4RE），提供了一种提高关系提取任务性能的途径。本文利用不同的LLMs评估了我们RAG4RE方法的有效性。通过使用TACRED、TACREV、Re-TACRED和SemEval RE等基准数据集，我们的目标是全面评估我们RAG4RE方法的有效性。特别地，我们在研究中利用了包括Flan T5、Llama2和Mistral在内的主要LLMs。我们的研究结果表明，我们的RAG4RE方法在TACRED数据集及其变体中超过了仅基于LLMs的传统RE方法。此外，与之前的RE方法相比，我们的方法在TACRED和TACREV数据集上表现出色，突显了其有效性和在自然语言处理中推进RE任务的潜力。

Summary / 总结

The paper proposes RAG4RE, a method for relation extraction that leverages Large Language Models (LLMs) augmented with retrieval techniques to improve performance. The study evaluates RAG4RE using LLMs like Flan T5, Llama2, and Mistral on benchmarks such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets. The results show that RAG4RE outperforms traditional LLM-based relation extraction methods, especially on the TACRED dataset and its variations.

该研究提出了一种名为RAG4RE的关系抽取方法，利用大型语言模型（LLMs）结合检索技术来提升性能。研究使用Flan T5、Llama2和Mistral等LLM，在TACRED、TACREV、Re-TACRED和SemEval RE等基准数据集上评估了RAG4RE的效果。结果显示，RAG4RE在TACRED及其变体数据集上显著优于传统的基于LLM的关系抽取方法。

ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

Authors: Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu

First: 2025-10-28T17:55:42+00:00 · Latest: 2025-10-28T17:55:42+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.

中文标题/摘要

标题：ComboBench：大型语言模型能否操控物理设备来玩虚拟现实游戏？

虚拟现实（VR）游戏要求玩家将高层次的语义动作转化为精确的设备操作，使用控制器和头戴式显示器（HMD）。尽管人类基于常识和身体理解能够直观地进行这种转化，但大型语言模型（LLMs）能否有效复制这种能力仍待探索。本文介绍了一个基准测试，ComboBench，评估LLMs将语义动作转化为跨262个场景的VR设备操作序列的能力，这些场景来自四款流行的VR游戏：《半条命： Alyx》、《Into the Radius》、《Moss: Book II》和《Vivecraft》。我们评估了七款LLMs，包括GPT-3.5、GPT-4、GPT-4o、Gemini-1.5-Pro、LLaMA-3-8B、Mixtral-8x7B和GLM-4-Flash，与标注的基准和人类表现进行比较。结果显示，尽管像Gemini-1.5-Pro这样的顶级模型展示了强大的任务分解能力，但在程序推理和空间理解方面仍不及人类。不同游戏之间的表现差异显著，表明对交互复杂性的敏感性。少量示例显著提高了性能，表明有可能针对性地增强LLMs的VR操作能力。所有材料已发布于https://sites.google.com/view/combobench。

Datasheets for Machine Learning Sensors

Authors: Matthew Stewart, Yuke Zhang, Pete Warden, Yasmine Omri, Shvetank Prakash, Jacob Huckelberry, Joao Henrique Santos, Shawn Hymel, Benjamin Yeager Brown, Jim MacArthur, Nat Jeffries, Emanuel Moss, Mona Sloane, Brian Plancher, Vijay Janapa Reddi

First: 2023-06-15T04:24:13+00:00 · Latest: 2025-10-28T17:53:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine learning (ML) is becoming prevalent in embedded AI sensing systems. These "ML sensors" enable context-sensitive, real-time data collection and decision-making across diverse applications ranging from anomaly detection in industrial settings to wildlife tracking for conservation efforts. As such, there is a need to provide transparency in the operation of such ML-enabled sensing systems through comprehensive documentation. This is needed to enable their reproducibility, to address new compliance and auditing regimes mandated in regulation and industry-specific policy, and to verify and validate the responsible nature of their operation. To address this gap, we introduce the datasheet for ML sensors framework. We provide a comprehensive template, collaboratively developed in academia-industry partnerships, that captures the distinct attributes of ML sensors, including hardware specifications, ML model and dataset characteristics, end-to-end performance metrics, and environmental impacts. Our framework addresses the continuous streaming nature of sensor data, real-time processing requirements, and embeds benchmarking methodologies that reflect real-world deployment conditions, ensuring practical viability. Aligned with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), our approach enhances the transparency and reusability of ML sensor documentation across academic, industrial, and regulatory domains. To show the application of our approach, we present two datasheets: the first for an open-source ML sensor designed in-house and the second for a commercial ML sensor developed by industry collaborators, both performing computer vision-based person detection.

中文标题/摘要

标题：机器学习传感器数据表

机器学习（ML）在嵌入式AI传感系统中变得越来越普遍。“ML传感器”能够实现上下文相关、实时的数据采集和决策，在从工业环境中的异常检测到保护野生动物的保育工作等众多应用中发挥着作用。因此，有必要通过全面的文档提供这些ML驱动传感系统的透明度，以实现其可重复性，应对监管和行业特定政策中规定的新的合规性和审计要求，并验证和验证其操作的负责任性。为了解决这一缺口，我们提出了ML传感器数据表框架。我们提供了一个全面的模板，该模板在学术界-工业界合作伙伴关系中共同开发，涵盖了ML传感器的独特属性，包括硬件规格、ML模型和数据集特征、端到端性能指标以及环境影响。我们的框架解决了传感器数据的连续流特性、实时处理要求，并嵌入了反映实际部署条件的基准测试方法，确保其实用性。与FAIR原则（可发现性、可访问性、互操作性和可重用性）一致，我们的方法增强了ML传感器文档在学术界、工业界和监管领域的透明度和可重用性。为了展示我们方法的应用，我们展示了两个数据表：一个是内部设计的开源ML传感器，另一个是与工业合作者共同开发的商业ML传感器，两者都基于计算机视觉进行人员检测。

Summary / 总结

The paper introduces the datasheet for ML sensors framework to enhance transparency and reproducibility in ML-enabled sensing systems. It provides a comprehensive template capturing hardware, ML model, dataset, performance, and environmental impacts, addressing continuous streaming and real-time processing. The framework aligns with FAIR principles and includes two case studies: an in-house open-source sensor and a commercial sensor, both performing person detection using computer vision.

论文提出了ML传感器的数据表框架，以增强ML启用传感系统的透明度和可重复性。该框架提供了一个综合模板，涵盖了硬件、ML模型、数据集、性能和环境影响，解决了连续流式传输和实时处理的问题。展示了两个数据表：一个是内部开发的开源传感器，另一个是与行业合作开发的商业传感器，两者都侧重于基于计算机视觉的人体检测。

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Authors: Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

First: 2025-10-28T17:53:13+00:00 · Latest: 2025-10-28T17:53:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.

中文标题/摘要

标题：代理数据协议：统一多样有效的LLM代理微调数据集

大规模监督微调AI代理的公开研究结果相对罕见，因为代理训练数据的收集面临独特挑战。在本文中，我们提出观点认为瓶颈不在于缺乏底层数据源，而在于大量数据分散在不同的格式、工具和接口中。为此，我们引入了代理数据协议（ADP），这是一种轻量级的表示语言，作为不同格式代理数据集之间的“中间语言”，并统一了下游代理训练管道。ADP的设计足够表达各种任务，包括API/工具使用、浏览、编程、软件工程和一般代理工作流程，同时保持简单易解析和训练，无需在每个数据集级别进行工程设计。在实验中，我们将13个现有代理训练数据集统一为ADP格式，并将标准化的ADP数据转换为多个代理框架的训练就绪格式。我们进行了SFT，并展示了相对于对应基础模型约20%的性能提升，且在标准编程、浏览、工具使用和研究基准测试中达到或接近SOTA性能，无需特定领域调整。所有代码和数据均已公开发布，希望ADP能帮助降低标准化、可扩展和可重复代理训练的门槛。

Summary / 总结

This work addresses the challenge of unifying diverse datasets for fine-tuning AI agents through the introduction of the agent data protocol (ADP). ADP serves as a lightweight representation language that standardizes various agent datasets, facilitating their integration into unified training pipelines. Experiments showed an average performance gain of about 20% over base models and state-of-the-art performance on standard benchmarks without domain-specific tuning. The code and data are publicly available to promote standardized and reproducible agent training.

本文通过引入代理数据协议（ADP）解决了将多样化的数据集统一起来用于AI代理微调的挑战。ADP作为一种轻量级的表示语言，标准化了各种代理数据集，使其能够集成到统一的训练管道中。实验结果显示，与基模型相比，平均性能提高了约20%，并在标准基准测试中达到了最先进的或接近最先进的性能，无需特定领域的调整。代码和数据已公开发布，以促进标准化、可扩展和可重复的代理训练。

Tongyi DeepResearch Technical Report

Authors: Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

First: 2025-10-28T17:53:02+00:00 · Latest: 2025-10-28T17:53:02+00:00

Comments: https://tongyi-agent.github.io/blog

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

中文标题/摘要

标题：同义DeepResearch技术报告

我们介绍了同义DeepResearch，这是一种代理型大型语言模型，特别设计用于长期、深入的信息搜索研究任务。为激励自主深度研究代理，同义DeepResearch通过结合代理中期训练和代理后训练的端到端训练框架进行开发，从而实现跨复杂任务的可扩展推理和信息搜索。我们设计了一个完全自动化的数据合成管道，无需依赖昂贵的人工注释，为所有训练阶段提供支持。通过为每个阶段构建定制化环境，我们的系统确保了稳定和一致的交互。同义DeepResearch拥有总计305亿个参数，每个词激活3.3亿个参数，实现了在包括人类最后考试、BrowseComp、BrowseComp-ZH、WebWalkerQA、xbench-DeepSearch、FRAMES和xbench-DeepSearch-2510等一系列代理深度研究基准测试中的领先性能。我们开源了该模型、框架和完整解决方案，以赋能社区。

Summary / 总结

Tongyi DeepResearch is an agentic large language model designed for long-term, deep research tasks. It uses an end-to-end training framework combining mid-training and post-training to enable scalable reasoning and information seeking. The model, with 30.5 billion parameters and 3.3 billion activated per token, excels in various benchmarks such as Humanity's Last Exam and BrowseComp. The research opens sources for the model, framework, and solutions to promote community development.

Tongyi DeepResearch 是一种专门针对长期、深入研究任务的代理型大型语言模型。它通过结合中间训练和后期训练的端到端训练框架，实现可扩展的推理和信息检索。该模型拥有305亿参数，每token激活3.3亿参数，表现出色，如在Humanity's Last Exam和BrowseComp等基准测试中。研究还开源了模型、框架和解决方案，以促进社区发展。

Greedy Sampling Is Provably Efficient for RLHF

Authors: Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

Venue: NeurIPS 2025

First: 2025-10-28T17:52:08+00:00 · Latest: 2025-10-28T17:52:08+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.

中文标题/摘要

标题：贪婪采样在RLHF中可证明高效

人类反馈强化学习（RLHF）已成为大型语言模型后训练的关键技术。尽管其在实验上取得了成功，但对RLHF的理论理解仍然有限，因为仅使用偏好反馈学习KL正则化目标带来了额外的挑战，不同于经典的RL。现有工作主要研究基于奖励的Bradley-Terry（BT）偏好模型，并扩展了利用乐观或悲观的经典设计。与此相反，本工作考虑了更一般的偏好模型（其实际相关性最近已被观察到），并获得了比现有结果大得多的性能保证。令人惊讶的是，这些结果是从直接使用经验估计（即贪婪采样）的算法中得出的，而不是像以前的工作那样构建乐观或悲观的估计。这一洞察根植于KL正则化目标下最优策略类的独特结构特性，并进一步将其专门化到BT模型，突显了贪婪采样在RLHF中的惊人充分性。

Summary / 总结

This work addresses the theoretical understanding of Reinforcement Learning from Human Feedback (RLHF), focusing on the general preference model. It introduces algorithms that use greedy sampling to achieve performance guarantees, which are significantly better than existing methods. The key insight is rooted in the unique structural property of the optimal policy class under the KL-regularized target, demonstrating the sufficiency of greedy sampling in RLHF even without optimistic or pessimistic estimates.

该研究关注强化学习从人类反馈（RLHF）的理论理解，重点是通用偏好模型。它引入了使用贪婪采样的算法，实现了显著优于现有方法的性能保证。关键洞察源于在KL正则化目标下最优策略类的独特结构特性，展示了即使不使用乐观或悲观估计，贪婪采样在RLHF中的充分性。

ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking

Authors: Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, Yong Jiang

First: 2025-10-28T17:51:50+00:00 · Latest: 2025-10-28T17:51:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10--30% reduction in exploratory token consumption.

中文标题/摘要

标题：ParallelMuse：自主并行思考在深度信息搜索中的应用

并行思考扩展了探索的广度，补充了信息搜索(IS)代理的深度探索，进一步增强了问题解决能力。然而，传统的并行思考在这个环境中面临两个关键挑战：从头开始反复展开的低效性，以及在答案生成过程中整合长期推理轨迹的困难，因为有限的上下文容量无法全面考虑推理过程。为了解决这些问题，我们提出了ParallelMuse，这是一种为深度IS代理设计的两阶段范式。第一阶段，功能指定的部分展开，将生成的序列划分为功能区域，并进行不确定性引导的路径重用和分支，以提高探索效率。第二阶段，压缩推理聚合，利用推理冗余无损地压缩与答案推导相关的信息，并综合生成一个连贯的最终答案。跨多个开源代理和基准的实验表明，与探索性令牌消耗减少10-30%相比，性能提高了62%。

Summary / 总结

ParallelMuse addresses the inefficiency and integration challenges of conventional parallel thinking in deep information-seeking by proposing a two-stage approach. The first stage, Functionality-Specified Partial Rollout, partitions sequences and reuses paths to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, compresses reasoning to synthesize a coherent final answer. Experiments show up to 62% performance improvement with a 10-30% reduction in exploratory token consumption.

ParallelMuse通过提出两阶段方法来解决传统并行思考在深度信息搜索中的效率和整合问题。第一阶段，功能指定的部分展开，将序列分区并重用路径以提高探索效率。第二阶段，压缩推理聚合，压缩推理以合成一致的最终答案。实验显示，性能提高了62%，探索性令牌消耗减少了10-30%。

AgentFold: Long-Horizon Web Agents with Proactive Context Management

Authors: Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang

First: 2025-10-28T17:51:50+00:00 · Latest: 2025-10-28T17:51:50+00:00

Comments: 26 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.

中文标题/摘要

标题：AgentFold：具有主动上下文管理的长时网络代理

基于LLM的网络代理在信息检索方面展现出巨大的潜力，但它们在长时任务上的有效性受到上下文管理基本权衡的阻碍。现有的基于ReAct的代理因积累噪声的原始历史而面临上下文饱和的问题，而固定地在每一步总结完整历史的方法则面临不可逆地丢失关键细节的风险。为解决这些问题，我们引入了AgentFold，这是一种以主动上下文管理为中心的新代理范式，灵感来源于人类认知过程中的回顾性整合。AgentFold将其上下文视为一个动态的认知工作空间，需要积极地塑造，而不是被动地记录。在每一步，它学习执行一个“折叠”操作，以在多个尺度上管理其历史轨迹：它可以进行精细的浓缩以保留关键的细粒度细节，或进行深度整合以抽象掉整个多步子任务。在主要基准测试上的结果令人瞩目：仅通过简单的监督微调（无需持续预训练或RL），我们的AgentFold-30B-A3B代理在BrowseComp上达到了36.2%，在BrowseComp-ZH上达到了47.3%。值得注意的是，这一性能不仅超越或匹配了规模大得多的开源模型，如DeepSeek-V3.1-671B-A37B，还超越了领先的专有代理，如OpenAI的o4-mini。

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Authors: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang

First: 2025-10-28T17:50:40+00:00 · Latest: 2025-10-28T17:50:40+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.

中文标题/摘要

标题：合成数据的再利用以精细搜索代理监督

基于LLM的搜索代理越来越多地通过以实体为中心的合成数据进行训练，以解决复杂的、知识密集型的任务。然而，现有的训练方法，如组相对策略优化（GRPO），会丢弃这些丰富的实体信息，转而依赖稀疏的结果导向奖励。这一关键限制使得它们无法区分那些具有实质性正确推理但最终答案有误的“近似正确”样本与完全失败，从而丢弃了有价值的学习信号。我们通过利用训练过程中丢弃的实体来解决这一问题。我们的实证分析表明，在代理推理过程中识别出的正确实体数量与最终答案的准确性之间存在强烈的正相关关系。基于这一洞察，我们提出了实体感知组相对策略优化（E-GRPO），这是一种新的框架，它定义了一个密集的实体感知奖励函数。E-GRPO根据样本与实体的匹配程度分配部分奖励，使模型能够有效地从这些“近似正确”样本中学习。在各种问答（QA）和深度研究基准上的实验表明，E-GRPO在所有情况下都显著优于GRPO基线。此外，我们的分析表明，E-GRPO不仅在准确性上表现更优，还诱导了更高效的推理策略，需要更少的工具调用，展示了更有效和样本高效的方法来对齐搜索代理。

Summary / 总结

The paper addresses the issue of LLM-based search agents discarding rich entity information during training, which limits their ability to distinguish between near-miss samples and complete failures. To address this, the authors introduce Entity-aware Group Relative Policy Optimization (E-GRPO), which assigns partial rewards based on entity match rates. Experiments show that E-GRPO outperforms the baseline GRPO in terms of accuracy and induces more efficient reasoning policies with fewer tool calls.

论文针对Group Relative Policy Optimization (GRPO)在训练LLM搜索代理时丢弃丰富实体信息的局限性，提出了Entity-aware Group Relative Policy Optimization (E-GRPO)，该方法根据样本的实体匹配率给予部分奖励，使模型能够从近似正确但最终答案有误的样本中学习。实验表明，E-GRPO在问答和深度研究基准测试中表现出更高的准确性和更高效的推理策略，需要更少的工具调用。

MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

Authors: Yun Zhang, Zhaoliang Zheng, Johnson Liu, Zhiyu Huang, Zewei Zhou, Zonglin Meng, Tianhui Cai, Jiaqi Ma

First: 2025-10-28T17:49:42+00:00 · Latest: 2025-10-28T17:49:42+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.

中文标题/摘要

标题：MIC-BEV：多基础设施相机鸟瞰图变换器及其关系感知融合在3D物体检测中的应用

基于基础设施的感知在智能交通系统中起着关键作用，提供全局态势感知并支持协同自主。然而，现有的基于相机的检测模型在这些场景中往往表现不佳，这主要是由于多视图基础设施设置、多样化的相机配置、视觉输入退化以及各种道路布局带来的挑战。我们提出了MIC-BEV，这是一种基于Transformer的鸟瞰图（BEV）感知框架，用于基础设施多相机3D物体检测。MIC-BEV灵活支持具有异构内在和外在参数的可变数量的相机，并在传感器退化条件下表现出强大的鲁棒性。MIC-BEV中提出的图增强融合模块通过利用相机与BEV单元之间的几何关系以及潜在的视觉线索，将多视图图像特征整合到BEV空间中。为了支持训练和评估，我们引入了M2I，这是一个用于基础设施物体检测的合成数据集，包含多样化的相机配置、道路布局和环境条件。在M2I和现实世界数据集RoScenes上的广泛实验表明，MIC-BEV在3D物体检测中达到了最先进的性能，并且在极端天气和传感器退化等具有挑战性的条件下仍然保持鲁棒性。这些结果突显了MIC-BEV在实际部署中的潜力。数据集和源代码可在：https://github.com/HandsomeYun/MIC-BEV获取。

Learning to Drive Safely with Hybrid Options

Authors: Bram De Cooman, Johan Suykens

First: 2025-10-28T17:40:04+00:00 · Latest: 2025-10-28T17:40:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.

中文标题/摘要

标题：利用混合选项安全学习驾驶

在众多用于自主驾驶的深度强化学习方法中，只有少数采用了选项（或技能）框架。这令人惊讶，因为该框架天然适合于分层控制应用，尤其是自主驾驶任务。因此，在这项工作中，我们将选项框架应用于高速公路上的自主驾驶任务。具体而言，我们为纵向和横向操作定义了专门的选项，并嵌入了安全和舒适约束。这样，先前的领域知识可以被纳入学习过程，并且可以更容易地约束学习到的驾驶行为。我们提出了几种分层控制的选项设置，并根据最新的强化学习技术推导出实用算法。通过分别选择纵向和横向控制的动作，引入的混合选项策略获得了与人类驾驶员相同的表达能力和灵活性，同时比经典连续动作策略更容易解释。在所有研究的方法中，这些灵活的混合选项策略在各种交通条件下表现最佳，优于基线动作策略。

Summary / 总结

This paper addresses the application of the options framework in autonomous driving, particularly on highways. It defines specific options for longitudinal and lateral maneuvers with safety and comfort constraints, integrating domain knowledge into the learning process. The proposed hierarchical control with hybrid options policies outperform traditional continuous action policies under varying traffic conditions, demonstrating better expressiveness and interpretability while maintaining the flexibility of human drivers.

本文探讨了在高速公路上应用选项框架进行自主驾驶的方法。它定义了针对纵向和横向操作的具体选项，并嵌入了安全和舒适约束，将领域知识整合到学习过程中。所提出的基于混合选项的分层控制策略在各种交通条件下优于传统的连续动作策略，展示了更好的表达性和可解释性，同时保持了人类驾驶员的灵活性。

ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

Authors: Jason Wu, Yuyang Yuan, Kang Yang, Lance Kaplan, Mani Srivastava

Venue: Neurips 2025

First: 2025-02-11T17:19:44+00:00 · Latest: 2025-10-28T17:37:03+00:00

Comments: Accepted to Neurips 2025

Abs · PDF · Code1 · Code2

Abstract

Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.

中文标题/摘要

标题：ADMN：一种适应层的多模态网络，用于动态输入噪声和计算资源

多模态深度学习系统由于多种传感模态提供的鲁棒性，在动态场景中得到部署。然而，它们在计算资源可用性（由于多租户、设备异构性等）和输入质量（来自传感器馈送污染、环境噪声等）波动方面存在困难。静态配置的多模态系统无法适应随时间变化的计算资源，而现有的动态网络则难以应对严格的计算预算。此外，这两种系统通常忽视了模态质量变化的影响。因此，遭受严重污染的模态可能会无谓地消耗本应分配给其他模态的资源。我们提出了ADMN，一种适应层的深度多模态网络，能够同时应对这两个挑战：它根据所有模态的计算资源约束调整激活层的总数，并根据模态质量不断重新分配输入模态的层。我们的评估展示了ADMN可以在减少高达75%的浮点运算的同时，达到最先进的网络的准确度。

Summary / 总结

The paper addresses the challenges of deploying multimodal deep learning systems in dynamic scenarios with varying compute resources and input quality. ADMN, a layer-wise adaptive network, adjusts the number of active layers across modalities to meet strict compute constraints and reallocates layers based on input quality. Experiments demonstrate that ADMN can achieve comparable accuracy to state-of-the-art networks while reducing up to 75% of floating-point operations.

论文提出了ADMN，一种适应动态计算资源和输入质量变化的多模态网络。它通过调整各模态的活跃层数量来满足严格的计算约束，并根据输入质量重新分配层。实验表明，ADMN可以在减少高达75%的浮点运算的同时，达到与最新网络相当的准确度。

Multi-Agent Scenario Generation in Roundabouts with a Transformer-enhanced Conditional Variational Autoencoder

Authors: Li Li, Tobias Brinkmann, Till Temmen, Markus Eisenbarth, Jakob Andert

First: 2025-10-28T17:36:52+00:00 · Latest: 2025-10-28T17:36:52+00:00

Abs · PDF · Code1 · Code2

Abstract

With the increasing integration of intelligent driving functions into serial-produced vehicles, ensuring their functionality and robustness poses greater challenges. Compared to traditional road testing, scenario-based virtual testing offers significant advantages in terms of time and cost efficiency, reproducibility, and exploration of edge cases. We propose a Transformer-enhanced Conditional Variational Autoencoder (CVAE-T) model for generating multi-agent traffic scenarios in roundabouts, which are characterized by high vehicle dynamics and complex layouts, yet remain relatively underexplored in current research. The results show that the proposed model can accurately reconstruct original scenarios and generate realistic, diverse synthetic scenarios. Besides, two Key-Performance-Indicators (KPIs) are employed to evaluate the interactive behavior in the generated scenarios. Analysis of the latent space reveals partial disentanglement, with several latent dimensions exhibiting distinct and interpretable effects on scenario attributes such as vehicle entry timing, exit timing, and velocity profiles. The results demonstrate the model's capability to generate scenarios for the validation of intelligent driving functions involving multi-agent interactions, as well as to augment data for their development and iterative improvement.

中文标题/摘要

标题：环形交叉口多智能体场景生成的增强变换条件变分自编码器

随着智能驾驶功能在量产车辆中的集成程度不断提高，确保其功能性和鲁棒性提出了更大的挑战。与传统的道路测试相比，基于场景的虚拟测试在时间、成本效率、可重复性和边缘情况探索方面具有显著优势。我们提出了一种增强变换条件变分自编码器（CVAE-T）模型，用于生成环形交叉口的多智能体交通场景，这些场景具有高车辆动态和复杂布局的特点，但在当前研究中仍相对未被充分探索。结果表明，所提出的模型能够准确重构原始场景并生成真实、多样的合成场景。此外，使用了两个关键性能指标（KPIs）来评估生成场景中的交互行为。对潜在空间的分析揭示了部分解耦，多个潜在维度对场景属性（如车辆进入时间、退出时间和速度曲线）具有独特的和可解释的影响。结果表明，该模型能够生成涉及多智能体交互的智能驾驶功能验证场景，以及用于其开发和迭代改进的数据增强。

Summary / 总结

The research aims to enhance the functionality and robustness of intelligent driving functions in vehicles through scenario-based virtual testing. The study proposes a Transformer-enhanced Conditional Variational Autoencoder (CVAE-T) to generate realistic multi-agent traffic scenarios in roundabouts, which are complex and underexplored. The model successfully reconstructs original scenarios and generates diverse synthetic ones, with key performance indicators showing accurate interactive behavior. Latent space analysis reveals partial disentanglement, indicating the model's ability to control specific scenario attributes like vehicle entry and exit timing and velocity profiles.

研究旨在通过基于场景的虚拟测试来提升智能驾驶功能的可靠性和鲁棒性，特别是在具有高车辆动态特性的环形交叉口。研究提出了一种增强的条件变分自编码器（CVAE-T），用于生成多代理交通场景。该模型成功地重建了原始场景，并生成了现实且多样的合成场景。使用关键性能指标（KPIs）评估生成场景中的交互行为，分析发现潜在空间中部分维度对场景属性如车辆进入和退出时间以及速度曲线有明确的影响。研究结果表明，该模型能够生成用于验证和开发涉及多代理交互的智能驾驶功能的场景，并增强相关数据。

Pearl: A Foundation Model for Placing Every Atom in the Right Location

Authors: Genesis Research Team, Alejandro Dobles, Nina Jovic, Kenneth Leidal, Pranav Murugan, David C. Williams, Drausin Wulsin, Nate Gruver, Christina X. Ji, Korrawat Pruegsanusak, Gianluca Scarpellini, Ansh Sharma, Wojciech Swiderski, Andrea Bootsma, Richard Strong Bowen, Charlotte Chen, Jamin Chen, Marc André Dämgen, Roy Tal Dew, Benjamin DiFrancesco, J. D. Fishman, Alla Ivanova, Zach Kagin, David Li-Bland, Zuli Liu, Igor Morozov, Jeffrey Ouyang-Zhang, Frank C. Pickard IV, Kushal S. Shah, Ben Shor, Gabriel Monteiro da Silva, Maxx Tessmer, Carl Tilbury, Cyr Vetcher, Daniel Zeng, Maruan Al-Shedivat, Aleksandra Faust, Evan N. Feinberg, Michael V. LeVine, Matteus Pan

First: 2025-10-28T17:36:51+00:00 · Latest: 2025-10-28T17:36:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.

中文标题/摘要

标题：珍珠：一种用于将每个原子放置在正确位置的基础模型

准确预测蛋白质-配体复合物的三维结构仍然是计算药物发现中的一个基本挑战，限制了治疗设计的进度和成功率。深度学习方法最近显示出作为结构预测工具的强大潜力，实现了跨多种生物分子系统的有希望的准确性。然而，它们的表现和实用性受到稀缺实验数据、低效架构、物理上无效的构象以及在推理时利用辅助信息能力有限的限制。为了解决这些问题，我们引入了珍珠（Placing Every Atom in the Right Location），一种大规模蛋白质-配体共折叠的基础模型。珍珠通过三个关键创新来应对这些挑战：(1) 包括大规模合成数据的训练食谱，以克服数据稀缺；(2) 结合SO(3)-等变扩散模块的架构，以内在地尊重三维旋转对称性，提高泛化能力和样本效率；(3) 可控推理，包括支持蛋白质和非聚合物组件的通用多链模板系统以及双重无条件/有条件模式。珍珠在蛋白质-配体共折叠方面建立了新的最佳性能。在生成准确（RMSD < 2 Å）且物理上有效的构象的关键指标上，珍珠在公共Runs N' Poses和PoseBusters基准测试中分别超越了AlphaFold 3和其他开源基线14.5%和14.2%，优于第二好的模型。在口袋条件共折叠领域，珍珠在更严格的RMSD < 1 Å阈值下，在一组具有挑战性的实际药物靶点上实现了3.6倍的改进。最后，我们证明了模型性能与训练中使用的合成数据集大小直接相关。

Summary / 总结

Pearl is a foundation model designed to predict the three-dimensional structures of protein-ligand complexes, addressing limitations in deep learning methods such as data scarcity and physically invalid poses. It introduces three key innovations: large-scale synthetic data for training, an SO(3)-equivariant diffusion module to respect 3D rotational symmetries, and a controllable inference system. Pearl outperforms previous models, achieving 14.5% and 14.2% improvements over AlphaFold 3 and other baselines on public benchmarks, and delivering a 3.6-fold improvement on challenging drug targets with RMSD < 1 Å in the pocket-conditional cofolding regime.

Pearl 是一种基础模型，旨在准确预测蛋白质-配体复合物的三维结构，解决数据稀缺和物理上无效构象等问题。它引入了三大创新：大规模合成数据训练、SO(3)-不变扩散模块以尊重三维旋转对称性，以及可控推理模式。Pearl 在现有模型中表现出色，分别在公共基准测试中比 AlphaFold 3 和其他开源基线提高了 14.5% 和 14.2%。在严格的 RMSD < 1 Å 的口袋条件折叠领域，它还展示了 3.6 倍的改进，适用于真实世界的药物靶点。

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Authors: Yolo Yunlong Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu

Venue: NeurIPS 2025

First: 2025-05-26T18:20:22+00:00 · Latest: 2025-10-28T17:35:54+00:00

Comments: Accepted to NeurIPS 2025 DB Track

Abs · PDF · Code1 · Code2 · Project1

Abstract

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

中文标题/摘要

标题：MMPerspective：MLLMs 是否理解视角？视角感知、推理与鲁棒性的全面基准

理解视角是人类视觉感知的基础，但多模态大型语言模型（MLLMs）如何内化视角几何学仍不清楚。我们引入了MMPerspective，这是首个专门设计用于系统评估MLLMs对视角理解的基准，通过10个精心设计的任务，涵盖三个互补维度：视角感知、推理和鲁棒性。基准包括2,711个真实世界和合成图像实例，以及5,083个问题-答案对，这些对探索关键能力，如消失点感知、计数、视角类型推理、三维空间中的线关系理解、视角保持变换的不变性等。通过对43个最先进的MLLMs进行全面评估，我们发现显著的局限性：尽管模型在表面感知任务上表现出色，但在组合推理和在扰动下保持空间一致性方面却遇到困难。进一步的分析揭示了模型架构、规模与视角能力之间的有趣模式，突显了鲁棒性瓶颈和链式思考提示的好处。MMPerspective为诊断和推进视觉语言系统的空间理解提供了有价值的测试平台。资源可访问：https://yunlong10.github.io/MMPerspective/

Summary / 总结

MMPerspective is a benchmark designed to evaluate MLLMs' understanding of perspective through 10 tasks covering perception, reasoning, and robustness. It includes 2,711 image instances and 5,083 question-answer pairs. The evaluation of 43 state-of-the-art MLLMs revealed that models perform well on surface-level tasks but struggle with compositional reasoning and maintaining spatial consistency. The study also highlights the importance of model architecture and scale in perspective capabilities.

MMPerspective 是一个基准，旨在从感知、推理和鲁棒性三个维度评估 MLLMs 对视角的理解。它包含 2,711 张图像实例和 5,083 个问题-答案对。对 43 个最先进的 MLLMs 的评估表明，尽管模型在基本的感知任务上表现良好，但在复杂推理和在扰动下保持空间一致性方面存在困难。该基准突显了在视觉-语言系统中提高空间理解的必要性。

InteractComp: Evaluating Search Agents With Ambiguous Queries

Authors: Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo

First: 2025-10-28T17:35:54+00:00 · Latest: 2025-10-28T17:35:54+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.

中文标题/摘要

标题：InteractComp：评估具有模糊查询的搜索代理

语言代理在网页搜索和信息检索方面展现了巨大的潜力。然而，这些搜索代理假设用户查询是完整且明确的，这一假设与现实不符，用户通常从不完整查询开始，需要通过互动来澄清。但大多数代理在搜索过程中缺乏互动机制，现有的基准测试也无法评估这一能力。为解决这一差距，我们引入了InteractComp，一个旨在评估搜索代理是否能够识别查询的模糊性并在搜索过程中主动互动以解决这一问题的基准测试。我们遵循易于验证、互动以澄清的原则，通过目标-干扰方法构建了涵盖9个领域的210个专家精选问题，这些问题是通过互动才能解决的真实模糊性。对17个模型的评估揭示了显著的失败：最佳模型在完整上下文下的准确率为71.50%，但在仅13.73%的情况下，显示出系统性的过度自信而非推理缺陷。被迫互动产生了显著的收益，证明了当前策略未能激发的潜在能力。纵向分析显示，互动能力在15个月内停滞不前，而搜索性能提高了七倍，揭示了一个关键的盲点。这种停滞，加上搜索任务固有的即时反馈，使InteractComp成为评估和训练搜索代理互动能力的宝贵资源。代码可在https://github.com/FoundationAgents/InteractComp 获取。

Summary / 总结

InteractComp is a benchmark designed to evaluate search agents' ability to handle ambiguous queries by requiring them to interact with users. It consists of 210 expert-curated questions across 9 domains, creating genuine ambiguity that can only be resolved through interaction. Evaluation of 17 models showed that even the best model only achieved 13.73% accuracy without interaction, but accuracy improved dramatically to 71.50% when interaction was forced. This indicates that search agents have latent capabilities that current strategies fail to utilize, and highlights a critical blind spot in their development that needs to be addressed. The benchmark is valuable for both evaluating and training interaction capabilities in search agents.

InteractComp 是一个基准，用于评估搜索代理处理含糊查询并与其用户互动的能力。它包含来自9个领域的210个专家策划的问题，模型在没有互动的情况下识别和解决含糊性方面表现出显著失败，准确率仅为13.73%，而完整上下文下为71.50%。强制互动显著提高了性能，突显了模型的潜在能力。纵向分析表明，互动能力停滞不前，而搜索性能则提高了七倍，这表明需要更好地训练搜索代理的互动能力。

SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

Authors: Mia Kan, Yilin Liu, Niloy Mitra

First: 2025-10-28T17:35:02+00:00 · Latest: 2025-10-28T17:35:02+00:00

Comments: Website: https://kan32501.github.io/sage.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

中文标题/摘要

标题：SAGE：结构感知生成视频过渡

视频过渡旨在合成两个片段之间的中间帧，但简单的线性混合方法会引入伪影，限制了专业使用或破坏时间连贯性。传统技术（交叉淡入淡出、形态变化、帧插值）和最近的生成过渡方法可以生成高质量的合理中间帧，但它们在处理涉及较大时间间隔或显著语义差异的多样片段时存在困难，留下了内容感知和视觉连贯过渡的空白。我们通过借鉴艺术工作流程，提炼出如轮廓对齐和重要特征插值等策略，以保持结构和感知连续性来应对这一挑战。在此基础上，我们提出了SAGE（结构感知生成视频过渡）作为零样本方法，结合通过线图和运动流提供的结构指导与生成合成，使过渡平滑且语义一致，无需微调。广泛的实验和与当前替代方案（如[FILM, TVG, DiffMorpher, VACE, GI]）的比较表明，SAGE在定量指标和用户研究中均优于经典和生成基线，用于生成多样片段之间的过渡。代码将在接受后发布。

Summary / 总结

SAGE addresses the challenge of synthesizing smooth and semantically consistent video transitions between diverse clips by combining structural guidance with generative synthesis. It uses line maps and motion flow to align silhouettes and interpolate salient features, overcoming the limitations of naive and traditional methods. Experiments show that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies, particularly in handling large temporal gaps and significant semantic differences.

研究旨在通过解决传统技术和朴素方法的局限性，改进视频过渡效果，这些方法常常会产生伪影或在处理大时间间隔时失败。SAGE 是一种结构感知的生成方法，利用线图和运动流引导生成合成，从而实现平滑且语义一致的过渡。实验表明，SAGE 在生成不同片段之间的高质量过渡方面优于经典和生成基线方法。

OrchDAG: Complex Tool Orchestration in Multi-Turn Interactions with Plan DAGs

Authors: Yifu Lu, Shengjie Liu, Li Dong

First: 2025-10-28T17:28:01+00:00 · Latest: 2025-10-28T17:28:01+00:00

Comments: 9 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Agentic tool use has gained traction with the rise of agentic tool calling, yet most existing work overlooks the complexity of multi-turn tool interactions. We introduce OrchDAG, a synthetic data generation pipeline that models tool execution as directed acyclic graphs (DAGs) with controllable complexity. Using this dataset, we benchmark model performance and propose a graph-based reward to enhance RLVR training. Experiments show that the dataset presents a challenging but solvable benchmark, and the proposed reward is effective when combined with GRPO-style algorithms, highlighting the importance of leveraging topological structure and data complexity in multi-turn tool use.

中文标题/摘要

标题：OrchDAG：多轮交互中的复杂工具编排与计划DAG

随着代理工具调用的兴起，代理工具使用已引起关注，但现有大多数工作忽略了多轮工具交互的复杂性。我们引入了OrchDAG，这是一种合成数据生成管道，将工具执行建模为具有可控复杂性的有向无环图（DAG）。使用此数据集，我们评估了模型性能，并提出了一种基于图的奖励来增强RLVR训练。实验表明，该数据集提供了一个具有挑战性但可解决的基准，并且提出的奖励与GRPO风格的算法结合使用时是有效的，突显了在多轮工具使用中利用拓扑结构和数据复杂性的重要性。

Summary / 总结

The research motivation is to address the complexity of multi-turn tool interactions in agentic tool use, which is often overlooked. The main method involves creating a synthetic dataset called OrchDAG, where tool executions are modeled as directed acyclic graphs (DAGs) with controllable complexity. Key experimental findings show that this dataset provides a challenging benchmark for model performance and that the proposed graph-based reward, when combined with GRPO-style algorithms, effectively enhances training in reinforcement learning for vision and language tasks.

研究动机是解决多轮工具交互在主动工具使用中的复杂性，这在现有工作中经常被忽视。主要方法是创建一个名为OrchDAG的合成数据集，其中工具执行被建模为具有可控复杂性的有向无环图（DAG）。关键实验发现表明，该数据集为模型性能提供了一个具有挑战性的基准，并且提出的基于图的奖励与GRPO风格算法结合使用时，可以有效增强强化学习中的视觉和语言任务训练。

Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Authors: Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim

Venue: EMNLP 2025

First: 2024-11-02T15:23:28+00:00 · Latest: 2025-10-28T17:26:20+00:00

Comments: 8 pages for main body, 19 pages in total

Abs · PDF · Code1 · Code2 · Code3

Abstract

As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an easy-to-use web demonstration and code to foster adoption of Arena-Lite, streamlining model selection across research and industry communities. Arena-Lite demo and code are available on \href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}

中文标题/摘要

标题：Arena-Lite：基于锦标赛直接对比的高效可靠大型语言模型评估

随着大型语言模型（LLMs）在各个领域扩展，LLM评判者已成为系统评估的必要组成部分。当前的基准测试通常将系统输出与基线进行比较。尽管这种基于基线的方法方便，但其可靠性低于直接系统间比较。我们提出了Arena-Lite，它在一对一比较的基础上整合了锦标赛结构。应用锦标赛结构和直接比较消除了基线输出的需求，减少了所需的比较次数，并允许更高的系统排名可靠性。我们进行了两项实验：（1）受控随机建模和（2）使用真实LLM评判者的实证验证。这些实验共同证明，Arena-Lite即使在较小的数据集或较弱的评判者情况下，也能以较少的比较次数实现更高的可靠性。我们提供了一个易于使用的网络演示和代码，以促进Arena-Lite的采用，简化研究和工业社区中的模型选择。Arena-Lite的演示和代码可在 https://huggingface.co/spaces/NCSOFT/ArenaLite 获取

Summary / 总结

Arena-Lite is designed to evaluate Large Language Models (LLMs) more efficiently and reliably by using a tournament-based direct comparison method, which eliminates the need for baselines and reduces the number of required comparisons. The method consistently achieves higher reliability even with smaller datasets or weaker judges, as demonstrated through controlled stochastic modeling and empirical validation. The experiments show that Arena-Lite can provide more accurate system rankings with fewer comparisons compared to baseline-mediated approaches.

Arena-Lite 通过使用基于锦标赛的直接对比方法来更高效和可靠地评估大型语言模型（LLMs），这种方法不需要基准数据并减少了所需比较的数量。实验表明，即使使用较小的数据集或较弱的评判者，Arena-Lite 也能通过更少的比较提供更准确的系统排名，这通过控制的随机建模和实际 LLM 评判者的验证得到了验证。

Eye-Tracking, Mouse Tracking, Stimulus Tracking,and Decision-Making Datasets in Digital Pathology

Authors: Veronica Thai, Rui Li, Meng Ling, Shuning Jiang, Jeremy Wolfe, Raghu Machiraju, Yan Hu, Zaibo Li, Anil Parwani, Jian Chen

First: 2025-10-28T17:18:43+00:00 · Latest: 2025-10-28T17:18:43+00:00

Comments: 16 pages, 9 figures, submitted to Nature Scientific Data

Abs · PDF · Code1 · Code2 · Project1

Abstract

Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.

中文标题/摘要

标题：眼动追踪、鼠标追踪、刺激追踪和决策数据集在数字病理学中的应用

对千兆像素全切片图像（WSIs）的解释是病理学家的一项重要但困难的任务。他们的诊断准确率估计平均约为70%。增加第二名病理学家并不能显著提高决策一致性。该领域缺乏足够的行为数据来解释诊断错误和不一致性。为了填补这一空白，我们介绍了PathoGaze1.0，这是一个全面的行为数据集，捕捉了癌症诊断过程中整个诊断工作流程中的动态视觉搜索和决策过程。该数据集包括19名病理学家对397张WSIs进行解释时收集的18.69小时的眼动追踪、鼠标交互、刺激追踪、视窗导航和诊断决策数据（EMSVD）。数据收集过程通过一个基于应用的测试平台PTAH强调生态效度。总共记录了171,909个注视点、263,320次扫视和1,867,362次鼠标交互事件。此外，此类数据还可以用于提高病理学家和可能支持人类专家的AI系统的培训。所有实验均在https://osf.io/hj9a7预先注册，并且完整的数据集及分析代码可在https://go.osu.edu/pathogaze获取。

Summary / 总结

The research aims to improve the understanding of diagnostic errors and inconsistencies in digital pathology by providing a comprehensive dataset. The method involves collecting eye-tracking, mouse interaction, and stimulus tracking data from 19 pathologists interpreting 397 whole-slide images over 18.69 hours. Key findings include 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events, which can help explain diagnostic processes and improve training for both pathologists and AI systems.

该论文介绍了PathoGaze1.0数据集，该数据集捕捉了病理学家在癌症诊断过程中视觉搜索和决策过程。数据集包括19名病理学家在397张WSI上进行18.69小时的眼动追踪、鼠标交互和刺激追踪数据，通过PTAH测试床强调生态有效性。该数据集记录了171,909个注视点、263,320次扫视和1,867,362次鼠标交互事件，可用于提高病理学家的培训和辅助人类专家的AI系统的训练。所有实验均已预先注册，数据集已公开发布。

Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning

Authors: Nitin Rai, Daeun, Choi, Nathan S. Boyd, Arnold W. Schumann

First: 2025-10-28T17:16:47+00:00 · Latest: 2025-10-28T17:16:47+00:00

Comments: 26 pages, 8 figures, and 2 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Site-specific disease management (SSDM) in crops has advanced rapidly through machine and deep learning (ML and DL) for real-time computer vision. Research evolved from handcrafted feature extraction to large-scale automated feature learning. With foundation models (FMs), crop disease datasets are now processed in fundamentally new ways. Unlike traditional neural networks, FMs integrate visual and textual data, interpret symptoms in text, reason about symptom-management relationships, and support interactive QA for growers and educators. Adaptive and imitation learning in robotics further enables field-based disease management. This review screened approx. 40 articles on FM applications for SSDM, focusing on large-language models (LLMs) and vision-language models (VLMs), and discussing their role in adaptive learning (AL), reinforcement learning (RL), and digital twin frameworks for targeted spraying. Key findings: (a) FMs are gaining traction with surging literature in 2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; (c) RL and AL are still nascent for smart spraying; (d) digital twins with RL can simulate targeted spraying virtually; (e) addressing the sim-to-real gap is critical for real-world deployment; (f) human-robot collaboration remains limited, especially in human-in-the-loop approaches where robots detect early symptoms and humans validate uncertain cases; (g) multi-modal FMs with real-time feedback will drive next-gen SSDM. For updates, resources, and contributions, visit, https://github.com/nitin-dominic/AgriPathogenDatabase, to submit papers, code, or datasets.

中文标题/摘要

标题：精准农业中特定地点病害和害虫管理的进步：从推理驱动的基础模型到适应性、反馈学习

作物的特定地点病害管理（SSDM）通过机器学习和深度学习（ML和DL）的实时计算机视觉技术得到了迅速发展。研究从手工特征提取演进到大规模自动化特征学习。借助基础模型（FMs），作物病害数据集现在以根本不同的方式处理。与传统神经网络不同，FMs整合了视觉和文本数据，解释文本中的症状，推理症状管理关系，并支持种植者和教育者的交互式问答。机器人领域的适应性和模仿学习进一步使田间病害管理成为可能。本文综述了约40篇关于FMs在SSDM应用的文章，重点关注大型语言模型（LLMs）和视觉语言模型（VLMs），讨论了它们在适应性学习（AL）、强化学习（RL）和数字孪生框架中的作用，用于精准喷洒。主要发现：（a）FMs在2023-24年因文献激增而受到关注；（b）VLMs超越了LLMs，出版物增加了5-10倍；（c）智能喷洒领域的RL和AL仍处于初级阶段；（d）带有RL的数字孪生可以虚拟模拟精准喷洒；（e）解决模拟与现实之间的差距对于实际部署至关重要；（f）人机协作仍然有限，尤其是在人类在环方法中，机器人检测早期症状，人类验证不确定的案例；（g）具有实时反馈的多模态FMs将推动下一代SSDM。欲获取更新、资源和贡献，请访问https://github.com/nitin-dominic/AgriPathogenDatabase，提交论文、代码或数据集。

Summary / 总结

This paper explores the advancement of site-specific disease management (SSDM) in crops using machine and deep learning techniques. It highlights the transition from handcrafted feature extraction to large-scale automated feature learning with foundation models (FMs) that integrate visual and textual data. Key findings include the growing literature on FMs, the outpacing of vision-language models (VLMs) over large-language models (LLMs), the nascent state of reinforcement learning (RL) and adaptive learning (AL) for smart spraying, and the importance of addressing the sim-to-real gap for practical deployment. Multi-modal FMs with real-time feedback are expected to drive future advancements in SSDM.

本文探讨了利用机器和深度学习技术在作物中实现定点病害管理（SSDM）的进步。它强调了从手工特征提取到大规模自动特征学习的过渡，使用基础模型（FMs）整合视觉和文本数据。关键发现包括FMs文献的增长、VLMs在LLMs上的超越、RL和AL在智能喷洒中的起步阶段、以及解决模拟到现实差距的重要性。实时反馈的多模态FMs预计将推动SSDM的未来发展。

A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries

Authors: Xin Zhang, Yuqi Song, Fei Zuo

First: 2025-10-28T17:06:40+00:00 · Latest: 2025-10-28T17:06:40+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network's learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model's effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.

中文标题/摘要

标题：一种用于鲁棒检测AI生成的面部伪造的双分支CNN

生成AI的迅速发展使得创建高度逼真的伪造面部图像成为可能，这对AI安全、数字媒体完整性和公众信任构成了重大威胁。面部伪造技术，从面部替换和属性编辑到强大的扩散图像合成，正越来越多地被用于诸如误导信息、身份欺诈和诽谤等恶意目的。这一不断增长的挑战突显了迫切需要鲁棒且通用的面部伪造检测方法，作为AI安全基础设施的关键组成部分。在本文中，我们提出了一种新颖的双分支卷积神经网络用于面部伪造检测，该网络利用来自空间域和频域的互补线索。RGB分支捕获语义信息，而频域分支则专注于生成模型难以抑制的高频伪影。引入了通道注意力模块以自适应地融合这些异构特征，突出显示最有助于伪造鉴别信息的通道。为了引导网络的学习过程，我们设计了一种统一的损失函数FSC Loss，结合了焦点损失、监督对比损失和频域中心边距损失，以增强类别可分性和鲁棒性。我们在DiFF基准上评估了我们的模型，该基准包括来自四种代表性方法（文本到图像、图像到图像、面部替换和面部编辑）生成的伪造图像。我们的方法在所有类别中均表现出色，并优于平均水平的人类准确性。这些结果表明了该模型的有效性及其在保护AI生态系统免受视觉伪造攻击方面的潜在贡献。

Summary / 总结

This paper addresses the challenge of detecting AI-generated facial forgeries, which are becoming increasingly realistic and pose significant security risks. To tackle this, the authors propose a dual-branch CNN that combines spatial and frequency domain cues. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts. A channel attention module is used to fuse these features, and a unified loss function, FSC Loss, is designed to enhance class separability and robustness. The model outperforms average human accuracy on the DiFF benchmark, covering various forgery techniques, demonstrating its effectiveness in detecting AI-generated forgeries.

本文针对日益逼真的AI生成面部伪造带来的安全风险，提出了一种双分支CNN，该网络同时捕捉空间和频域特征，并通过通道注意力模块融合这些特征。引入了一种统一的损失函数FSC Loss，以提高类别可分性和鲁棒性。在DiFF基准测试上的实验表明，该方法在不同伪造类型上的性能优于人类准确度，表明其在防范视觉伪造攻击方面的有效性。

Causal Ordering for Structure Learning From Time Series

Authors: Pedro P. Sanchez, Damian Machlanski, Steven McDonagh, Sotirios A. Tsaftaris

First: 2025-10-28T17:06:15+00:00 · Latest: 2025-10-28T17:06:15+00:00

Comments: 32 pages

Abs · PDF · Code1 · Code2

Abstract

Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grow. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph, mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks ($d{=}\!3-\!6$ variables, $T{=}200\!-\!5{,}000$ samples), DOTS improves mean window-graph $F1$ from $0.63$ (best baseline) to $0.81$. On the CausalTime real-world benchmark ($d{=}20\!-\!36$), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph $F1$ while halving runtime relative to graph-optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery.

中文标题/摘要

标题：时间序列结构学习中的因果排序

从时间序列数据中预测因果结构对于理解生理学、脑连接性、气候动力学和社会经济行为中的复杂现象至关重要。时间序列中的因果发现受到识别真实因果关系的组合复杂性的阻碍，尤其是在变量和时间点数量增加时。简化任务的一种常见方法是所谓的排序方法。传统排序方法固有限制了结果模型的表示能力。在本文中，我们通过利用多个有效的因果排序来解决这一问题，而不是像标准做法那样使用单一排序。我们提出了DOTS（扩散排序时间结构），使用基于扩散的因果发现方法处理时间数据。通过整合多个排序，DOTS 有效地恢复了潜在有向无环图的传递闭包，减轻了单一排序方法固有的虚假特征。我们基于标准假设（如平稳性和加性噪声模型）形式化了该问题，并利用扩散过程进行评分匹配以实现高效的海森矩阵估计。广泛的实验验证了该方法的有效性。在合成和真实世界数据集上的实证评估表明，DOTS 在性能上优于最先进的基线方法，提供了一种可扩展且稳健的时间因果发现方法。在合成基准测试中（d=3-6个变量，T=200-5,000个样本），DOTS 将窗口图 F1 值从 0.63（最佳基线）提高到 0.81。在 CausalTime 真实世界基准测试中（d=20-36），虽然基线方法在单个数据集上表现最好，但DOTS 在平均摘要图 F1 值上达到最高，同时将运行时间相对图优化方法减半。这些结果确立了DOTS 作为时间因果发现的可扩展且准确的解决方案。

Summary / 总结

This paper addresses the challenge of causal structure learning from time series data by proposing DOTS (Diffusion Ordered Temporal Structure), which integrates multiple valid causal orderings to mitigate the limitations of single-ordering methods. DOTS uses diffusion-based causal discovery and score matching with diffusion processes to efficiently estimate the Hessian. Experiments on both synthetic and real-world datasets show that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery, with significant improvements in F1 scores and reduced runtime compared to graph-optimisation methods.

该论文旨在解决时间序列数据中的因果结构学习问题，这对于理解复杂现象至关重要。作者提出了DOTS（Diffusion Ordered Temporal Structure），该方法通过整合多个有效的因果排序来提高模型的表示能力。实验结果表明，DOTS在合成数据集和真实世界数据集上均优于现有方法，提供了一种可扩展且稳健的时间因果发现方法。在合成基准上，DOTS将窗口图F1均值从0.63提高到0.81；在CausalTime基准上，DOTS实现了最高的平均摘要图F1值，并将运行时间减半，相比图优化方法。

Symbolic Snapshot Ensembles

Authors: Mingyue Liu, Andrew Cropper

First: 2025-10-28T17:01:38+00:00 · Latest: 2025-10-28T17:01:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Inductive logic programming (ILP) is a form of logical machine learning. Most ILP algorithms learn a single hypothesis from a single training run. Ensemble methods train an ILP algorithm multiple times to learn multiple hypotheses. In this paper, we train an ILP algorithm only once and save intermediate hypotheses. We then combine the hypotheses using a minimum description length weighting scheme. Our experiments on multiple benchmarks, including game playing and visual reasoning, show that our approach improves predictive accuracy by 4% with less than 1% computational overhead.

中文标题/摘要

标题：符号快照集成

归纳逻辑编程（ILP）是一种逻辑形式的机器学习。大多数ILP算法从单次训练运行中学习一个假设。集成方法多次训练ILP算法以学习多个假设。在本文中，我们仅训练一次ILP算法并保存中间假设。然后，我们使用最小描述长度加权方案结合这些假设。我们的实验在多个基准上进行，包括游戏和视觉推理，结果显示我们的方法通过提高4%的预测准确性，同时计算开销不到1%。

VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree

Authors: Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng

Venue: NeurIPS 2025 poster

First: 2025-10-26T14:36:15+00:00 · Latest: 2025-10-28T16:57:22+00:00

Comments: NeurIPS 2025 poster

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.

中文标题/摘要

标题：VADTree：基于层次粒度感知树的无训练视频异常检测

视频异常检测（VAD）专注于在视频中识别异常。监督方法需要大量领域内训练数据，并且无法为异常提供清晰的解释。相比之下，无训练方法利用大型预训练模型的知识储备和语言互动性来检测异常。然而，当前固定长度的时间窗口采样方法难以准确捕捉具有不同时间跨度的异常。因此，我们提出了VADTree，利用层次粒度感知树（HGTree）结构进行灵活的VAD采样。VADTree利用预训练的通用事件边界检测（GEBD）模型嵌入的知识来表征潜在的异常事件边界。具体来说，VADTree基于边界置信度将视频分解为通用事件节点，并进行自适应粗细层次结构构建和冗余去除以构建HGTree。然后，将多维先验注入视觉语言模型（VLMs）以增强节点级别的异常感知，并通过大型语言模型（LLMs）实现通用事件节点的异常推理。最后，使用跨簇节点相关方法整合多粒度异常评分。在三个具有挑战性的数据集上的广泛实验表明，VADTree在无训练设置中实现了最先进的性能，同时大幅减少了采样的视频片段数量。代码将在https://github.com/wenlongli10/VADTree上提供。

Summary / 总结

VADTree proposes a Hierarchical Granularity-Aware Tree (HGTree) structure for flexible sampling in video anomaly detection (VAD), leveraging a pre-trained Generic Event Boundary Detection (GEBD) model to identify potential anomaly event boundaries. VADTree decomposes videos into generic event nodes and constructs an HGTree through adaptive coarse-fine hierarchical structuring and redundancy removal. This method integrates multi-dimensional priors into visual language models to enhance anomaly perception and uses large language models for anomaly reasoning. Experiments show that VADTree outperforms existing training-free methods while significantly reducing the number of sampled video segments.

VADTree 提出了一种层次粒度感知树（HGTree）结构，用于灵活的视频异常检测（VAD）采样，利用预训练的通用事件边界检测（GEBD）模型来识别潜在的异常事件边界。VADTree 将视频分解为通用事件节点，通过自适应粗细层次结构构建 HGTree，并整合多粒度的异常得分。实验表明，VADTree 在无训练设置中优于现有方法，同时减少了采样的视频片段数量。

Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

Authors: Snegha A, Sayambhu Sen, Piyush Singh Pasi, Abhishek Singhania, Preethi Jyothi

First: 2025-10-28T16:48:03+00:00 · Latest: 2025-10-28T16:48:03+00:00

Comments: 12 Pages

Abs · PDF · Code1 · Code2

Abstract

With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.

中文标题/摘要

标题：基于前缀的适配在零样本跨语言迁移中的应用

随着新大型语言模型（LLMs）如Llama和Mistral的发布，由于它们的多语言预训练和强大的泛化能力，零样本跨语言迁移变得越来越可行。然而，将这些仅解码器的LLMs适应到新的跨语言任务仍然具有挑战性。虽然参数高效微调（PeFT）技术如低秩适应（LoRA）被广泛应用，但基于前缀的技术如软提示调优、前缀调优和Llama Adapter则较少被探索，尤其是在仅解码器模型的零样本迁移中。我们对三种基于前缀的方法在从英语到35多种高资源和低资源语言的零样本跨语言迁移进行了全面研究。进一步的分析还探讨了语言家族和书写系统之间的迁移，以及从1B到24B的模型规模扩展的影响。使用Llama 3.1 8B，前缀方法在Belebele基准测试中比LoRA基线高出6%。Mistral v0.3 7B也观察到了类似改进。尽管前缀调优仅使用了1.23M学习参数，我们在多种基准测试中仍实现了持续改进。这些发现突显了基于前缀技术作为LoRA的有效且可扩展替代方案的潜力，特别是在低资源多语言环境中。

Summary / 总结

This study investigates the effectiveness of prefix-based methods for zero-shot cross-lingual transfer from English to 35+ languages using Llama 3.1 8B and Mistral v0.3 7B models. The research compares these methods to LoRA baselines and finds that prefix methods outperform LoRA by up to 6% on the Belebele benchmark, demonstrating their potential as a scalable alternative in low-resource settings.

研究探讨了使用Llama 3.1 8B和Mistral v0.3 7B进行从英语到35多种语言的零样本跨语言迁移时，前缀基方法的有效性，并将其与LoRA基线进行比较。研究发现，前缀方法在Belebele基准测试中比LoRA基线高出6%，这表明前缀方法在低资源环境中具有潜在的有效性和可扩展性。

Global Optimization of Gaussian Process Acquisition Functions Using a Piecewise-Linear Kernel Approximation

Authors: Yilin Xie, Shiqiang Zhang, Joel A. Paulson, Calvin Tsay

First: 2024-10-22T10:56:52+00:00 · Latest: 2025-10-28T16:44:42+00:00

Comments: 18 pages, 4 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Bayesian optimization relies on iteratively constructing and optimizing an acquisition function. The latter turns out to be a challenging, non-convex optimization problem itself. Despite the relative importance of this step, most algorithms employ sampling- or gradient-based methods, which do not provably converge to global optima. This work investigates mixed-integer programming (MIP) as a paradigm for global acquisition function optimization. Specifically, our Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP) formulation introduces a piecewise-linear approximation for Gaussian process kernels and admits a corresponding MIQP representation for acquisition functions. The proposed method is applicable to uncertainty-based acquisition functions for any stationary or dot-product kernel. We analyze the theoretical regret bounds of the proposed approximation, and empirically demonstrate the framework on synthetic functions, constrained benchmarks, and a hyperparameter tuning task.

中文标题/摘要

标题：高斯过程获取函数的分段线性核近似全局优化

贝叶斯优化依赖于迭代构建和优化获取函数。后者本身是一个具有挑战性的非凸优化问题。尽管这一步骤相对重要，大多数算法仍采用采样或梯度方法，这些方法不能保证收敛到全局最优。本文研究了混合整数规划（MIP）作为全局获取函数优化的范式。具体而言，我们提出的分段线性核混合整数二次规划（PK-MIQP）形式化引入了高斯过程核的分段线性近似，并为此类获取函数提供了相应的MIQP表示。所提出的方法适用于任何平稳或点积核的基于不确定性获取函数。我们分析了所提出近似的理论后悔界，并在合成函数、约束基准和超参数调整任务上进行了实证演示。

Summary / 总结

This paper addresses the challenge of optimizing acquisition functions in Bayesian optimization, which is a non-convex problem. It proposes a Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP) method that uses a piecewise-linear approximation of Gaussian process kernels to formulate the acquisition function as a mixed-integer quadratic programming problem. The method is applicable to various uncertainty-based acquisition functions and stationary or dot-product kernels. Theoretical analysis shows that the proposed approximation has regret bounds, and empirical results on synthetic functions, constrained benchmarks, and hyperparameter tuning tasks demonstrate its effectiveness.

本文解决了贝叶斯优化中获取函数优化的挑战，这是一个非凸问题。提出了一种Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP) 方法，通过使用高斯过程核的分段线性近似将获取函数形式化为混合整数二次规划问题。该方法适用于各种基于不确定性获取函数和平稳或点积核。理论分析表明，所提出的近似具有遗憾边界，而实验证实在合成函数、约束基准和超参数调整任务上的结果证明了其有效性。

DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices

Authors: Suman Kunwar

First: 2025-10-21T10:55:32+00:00 · Latest: 2025-10-28T16:44:35+00:00

Comments: 8 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) including our purposed YOLOv8n-CBAM model using our annotated dataset designed for recycling. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy(~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 80% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB ), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of "Greener AI" models to support real-time, sustainable waste sorting on edge devices.

中文标题/摘要

标题：DWaste：利用移动和边缘设备进行更环保的废物分类的计算机视觉平台

便利包装的兴起导致了大量废物的产生，使得高效的废物分类对于可持续废物管理至关重要。为了解决这一问题，我们开发了DWaste，一个基于计算机视觉的平台，旨在在资源受限的智能手机和边缘设备上进行实时废物分类，包括离线功能。我们使用了我们为回收设计的标注数据集，对各种图像分类模型（EfficientNetV2S/M、ResNet50/101、MobileNet）和目标检测模型（YOLOv8n、YOLOv11n，包括我们提出的YOLOv8n-CBAM模型）进行了基准测试。我们发现准确性和资源消耗之间存在明显的权衡：最佳分类器EfficientNetV2S实现了高准确率（约96%），但存在高延迟（约0.22秒）和较高的碳排放。相比之下，轻量级的目标检测模型在超快推理（约0.03秒）和显著较小的模型大小（<7MB）方面表现出色，使其成为实时、低功耗使用的理想选择。模型量化进一步提高了效率，大幅减少了模型大小和VRAM使用量，最多可减少75%。我们的工作展示了成功实施“更环保的AI”模型，以支持边缘设备上的实时、可持续废物分类。

Summary / 总结

DWaste is a computer vision platform for real-time waste sorting on resource-constrained devices. It benchmarks various models and finds that lightweight object detection models, such as YOLOv8n-CBAM, offer strong performance with ultra-fast inference and low resource consumption, making them suitable for real-time, low-power use. Model quantization further enhances efficiency, reducing model size and VRAM usage significantly.

DWaste 是一个用于资源受限设备的实时垃圾分类平台，使用了包括 EfficientNetV2S、ResNet、MobileNet 和 YOLOv8n-CBAM 等多种模型。研究发现，EfficientNetV2S 提供了高精度但伴随高延迟和碳排放，而轻量级模型如 YOLOv8n-CBAM 则提供了强大的性能、超快的推理速度和更小的模型大小，使其适合用于实时、低功耗使用。模型量化进一步提高了效率，通过减少模型大小和 VRAM 使用量最多可达 75%。这项工作展示了‘绿色 AI’模型在边缘设备上实现实时可持续垃圾分类的成功实施。

RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

Authors: Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Amin Beheshti, Quan Z. Sheng, Qingming Huang

First: 2024-05-11T16:22:00+00:00 · Latest: 2025-10-28T16:43:19+00:00

Comments: Published in Pattern Recognition

Abs · PDF · Code1 · Code2

Abstract

Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.

中文标题/摘要

标题：RETTA：检索增强的测试时适应零样本视频字幕生成

尽管完全监督的视频字幕取得了显著进展，但零样本方法仍然很少被探索。在本文中，我们提出了一种新颖的零样本视频字幕框架，名为检索增强的测试时适应（RETTA），该框架利用现有的大规模预训练视觉和语言模型，在测试时直接生成字幕。具体而言，我们使用四个关键模型来连接视频和文本：通用视频-文本检索模型XCLIP、通用图像-文本匹配模型CLIP、文本对齐模型AnglE和文本生成模型GPT-2，因为这些模型具有开源代码。主要挑战是如何使文本生成模型充分了解给定视频的内容，以便生成相应的字幕。为了解决这个问题，我们提出使用可学习的标记作为这四个冻结模型GPT-2、XCLIP、CLIP和AnglE之间的通信媒介。不同于传统的用训练数据训练这些标记的方法，我们提出使用几个精心设计的损失函数下的推理数据的软目标来学习这些标记，从而使标记能够吸收适合GPT-2的视频信息。该过程可以在几次迭代中高效完成（我们在实验中使用了16次迭代），并且不需要真实数据。在MSR-VTT、MSVD和VATEX三个广泛使用的数据集上的大量实验结果表明，与几种最先进的零样本视频字幕方法相比，在主要指标CIDEr上绝对提高了5.1%-32.4%。

Summary / 总结

The paper proposes RETTA, a zero-shot video captioning framework that leverages pretrained models for test-time adaptation. It uses a retrieval model (XCLIP), an image-text matching model (CLIP), a text alignment model (AnglE), and a text generation model (GPT-2) to bridge video and text. The key innovation is the use of learnable tokens to enable GPT-2 to generate captions by absorbing video information, improving CIDEr scores by 5.1%-32.4% on MSR-VTT, MSVD, and VATEX datasets compared to state-of-the-art methods.

该论文提出了一种名为RETTA的零样本视频字幕框架，利用预训练模型进行测试时适应。它使用XCLIP、CLIP、AnglE和GPT-2四个模型来连接视频和文本，通过可学习的标记作为这些模型之间的通信媒介。这些标记通过从推理数据中学习软目标来学习，使GPT-2能够生成与视频内容更匹配的字幕。在MSR-VTT、MSVD和VATEX上的实验表明，与现有零样本方法相比，CIDEr得分提高了5.1%到32.4%。

Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT

Authors: Xu Jiang, Huiying Pan, Ligen Shi, Jianing Sun, Wenfeng Xu, Xing Zhao

First: 2025-10-28T16:13:14+00:00 · Latest: 2025-10-28T16:13:14+00:00

Comments: 8 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.

中文标题/摘要

标题：基于物理启发的高斯柯尔莫哥洛夫-阿诺尔德网络在锥束CT中X射线散射校正

锥束CT（CBCT）利用平板探测器实现高空间分辨率的三维成像。然而，在数据采集过程中，CBCT容易受到散射的影响，这会导致重建图像中的CT值偏差和组织对比度降低，最终降低诊断准确性。为了解决这一问题，我们提出了一种基于深度学习的散射伪影校正方法，该方法受到物理先验知识的启发。利用在投影域中观察到的点散射概率密度分布具有旋转对称性的事实，该方法使用高斯径向基函数（RBF）来建模点散射函数，并将其嵌入到柯尔莫哥洛夫-阿诺尔德网络（KAN）层中，该层提供了高效非线性映射能力，用于学习高维散射特征。通过结合散射光子分布的物理特性以及KAN的复杂函数映射能力，该模型提高了其准确表示散射的能力。通过合成和实际扫描实验验证了该方法的有效性。实验结果表明，该模型可以有效地校正重建图像中的散射伪影，并在定量指标方面优于当前方法。

Summary / 总结

This paper proposes a deep learning method for scatter artifact correction in cone-beam CT, inspired by physical principles. It uses Gaussian Radial Basis Functions to model point scatter and integrates this into Kolmogorov-Arnold Networks, which are capable of learning high-dimensional features. The method is validated through both synthetic and real-scan experiments, showing effective correction of scatter artifacts and superior performance compared to existing methods in quantitative metrics.

该论文提出了一种用于修正锥束CT图像中散射伪影的深度学习方法。该方法借鉴了物理先验知识，使用高斯径向基函数（RBF）来建模点散射函数，并将其嵌入到Kolmogorov-Arnold网络（KAN）层中，该层提供了高效的非线性映射能力。该模型通过合成和实际扫描实验得到了验证，显示出其在修正散射伪影方面的有效性，并在定量指标上优于现有方法。

Frequency-Aware Vision Transformers for High-Fidelity Super-Resolution of Earth System Models

Authors: Ehsan Zeraatkar, Salah A Faroughi, Jelena Tešić

First: 2025-02-18T01:52:41+00:00 · Latest: 2025-10-28T16:06:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Super-resolution (SR) is crucial for enhancing the spatial fidelity of Earth System Model (ESM) outputs, allowing fine-scale structures vital to climate science to be recovered from coarse simulations. However, traditional deep super-resolution methods, including convolutional and transformer-based models, tend to exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details. In this work, we introduce two frequency-aware frameworks: the Vision Transformer-Tuned Sinusoidal Implicit Representation (ViSIR), combining Vision Transformers and sinusoidal activations to mitigate spectral bias, and the Vision Transformer Fourier Representation Network (ViFOR), which integrates explicit Fourier-based filtering for independent low- and high-frequency learning. Evaluated on the E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading CNN, GAN, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6~dB improvements in PSNR and significantly higher SSIM. Detailed ablation and scaling studies highlight the benefit of full-field training, the impact of frequency hyperparameters, and the potential for generalization. The results establish ViFOR as a state-of-the-art, scalable solution for climate data downscaling. Future extensions will address temporal super-resolution, multimodal climate variables, automated parameter selection, and integration of physical conservation constraints to broaden scientific applicability.

中文标题/摘要

标题：频率感知的视觉变换器用于地球系统模型的高保真超分辨率

超分辨率（SR）对于增强地球系统模型（ESM）输出的空间保真度至关重要，允许从粗略模拟中恢复对气候科学至关重要的细尺度结构。然而，传统的深度超分辨率方法，包括卷积和基于变换器的模型，往往会表现出频谱偏差，优先重建低频内容而非有价值的高频细节。在本文中，我们引入了两种频率感知框架：视觉变换器调谐正弦隐式表示（ViSIR），结合视觉变换器和正弦激活以减轻频谱偏差，以及视觉变换器傅里叶表示网络（ViFOR），该网络整合了显式的傅里叶基过滤器以实现独立的低频和高频学习。在E3SM-HR地球系统数据集上，这些模型在表面温度、短波和长波通量方面优于领先的CNN、GAN和vanilla变换器基线，ViFOR在峰值信噪比（PSNR）上表现出高达2.6 dB的改进，并且显著提高了结构相似性（SSIM）。详细的消融和扩展研究突显了全域训练的好处、频率超参数的影响以及泛化的潜力。结果确立了ViFOR作为最先进的、可扩展的气候数据降尺度解决方案。未来扩展将解决时间超分辨率、多模态气候变量、自动参数选择以及物理守恒约束的集成，以扩大科学应用范围。

Summary / 总结

This work addresses the challenge of enhancing the spatial fidelity of Earth System Model outputs through super-resolution. It introduces two frequency-aware frameworks, ViSIR and ViFOR, which combine Vision Transformers with sinusoidal activations and explicit Fourier-based filtering, respectively. These models outperform existing CNN, GAN, and vanilla transformer baselines, with ViFOR achieving up to 2.6 dB improvements in PSNR and higher SSIM. The study demonstrates the benefits of full-field training and the impact of frequency hyperparameters, establishing ViFOR as a state-of-the-art solution for climate data downscaling.

本文旨在通过高保真超分辨率提高地球系统模型的精细尺度气候结构恢复。提出了两种频率感知框架ViSIR和ViFOR，通过结合Vision Transformers和正弦激活以及显式的Fourier基过滤来缓解频谱偏差。实验结果表明，ViFOR在E3SM-HR数据集上优于CNN、GAN和vanilla变压器基线，PSNR提高了2.6 dB，SSIM得分更高。

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Authors: Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, Xing Sun

First: 2025-02-07T18:59:56+00:00 · Latest: 2025-10-28T16:02:48+00:00

Comments: https://github.com/VITA-MLLM/Long-VITA

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully open-source and reproducible.. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.

中文标题/摘要

标题：Long-VITA：将大型多模态模型扩展至100万标记，同时保持短语境准确性

我们介绍了Long-VITA，一种简单而有效的大型多模态模型，用于长上下文视觉语言理解任务。它能够同时处理和分析4K帧或100万标记的图像、视频和文本，并在短上下文多模态任务中提供先进的性能。我们提出了一种有效的多模态训练方案，从大型语言模型开始，经过视觉语言对齐、一般知识学习，以及两次长序列微调阶段。我们还实现了上下文并行分布式推理和logits掩蔽语言建模头部，以在模型推理过程中将Long-VITA扩展到无限长的图像和文本输入。关于训练数据，Long-VITA基于来自公共数据集的1700万样本混合，并在各种多模态基准测试中展示了最先进的性能，与具有内部数据的最新模型相比。Long-VITA完全开源且可重现。通过利用我们的推理设计，Long-VITA模型在单个节点8个GPU的情况下实现了2倍的预填充加速和4倍的上下文长度扩展。我们希望Long-VITA能够作为竞争基准，为开源社区在推进长上下文多模态理解方面提供有价值的见解。

Summary / 总结

Long-VITA is a large multi-modal model designed for long-context visual-language understanding tasks. It uses a multi-modal training schema involving language model initialization, vision-language alignment, general knowledge learning, and long-sequence fine-tuning. During inference, it employs context-parallelism and logits-masked language modeling to handle long inputs. Long-VITA shows state-of-the-art performance on various benchmarks and achieves a 2x prefill speedup and 4x context length extension on a single node with 8 GPUs.

Long-VITA 是一种针对长上下文视觉语言理解任务的大规模多模态模型，能够处理多达 100 万的标记。它采用多阶段训练方案和上下文并行分布式推理，同时在短上下文和长上下文任务上都表现出高水平的准确性。Long-VITA 在各种基准测试中表现出最先进的性能，并在单个节点配备 8 块 GPU 的情况下实现了 2 倍的预填充加速和 4 倍的上下文长度扩展。

GST-UNet: A Neural Framework for Spatiotemporal Causal Inference with Time-Varying Confounding

Authors: Miruna Oprescu, David K. Park, Xihaier Luo, Shinjae Yoo, Nathan Kallus

Venue: NeurIPS 2025

First: 2025-02-07T19:56:01+00:00 · Latest: 2025-10-28T16:01:40+00:00

Comments: 29 pages, 6 figures, 6 tables, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Estimating causal effects from spatiotemporal observational data is essential in public health, environmental science, and policy evaluation, where randomized experiments are often infeasible. Existing approaches, however, either rely on strong structural assumptions or fail to handle key challenges such as interference, spatial confounding, temporal carryover, and time-varying confounding -- where covariates are influenced by past treatments and, in turn, affect future ones. We introduce GST-UNet (G-computation Spatio-Temporal UNet), a theoretically grounded neural framework that combines a U-Net-based spatiotemporal encoder with regression-based iterative G-computation to estimate location-specific potential outcomes under complex intervention sequences. GST-UNet explicitly adjusts for time-varying confounders and captures non-linear spatial and temporal dependencies, enabling valid causal inference from a single observed trajectory in data-scarce settings. We validate its effectiveness in synthetic experiments and in a real-world analysis of wildfire smoke exposure and respiratory hospitalizations during the 2018 California Camp Fire. Together, these results position GST-UNet as a principled and ready-to-use framework for spatiotemporal causal inference, advancing reliable estimation in policy-relevant and scientific domains.

中文标题/摘要

标题：GST-UNet：一种用于时空因果推断的神经框架，包含时间变化混杂因素

从时空观测数据中估计因果效应在公共卫生、环境科学和政策评估中至关重要，因为随机实验往往不可行。现有方法要么依赖于强结构假设，要么无法处理诸如相互作用、空间混杂因素、时间延续效应和时间变化混杂因素等关键挑战——即协变量受过去治疗的影响，并反过来影响未来治疗。我们提出了GST-UNet（G-计算时空UNet），这是一种理论基础扎实的神经框架，结合了基于U-Net的时空编码器和基于回归的迭代G-计算，以估计复杂干预序列下的特定位置潜在结果。GST-UNet明确调整了时间变化混杂因素，并捕捉了非线性空间和时间依赖性，使其能够在数据稀缺的情况下从单个观测轨迹中进行有效的因果推断。我们在合成实验和2018年加州坎普大火期间野火烟雾暴露与呼吸系统住院分析中验证了其有效性。这些结果共同将GST-UNet定位为一个原则性的且易于使用的时空因果推断框架，推动了政策相关和科学领域的可靠估计。

Summary / 总结

GST-UNet is a neural framework designed to estimate causal effects from spatiotemporal observational data, addressing challenges such as time-varying confounding. It combines a U-Net-based spatiotemporal encoder with iterative G-computation for valid causal inference. Experimental results show its effectiveness in synthetic and real-world wildfire smoke exposure studies, demonstrating its capability to handle complex intervention sequences and non-linear dependencies.

GST-UNet 是一种用于时空观测数据因果推断的神经框架，解决了干扰、时空混杂和时间变化混杂等挑战。它结合了基于 U-Net 的时空编码器和迭代 G-计算，以估计特定位置的潜在结果。该框架通过合成实验和 2018 年加州坎普大火期间野火烟雾暴露与呼吸系统住院分析的有效性验证，展示了其在处理复杂干预序列和提供可靠因果推断方面的优势，特别是在数据稀缺的环境中。

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Authors: Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, Fei Huang

First: 2025-10-28T15:56:36+00:00 · Latest: 2025-10-28T15:56:36+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark's challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.

中文标题/摘要

标题：OSWorld-MCP：评估计算机使用代理工具调用能力的基准测试

随着决策和推理能力的进步，多模态代理在计算机应用场景中显示出强大的潜力。过去的评估主要评估了GUI交互技能，而由模型上下文协议（MCP）支持的工具调用能力则被很大程度上忽视了。将具有集成工具调用能力的代理与仅评估GUI交互能力的代理进行比较是不公平的。我们提出了OSWorld-MCP，这是首个全面且公平的基准测试，用于评估计算机使用代理在真实环境中的工具调用、GUI操作和决策能力。我们设计了一种新颖的自动化代码生成流水线来创建工具，并将它们与现有工具的精选集合结合。严格的手动验证产生了158个高质量工具（覆盖7种常见应用），每个工具都经过验证，确保其正确功能、实际适用性和多功能性。对OSWorld-MCP上的先进多模态代理进行广泛评估表明，MCP工具通常提高了任务成功率（例如，从OpenAI o3在15步中的8.3%提高到20.4%，从Claude 4 Sonnet在50步中的40.1%提高到43.3%），突显了评估工具调用能力的重要性。然而，即使是最强大的模型，其工具调用率也只有36.3%，表明仍有改进的空间，突显了基准测试的挑战性。通过明确测量MCP工具使用技能，OSWorld-MCP加深了对多模态代理的理解，并为在复杂、工具辅助环境中评估性能设立了新标准。我们的代码、环境和数据可在https://osworld-mcp.github.io/公开获取。

LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

Authors: Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang

First: 2025-10-28T15:55:36+00:00 · Latest: 2025-10-28T15:55:36+00:00

Abs · PDF · Code1 · Code2

Abstract

With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.

中文标题/摘要

标题：LoRA-DA：基于渐近分析的数据感知低秩适应初始化

随着大语言模型（LLM）的广泛应用，LoRA 成为一种主导的参数高效微调（PEFT）方法，其初始化方法引起了越来越多的关注。然而，现有方法存在明显局限性：许多方法未结合目标域数据，而基于梯度的方法仅通过一阶梯度分解浅层次地利用数据，这由于其基础的一步微调模型的实证性能不佳，以及这些方法缺乏严格的理论基础或依赖于严格的各向同性假设而令人不满意。本文基于渐近分析建立了数据感知LoRA初始化的理论框架。从一个旨在最小化微调模型与目标模型参数差异的通用优化目标出发，我们推导出一个包含偏差项和方差项的优化问题：偏差项与微调模型和目标模型的参数距离相关，并通过Fisher梯度形式近似以保留各向异性；方差项通过Fisher信息考虑了采样随机性引入的不确定性。通过解决该问题，我们获得LoRA的最优初始化策略。基于此理论框架，我们开发了高效算法LoRA-DA，从少量目标域样本中估计优化问题中的项并获得最优LoRA初始化。在多个基准上的实验结果表明，LoRA-DA在最终准确度上始终优于现有初始化方法。额外的研究表明，LoRA-DA具有更快、更稳定的收敛性、在不同秩上的鲁棒性以及仅具有较小的初始化开销。源代码将在发表后发布。

Summary / 总结

This paper addresses the limitations of existing LoRA initialization methods by proposing a data-aware approach, LoRA-DA, based on asymptotic analysis. The method derives an optimization problem that includes a bias term approximated using a Fisher-gradient formulation and a variance term accounting for sampling uncertainty. Empirical results show that LoRA-DA improves final accuracy and offers faster, more stable convergence compared to existing methods, with minimal initialization overhead and robustness across ranks.

本文通过提出一种数据感知的方法LoRA-DA来解决现有LoRA初始化方法的局限性。它利用渐近分析推导出一个包含偏差项和方差项的优化问题，从而得到最优的初始化策略。实验结果表明，LoRA-DA在多个基准测试中提高了最终的准确率，并且具有更快、更稳定的收敛速度、跨秩的鲁棒性和较小的初始化开销，相比现有方法更具优势。

Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks

Authors: Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishnan

First: 2025-10-28T15:45:15+00:00 · Latest: 2025-10-28T15:45:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the popularity of reinforcement learning (RL) in wireless networks, existing approaches that rely on model-free RL (MFRL) and model-based RL (MBRL) are data inefficient and short-sighted. Such RL-based solutions cannot generalize to novel network states since they capture only statistical patterns rather than the underlying physics and logic from wireless data. These limitations become particularly challenging in complex wireless networks with high dynamics and long-term planning requirements. To address these limitations, in this paper, a novel dual-mind world model-based learning framework is proposed with the goal of optimizing completeness-weighted age of information (CAoI) in a challenging mmWave V2X scenario. Inspired by cognitive psychology, the proposed dual-mind world model encompasses a pattern-driven System 1 component and a logic-driven System 2 component to learn dynamics and logic of the wireless network, and to provide long-term link scheduling over reliable imagined trajectories. Link scheduling is learned through end-to-end differentiable imagined trajectories with logical consistency over an extended horizon rather than relying on wireless data obtained from environment interactions. Moreover, through imagination rollouts, the proposed world model can jointly reason network states and plan link scheduling. During intervals without observations, the proposed method remains capable of making efficient decisions. Extensive experiments are conducted on a realistic simulator based on Sionna with real-world physical channel, ray-tracing, and scene objects with material properties. Simulation results show that the proposed world model achieves a significant improvement in data efficiency and achieves strong generalization and adaptation to unseen environments, compared to the state-of-the-art RL baselines, and the world model approach with only System 1.

中文标题/摘要

标题：双心智世界模型：动态无线网络中学习的一般框架

尽管强化学习（RL）在无线网络中非常流行，但现有的依赖于无模型强化学习（MFRL）和基于模型的强化学习（MBRL）的方法数据效率低下且目光短浅。这些基于RL的解决方案无法将学到的知识推广到新的网络状态，因为它们只能捕捉统计模式而不能从无线数据中提取出潜在的物理和逻辑规律。这些限制在具有高动态性和长期规划需求的复杂无线网络中尤为突出。为了解决这些限制，本文提出了一种新颖的基于双心智世界模型的学习框架，旨在在挑战性的毫米波V2X场景中优化加权信息新鲜度（CAoI）。受认知心理学的启发，所提出的双心智世界模型包括一个模式驱动的系统1组件和一个逻辑驱动的系统2组件，用于学习无线网络的动力学和逻辑，并提供基于可靠想象轨迹的长期链路调度。链路调度是通过端到端可微想象轨迹学习的，这些轨迹在长时间范围内具有逻辑一致性，而不是依赖于从环境交互中获得的无线数据。此外，通过想象滚动，所提出的世界模型可以联合推理网络状态并规划链路调度。在无观测间隔期间，所提出的方法仍能做出高效的决策。基于Sionna的现实模拟器进行了广泛的实验，该模拟器基于实际物理信道、射线追踪和具有材料属性的场景对象。仿真结果表明，所提出的世界模型在数据效率方面取得了显著改进，并且在未见过的环境中具有强大的泛化能力和适应性，优于最先进的RL基线方法，以及仅包含系统1的世界模型方法。

Summary / 总结

The paper addresses the limitations of existing reinforcement learning methods in wireless networks, particularly their data inefficiency and inability to generalize to novel network states. It proposes a dual-mind world model framework that combines a pattern-driven System 1 and a logic-driven System 2 to learn the dynamics and logic of the wireless network. The method uses end-to-end differentiable imagined trajectories for long-term link scheduling and demonstrates significant improvements in data efficiency and generalization compared to state-of-the-art RL baselines and a world model with only System 1. Extensive experiments were conducted on a realistic simulator to validate these findings.

论文针对现有无线网络中的强化学习方法存在的数据效率低和难以泛化到新网络状态的问题，提出了一种结合模式驱动的System 1和逻辑驱动的System 2的双心智世界模型框架。该方法通过端到端可微的想象轨迹进行长期链路调度，并展示了与最先进的RL基线和仅包含System 1的世界模型相比，在数据效率和泛化能力方面的显著改进。在基于Sionna的真实模拟器上进行了大量实验以验证这些结果。