arXiv 论文速递

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Authors: Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng

First: 2025-10-30T17:59:55+00:00 · Latest: 2025-10-30T17:59:55+00:00

Comments: Project Page: https://video-cof.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

中文标题/摘要

标题：视频模型准备好作为零样本推理器了吗？MME-CoF基准上的实证研究

近期的视频生成模型能够生成高保真度、时间上连贯的视频，表明它们可能蕴含了大量世界知识。除了现实的合成之外，它们还表现出视觉感知、建模和操控的新兴行为。然而，一个重要的问题仍然存在：视频模型是否准备好在具有挑战性的视觉推理场景中作为零样本推理器使用？在本文中，我们进行了一项实证研究，全面探讨了这一问题，重点关注领先的Veo-3。我们从12个维度评估了其推理行为，包括空间、几何、物理、时间以及具身逻辑，系统地描述了其优势和失败模式。为了标准化这项研究，我们将评估数据整理成MME-CoF，这是一个紧凑的基准，能够深入和全面地评估链框（CoF）推理。我们的发现表明，尽管当前的视频模型在短时间空间一致性、细粒度定位和局部一致动态方面表现出有希望的推理模式，但在长时间因果推理、严格的几何约束和抽象逻辑方面仍然有限。总体而言，它们尚未可靠地作为独立的零样本推理器，但作为专用推理模型的补充视觉引擎表现出令人鼓舞的迹象。

Summary / 总结

This study investigates whether video generation models can serve as zero-shot reasoners by evaluating Veo-3 across 12 dimensions using the MME-CoF benchmark. The findings show that while the models exhibit promising reasoning in short-horizon spatial coherence and fine-grained grounding, they are limited in long-horizon causal reasoning and abstract logic, indicating they are not yet reliable as standalone zero-shot reasoners but could complement dedicated reasoning models.

研究通过评估Veo-3模型在空间、几何和时间逻辑等12个维度上的表现，探讨视频生成模型是否可以作为零样本推理器。研究引入了MME-CoF基准，用于评估链帧推理。主要发现表明，尽管模型在短期空间一致性和平滑的细节定位上表现出色，但在长期因果推理和抽象逻辑方面仍存在局限，表明它目前还不足以作为独立的零样本推理器，但显示出作为专用推理模型补充工具的潜力。

OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Authors: Yukun Huang, Jiwen Yu, Yanning Zhou, Jianan Wang, Xintao Wang, Pengfei Wan, Xihui Liu

First: 2025-10-30T17:59:51+00:00 · Latest: 2025-10-30T17:59:51+00:00

Comments: Project page: https://yukun-huang.github.io/OmniX/

Abs · PDF · Code1 · Code2 · Project1

Abstract

There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.

中文标题/摘要

标题：OmniX: 从统一全景生成与感知到图形就绪的3D场景

构建3D场景有两种常见方法：程序生成和2D提升。其中，基于全景的2D提升已成为一种有前景的技术，利用强大的2D生成先验生成沉浸式、逼真且多样的3D环境。在本工作中，我们推进了这一技术，生成适合基于物理的渲染（PBR）、重新照明和模拟的图形就绪3D场景。我们的关键见解是重新利用2D生成模型进行全景几何、纹理和PBR材料的感知。与现有的2D提升方法强调外观生成而忽略内在属性感知不同，我们提出了OmniX，一个多功能和统一的框架。基于轻量级且高效的跨模态适配器结构，OmniX 重新利用2D生成先验用于广泛的全景视觉任务，包括全景感知、生成和完成。此外，我们构建了一个大规模合成全景数据集，包含来自不同室内和室外场景的高质量多模态全景图。广泛的实验表明，我们的模型在全景视觉感知和图形就绪3D场景生成中的有效性，为沉浸式和物理逼真的虚拟世界生成开辟了新可能性。

Summary / 总结

OmniX advances panorama-based 2D lifting to generate graphics-ready 3D scenes for PBR, relighting, and simulation. It uses a cross-modal adapter to repurpose 2D generative models for panoramic perception and generation, addressing the limitations of existing approaches that focus solely on appearance. Experiments show OmniX's effectiveness in both panoramic perception and 3D scene generation, paving the way for immersive and physically realistic virtual worlds.

OmniX 将基于全景的 2D 提升技术进一步发展，以生成适合 PBR、重新照明和模拟的 3D 场景。它使用跨模态适配器重新利用 2D 生成模型进行全景感知和生成，解决了现有方法仅关注外观的局限性。实验表明，OmniX 在全景感知和 3D 场景生成方面均表现出色，为沉浸式和物理上真实的虚拟世界生成开辟了新途径。

UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection

Authors: Jigang Fan, Quanlin Wu, Shengjie Luo, Liwei Wang

Venue: NeurIPS 2025 Spotlight

First: 2025-06-03T17:49:41+00:00 · Latest: 2025-10-30T17:59:46+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The detection of ligand binding sites for proteins is a fundamental step in Structure-Based Drug Design. Despite notable advances in recent years, existing methods, datasets, and evaluation metrics are confronted with several key challenges: (1) current datasets and methods are centered on individual protein-ligand complexes and neglect that diverse binding sites may exist across multiple complexes of the same protein, introducing significant statistical bias; (2) ligand binding site detection is typically modeled as a discontinuous workflow, employing binary segmentation and subsequent clustering algorithms; (3) traditional evaluation metrics do not adequately reflect the actual performance of different binding site prediction methods. To address these issues, we first introduce UniSite-DS, the first UniProt (Unique Protein)-centric ligand binding site dataset, which contains 4.81 times more multi-site data and 2.08 times more overall data compared to the previously most widely used datasets. We then propose UniSite, the first end-to-end ligand binding site detection framework supervised by set prediction loss with bijective matching. In addition, we introduce Average Precision based on Intersection over Union (IoU) as a more accurate evaluation metric for ligand binding site prediction. Extensive experiments on UniSite-DS and several representative benchmark datasets demonstrate that IoU-based Average Precision provides a more accurate reflection of prediction quality, and that UniSite outperforms current state-of-the-art methods in ligand binding site detection. The dataset and codes will be made publicly available at https://github.com/quanlin-wu/unisite.

中文标题/摘要

标题：UniSite：首个跨结构数据集及端到端配体结合位点检测学习框架

蛋白质配体结合位点的检测是结构基于药物设计中的一个基本步骤。尽管近年来取得了显著进展，但现有方法、数据集和评估指标仍面临几个关键挑战：(1) 当前的数据集和方法主要集中在单一蛋白质-配体复合物上，忽视了同一蛋白质的不同复合物中可能存在多种结合位点，引入了显著的统计偏差；(2) 配体结合位点检测通常被建模为一个不连续的工作流程，采用二元分割和后续聚类算法；(3) 传统评估指标未能充分反映不同结合位点预测方法的实际性能。为解决这些问题，我们首先引入了UniSite-DS，这是首个以UniProt（唯一蛋白质）为中心的配体结合位点数据集，包含4.81倍的多站点数据和2.08倍的整体数据，相比之前最广泛使用的数据集。然后，我们提出了UniSite，这是首个基于集合预测损失和双射匹配的端到端配体结合位点检测框架。此外，我们引入了基于交并比（IoU）的平均精度作为更准确的配体结合位点预测评估指标。在UniSite-DS和几个代表性基准数据集上的广泛实验表明，基于IoU的平均精度提供了更准确的预测质量反映，且UniSite在配体结合位点检测中优于当前最先进的方法。数据集和代码将在https://github.com/quanlin-wu/unisite公开提供。

Summary / 总结

The paper introduces UniSite-DS, the first UniProt-centric dataset for ligand binding sites, addressing the limitations of existing datasets by including more multi-site and overall data. It also presents UniSite, an end-to-end detection framework using set prediction loss and bijective matching, and proposes a new evaluation metric, IoU-based Average Precision, which better reflects prediction quality. Experiments show that UniSite outperforms current state-of-the-art methods in ligand binding site detection on UniSite-DS and benchmark datasets.

论文提出了UniSite-DS数据集和UniSite框架，用于配体结合位点检测，解决了现有方法中的统计偏差和评估指标问题。UniSite-DS包含比之前的数据集更多的多样化结合位点数据。UniSite使用集合预测损失和双射匹配进行端到端检测，并引入基于交并比(IoU)的平均精确度作为更准确的评估指标。实验表明，UniSite在UniSite-DS和基准数据集上的表现优于当前最先进的方法。

SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Authors: Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu

First: 2025-10-30T17:59:39+00:00 · Latest: 2025-10-30T17:59:39+00:00

Comments: 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

中文标题/摘要

标题：SEE4D：无需姿态的自回归视频修复生成四维内容

沉浸式应用需要从普通视频中合成时空四维内容，而无需昂贵的三维监督。现有从视频到四维的方法通常依赖于手动标注的相机姿态，这既费时又脆弱，不适合野外视频。最近的基于变形然后修复的方法通过使用新颖的相机轨迹变形输入帧，并利用修复模型填充缺失区域，从而减轻了对姿态标签的需求，进而从多个视角描绘四维场景。然而，这种轨迹到轨迹的建模方式往往将相机运动与场景动态纠缠在一起，增加了建模和推理的复杂性。我们提出了SEE4D，一种无需姿态的轨迹到相机框架，用渲染到一组固定的虚拟相机来替代显式的轨迹预测，从而将相机控制与场景建模分离。一个视角条件的视频修复模型被训练来通过去噪真实合成的变形图像学习稳健的几何先验，并修复虚拟视角之间的遮挡或缺失区域，从而消除显式三维标注的需求。基于这个修复核心，我们设计了一个时空自回归推理管道，遍历虚拟相机样条，并使用重叠窗口扩展视频，从而在每步复杂度有限的情况下实现连贯生成。我们在跨视角视频生成和稀疏重建基准上验证了See4D。在定量指标和定性评估中，我们的方法在泛化能力和性能上优于姿态或轨迹条件的基线，推动了从普通视频中构建实用的四维世界模型。

Summary / 总结

SEE4D is a pose-free 4D generation method that uses an auto-regressive video inpainting model to synthesize 4D content from casual videos. It replaces the need for manually annotated camera poses with a bank of fixed virtual cameras, separating camera control from scene modeling. The method achieves superior generalization and performance compared to pose- or trajectory-conditioned baselines on cross-view video generation and sparse reconstruction benchmarks.

SEE4D 是一种无需姿态的 4D 生成方法，使用自回归视频修复模型从普通视频中合成 4D 内容。它通过渲染一组固定的虚拟摄像头来替代对相机姿态注释的需求，将相机控制与场景建模分离。该方法在交叉视角视频生成和稀疏重建基准测试中表现出更好的泛化能力和性能。

The Quest for Generalizable Motion Generation: Data, Model, and Evaluation

Authors: Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, Ziwei Liu

First: 2025-10-30T17:59:27+00:00 · Latest: 2025-10-30T17:59:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.

中文标题/摘要

标题：通用运动生成的追求：数据、模型与评估

尽管在标准基准上取得了近期进展，现有的3D人体运动生成（MoGen）模型仍然面临其泛化能力的基本瓶颈。相比之下，相邻的生成领域，尤其是视频生成（ViGen），在建模人类行为方面展示了显著的泛化能力，突显了MoGen可以借鉴的可转移见解。受此观察的启发，我们提出了一种全面的框架，系统地将ViGen的知识转移到MoGen的三个关键支柱：数据、建模和评估。首先，我们引入了ViMoGen-228K，这是一个包含228,000个高质量运动样本的大规模数据集，该数据集整合了高保真光学MoCap数据和来自网络视频的语义标注运动以及由最先进的ViGen模型生成的合成样本。数据集包括文本-运动对和文本-视频-运动三元组，大幅扩展了语义多样性。其次，我们提出了ViMoGen，一种基于流匹配的扩散变换器，通过门控多模态条件统一MoCap数据和ViGen模型的先验。为了提高效率，我们进一步开发了ViMoGen-light，这是一种精简变体，消除了对视频生成的依赖，同时保持了强大的泛化能力。最后，我们提出了MBench，这是一种分层基准，用于对运动质量、提示保真度和泛化能力进行细粒度评估。广泛的实验表明，我们的框架在自动和人工评估中均显著优于现有方法。代码、数据和基准将公开提供。

Summary / 总结

This paper addresses the limitation of generalization in 3D human motion generation models by drawing insights from video generation. It introduces ViMoGen-228K, a large dataset combining high-fidelity motion capture data and web video motions, and proposes ViMoGen, a diffusion transformer that integrates priors from motion capture and video generation. The authors also developed ViMoGen-light, a more efficient variant. Additionally, they created MBench, a benchmark for evaluating motion quality, prompt fidelity, and generalization. Experiments demonstrated that their approach outperforms existing methods in both automated and human evaluations.

该研究通过借鉴视频生成领域的经验来解决3D人体动作生成的一般化问题。引入了ViMoGen-228K数据集，该数据集结合了高质量的动作样本和来自网络视频的语义标注动作，并提出了ViMoGen，这是一种结合动作捕捉和视频生成先验知识的扩散变换器。该框架还包括ViMoGen-light，一种更高效的变体。MBench是一个分层基准，用于评估动作质量、提示保真度和一般化能力。实验表明，在自动和人工评估中，该框架显著优于现有方法。

Defeating the Training-Inference Mismatch via FP16

Authors: Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

First: 2025-10-30T17:58:11+00:00 · Latest: 2025-10-30T17:58:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.

中文标题/摘要

标题：通过FP16克服训练推断不匹配

大规模语言模型（LLMs）的强化学习（RL）微调常常由于训练和推断策略之间的数值不匹配而变得不稳定。尽管先前的工作试图通过算法修正或工程对齐来缓解这一问题，但我们表明其根本原因在于浮点精度本身。尽管广泛采用的BF16具有较大的动态范围，但其引入了大量舍入误差，破坏了训练和推断的一致性。在本文中，我们证明简单地恢复到**FP16**可以有效消除这种不匹配。这一改变简单，完全被现代框架支持，只需几行代码更改，无需修改模型架构或学习算法。我们的结果表明，使用FP16在各种任务、算法和框架中均能获得更稳定的优化、更快的收敛和更强的性能。我们希望这些发现能促使更广泛地重新考虑RL微调中的精度权衡。

Summary / 总结

This paper addresses the instability in reinforcement learning fine-tuning of large language models due to the numerical mismatch between training and inference policies. It identifies that the root cause is the use of BF16 floating point precision, which introduces significant rounding errors. The authors demonstrate that switching to FP16 resolves this issue, leading to more stable optimization, faster convergence, and better performance across various tasks and frameworks with minimal code changes.

该论文解决了由于训练和推理之间数值不匹配导致的大语言模型强化学习微调中的不稳定性问题。研究指出，根本原因在于使用了BF16浮点精度，这引入了大量舍入误差。作者通过切换到FP16解决了这一问题，这使得优化更加稳定、收敛更快，并且在各种任务和框架中表现更佳，仅需少量代码更改。

Remote Labor Index: Measuring AI Automation of Remote Work

Authors: Mantas Mazeika, Alice Gatti, Cristina Menghini, Udari Madhushani Sehwag, Shivam Singhal, Yury Orlovskiy, Steven Basart, Manasi Sharma, Denis Peskoff, Elaine Lau, Jaehyuk Lim, Lachlan Carroll, Alice Blair, Vinaya Sivakumar, Sumana Basu, Brad Kenstler, Yuntao Ma, Julian Michael, Xiaoke Li, Oliver Ingebretsen, Aditya Mehta, Jean Mottola, John Teichmann, Kevin Yu, Zaina Shaik, Adam Khoja, Richard Ren, Jason Hausenloy, Long Phan, Ye Htet, Ankit Aich, Tahseen Rabbani, Vivswan Shah, Andriy Novykov, Felix Binder, Kirill Chugunov, Luis Ramirez, Matias Geralnik, Hernán Mesura, Dean Lee, Ed-Yeremai Hernandez Cardona, Annette Diamond, Summer Yue, Alexandr Wang, Bing Liu, Ernesto Hernandez, Dan Hendrycks

Venue: www

First: 2025-10-30T17:58:04+00:00 · Latest: 2025-10-30T17:58:04+00:00

Comments: Website: https://www.remotelabor.ai

Abs · PDF · Code1 · Code2 · Project1

Abstract

AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.

中文标题/摘要

标题：远程劳动力指数：衡量AI对远程工作的自动化

人工智能在知识和推理的研究基准上取得了快速进展，但这些进展如何转化为经济价值和自动化仍不清楚。为了衡量这一点，我们引入了远程劳动力指数（RLI），这是一个涵盖多个行业的广泛基准，包含真实世界、具有经济价值的项目，旨在评估代理在实际环境中的端到端性能。AI代理在RLI上的表现接近最低水平，最高性能的代理实现了2.5%的自动化率。这些结果有助于将关于AI自动化的讨论建立在实证证据的基础上，为跟踪AI影响提供共同基础，并使利益相关者能够积极应对由AI驱动的劳动力自动化。

HEIR: Learning Graph-Based Motion Hierarchies

Authors: Cheng Zheng, William Koch, Baiang Li, Felix Heide

Venue: NeurIPS 2025

First: 2025-10-30T17:57:40+00:00 · Latest: 2025-10-30T17:57:40+00:00

Comments: Code link: https://github.com/princeton-computational-imaging/HEIR

Abs · PDF · Code1 · Code2 · Code3

Abstract

Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/

中文标题/摘要

标题：HEIR：基于图的运动层次结构学习

运动的层次结构存在于计算机视觉、图形学和机器人学等多个研究领域，其中复杂的动态通常源自更简单运动组件之间的协调交互。现有的建模方法通常依赖于手动定义或启发式的固定运动基元层次结构，这限制了其在不同任务中的泛化能力。在本工作中，我们提出了一种通用的层次运动建模方法，可以从数据中直接学习结构化和可解释的运动关系。我们的方法使用基于图的层次结构表示观察到的运动，明确地将全局绝对运动分解为父继承模式和局部运动残差。我们将层次结构推理形式化为一个可微图学习问题，其中顶点表示基本运动，有向边通过图神经网络捕捉学习到的父节点与子节点之间的依赖关系。我们在三个示例上评估了我们的层次重建方法：1D平移运动、2D旋转运动以及通过高斯点绘制动态3D场景变形。实验结果表明，我们的方法在1D和2D情况下重建了内在的运动层次结构，并且在动态3D高斯点绘制场景中产生了更真实且可解释的变形，优于基线方法。通过提供一种可适应的数据驱动的层次建模范式，我们的方法为广泛的以运动为中心的任务提供了适用的表述。

Summary / 总结

The research aims to develop a method for learning hierarchical motion structures directly from data to improve the generalizability of motion modeling across various tasks. The method uses graph-based hierarchies to decompose global motions into parent-inherited patterns and local residuals, and it evaluates the approach on 1D, 2D, and 3D motion scenarios. The results show that the method can reconstruct intrinsic motion hierarchies in 1D and 2D and produces more realistic and interpretable deformations in 3D scenes compared to a baseline method.

研究旨在开发一种可以从数据中直接学习运动层次结构的方法，解决手动定义或启发式层次结构的局限性。该方法使用图神经网络来推断运动元素之间的层次关系，将全局运动分解为父级继承的模式和局部残差。实验表明，该方法在1D、2D和3D运动上可以重建内在的运动层次结构，并在动态3D场景中生成更真实的变形，优于基线方法。

Clone Deterministic 3D Worlds with Geometrically-Regularized World Models

Authors: Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen

First: 2025-10-30T17:56:43+00:00 · Latest: 2025-10-30T17:56:43+00:00

Abs · PDF · Code1 · Code2

Abstract

A world model is an internal model that simulates how the world evolves. Given past observations and actions, it predicts the future of both the embodied agent and its environment. Accurate world models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings. Despite rapid progress, current world models remain brittle and degrade over long horizons. We argue that a central cause is representation quality: exteroceptive inputs (e.g., images) are high-dimensional, and lossy or entangled latents make dynamics learning unnecessarily hard. We therefore ask whether improving representation learning alone can substantially improve world-model performance. In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone and overfit to a deterministic 3D world. We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space. This approach yields significantly improved latent representations that align closely with the true topology of the environment. GRWM is plug-and-play, requires only minimal architectural modification, scales with trajectory length, and is compatible with diverse latent generative backbones. Across deterministic 3D settings and long-horizon prediction tasks, GRWM significantly increases rollout fidelity and stability. Analyses show that its benefits stem from learning a latent manifold with superior geometric structure. These findings support a clear takeaway: improving representation learning is a direct and useful path to robust world models, delivering reliable long-horizon predictions without enlarging the dynamics module.

中文标题/摘要

标题：使用几何正则化的世界模型克隆确定性3D世界

世界模型是一种内部模型，用于模拟世界如何演变。给定过去的观察和行动，它预测代理及其环境的未来。准确的世界模型对于使代理能够在复杂、动态的环境中思考、计划和有效推理至关重要。尽管取得了快速进展，但当前的世界模型仍然脆弱，并且在长时间范围内会退化。我们认为主要原因在于表示质量：外部感受输入（例如，图像）是高维的，而损失性的或纠缠的潜在变量使动力学习变得不必要的困难。因此，我们询问是否仅通过改进表示学习就能显著提高世界模型的性能。在这项工作中，我们朝着构建真正准确的世界模型迈出了一步，通过解决一个基本但尚未解决的问题：构建一个能够完全克隆和拟合到确定性3D世界的模型。我们提出了几何正则化世界模型（GRWM），该方法要求自然感官轨迹中的连续点在潜在表示空间中保持接近。这种方法产生了显著改进的潜在表示，与环境的真实拓扑结构紧密对齐。GRWM 是即插即用的，只需要进行最小的架构修改，可以扩展到轨迹长度，并且与多种潜在生成骨干兼容。在确定性3D设置和长时间预测任务中，GRWM 显著提高了展开的准确性和稳定性。分析表明，其好处来自于学习具有优越几何结构的潜在流形。这些发现支持一个明确的结论：改进表示学习是直接且有用的通往稳健世界模型的道路，能够提供可靠的长时间预测而无需扩大动力模块。

Summary / 总结

This paper addresses the challenge of building accurate world models that can predict the future state of a 3D environment. The authors propose Geometrically-Regularized World Models (GRWM) to improve representation learning by ensuring that consecutive points in sensory trajectories remain close in latent space. This method significantly enhances the accuracy of predictions and stability over long horizons in deterministic 3D settings.

该论文旨在为复杂动态环境创建准确的世界模型。作者提出了几何正则化世界模型（GRWM），通过确保感官轨迹中连续点在潜在空间中的接近性来改进表示学习。这种方法显著提高了潜在表示，并在确定性的3D环境中提高了轨迹仿真精度和稳定性，支持了更好的表示学习是构建稳健世界模型的关键观点。

ChartAB: A Benchmark for Chart Grounding & Dense Alignment

Authors: Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou

First: 2025-10-30T17:56:31+00:00 · Latest: 2025-10-30T17:56:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.

中文标题/摘要

标题：ChartAB：图表定位与密集对齐基准

图表在可视化、推理、数据分析以及人类思想交流中发挥着重要作用。然而，现有的视觉-语言模型（VLMs）在细节感知方面仍存在不足，难以从图表中提取精细结构。这种图表定位的限制也阻碍了它们比较多个图表和推理的能力。在本文中，我们引入了一个新的“图表对齐基准（ChartAB）”，以全面评估VLMs在图表定位任务中的表现，即提取表格数据、定位可视化元素以及从不同类型和复杂度的图表中识别各种属性。我们设计了一个JSON模板，以方便计算每个定位任务的评估指标。通过引入一种新颖的两阶段推理工作流，基准还可以进一步评估VLMs在两个图表之间对齐和比较元素/属性的能力。我们对几种近期VLMs的评估分析揭示了它们在图表理解中的感知偏差、弱点、鲁棒性和幻觉。这些发现突显了VLMs在图表理解任务中的细微差异，并指出了当前模型需要加强的具体技能。

Summary / 总结

The paper introduces ChartAB, a benchmark for evaluating vision-language models in chart grounding tasks, including extracting tabular data, localizing visualization elements, and recognizing chart attributes. It uses a JSON template to calculate specific evaluation metrics and a two-stage inference workflow to assess models' ability to align and compare elements across charts. The study reveals biases, weaknesses, and hallucinations in recent models, highlighting the need to improve their fine-grained understanding of charts.

该论文提出了ChartAB基准，用于评估视觉-语言模型在图表定位任务中的表现，包括提取表格数据、定位可视化元素和识别图表属性。通过使用两阶段推理流程，该基准还评估了模型在跨图表对元素进行对齐和比较的能力。对近期VLMs的评估揭示了它们在图表理解中的感知偏差和弱点，强调了需要加强当前模型中的特定技能。

Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance

Authors: Valentyna Starodub, Mantas Lukoševičius

First: 2025-10-30T17:55:46+00:00 · Latest: 2025-10-30T17:55:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.

中文标题/摘要

标题：通过精心选择U-Net架构和损失函数以缓解类别不平衡，超越AMD区域估计的最新成果

年龄相关黄斑变性（AMD）是60岁以上人群不可逆视力损害的主要原因之一。本研究专注于RGB眼底图像中的AMD病灶语义分割，这是一种无创且成本效益高的成像技术。ADAM挑战赛的结果——迄今为止最全面的RGB眼底图像AMD检测研究竞赛和开放数据集——作为我们评估的基准。以U-Net连接性为基础，我们评估并比较了几种方法以改进分割模型的架构和训练管道，包括预处理技术、不同复杂度的编码器（骨干网）深度网络类型以及专门的损失函数以缓解图像和像素级别的类别不平衡。本研究的主要成果是AMD检测框架的最终配置，该配置在非侵入性RGB眼底图像的多类分割中优于所有先前的ADAM挑战赛提交。本文中所进行实验所使用的源代码已免费提供。

SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models

Authors: Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas

First: 2025-10-30T17:52:39+00:00 · Latest: 2025-10-30T17:52:39+00:00

Abs · PDF · Code1 · Code2

Abstract

This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

中文标题/摘要

标题：SteerVLM：通过轻量级激活转向实现视觉语言模型稳健的模型控制

本工作介绍了SteerVLM，这是一种轻量级的转向模块，旨在引导视觉语言模型(VLMs)生成更符合所需指令的输出。我们的方法通过学习配对提示的潜在嵌入，编码目标和相反行为，动态调整语言模态与图像上下文之间的激活连接。这允许在不修改模型权重的情况下，在推理时对复杂的输出语义进行精细控制，同时保持对离目标任务的性能。我们的转向模块的学习参数量仅为原始VLM大小的0.14%。我们的转向模块通过维度上的激活调制和跨层自适应转向获得模型控制，无需预先提取的静态向量或手动调整干预点。此外，我们还引入了VNIA（视觉叙事意图对齐）多模态数据集，专门用于促进VLM转向技术的发展和评估。我们的方法在VLM转向和幻觉缓解基准测试中优于现有干预技术，并提出了一种通过激活工程实现多模态模型控制的稳健解决方案。

Summary / 总结

SteerVLM is a lightweight module that guides VLMs to produce outputs more aligned with desired instructions by dynamically adjusting activations. It learns from paired prompts and requires minimal parameters, enhancing control without affecting off-target performance. SteerVLM outperforms existing techniques in steering and hallucination mitigation benchmarks, demonstrating effective multimodal model control through activation modulation.

SteerVLM 是一个轻量级模块，通过动态调整激活来引导 VLM 生成更符合期望指令的输出。它通过配对提示学习，并且只需要少量参数，既增强了控制能力又不影响非目标任务的表现。SteerVLM 在引导和幻觉缓解基准测试中优于现有技术，展示了通过激活工程实现多模态模型控制的有效性。

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Authors: Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou

First: 2025-10-30T17:52:02+00:00 · Latest: 2025-10-30T17:52:02+00:00

Comments: 14 pages, 9 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/

中文标题/摘要

标题：AMO-Bench：大型语言模型在高中数学竞赛中仍表现不佳

我们提出了AMO-Bench，这是一个高级数学推理基准，包含奥林匹克级别甚至更高的难度，共有50个人工设计的问题。现有的基准测试广泛利用高中数学竞赛来评估大型语言模型（LLMs）的数学推理能力。然而，许多现有的数学竞赛由于性能饱和（例如AIME24/25）而变得不再有效评估顶级LLMs。为了解决这一问题，AMO-Bench通过确保所有50个问题（1）由专家交叉验证以达到至少国际数学奥林匹克（IMO）的难度标准，（2）完全原创以防止潜在的性能泄漏来自数据记忆，引入了更严格的挑战。此外，AMO-Bench中的每个问题只需要最终答案，而不是证明，这使得自动和稳健的评分成为可能。在AMO-Bench上对26个LLM进行的实验结果显示，即使表现最好的模型在AMO-Bench上的准确率也只有52.4%，大多数LLM的得分低于40%。除了这些糟糕的表现，我们进一步的分析揭示了AMO-Bench上随着测试时计算量增加的有希望的扩展趋势。这些结果突显了当前LLMs在数学推理方面改进的巨大空间。我们发布AMO-Bench以促进进一步研究，以提高语言模型的推理能力。

Summary / 总结

AMO-Bench is a new benchmark for evaluating mathematical reasoning capabilities of large language models (LLMs) with problems at or above the International Mathematical Olympiad (IMO) difficulty level. It includes 50 original problems cross-validated by experts. Experiments on 26 LLMs show that even the best model achieves only 52.4% accuracy, indicating significant room for improvement in mathematical reasoning. Further analysis suggests a scaling trend with increased test-time compute.

AMO-Bench 是一个用于评估大型语言模型 (LLM) 数学推理能力的新基准，包含50个国际数学奥林匹克竞赛级别的原创问题，防止数据记忆。实验显示，即使表现最好的模型在 AMO-Bench 上也只能达到52.4% 的准确率，表明在数学推理方面有很大的改进空间。研究还揭示了随着测试时计算量增加，性能有所提升的趋势。

Smoothing Slot Attention Iterations and Recurrences

Authors: Rongzhen Zhao, Wenyan Yang, Juho Kannala, Joni Pajarinen

First: 2025-08-07T14:09:33+00:00 · Latest: 2025-10-30T17:46:35+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Slot Attention (SA) and its variants lie at the heart of mainstream Object-Centric Learning (OCL). Objects in an image can be aggregated into respective slot vectors, by \textit{iteratively} refining cold-start query vectors, typically three times, via SA on image features. For video, such aggregation is \textit{recurrently} shared across frames, with queries cold-started on the first frame while transitioned from the previous frame's slots on non-first frames. However, the cold-start queries lack sample-specific cues thus hinder precise aggregation on the image or video's first frame; Also, non-first frames' queries are already sample-specific thus require transforms different from the first frame's aggregation. We address these issues for the first time with our \textit{SmoothSA}: (1) To smooth SA iterations on the image or video's first frame, we \textit{preheat} the cold-start queries with rich information of input features, via a tiny module self-distilled inside OCL; (2) To smooth SA recurrences across all video frames, we \textit{differentiate} the homogeneous transforms on the first and non-first frames, by using full and single iterations respectively. Comprehensive experiments on object discovery, recognition and downstream benchmarks validate our method's effectiveness. Further analyses intuitively illuminate how our method smooths SA iterations and recurrences. Our source code, model checkpoints and training logs are available on https://github.com/Genera1Z/SmoothSA.

中文标题/摘要

标题：平滑槽注意迭代和递归

槽注意（SA）及其变体是主流对象中心学习（OCL）的核心。图像中的对象可以被聚合为相应的槽向量，通过SA迭代地细化冷启动查询向量，通常在图像特征上进行三次迭代。对于视频，这种聚合在帧间是递归共享的，查询在第一帧冷启动，而在非第一帧则从前一帧的槽过渡。然而，冷启动查询缺乏样本特定的线索，从而妨碍了对图像或视频第一帧的精确聚合；此外，非第一帧的查询已经是样本特定的，因此需要与第一帧聚合不同的变换。我们首次通过我们的SmoothSA解决了这些问题：（1）为了平滑图像或视频第一帧的SA迭代，我们使用输入特征的丰富信息预热冷启动查询，通过OCL内部自学习的小模块；（2）为了平滑所有视频帧间的SA递归，我们通过分别使用完整和单一迭代来区分第一帧和非第一帧的同质变换。全面的实验验证了我们方法的有效性。进一步的分析直观地阐明了我们的方法如何平滑SA迭代和递归。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/SmoothSA上获得。

Summary / 总结

The paper addresses limitations in Slot Attention (SA) and its variants by proposing SmoothSA to improve the cold-start queries in the first frame and the recurrent queries across frames. SmoothSA preheats the cold-start queries with rich input feature information for the first frame and differentiates the transforms on the first and non-first frames using full and single iterations respectively. Experiments show that SmoothSA enhances object discovery, recognition, and downstream benchmarks, and provides insights into how it smooths SA iterations and recurrences.

论文通过提出SmoothSA来改进SA及其变体中的冷启动查询和跨帧查询的局限性。SmoothSA通过在输入特征信息中丰富冷启动查询来预热第一帧，并在第一帧和非第一帧之间使用完整和单次迭代分别区分变换。实验表明，SmoothSA提高了对象发现、识别和下游基准的性能，并直观地解释了如何平滑SA迭代和递归。

Predicting Video Slot Attention Queries from Random Slot-Feature Pairs

Authors: Rongzhen Zhao, Jian Li, Juho Kannala, Joni Pajarinen

First: 2025-08-02T12:48:04+00:00 · Latest: 2025-10-30T17:43:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unsupervised video Object-Centric Learning (OCL) is promising as it enables object-level scene representation and dynamics modeling as we humans do. Mainstream video OCL methods adopt a recurrent architecture: An aggregator aggregates current video frame into object features, termed slots, under some queries; A transitioner transits current slots to queries for the next frame. This is an effective architecture but all existing implementations both (\textit{i1}) neglect to incorporate next frame features, the most informative source for query prediction, and (\textit{i2}) fail to learn transition dynamics, the knowledge essential for query prediction. To address these issues, we propose Random Slot-Feature pair for learning Query prediction (RandSF.Q): (\textit{t1}) We design a new transitioner to incorporate both slots and features, which provides more information for query prediction; (\textit{t2}) We train the transitioner to predict queries from slot-feature pairs randomly sampled from available recurrences, which drives it to learn transition dynamics. Experiments on scene representation demonstrate that our method surpass existing video OCL methods significantly, e.g., up to 10 points on object discovery, setting new state-of-the-art. Such superiority also benefits downstream tasks like dynamics modeling. Our core source code, model checkpoints and training logs are available on https://github.com/Genera1Z/RandSF.Q.

中文标题/摘要

标题：从随机槽-特征对预测视频槽注意力查询

无监督视频对象中心学习（OCL）很有前景，因为它能够像人类一样实现对象级别的场景表示和动力学建模。主流的视频OCL方法采用递归架构：聚合器将当前视频帧聚合为对象特征，称为槽；转换器将当前槽转换为下一帧的查询。这是一种有效的架构，但所有现有的实现都（\textit{i1}）忽略了下一帧特征，这是查询预测最丰富的来源，且（\textit{i2}）未能学习转换动力学，这是查询预测所需的重要知识。为解决这些问题，我们提出了随机槽-特征对学习查询预测（RandSF.Q）：（\textit{t1}）我们设计了一种新的转换器，使其同时包含槽和特征，从而为查询预测提供更多信息；（\textit{t2}）我们训练转换器从可用重演中随机采样的槽-特征对预测查询，这促使它学习转换动力学。场景表示实验表明，我们的方法在对象发现方面显著优于现有视频OCL方法，例如在对象发现上提高了10个点，从而建立了新的最先进的水平。这种优越性也惠及了动力学建模等下游任务。我们的核心源代码、模型检查点和训练日志可在https://github.com/Genera1Z/RandSF.Q/获得。

A General Incentives-Based Framework for Fairness in Multi-agent Resource Allocation

Authors: Ashwin Kumar, William Yeoh

First: 2025-10-30T17:37:51+00:00 · Latest: 2025-10-30T17:37:51+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the General Incentives-based Framework for Fairness (GIFF), a novel approach for fair multi-agent resource allocation that infers fair decision-making from standard value functions. In resource-constrained settings, agents optimizing for efficiency often create inequitable outcomes. Our approach leverages the action-value (Q-)function to balance efficiency and fairness without requiring additional training. Specifically, our method computes a local fairness gain for each action and introduces a counterfactual advantage correction term to discourage over-allocation to already well-off agents. This approach is formalized within a centralized control setting, where an arbitrator uses the GIFF-modified Q-values to solve an allocation problem. Empirical evaluations across diverse domains, including dynamic ridesharing, homelessness prevention, and a complex job allocation task-demonstrate that our framework consistently outperforms strong baselines and can discover far-sighted, equitable policies. The framework's effectiveness is supported by a theoretical foundation; we prove its fairness surrogate is a principled lower bound on the true fairness improvement and that its trade-off parameter offers monotonic tuning. Our findings establish GIFF as a robust and principled framework for leveraging standard reinforcement learning components to achieve more equitable outcomes in complex multi-agent systems.

中文标题/摘要

标题：一种基于激励的公平性框架在多智能体资源分配中的应用

我们提出了基于激励的公平性框架（GIFF），这是一种新颖的公平多智能体资源分配方法，可以从标准价值函数中推断出公平的决策。在资源受限的环境中，优化效率的智能体往往会创造出不公平的结果。我们的方法利用行动价值（Q-）函数来平衡效率和公平，而无需额外的训练。具体来说，我们的方法为每个行动计算局部公平收益，并引入一个反事实优势修正项，以防止过度分配给已经富裕的智能体。该方法在集中控制的环境中进行了形式化，仲裁者使用GIFF修改后的Q值来解决分配问题。在包括动态拼车、预防无家可归和复杂的工作分配任务在内的多种领域中进行的实证评估表明，我们的框架始终优于强大的基线，并能够发现远见卓识、公平的政策。该框架的有效性得到了理论基础的支持；我们证明其公平性代理是真实公平改进的原理性下界，并且其权衡参数提供了单调调整。我们的研究结果确立了GIFF作为一种稳健且原理性的框架，可以利用标准强化学习组件在复杂的多智能体系统中实现更公平的结果。

Summary / 总结

The research introduces GIFF, a framework for fair multi-agent resource allocation that uses Q-functions to balance efficiency and fairness. It computes local fairness gains and introduces a counterfactual advantage correction to prevent over-allocation to well-off agents. Empirical evaluations across various domains show that GIFF outperforms strong baselines and discovers equitable policies. Theoretical analysis supports its fairness and monotonic tuning capabilities.

研究引入了基于激励的公平性框架（GIFF），利用Q函数平衡多智能体资源分配中的效率与公平性。该方法计算局部公平收益，并引入反事实优势修正项以防止向已富裕的智能体过度分配资源。在多个领域的实证评估表明，GIFF优于强基线，并能发现公平政策。理论分析支持该框架的公平性和单调调参能力。

ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

Authors: Chihan Huang, Hao Tang

First: 2025-07-08T15:17:24+00:00 · Latest: 2025-10-30T17:35:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality, while maintaining inference efficiency. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.

中文标题/摘要

标题：ScoreAdv：基于扩散模型的自然型目标生成对抗性示例

尽管深度学习在各个领域取得了成功，但它仍然容易受到对抗攻击的影响。尽管许多现有的对抗攻击方法能够实现高成功率，但它们通常依赖于$\ell_{p}$范数扰动约束，这并不符合人类的感知能力。因此，研究人员将重点转向生成自然的、无限制的对抗性示例（UAEs）。基于GAN的方法存在固有的局限性，如由于不稳定性导致的图像质量差和模式崩溃。同时，扩散模型已被用于生成UAEs，但它们仍然依赖于迭代的PGD扰动注入，而未能充分利用其核心去噪能力。在本文中，我们提出了一种基于扩散模型生成UAEs的新方法，称为ScoreAdv。该方法结合了一种可解释的对抗指导机制，逐步将采样分布向对抗分布偏移，同时使用可解释的显著性图将参考图像的视觉信息注入生成样本中。值得注意的是，我们的方法能够生成无限数量的自然对抗性示例，并且不仅可以攻击分类模型，还可以攻击检索模型。我们在ImageNet和CelebA数据集上进行了广泛的实验，在黑盒和白盒设置下，验证了ScoreAdv在十个目标模型上的性能。我们的结果表明，ScoreAdv在攻击成功率和图像质量方面达到了最先进的水平，同时保持了推理效率。此外，去噪和对抗扰动之间的动态平衡使ScoreAdv即使在防御措施下也能保持稳健性。

Summary / 总结

ScoreAdv is a novel approach for generating natural adversarial examples using diffusion models. It incorporates an adversarial guidance mechanism and an interpretable saliency map to shift the sampling distribution towards the adversarial distribution. ScoreAdv can generate unlimited natural adversarial examples and attack both classification and retrieval models. Experiments on ImageNet and CelebA show that ScoreAdv achieves high attack success rates and image quality, while maintaining efficiency and robustness against defensive measures.

ScoreAdv 是一种使用扩散模型生成自然对抗样本的新方法。它结合了对抗指导机制和可解释的显著图，以逐步将采样分布向对抗分布偏移。在 ImageNet 和 CelebA 数据集上的实验表明，ScoreAdv 达到了较高的攻击成功率和图像质量，优于现有方法且保持了高效性。它可以生成无限数量的自然对抗样本，并攻击分类和检索模型。

Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models

Authors: J. de Curtò, I. de Zarzà, Pablo García, Jordi Cabot

First: 2025-10-30T17:31:03+00:00 · Latest: 2025-10-30T17:31:03+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.

Summary / 总结

This paper evaluates the reasoning capabilities of 15 foundation models across three computational platforms: HPC supercomputing, cloud platforms, and university clusters. Through three experimental phases, the study establishes a benchmark for reasoning tasks in eight academic domains. Key findings include challenges to conventional scaling assumptions, emphasizing the importance of training data quality over model size, and providing actionable guidelines for model selection.

该研究评估了15个基础模型在三个计算平台上（HPC超级计算机、云平台和大学集群）的推理能力，通过三个实验阶段建立了涵盖八个学术领域的基准。主要发现包括挑战了传统的规模假设，强调了训练数据质量的重要性超过模型大小，并提供了跨教育、生产和研究领域的模型选择指南。

Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

Authors: Xinhan Zheng, Huyu Wu, Xueting Wang, Haiyun Jiang

First: 2025-10-30T17:22:22+00:00 · Latest: 2025-10-30T17:22:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

中文标题/摘要

标题：通过注意力键空间分析揭示多模态大型语言模型中的固有文本偏见

多模态大型语言模型（MLLMs）在处理视觉语言数据时表现出对文本输入的明显偏好，限制了它们从视觉证据中有效推理的能力。不同于以往研究将这种文本偏见归因于外部因素如数据不平衡或指令调优，我们提出这种偏见源自模型内部架构。具体来说，我们假设视觉键向量（视觉键）在语言仅预训练期间学习的文本键空间中是离群的（OOD）。因此，在注意力计算过程中，这些视觉键会系统地获得较低的相似度分数，导致它们在上下文表示中的利用率较低。为了验证这一假设，我们从LLaVA和Qwen2.5-VL中提取键向量，并使用定性（t-SNE）和定量（Jensen-Shannon散度）方法分析其分布结构。结果直接证明了视觉键和文本键在注意力空间中占据明显不同的子空间。跨模态的差异具有统计显著性，远超同模态内的变化量级。这些发现揭示了文本偏见源自注意力键空间内的固有不匹配，而不仅仅是外部数据因素。

Summary / 总结

The study investigates the intrinsic text bias in multimodal large language models (MLLMs) by analyzing attention key-space. It hypothesizes that visual key vectors are out-of-distribution relative to the text key space learned during language-only pretraining, leading to their under-utilization. Key vectors from LLaVA and Qwen2.5-VL were extracted and analyzed using t-SNE and Jensen-Shannon divergence, showing that visual and textual keys occupy distinct subspaces, with inter-modal divergence significantly exceeding intra-modal variation.

研究通过分析注意力键空间，探讨了多模态大型语言模型（MLLMs）的内在文本偏见。研究假设视觉键向量在语言仅预训练期间学习的文本键空间中是离群的，导致其在上下文表示中的利用率较低。通过使用t-SNE和Jensen-Shannon散度分析LLaVA和Qwen2.5-VL的键向量，结果显示视觉和文本键在注意力空间中占据不同的子空间，跨模态的差异显著大于同模态的差异。

Controlling Thinking Speed in Reasoning Models

Authors: Zhengkai Lin, Zhihang Fu, Ze Chen, Chao Chen, Liang Xie, Wenxiao Wang, Deng Cai, Zheng Wang, Jieping Ye

Venue: NeurIPS 2025 Spotlight

First: 2025-07-04T16:41:06+00:00 · Latest: 2025-10-30T17:13:35+00:00

Comments: NeurIPS 2025 Spotlight

Abs · PDF · Code1 · Code2

Abstract

Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs' representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-in module delivers an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.

中文标题/摘要

标题：在推理模型中控制思考速度

人类认知被认为分为两种模式：快速直觉的系统1思考和缓慢审慎的系统2思考。虽然当前的大规模推理模型（LRMs）在系统2思考方面表现出色，但它们无法进行快速思考的能力导致了高计算开销和延迟。在本研究中，我们通过动态调整思考速度使LRMs能够近似人类智能，优化准确性和效率之间的权衡。我们的方法解决了两个关键问题：（1）如何在LRMs中控制思考速度，（2）何时调整以获得最佳性能。对于第一个问题，我们确定了控制LRMs表示空间中慢速和快速思考转换的引导向量。利用这个向量，我们实现了第一个基于表示编辑的测试时缩放效果，优于现有的基于提示的缩放方法。对于第二个问题，我们应用实时难度估计来信号处理不同复杂度的推理段落。结合这些技术，我们提出了第一个能够快速处理简单步骤并深入分析复杂推理的推理策略。无需任何训练或额外成本，我们的插件模块在领先的大规模推理模型和高级推理基准上分别实现了平均+1.3%的准确性和-8.6%的标记使用量。我们的所有算法都是基于vLLM实现的，并预计支持更广泛的应用并启发未来的研究。

Summary / 总结

This work aims to enhance Large Reasoning Models (LRMs) by enabling dynamic adjustment of thinking speed to approximate human intelligence. The authors identify a steering vector to control slow-fast thinking transitions and propose a representation editing-based test-time scaling effect, outperforming existing prompt-based methods. They also introduce real-time difficulty estimation for varying complexity in reasoning segments, resulting in an average +1.3% accuracy improvement with -8.6% token usage across leading LRMs and benchmarks without additional training or costs.

本文旨在通过动态调整思考速度来增强大型推理模型（LRMs），使其更接近人类认知。作者识别了一个控制LRMs缓慢和快速思考转换的引导向量，并提出了一种基于表示编辑的测试时缩放方法，优于现有的基于提示的方法。他们还引入了实时难度估计来优化推理片段的处理。所提出的模块在各种LRMs和基准测试中平均提高了1.3%的准确性，同时减少了8.6%的标记使用量，且无需额外的训练成本。

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

First: 2025-09-25T16:19:06+00:00 · Latest: 2025-10-30T17:09:54+00:00

Comments: Added link to access models: https://huggingface.co/collections/nvidia/reward-models-10-2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

中文标题/摘要

标题：RLBFF：二元灵活反馈以实现人类反馈与可验证奖励之间的桥梁

在大型语言模型（LLM）后训练中，强化学习与人类反馈（RLHF）和强化学习与可验证奖励（RLVR）是主要的强化学习范式，各自具有独特的优势。然而，RLHF在可解释性和奖励作弊方面存在困难，因为它依赖于通常缺乏明确标准的人类判断；而RLVR则因其关注基于正确性的验证器而受到限制。我们提出了强化学习与二元灵活反馈（RLBFF），它结合了人类驱动的偏好灵活性与基于规则验证的精确性，使奖励模型能够捕捉响应质量的细微方面，而不仅仅是正确性。RLBFF从自然语言反馈中提取可以二元回答的原则（例如：信息准确性：是，或代码可读性：否），然后将这些原则用于奖励模型训练作为蕴含任务（响应是否满足任意原则）。我们展示了这种训练方式的奖励模型在数据匹配的情况下可以超越布雷德利-特里模型，并在RM-Bench（86.2%）和JudgeBench（81.4%，截至2025年9月24日排行榜第一）上达到顶级性能。此外，用户可以在推理时指定感兴趣的原理以定制奖励模型的焦点，这与布雷德利-特里模型不同。最后，我们提供了一个完全开源的食谱（包括数据），使用RLBFF和我们的奖励模型对Qwen3-32B进行对齐，以匹配或超越o3-mini和DeepSeek R1在MT-Bench、WildBench和Arena Hard v2通用对齐基准上的性能（成本不到5%）。模型：https://huggingface.co/collections/nvidia/reward-models-10-2025

Summary / 总结

The paper introduces RLBFF, a method combining human feedback and verifiable rewards to improve reinforcement learning. It extracts binary principles from natural language feedback to train reward models, achieving better performance than Bradley-Terry models on RM-Bench and JudgeBench. The method allows users to customize the focus of reward models during inference and aligns Qwen3-32B to match or exceed the performance of other models on various benchmarks while reducing inference cost significantly.

该论文提出了一种名为RLBFF的方法，结合了人类反馈和可验证奖励，以提高强化学习中奖励模型的可解释性和性能。RLBFF从自然语言反馈中提取二元原则，并将其作为蕴含任务用于训练奖励模型。实验结果显示，RLBFF训练的奖励模型在RM-Bench和JudgeBench上表现优于Bradley-Terry模型。此外，RLBFF允许用户在推理时自定义奖励模型的关注点，并且该方法是完全开源的，提供了可访问的数据和模型。

Value Drifts: Tracing Value Alignment During LLM Post-Training

Authors: Mehar Bhatia, Shravan Nayak, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, Vered Shwartz, Siva Reddy

First: 2025-10-30T17:09:09+00:00 · Latest: 2025-10-30T17:09:09+00:00

Abs · PDF · Code1 · Code2

Abstract

As LLMs occupy an increasingly important role in society, they are more and more confronted with questions that require them not only to draw on their general knowledge but also to align with certain human value systems. Therefore, studying the alignment of LLMs with human values has become a crucial field of inquiry. Prior work, however, mostly focuses on evaluating the alignment of fully trained models, overlooking the training dynamics by which models learn to express human values. In this work, we investigate how and at which stage value alignment arises during the course of a model's post-training. Our analysis disentangles the effects of post-training algorithms and datasets, measuring both the magnitude and time of value drifts during training. Experimenting with Llama-3 and Qwen-3 models of different sizes and popular supervised fine-tuning (SFT) and preference optimization datasets and algorithms, we find that the SFT phase generally establishes a model's values, and subsequent preference optimization rarely re-aligns these values. Furthermore, using a synthetic preference dataset that enables controlled manipulation of values, we find that different preference optimization algorithms lead to different value alignment outcomes, even when preference data is held constant. Our findings provide actionable insights into how values are learned during post-training and help to inform data curation, as well as the selection of models and algorithms for preference optimization to improve model alignment to human values.

中文标题/摘要

标题：价值漂移：LLM后训练中的价值对齐追踪

随着大语言模型（LLM）在社会中的作用越来越重要，它们越来越多地面临需要不仅依靠其一般知识，还要与某些人类价值体系保持一致的问题。因此，研究LLM与人类价值观的对齐已成为一个关键的研究领域。然而，先前的工作主要集中在评估完全训练好的模型的价值对齐情况，而忽略了模型在学习表达人类价值观过程中的训练动态。在本研究中，我们探讨了模型在后训练过程中价值对齐是如何以及在哪个阶段出现的。我们的分析将后训练算法和数据集的影响因素分离出来，测量了训练过程中价值漂移的大小和时间。通过使用不同规模的Llama-3和Qwen-3模型以及流行的监督微调（SFT）和偏好优化数据集和算法，我们发现SFT阶段通常建立了模型的价值，而随后的偏好优化很少重新对齐这些价值。此外，使用一个可以控制价值变化的合成偏好数据集，我们发现不同的偏好优化算法即使在偏好数据保持不变的情况下也会导致不同的价值对齐结果。我们的研究结果提供了关于后训练过程中如何学习价值的可操作见解，并有助于指导数据收集，以及选择偏好优化的模型和算法以提高模型与人类价值观的对齐。

Summary / 总结

This study investigates how and when language models align with human values during post-training, using Llama-3 and Qwen-3 models and various fine-tuning and preference optimization datasets and algorithms. The research finds that the supervised fine-tuning phase primarily establishes a model's values, while subsequent preference optimization rarely re-aligns these values. Different preference optimization algorithms can lead to different value alignment outcomes even with the same preference data. These findings offer insights into value learning during post-training and guide data curation and algorithm selection for improving model alignment with human values.

研究探讨了LLM在后训练过程中如何以及何时与人类价值观对齐，使用了Llama-3和Qwen-3模型以及多种细调和偏好优化数据集和算法。研究发现，监督细调阶段主要建立模型的价值观，而后续的偏好优化很少重新对齐这些价值观。即使偏好数据保持一致，不同的偏好优化算法也会导致不同的价值对齐结果。这些发现为理解后训练期间的价值学习提供了见解，并指导数据收集和偏好优化算法的选择，以提高模型与人类价值观的对齐。

ProstNFound+: A Prospective Study using Medical Foundation Models for Prostate Cancer Detection

Authors: Paul F. R. Wilson, Mohamed Harmanani, Minh Nguyen Nhat To, Amoon Jamzad, Tarek Elghareb, Zhuoxin Guo, Adam Kinnaird, Brian Wodlinger, Purang Abolmaesumi, Parvin Mousavi

First: 2025-10-30T17:07:04+00:00 · Latest: 2025-10-30T17:07:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Purpose: Medical foundation models (FMs) offer a path to build high-performance diagnostic systems. However, their application to prostate cancer (PCa) detection from micro-ultrasound ({\mu}US) remains untested in clinical settings. We present ProstNFound+, an adaptation of FMs for PCa detection from {\mu}US, along with its first prospective validation. Methods: ProstNFound+ incorporates a medical FM, adapter tuning, and a custom prompt encoder that embeds PCa-specific clinical biomarkers. The model generates a cancer heatmap and a risk score for clinically significant PCa. Following training on multi-center retrospective data, the model is prospectively evaluated on data acquired five years later from a new clinical site. Model predictions are benchmarked against standard clinical scoring protocols (PRI-MUS and PI-RADS). Results: ProstNFound+ shows strong generalization to the prospective data, with no performance degradation compared to retrospective evaluation. It aligns closely with clinical scores and produces interpretable heatmaps consistent with biopsy-confirmed lesions. Conclusion: The results highlight its potential for clinical deployment, offering a scalable and interpretable alternative to expert-driven protocols.

中文标题/摘要

标题：ProstNFound+:一项使用医疗基础模型进行前列腺癌检测的前瞻性研究

目的：医疗基础模型（FMs）提供了一条构建高性能诊断系统的途径。然而，它们在从微超声（μUS）检测前列腺癌（PCa）方面的应用尚未在临床环境中得到验证。我们介绍了ProstNFound+，这是一种针对μUS的PCa检测的FMs适应性方法，并附带了其首次前瞻性验证。方法：ProstNFound+结合了医疗FM、适配器调优和一个嵌入PCa特异性临床生物标志物的自定义提示编码器。该模型生成癌症热图和临床显著PCa的风险评分。在多中心回顾性数据上进行训练后，该模型在五年前从新临床站点获取的数据上进行前瞻性评估。模型预测结果与标准临床评分协议（PRI-MUS和PI-RADS）进行了基准测试。结果：ProstNFound+在前瞻性数据上的泛化能力强，与回顾性评估相比没有性能下降。它与临床评分高度一致，并生成与活检证实的病变一致的可解释热图。结论：结果突显了其在临床部署中的潜力，提供了一种可扩展且可解释的专家驱动协议的替代方案。

Summary / 总结

ProstNFound+ is a medical foundation model adapted for prostate cancer detection from micro-ultrasound. It includes an adapter tuning and a custom prompt encoder that integrates clinical biomarkers. After training on multi-center retrospective data, the model was prospectively validated at a new clinical site, showing strong generalization and alignment with clinical scores, indicating its potential for clinical deployment.

ProstNFound+ 是一种针对微超声前列腺癌检测的医疗基础模型，结合了适配器调优和包含临床生物标志物的自定义提示编码器。该模型在多中心回顾性数据上训练后，在新的临床站点进行了前瞻性验证，显示了良好的泛化能力和与临床评分的一致性，生成的热图与活检确认的病灶一致。

Delegated Authorization for Agents Constrained to Semantic Task-to-Scope Matching

Authors: Majed El Helou, Chiara Troiani, Benjamin Ryder, Jean Diaconu, Hervé Muyal, Marcelo Yannuzzi

First: 2025-10-30T17:07:00+00:00 · Latest: 2025-10-30T17:07:00+00:00

Comments: Paper page at https://outshift-open.github.io/ASTRA

Abs · PDF · Code1 · Code2 · Project1

Abstract

Authorizing Large Language Model driven agents to dynamically invoke tools and access protected resources introduces significant risks, since current methods for delegating authorization grant overly broad permissions and give access to tools allowing agents to operate beyond the intended task scope. We introduce and assess a delegated authorization model enabling authorization servers to semantically inspect access requests to protected resources, and issue access tokens constrained to the minimal set of scopes necessary for the agents' assigned tasks. Given the unavailability of datasets centered on delegated authorization flows, particularly including both semantically appropriate and inappropriate scope requests for a given task, we introduce ASTRA, a dataset and data generation pipeline for benchmarking semantic matching between tasks and scopes. Our experiments show both the potential and current limitations of model-based matching, particularly as the number of scopes needed for task completion increases. Our results highlight the need for further research into semantic matching techniques enabling intent-aware authorization for multi-agent and tool-augmented applications, including fine-grained control, such as Task-Based Access Control (TBAC).

中文标题/摘要

标题：代理授权以约束于语义任务-范围匹配

授权大型语言模型驱动的代理动态调用工具和访问受保护资源会带来重大风险，因为当前的授权委托方法授予了过于广泛的权限，并允许代理超出预期任务范围操作。我们提出并评估了一种委托授权模型，使授权服务器能够语义地检查对受保护资源的访问请求，并发放仅限于代理分配任务所需最小范围的访问令牌。由于缺乏专注于委托授权流程的数据集，特别是包括给定任务中语义上适当和不适当的范围请求的数据集，我们引入了ASTRA数据集和数据生成管道，用于基准测试任务和范围之间的语义匹配。我们的实验展示了基于模型的匹配的潜力和当前局限性，尤其是在完成任务所需的范围数量增加时。我们的结果强调了进一步研究语义匹配技术以实现意图感知授权的必要性，包括细粒度控制，如基于任务的访问控制（TBAC）。

The End of Manual Decoding: Towards Truly End-to-End Language Models

Authors: Zhichao Wang, Dongyang Ma, Xinting Huang, Deng Cai, Tian Lan, Jiahao Xu, Haitao Mi, Xiaoying Tang, Yan Wang

First: 2025-10-30T17:01:43+00:00 · Latest: 2025-10-30T17:01:43+00:00

Abs · PDF · Code1 · Code2

Abstract

The "end-to-end" label for LLMs is a misnomer. In practice, they depend on a non-differentiable decoding process that requires laborious, hand-tuning of hyperparameters like temperature and top-p. This paper introduces AutoDeco, a novel architecture that enables truly "end-to-end" generation by learning to control its own decoding strategy. We augment the standard transformer with lightweight heads that, at each step, dynamically predict context-specific temperature and top-p values alongside the next-token logits. This approach transforms decoding into a parametric, token-level process, allowing the model to self-regulate its sampling strategy within a single forward pass. Through extensive experiments on eight benchmarks, we demonstrate that AutoDeco not only significantly outperforms default decoding strategies but also achieves performance comparable to an oracle-tuned baseline derived from "hacking the test set"-a practical upper bound for any static method. Crucially, we uncover an emergent capability for instruction-based decoding control: the model learns to interpret natural language commands (e.g., "generate with low randomness") and adjusts its predicted temperature and top-p on a token-by-token basis, opening a new paradigm for steerable and interactive LLM decoding.

中文标题/摘要

标题：手动解码终结：迈向真正端到端的语言模型

对于大语言模型（LLM）来说，“端到端”标签是一个误导。实际上，它们依赖于一个非可微分的解码过程，这需要手工调整温度和top-p等超参数。本文介绍了AutoDeco，这是一种新型架构，通过学习控制自己的解码策略，使其能够真正实现“端到端”生成。我们通过在轻量级头部中添加标准变压器，使每个步骤都动态预测上下文特定的温度和top-p值，以及下一个标记的概率值。这种方法将解码转换为参数化的、标记级别的过程，使模型能够在单次前向传播中自我调节其采样策略。通过在八个基准上的广泛实验，我们证明AutoDeco不仅显著优于默认解码策略，而且其性能与通过“破解测试集”获得的Oracle调优基线相当，这是任何静态方法的实用上限。最关键的是，我们发现了一种新兴的能力：基于指令的解码控制：模型学会解释自然语言命令（例如，“生成低随机性”），并在每个标记级别调整其预测的温度和top-p，从而开启了一种可引导和交互的大语言模型解码的新范式。

The Impact and Outlook of 3D Gaussian Splatting

Authors: Bernhard Kerbl

First: 2025-10-30T17:01:18+00:00 · Latest: 2025-10-30T17:01:18+00:00

Comments: Article written for Frontiers of Science Award, International Congress on Basic Science, 2025

Abs · PDF · Code1 · Code2

Abstract

Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.

中文标题/摘要

标题：3D 高斯斑点图的影响与展望

自引入以来，3D 高斯斑点图（3DGS）迅速改变了3D 场景表示的格局，激发了大量相关研究。后续工作包括对3DGS 效率、可扩展性和实际应用性的增强分析和贡献。在本文综述中，我们概述了3DGS 引发的几个关键发展方向。我们强调了资源高效训练和渲染的进步，向动态（或四维，4DGS）表示的演变，以及对其外观建模和渲染过程背后的数学基础的更深入探索。此外，我们还考察了将3DGS 带入移动和虚拟现实平台的努力，其扩展到大规模环境，以及通过前馈或分布式计算实现近即时辐射场重建的最新进展。这些发展表明，3DGS 已从一种突破性表示演变为3D 视觉和图形的多功能和基础工具。

Summary / 总结

The research motivation is to explore the impact and future directions of 3D Gaussian Splatting (3DGS) since its introduction. The main method involves analyzing and enhancing the efficiency, scalability, and real-world applicability of 3DGS through resource-efficient training and rendering, dynamic representations, and mathematical foundation exploration. Key experimental findings include advancements in mobile and virtual reality platforms, massive-scale environment extensions, and near-instant radiance field reconstruction via feed-forward or distributed computation.

研究动机是探索3D高斯点绘（3DGS）自引入以来的影响及其未来方向。主要方法包括通过资源高效训练和渲染、动态表示以及数学基础探索来增强3DGS的效率、可扩展性和实际应用性。关键实验发现包括在移动和虚拟现实平台上的进展、大规模环境的扩展以及通过前馈或分布式计算实现的近即时辐射场重建。

Kimi Linear: An Expressive, Efficient Attention Architecture

Authors: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du

First: 2025-10-30T16:59:43+00:00 · Latest: 2025-10-30T16:59:43+00:00

Comments: Kimi Linear tech report

Abs · PDF · Code1 · Code2

Abstract

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

中文标题/摘要

标题：Kimi Linear：一种表达能力强、效率高的注意力架构

我们介绍了Kimi Linear，这是一种混合线性注意力架构，在各种场景下（包括短上下文、长上下文和强化学习（RL）扩展阶段）首次在公平比较中超越了全注意力机制。其核心是Kimi Delta注意力（KDA），这是一种扩展了门控DeltaNet的表达性线性注意力模块，通过更精细的门控机制，更有效地利用有限的有限状态RNN内存。我们定制的块状算法通过一种特殊的对角线加低秩（DPLR）过渡矩阵变体，实现了高硬件效率，相比通用的DPLR形式，计算量显著减少，同时更符合经典的delta规则。我们预训练了一个具有30亿激活参数和480亿总参数的Kimi Linear模型，基于KDA和多头潜在注意力（MLA）的分层混合。我们的实验表明，在相同的训练方案下，Kimi Linear在所有评估任务中都显著优于全MLA，同时将KV缓存使用量降低高达75%，并实现高达6倍的100万上下文解码吞吐量。这些结果表明，Kimi Linear可以作为全注意力架构的直接替代品，具有更优的性能和效率，包括更长的输入和输出长度的任务。为了支持进一步的研究，我们开源了KDA内核和vLLM实现，并发布了预训练和指令调优的模型检查点。

Summary / 总结

Kimi Linear is a hybrid linear attention architecture that surpasses full attention in various scenarios, including short and long contexts and reinforcement learning. It features Kimi Delta Attention (KDA), which extends Gated DeltaNet with a finer-grained gating mechanism, and a bespoke chunkwise algorithm that reduces computation through specialized Diagonal-Plus-Low-Rank (DPLR) matrices. Experiments show that a 3B-parameter Kimi Linear model outperforms full Multi-Head Latent Attention (MLA) across tasks, with up to 75% less KV cache usage and 6 times higher decoding throughput for a 1M context.

Kimi Linear 是一种混合线性注意力架构，在各种场景下超越了全注意力机制，包括短文本和长文本以及强化学习。它包含 Kimi Delta Attention (KDA)，该机制增强了 Gated DeltaNet 并引入了更精细的门控机制，以及一种特殊的分块算法，通过特定的 Diagonal-Plus-Low-Rank (DPLR) 矩阵减少了计算量。实验表明，Kimi Linear 在所有评估任务中比 Multi-Head Latent Attention (MLA) 性能更好，KV 缓存使用量减少 75%，解码吞吐量提高到原来的 6 倍，适用于更长的输入和输出长度。

Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill

Authors: Vaibhav Kurrey, Sivakalyan Pujari, Gagan Raj Gupta

First: 2025-10-30T16:54:16+00:00 · Latest: 2025-10-30T16:54:16+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.

中文标题/摘要

标题：过程集成计算机视觉在钢铁轧制厂实时故障预测中的应用

我们展示了在钢铁轧制厂基于机器视觉的异常检测系统长期部署的研究。该系统集成工业摄像头实时监控设备运行、对齐和热坯运动情况。实时视频流在集中式视频服务器上使用深度学习模型处理，实现设备故障和过程中断的早期预测，从而减少非计划停机成本。基于服务器的推理减少了工业过程控制系统（PLC）的计算负担，支持在生产线上进行可扩展部署，无需额外资源。通过联合分析数据采集系统和视觉输入的数据，该系统识别故障的位置和可能的根本原因，提供主动维护的可操作见解。这种集成方法增强了工业制造环境中的操作可靠性、生产率和盈利能力。

Summary / 总结

The study presents a long-term deployment of a machine vision-based anomaly detection system for predicting failures in a steel rolling mill. The system uses industrial cameras to monitor equipment and hot bar motion in real time, processing live video streams on a centralized server with deep learning models. Key findings include early prediction of equipment failures, reduced unplanned breakdown costs, and enhanced operational reliability and productivity through proactive maintenance. The server-based inference approach minimizes computational load on PLCs, supporting scalable deployment across production lines with minimal additional resources.

研究展示了在钢铁轧制厂部署基于机器视觉的异常检测系统，用于预测设备故障。该系统使用工业摄像头实时监控设备和热钢棒的运动，并在集中式服务器上使用深度学习模型处理实时视频流。这使得能够提前预测设备故障，减少意外停机成本。系统能够识别故障位置和可能的原因，提供主动维护的行动建议。主要发现包括提前预测故障以及提高工业制造环境中的操作可靠性、生产率和盈利能力。

Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models

Authors: Mingchen Tu, Zhiqiang Liu, Juan Li, Liangyurui Liu, Junjie Wang, Lei Liang, Wen Zhang

First: 2025-10-30T16:53:45+00:00 · Latest: 2025-10-30T16:53:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs' adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.

中文标题/摘要

标题：Evontree：基于本体规则的大型语言模型自我进化框架

大型语言模型（LLMs）通过大规模预训练和精心策划的微调数据，在多个领域展示了卓越的能力。然而，在如医疗保健等数据敏感领域，缺乏高质量的领域特定训练语料库阻碍了LLMs的适应性应用。同时，领域专家将领域智慧提炼为本体规则，这些规则形式化了概念之间的关系，并确保了知识管理仓库的完整性。将LLMs视为人类知识的隐式存储库，我们提出了Evontree，一种新颖的框架，利用少量高质量的本体规则系统地提取、验证和增强LLMs中的领域知识，而无需大量外部数据集。具体而言，Evontree从原始模型中提取领域本体，使用两个核心本体规则检测不一致性，并通过自我提炼微调强化精炼知识。在医疗问答基准测试中，使用Llama3-8B-Instruct和Med42-v2进行的广泛实验表明，Evontree在准确率上持续优于未修改的模型和领先监督基线，最高可提高3.7%。这些结果证实了我们方法在LLMs低资源领域适应性方面的有效性、高效性和鲁棒性。

Summary / 总结

The research aims to enhance the adaptability of large language models (LLMs) in data-sensitive fields like healthcare, where high-quality domain-specific training data is scarce. Evontree, a novel framework, leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs. Experiments on medical QA benchmarks show that Evontree outperforms both unmodified models and supervised baselines, achieving up to a 3.7% improvement in accuracy.

研究旨在增强大型语言模型（LLMs）在如医疗保健等数据敏感领域的适应性，这些领域缺乏高质量的领域特定训练数据。Evontree 是一种新颖的框架，利用少量高质量的领域本体规则来系统地提取、验证和增强 LLM 中的领域知识。实验表明，Evontree 在医疗问答基准测试中优于未修改的模型和监督基线，准确率提高了多达 3.7%。

Resource Efficient Multi-stain Kidney Glomeruli Segmentation via Self-supervision

Authors: Zeeshan Nisar, Friedrich Feuerhake, Thomas Lampert

First: 2024-12-19T20:43:22+00:00 · Latest: 2025-10-30T16:42:33+00:00

Comments: 39 pages, 10 figures, 4 Tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Semantic segmentation under domain shift remains a fundamental challenge in computer vision, particularly when labelled training data is scarce. This challenge is particularly exemplified in histopathology image analysis, where the same tissue structures must be segmented across images captured under different imaging conditions (stains), each representing a distinct visual domain. Traditional deep learning methods like UNet require extensive labels, which is both costly and time-consuming, particularly when dealing with multiple domains (or stains). To mitigate this, various unsupervised domain adaptation based methods such as UDAGAN have been proposed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging. This article shows that through self-supervised pre-training -- including SimCLR, BYOL, and a novel approach, HR-CS-CO -- the performance of these segmentation methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self-supervised pre-training and using only 5% labels, the performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, averaged over all stains, compared to their respective fully supervised counterparts (without pre-training, using 100% labels). Furthermore, these findings are shown to generalise beyond their training distribution to public benchmark datasets. Implementations and pre-trained models are publicly available \href{https://github.com/zeeshannisar/resource-effecient-multi-stain-kidney-glomeruli-segmentation.git}{online}.

中文标题/摘要

标题：通过自我监督实现资源高效多染色肾脏肾小球分割

在领域转移下进行语义分割仍然是计算机视觉中的一个基本挑战，尤其是在标注训练数据稀缺的情况下。这一挑战在组织病理学图像分析中尤为明显，同一组织结构必须在不同成像条件（染色）下进行分割，每种染色代表一个不同的视觉领域。传统的深度学习方法如UNet需要大量的标签，这既昂贵又耗时，尤其是在处理多个领域（或染色）时。为缓解这一问题，已经提出了各种基于无监督领域适应的方法，如UDAGAN，这些方法通过只需要一个（源）染色进行标注来减少对标签的需求。然而，获取源染色标签仍然具有挑战性。本文表明，通过包括SimCLR、BYOL和一种新颖方法HR-CS-CO在内的自我监督预训练，即使在标签减少95%的情况下，这些分割方法（UNet和UDAGAN）的性能也能保持不变。值得注意的是，通过自我监督预训练并仅使用5%的标签，UNet和UDAGAN的性能下降分别仅为5.9%和6.2%，平均而言，与各自的完全监督版本（无预训练，使用100%标签）相比。此外，这些发现还展示了其在公共基准数据集上的泛化能力。相关实现和预训练模型已公开在线。

Summary / 总结

This study addresses the challenge of semantic segmentation under domain shift in histopathology images, where labeled training data is scarce. It proposes using self-supervised pre-training with methods like SimCLR, BYOL, and a novel approach, HR-CS-CO, to reduce the need for labeled data. With only 5% labels, the performance of UNet and UDAGAN drops minimally (5.9% and 6.2% respectively), comparable to their fully supervised counterparts using 100% labels. These findings are validated on public benchmark datasets, showing the approach's generalizability beyond the training distribution.

该研究解决了在病理图像中语义分割领域转移的问题，特别是在缺乏标注训练数据的情况下。通过使用自监督预训练方法，如SimCLR、BYOL和一种新型方法HR-CS-CO，减少对标注数据的需求。仅使用5%的标注数据，UNet和UDAGAN的性能分别下降5.9%和6.2%，与使用100%标注数据的完全监督方法相比。这些发现已在公共基准数据集上得到验证，展示了该方法在训练分布之外的泛化能力。

Action-Driven Processes for Continuous-Time Control

Authors: Ruimin He, Shaowei Lin

First: 2025-10-30T16:42:09+00:00 · Latest: 2025-10-30T16:42:09+00:00

Abs · PDF · Code1 · Code2

Abstract

At the heart of reinforcement learning are actions -- decisions made in response to observations of the environment. Actions are equally fundamental in the modeling of stochastic processes, as they trigger discontinuous state transitions and enable the flow of information through large, complex systems. In this paper, we unify the perspectives of stochastic processes and reinforcement learning through action-driven processes, and illustrate their application to spiking neural networks. Leveraging ideas from control-as-inference, we show that minimizing the Kullback-Leibler divergence between a policy-driven true distribution and a reward-driven model distribution for a suitably defined action-driven process is equivalent to maximum entropy reinforcement learning.

中文标题/摘要

标题：基于行动的连续时间控制过程

强化学习的核心在于行动——对环境观察做出的决策。行动在随机过程建模中同样至关重要，因为它们触发状态的不连续转换，并使信息在大型复杂系统中流动。在本文中，我们通过基于行动的过程统一了随机过程和强化学习的视角，并展示了其在尖峰神经网络中的应用。借鉴控制即推理的思想，我们证明了最小化由政策驱动的真实分布与由奖励驱动的模型分布之间的相对熵等价于最大熵强化学习。

CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling

Authors: Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang

First: 2025-06-24T17:30:27+00:00 · Latest: 2025-10-30T16:38:19+00:00

Comments: 39 pages, 24 figures

Abs · PDF · Code1 · Code2

Abstract

Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.

中文标题/摘要

标题：CronusVLA：通过多帧视觉-语言-动作建模实现高效稳健操作

基于预训练视觉-语言模型（VLMs）的近期视觉-语言-动作（VLA）模型在机器人操作方面表现出强大的性能。然而，这些模型仍然受限于单帧图像范式，未能充分利用多帧历史提供的时间信息，因为直接将多帧输入到VLM主干中会带来巨大的计算开销和推理延迟。我们提出了一种名为CronusVLA的统一框架，将单帧VLA模型扩展到多帧范式。CronusVLA遵循两阶段过程：（1）在大规模具身数据集上进行单帧预训练，通过自回归预测动作标记，建立有效的具身视觉-语言基础；（2）多帧后训练，将视觉-语言主干的预测从离散标记调整为可学习特征，并通过特征分块聚合历史信息。CronusVLA有效解决了多帧建模的现有挑战，同时提高了性能和观测鲁棒性。为了评估在时间和空间扰动下的鲁棒性，我们引入了SimplerEnv-OR基准，该基准包含24种观测扰动类型和120种严重程度级别。在模拟和真实环境中的三种具身模型实验表明，CronusVLA在SimplerEnv中的性能领先，鲁棒性优于OpenVLA 26.8%，并在SimplerEnv-OR中获得了最高的鲁棒性评分。这些结果突显了VLA模型中高效多帧适应的潜力，使其在更强大和鲁棒的实际部署中具有更大的可能性。

Summary / 总结

CronusVLA is a unified framework that extends single-frame vision-language-action models to a multi-frame paradigm, addressing computational overhead and inference latency. It involves single-frame pretraining with autoregressive action token prediction and multi-frame post-training for feature learning and historical information aggregation. Experiments show CronusVLA outperforms existing models with a 70.9% success rate on SimplerEnv and a 26.8% improvement on LIBERO, demonstrating superior robustness.

CronusVLA 是一个统一框架，将单帧视觉-语言-动作模型扩展到多帧范式，解决了计算开销和延迟问题。它包括两个阶段：单帧预训练，使用自回归动作令牌预测，以及多帧后训练，适应视觉语言主干并聚合历史信息。实验表明，CronusVLA 在 SimplerEnv 上的成功率为 70.9%，在 LIBERO 上的性能提高了 26.8%，显示出增强的鲁棒性和性能。

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

Authors: Derin Cayir, Renjie Tao, Rashi Rungta, Kai Sun, Sean Chen, Haidar Khan, Minseok Kim, Julia Reinspach, Yue Liu

First: 2025-08-03T01:56:03+00:00 · Latest: 2025-10-30T16:32:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements.

中文标题/摘要

标题：Refine-n-Judge：为LLM微调精炼高质量偏好链

大型语言模型（LLMs）通过基于偏好的微调取得了显著进展，这在很大程度上依赖于训练数据的质量。虽然人类反馈对于提高数据质量至关重要，但成本高昂且难以扩展。在本文中，我们介绍了Refine-n-Judge，这是一种自动迭代方法，利用单一LLM作为精炼者和评判者来提升数据集质量。与现有的迭代精炼方法不同，Refine-n-Judge 使用LLM生成精炼并明确评估每次改进，确保每次迭代都能实质性地提升数据集，而无需额外的人工注释或单独的奖励模型。在每一步中，LLM 精炼一个响应并判断该精炼是否优于之前的答案。这一过程将持续进行，直到LLM更喜欢初始答案而非精炼，表明没有进一步改进的空间。这产生了质量逐步提升、带有偏好评价的响应序列，非常适合微调。我们展示了Refine-n-Judge在涵盖五个语料库的多种公开数据集上的有效性，针对编码、数学和对话等任务。使用Refine-n-Judge增强的数据集微调的模型（Llama 3.1-8B和Llama 3.3-70B）在超过74%的比较中被LLM评判者更偏好，与使用GPT-4微调原始数据集的模型相比。此外，我们还报告了性能提升：在AlpacaEval和AlpacaEval 2.0上分别提高了5%，在MT-Bench上提高了19%。我们的结果表明，Refine-n-Judge生成了高质量的数据集并实现了可扩展的模型改进。

Summary / 总结

The paper introduces Refine-n-Judge, an automated iterative method that uses a single LLM to both refine and judge responses, improving dataset quality for LLM fine-tuning. This approach enhances dataset quality without additional human annotation or a separate reward model. Experiments across various datasets showed that models fine-tuned with Refine-n-Judge were preferred in over 74% of comparisons against models fine-tuned on original datasets, and achieved performance gains of up to 19% on MT-Bench.

论文提出了一种名为Refine-n-Judge的自动化迭代方法，利用单一LLM进行响应的改进和评判，以提高数据集质量用于LLM微调。该方法无需额外的人工注释或独立的奖励模型即可提升数据集质量。实验结果显示，使用Refine-n-Judge增强的数据集微调的模型在超过74%的比较中优于使用原始数据集微调的模型，并在MT-Bench上实现了高达19%的性能提升。

BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI

Authors: Alya Almsouti, Ainur Khamitova, Darya Taratynova, Mohammad Yaqub

First: 2025-10-30T16:29:09+00:00 · Latest: 2025-10-30T16:29:09+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Assessing the severity of artifacts in pediatric brain Magnetic Resonance Imaging (MRI) is critical for diagnostic accuracy, especially in low-field systems where the signal-to-noise ratio is reduced. Manual quality assessment is time-consuming and subjective, motivating the need for robust automated solutions. In this work, we propose BRIQA (Balanced Reweighting in Image Quality Assessment), which addresses class imbalance in artifact severity levels. BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes. Through experiments, no single architecture performs best across all artifact types, emphasizing the importance of architectural diversity. The rotating batching configuration improves performance across metrics by promoting balanced learning when combined with cross-entropy loss. BRIQA improves average macro F1 score from 0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification. The code is available at https://github.com/BioMedIA-MBZUAI/BRIQA.

中文标题/摘要

标题：BRIQA：儿科脑部MRI图像质量评估中的平衡重权

评估儿科脑部磁共振成像（MRI）中的伪影严重程度对于诊断准确性至关重要，特别是在低场系统中，信噪比降低。手动质量评估耗时且主观，推动了稳健的自动化解决方案的需求。在本工作中，我们提出了BRIQA（平衡重权图像质量评估），以解决伪影严重程度的类别不平衡问题。BRIQA 使用基于梯度的损失重权来动态调整各类别的贡献，并采用旋转批处理方案以确保对未充分代表的类别的一致性暴露。通过实验，没有单一架构在所有伪影类型中表现最佳，强调了架构多样性的重要性。旋转批处理配置与交叉熵损失结合使用时，通过促进平衡学习提高了所有指标的性能。BRIQA 将平均宏F1分数从0.659提高到0.706，在噪声（0.430）、 zipper（0.098）、定位（0.097）、对比度（0.217）、运动（0.022）和条纹（0.012）伪影严重程度分类方面取得了显著进步。代码可在https://github.com/BioMedIA-MBZUAI/BRIQA 获取。

CompoST: A Benchmark for Analyzing the Ability of LLMs To Compositionally Interpret Questions in a QALD Setting

Authors: David Maria Schmidt, Raoul Schubert, Philipp Cimiano

First: 2025-07-28T18:20:41+00:00 · Latest: 2025-10-30T16:25:15+00:00

Comments: Research Track, 24th International Semantic Web Conference (ISWC 2025), November 2-6, 2025, Nara, Japan

Abs · PDF · Code1 · Code2

Abstract

Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts. Large language models possess remarkable language interpretation capabilities and have been successfully applied to interpret questions by mapping them to SPARQL queries. An open question is how systematic this interpretation process is. Toward this question, in this paper, we propose a benchmark for investigating to what extent the abilities of LLMs to interpret questions are actually compositional. For this, we generate three datasets of varying difficulty based on graph patterns in DBpedia, relying on Lemon lexica for verbalization. Our datasets are created in a very controlled fashion in order to test the ability of LLMs to interpret structurally complex questions, given that they have seen the atomic building blocks. This allows us to evaluate to what degree LLMs are able to interpret complex questions for which they "understand" the atomic parts. We conduct experiments with models of different sizes using both various prompt and few-shot optimization techniques as well as fine-tuning. Our results show that performance in terms of macro $F_1$ degrades from $0.45$ over $0.26$ down to $0.09$ with increasing deviation from the samples optimized on. Even when all necessary information was provided to the model in the input, the $F_1$ scores do not exceed $0.57$ for the dataset of lowest complexity. We thus conclude that LLMs struggle to systematically and compositionally interpret questions and map them into SPARQL queries.

中文标题/摘要

标题：CompoST：一种分析大语言模型在QALD环境中组合性解释问题能力的基准

语言解释是一个组合过程，在这个过程中，更复杂的语言结构的意义是从其组成部分的意义中推断出来的。大型语言模型具备显著的语言解释能力，并成功应用于将问题映射为SPARQL查询。一个悬而未决的问题是这种解释过程是否系统化。为了解决这个问题，本文提出了一种基准，用于研究大语言模型解释问题的能力是否真正具有组合性。为此，我们基于DBpedia中的图模式生成了三个不同难度的数据集，并依赖Lemon词典进行口头化。我们的数据集以非常受控的方式创建，以测试大语言模型在已见过原子构建块的情况下解释结构复杂问题的能力。这使我们能够评估大语言模型在“理解”原子部分的情况下解释复杂问题的程度。我们使用不同规模的模型进行了实验，使用了各种提示和少量优化技术以及微调。结果显示，随着与优化样本偏差的增加，宏$F_1$性能从$0.45$下降到$0.26$，再下降到$0.09$。即使在输入中提供了所有必要信息，最低复杂度数据集的$F_1$分数也不超过$0.57$。因此，我们得出结论，大语言模型在系统性和组合性地解释问题并将它们映射为SPARQL查询方面存在困难。

Summary / 总结

This paper introduces CompoST, a benchmark to evaluate the compositional interpretation abilities of large language models (LLMs) in converting questions to SPARQL queries. The authors generate three datasets with varying complexity based on DBpedia graph patterns and Lemon lexica. Experiments with models of different sizes using various techniques show that LLMs' performance in macro $F_1$ decreases as the deviation from optimized samples increases, with $F_1$ scores not exceeding 0.57 for the least complex dataset. This indicates that LLMs struggle to systematically interpret complex questions and map them into SPARQL queries.

本文提出了CompoST基准，用于评估大型语言模型（LLMs）在将问题转换为SPARQL查询时的组合解释能力。作者基于DBpedia图模式和Lemon词典生成了三个具有不同复杂度的数据集。使用不同大小的模型和各种技术进行的实验表明，随着与优化样本偏差的增加，LLMs的宏$F_1$分数下降，对于最简单的数据集，$F_1$分数不超过0.57。这表明LLMs在系统地解释复杂问题并将它们映射为SPARQL查询方面存在困难。

The Era of Agentic Organization: Learning to Organize with Language Models

Authors: Zewen Chi, Li Dong, Qingxiu Dong, Yaru Hao, Xun Wu, Shaohan Huang, Furu Wei

First: 2025-10-30T16:25:10+00:00 · Latest: 2025-10-30T16:25:10+00:00

Abs · PDF · Code1 · Code2

Abstract

We envision a new era of AI, termed agentic organization, where agents solve complex problems by working collaboratively and concurrently, enabling outcomes beyond individual intelligence. To realize this vision, we introduce asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Specifically, we propose a thinking protocol where an organizer dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions. More importantly, the thinking structure in this protocol can be further optimized through reinforcement learning. Experiments demonstrate that AsyncThink achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning. Moreover, AsyncThink generalizes its learned asynchronous thinking capabilities, effectively tackling unseen tasks without additional training.

中文标题/摘要

标题：自主组织时代：利用语言模型组织学习

我们设想了一个新的AI时代，称为自主组织时代，在这个时代中，代理通过协作和并行工作解决复杂问题，从而实现超越个体智能的结果。为了实现这一愿景，我们引入了异步思考（AsyncThink）作为与大规模语言模型推理的新范式，将内部思考过程组织成可并发执行的结构。具体来说，我们提出了一种思考协议，其中组织者动态分配子查询给工人，合并中间知识，并生成连贯的解决方案。更重要的是，该协议中的思考结构可以通过强化学习进一步优化。实验表明，与并行思考相比，AsyncThink 的推理延迟降低了28%，同时在数学推理方面提高了准确性。此外，AsyncThink 能够将其学习到的异步思考能力泛化，有效应对未见过的任务而无需额外训练。

Summary / 总结

The paper introduces the concept of agentic organization, where AI agents collaborate to solve complex problems. It proposes AsyncThink, a new paradigm for reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Experiments show that AsyncThink reduces inference latency by 28% compared to parallel thinking while improving accuracy in mathematical reasoning. Additionally, AsyncThink demonstrates the ability to generalize its learned capabilities to tackle unseen tasks without further training.

论文提出了代理组织的概念，即AI代理协作解决复杂问题。它提出了AsyncThink，一种新的大型语言模型推理范式，将内部思考组织成可并发执行的结构。实验表明，AsyncThink相比并行思考将推理延迟降低了28%，同时在数学推理方面提高了准确性。此外，AsyncThink能够泛化其学习到的异步思考能力，无需额外训练即可处理未见过的任务。

Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems

Authors: Georgios Kamaras, Craig Innes, Subramanian Ramamoorthy

First: 2025-10-30T16:23:46+00:00 · Latest: 2025-10-30T16:23:46+00:00

Abs · PDF · Code1 · Code2

Abstract

In robotics, likelihood-free inference (LFI) can provide the domain distribution that adapts a learnt agent in a parametric set of deployment conditions. LFI assumes an arbitrary support for sampling, which remains constant as the initial generic prior is iteratively refined to more descriptive posteriors. However, a potentially misspecified support can lead to suboptimal, yet falsely certain, posteriors. To address this issue, we propose three heuristic LFI variants: EDGE, MODE, and CENTRE. Each interprets the posterior mode shift over inference steps in its own way and, when integrated into an LFI step, adapts the support alongside posterior inference. We first expose the support misspecification issue and evaluate our heuristics using stochastic dynamical benchmarks. We then evaluate the impact of heuristic support adaptation on parameter inference and policy learning for a dynamic deformable linear object (DLO) manipulation task. Inference results in a finer length and stiffness classification for a parametric set of DLOs. When the resulting posteriors are used as domain distributions for sim-based policy learning, they lead to more robust object-centric agent performance.

中文标题/摘要

标题：潜在指定错误领域支持的启发式适应：随机动力学系统中无似然推断的适应

在机器人学中，无似然推断（LFI）可以提供适应学习代理在参数部署条件集中的领域分布。LFI 假设一个任意的支持用于采样，该支持在整个初始通用先验逐步细化为更具描述性的后验过程中保持不变。然而，潜在指定错误的支持可能导致次优但错误自信的后验。为了解决这一问题，我们提出了三种启发式LFI变体：EDGE、MODE和CENTRE。每种变体以自己的方式解释推理步骤中的后验模式偏移，并在LFI步骤中将支持与后验推断一起进行适应。我们首先揭示了支持指定错误的问题，并使用随机动力学基准评估我们的启发式方法。然后，我们评估启发式支持适应对动态可变形线性对象（DLO）操作任务中的参数推断和策略学习的影响。对于参数化的DLO集合，推断结果提供了更精细的长度和刚度分类。当使用这些后验作为基于模拟的策略学习的领域分布时，它们会导致更稳健的对象中心代理性能。

Summary / 总结

The paper addresses the issue of potentially misspecified support in likelihood-free inference (LFI) for robotics, which can lead to suboptimal and falsely certain posteriors. To tackle this, three heuristic LFI methods—EDGE, MODE, and CENTRE—are proposed. These methods adapt the support during inference by interpreting the posterior mode shift differently. The effectiveness of these heuristics is evaluated using stochastic dynamical benchmarks and a dynamic deformable linear object manipulation task, resulting in improved parameter inference and more robust agent performance.

论文解决了机器人领域中潜在的采样支持不准确问题，这可能导致次优且虚假确定的后验分布。为此，提出了三种启发式LFI方法——EDGE、MODE和CENTRE，每种方法以不同方式解释后验模式的变化，并在推理过程中调整支持。这些方法在随机动力学基准和动态可变形线性物体操作任务中进行了评估，显示出改进的参数推理和更稳健的物体中心代理性能。

Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2

Authors: Daniela Martin, Joseph Gallego

First: 2025-10-30T16:20:28+00:00 · Latest: 2025-10-30T16:20:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.

中文标题/摘要

标题：朝可靠的北极海冰漂移估计迈进——基于RADARSAT-2扫描SAR图像的深度学习光流

准确估计海冰漂移对于北极航行、气候研究和操作性预报至关重要。尽管光流，一种计算机视觉技术，用于估计连续图像之间的像素级运动，已经在计算机视觉领域取得了快速进展，但其在地球物理问题和卫星SAR图像中的应用仍处于探索阶段。经典光流方法依赖于数学模型和对运动的强假设，这限制了它们在复杂场景中的准确性。最近基于深度学习的方法显著提高了性能，已成为计算机视觉的标准，因此它们被应用于海冰漂移估计。我们首次在RADARSAT 2扫描SAR海冰图像上对48种深度学习光流模型进行了大规模基准测试，使用端点误差（EPE）和Fl所有指标与GNSS跟踪浮标进行评估。多种模型实现了亚千米精度（EPE 6到8像素，300到400米），相对于海冰运动的空间尺度和北极航行的典型要求，这是一个较小的误差。我们的结果表明，这些模型能够捕捉到一致的区域漂移模式，并且最近基于深度学习的光流方法，其运动估计准确性显著优于经典方法，可以有效地应用于极地遥感。光流产生连续的空间漂移场，为每个图像像素提供运动估计，而不是仅在稀疏的浮标位置，为航行和气候建模提供了新的机会。

Summary / 总结

The research aims to improve the accuracy of sea ice drift estimation in the Arctic using deep learning optical flow on RADARSAT-2 imagery. The study evaluates 48 deep learning optical flow models against GNSS tracked buoys and finds that several models achieve sub-kilometer accuracy, demonstrating the capability to capture consistent regional drift patterns. This method offers spatially continuous drift fields, enhancing navigation and climate modeling in the Arctic.

研究旨在利用RADARSAT-2图像上的深度学习光学流技术提高北极海冰漂流的准确性。研究评估了48种深度学习光学流模型，并与GNSS跟踪的浮标进行对比，发现多个模型实现了亚千米级的精度，展示了捕捉一致的区域漂流模式的能力。该方法提供了连续的空间漂流场，增强了北极航行和气候建模。

Hybrid DQN-TD3 Reinforcement Learning for Autonomous Navigation in Dynamic Environments

Authors: Xiaoyi He, Danggui Chen, Zhenshuo Zhang, Zimeng Bai

First: 2025-10-30T16:12:01+00:00 · Latest: 2025-10-30T16:12:01+00:00

Comments: 6 pages, 5 figures; ROS+Gazebo (TurtleBot3) implementation; evaluation with PathBench metrics; code (primary): https://github.com/MayaCHEN-github/HierarchicalRL-robot-navigation; mirror (for reproducibility): https://github.com/ShowyHe/DRL-robot-navigation

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper presents a hierarchical path-planning and control framework that combines a high-level Deep Q-Network (DQN) for discrete sub-goal selection with a low-level Twin Delayed Deep Deterministic Policy Gradient (TD3) controller for continuous actuation. The high-level module selects behaviors and sub-goals; the low-level module executes smooth velocity commands. We design a practical reward shaping scheme (direction, distance, obstacle avoidance, action smoothness, collision penalty, time penalty, and progress), together with a LiDAR-based safety gate that prevents unsafe motions. The system is implemented in ROS + Gazebo (TurtleBot3) and evaluated with PathBench metrics, including success rate, collision rate, path efficiency, and re-planning efficiency, in dynamic and partially observable environments. Experiments show improved success rate and sample efficiency over single-algorithm baselines (DQN or TD3 alone) and rule-based planners, with better generalization to unseen obstacle configurations and reduced abrupt control changes. Code and evaluation scripts are available at the project repository.

中文标题/摘要

标题：混合DQN-TD3强化学习在动态环境中的自主导航

本文提出了一种层次化的路径规划与控制框架，该框架结合了高层的深度Q网络（DQN）进行离散子目标选择和低层的双延迟深度确定性策略梯度（TD3）控制器进行连续动作执行。高层模块选择行为和子目标；低层模块执行平滑的速度命令。我们设计了一种实用的奖励塑造方案（方向、距离、障碍物避免、动作平滑性、碰撞惩罚、时间惩罚和进度），并结合了基于LiDAR的安全门，以防止不安全的运动。该系统在ROS + Gazebo（TurtleBot3）中实现，并使用PathBench指标进行评估，包括成功率、碰撞率、路径效率和重新规划效率，在动态和部分可观测环境中。实验表明，与单一算法基线（仅DQN或仅TD3）和基于规则的规划器相比，该系统在未见过的障碍配置下的泛化能力更强，且控制变化更平滑。代码和评估脚本可在项目仓库中获得。

Summary / 总结

This paper introduces a hierarchical reinforcement learning framework combining DQN for high-level sub-goal selection and TD3 for low-level actuation. It uses a practical reward shaping scheme and a LiDAR-based safety gate. Experiments in dynamic environments show improved success rate and sample efficiency compared to single-algorithm baselines and rule-based planners, with better generalization to unseen obstacles and reduced abrupt control changes. Evaluation metrics include success rate, collision rate, path efficiency, and re-planning efficiency.

本文提出了一种结合高阶DQN进行离散子目标选择和低阶TD3进行连续动作控制的层次化强化学习框架。使用奖励塑造方案和基于LiDAR的安全门来确保安全导航。在动态和部分可观测环境中的实验表明，与单算法基线和基于规则的规划器相比，该方法在成功率和样本效率方面有所提高，具有更好的对未见过的障碍物的泛化能力和减少的突变控制变化。评估指标包括成功率、碰撞率、路径效率和重新规划效率。

Curly Flow Matching for Learning Non-gradient Field Dynamics

Authors: Katarina Petrović, Lazar Atanackovic, Viggo Moro, Kacper Kapuśniak, İsmail İlkan Ceylan, Michael Bronstein, Avishek Joey Bose, Alexander Tong

Venue: NeurIPS 2025

First: 2025-10-30T16:11:39+00:00 · Latest: 2025-10-30T16:11:39+00:00

Comments: Accepted to NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Modeling the transport dynamics of natural processes from population-level observations is a ubiquitous problem in the natural sciences. Such models rely on key assumptions about the underlying process in order to enable faithful learning of governing dynamics that mimic the actual system behavior. The de facto assumption in current approaches relies on the principle of least action that results in gradient field dynamics and leads to trajectories minimizing an energy functional between two probability measures. However, many real-world systems, such as cell cycles in single-cell RNA, are known to exhibit non-gradient, periodic behavior, which fundamentally cannot be captured by current state-of-the-art methods such as flow and bridge matching. In this paper, we introduce Curly Flow Matching (Curly-FM), a novel approach that is capable of learning non-gradient field dynamics by designing and solving a Schr\"odinger bridge problem with a non-zero drift reference process -- in stark contrast to typical zero-drift reference processes -- which is constructed using inferred velocities in addition to population snapshot data. We showcase Curly-FM by solving the trajectory inference problems for single cells, computational fluid dynamics, and ocean currents with approximate velocities. We demonstrate that Curly-FM can learn trajectories that better match both the reference process and population marginals. Curly-FM expands flow matching models beyond the modeling of populations and towards the modeling of known periodic behavior in physical systems. Our code repository is accessible at: https://github.com/kpetrovicc/curly-flow-matching.git

中文标题/摘要

标题：卷曲流匹配学习非梯度场动力学

从群体水平的观测数据建模自然过程的传输动力学是一个在自然科学中普遍存在的问题。此类模型依赖于对底层过程的关键假设，以实现对实际系统行为的忠实学习。当前方法的默认假设是基于最小作用原理，导致梯度场动力学，并导致轨迹最小化两个概率测度之间的能量泛函。然而，许多实际系统，如单细胞RNA中的细胞周期，已知表现出非梯度、周期性行为，这从根本上无法被当前最先进的方法，如流匹配和桥匹配所捕捉。在本文中，我们引入了卷曲流匹配（Curly-FM），这是一种能够通过设计和求解具有非零漂移参考过程的薛定谔桥问题来学习非梯度场动力学的新方法——与典型的零漂移参考过程形成鲜明对比，该参考过程是使用推断出的速度和群体快照数据构建的。我们通过解决单细胞、计算流体动力学和洋流的轨迹推断问题，使用近似速度展示了卷曲流匹配。我们证明卷曲流匹配可以学习更好地匹配参考过程和群体边缘的概率轨迹。卷曲流匹配扩展了流匹配模型，从群体建模扩展到物理系统中已知的周期性行为建模。我们的代码库可在：https://github.com/kpetrovicc/curly-flow-matching.git

Summary / 总结

This paper addresses the challenge of modeling non-gradient field dynamics in natural processes by introducing Curly Flow Matching (Curly-FM), which uses a non-zero drift reference process to capture periodic behavior. Curly-FM improves trajectory inference for single cells, computational fluid dynamics, and ocean currents. It learns trajectories that better match both the reference process and population marginals, expanding flow matching models to periodic behavior in physical systems.

本文解决了当前方法由于依赖梯度场假设而无法捕捉非梯度场动力学的挑战。作者提出了Curly Flow Matching (Curly-FM) 方法，该方法使用非零漂移参考过程来学习非梯度动力学。Curly-FM 在单细胞轨迹推断、计算流体动力学和海洋 currents 等应用中表现出色，能够更好地匹配参考过程和群体边缘分布。