arXiv 论文速递

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem

Venue: Transactions on Machine Learning Research, 2025

First: 2025-05-29T17:59:59+00:00 · Latest: 2025-11-06T18:59:57+00:00

Comments: Published in TMLR, with a J2C Certification

Abstract

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

中文标题/摘要

标题：TextRegion: 冻结图像-文本模型的文本对齐区域标记

图像-文本模型在图像级任务上表现出色，但在详细的视觉理解方面存在困难。尽管这些模型提供了强大的视觉-语言对齐，但分割模型如SAM2能够提供精确的空间边界。为此，我们提出了一种简单、有效且无需训练的TextRegion框架，该框架结合了图像-文本模型和SAM2的优点，生成强大的文本对齐区域标记。这些标记能够实现详细的视觉理解，同时保留开放词汇的能力。它们可以直接应用于各种下游任务，包括开放世界语义分割、指示表达理解和语义定位。我们进行了广泛的评估，并且在与最先进的无需训练方法的比较中，始终取得了优越或竞争力的表现。此外，我们的框架与许多图像-文本模型兼容，使其非常实用且易于扩展，随着更强的模型出现。代码可在：https://github.com/avaxiao/TextRegion 获取。

Summary / 总结

The research aims to improve detailed visual understanding by combining the strengths of image-text models and segmentation models like SAM2. The proposed TextRegion framework generates text-aligned region tokens, which enable detailed visual understanding while maintaining open-vocabulary capabilities. Extensive evaluations show that TextRegion achieves superior or competitive performance in various downstream tasks such as open-world semantic segmentation, referring expression comprehension, and grounding compared to state-of-the-art training-free methods.

研究旨在通过结合图像-文本模型和SAM2等分割模型的优势，提高详细的视觉理解能力。提出的TextRegion框架生成了文本对齐的区域令牌，这些令牌能够在保持开放词汇能力的同时实现详细的视觉理解。广泛的评估表明，TextRegion在开放世界语义分割、指示表达理解和定位等下游任务上，与最先进的无训练方法相比，取得了优越或竞争力的表现。

GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

Authors: Qingzhou Lu, Yao Feng, Baiyu Shi, Michael Piseno, Zhenan Bao, C. Karen Liu

First: 2025-11-06T18:59:33+00:00 · Latest: 2025-11-06T18:59:33+00:00

Comments: Home page: https://gentle-humanoid.axell.top

Abs · PDF · Code1 · Code2

Abstract

Humanoid robots are expected to operate in human-centered environments where safe and natural physical interaction is essential. However, most recent reinforcement learning (RL) policies emphasize rigid tracking and suppress external forces. Existing impedance-augmented approaches are typically restricted to base or end-effector control and focus on resisting extreme forces rather than enabling compliance. We introduce GentleHumanoid, a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. At its core is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls sampled from human motion data). This formulation ensures kinematically consistent forces across the shoulder, elbow, and wrist, while exposing the policy to diverse interaction scenarios. Safety is further supported through task-adjustable force thresholds. We evaluate our approach in both simulation and on the Unitree G1 humanoid across tasks requiring different levels of compliance, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, our policy consistently reduces peak contact forces while maintaining task success, resulting in smoother and more natural interactions. These results highlight a step toward humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.

中文标题/摘要

标题：GentleHumanoid：学习接触丰富的人与物体交互的上身顺应性

类人机器人期望在以人类为中心的环境中操作，其中安全而自然的物理交互至关重要。然而，最近的强化学习（RL）策略大多强调刚性跟踪并抑制外部力。现有的阻抗增强方法通常仅限于基座或末端执行器控制，并专注于抵抗极端力而不是实现顺应性。我们引入了GentleHumanoid框架，该框架将阻抗控制整合到全身运动跟踪策略中，以实现上身顺应性。其核心是一个统一的弹簧模型，该模型既描述了抵抗接触（在接触表面时的恢复力）也描述了引导接触（从人类运动数据中采样的推力或拉力）。该模型确保肩部、肘部和腕部的力在运动学上一致，同时使策略暴露于各种交互场景中。通过任务可调的力阈值进一步支持安全性。我们在模拟和Unitree G1类人机器人上评估了我们的方法，涵盖不同顺应性水平的任务，包括温柔拥抱、坐起辅助和安全物体操作。与基线相比，我们的策略始终降低峰值接触力，同时保持任务成功，从而实现更平滑和自然的交互。这些结果突显了向能够安全有效地与人类合作并处理物体的类人机器人迈进的一步。

Summary / 总结

GentleHumanoid is a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance in human and object interaction. It uses a unified spring-based formulation to model resistive and guiding contacts, ensuring kinematically consistent forces across the shoulder, elbow, and wrist. Experimental results show that the policy reduces peak contact forces while maintaining task success in various scenarios, including gentle hugging and object manipulation, leading to smoother and more natural interactions compared to baselines.

GentleHumanoid 是一个框架，将阻抗控制整合到全身运动跟踪策略中，以实现上身的顺应性，在与人类和物体的交互中实现安全自然的物理接触。它使用统一的弹簧模型来模拟抗拒接触和引导接触，确保肩部、肘部和腕部的力在运动上的一致性。实验结果显示，该策略在各种场景中减少了峰值接触力，同时保持任务的成功率，包括温柔拥抱和物体操作，从而实现了更平滑和自然的交互，优于基线方法。

Residual Kolmogorov-Arnold Network for Enhanced Deep Learning

Authors: Ray Congrui Yu, Sherry Wu, Jiang Gui

First: 2024-10-07T21:12:32+00:00 · Latest: 2025-11-06T18:59:32+00:00

Comments: Code is available at https://github.com/withray/residualKAN.git

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite their immense success, deep convolutional neural networks (CNNs) can be difficult to optimize and costly to train due to hundreds of layers within the network depth. Conventional convolutional operations are fundamentally limited by their linear nature along with fixed activations, where many layers are needed to learn meaningful patterns in data. Because of the sheer size of these networks, this approach is simply computationally inefficient, and poses overfitting or gradient explosion risks, especially in small datasets. As a result, we introduce a "plug-in" module, called Residual Kolmogorov-Arnold Network (RKAN). Our module is highly compact, so it can be easily added into any stage (level) of traditional deep networks, where it learns to integrate supportive polynomial feature transformations to existing convolutional frameworks. RKAN offers consistent improvements over baseline models in different vision tasks and widely tested benchmarks, accomplishing cutting-edge performance on them.

中文标题/摘要

标题：残差柯尔莫哥洛夫-阿诺尔德网络以增强深度学习

尽管深度卷积神经网络（CNNs）取得了巨大的成功，但由于网络深度中的数百层，它们在优化和训练方面可能非常困难且成本高昂。传统的卷积操作本质上受到其线性性质的限制，以及固定的激活函数，因此需要许多层来学习数据中的有意义模式。由于这些网络的规模庞大，这种方法在计算上效率低下，并且在小数据集上存在过拟合或梯度爆炸的风险。因此，我们引入了一个“插件”模块，称为残差柯尔莫哥洛夫-阿诺尔德网络（RKAN）。我们的模块非常紧凑，可以轻松地添加到传统深度网络的任何阶段（层次），它学习整合支持性的多项式特征变换到现有的卷积框架中。RKAN在不同的视觉任务和广泛测试的基准上都优于基线模型，实现了前沿的性能。

Summary / 总结

The paper addresses the challenges of optimizing and training deep convolutional neural networks (CNNs) due to their large number of layers and linear nature. To overcome these issues, the authors introduce the Residual Kolmogorov-Arnold Network (RKAN), which integrates polynomial feature transformations into existing CNNs. The RKAN module enhances the network's ability to learn meaningful patterns, leading to improved performance across various vision tasks and benchmarks.

论文针对深度卷积神经网络（CNN）由于层数众多和线性特性带来的优化和训练难题。为了解决这些问题，作者引入了残差柯尔莫哥洛夫-阿诺尔德网络（RKAN），该模块将多项式特征变换整合到现有的CNN中，增强了网络学习有意义模式的能力，从而在各种视觉任务和基准测试中取得了优异的表现。

Tracking and Understanding Object Transformations

Authors: Yihong Sun, Xinyu Yang, Jennifer J. Sun, Bharath Hariharan

Venue: NeurIPS 2025

First: 2025-11-06T18:59:30+00:00 · Latest: 2025-11-06T18:59:30+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.

中文标题/摘要

标题：追踪和理解对象变换

现实世界中的物体经常经历状态变换。从苹果被切成片到蝴蝶从蛹中出来，追踪这些变化对于理解现实世界中的物体和动力学非常重要。然而，现有方法往往在物体变换后会失去目标物体，因为物体外观发生了显著变化。为了解决这一限制，我们提出了“追踪任意状态”任务：在物体变换过程中追踪物体并检测、描述状态变化，同时引入了一个新的基准数据集VOST-TAS。为了解决这个问题，我们提出了TubeletGraph，这是一种零样本系统，可以在变换后恢复丢失的物体，并描绘出物体状态随时间的变化。TubeletGraph 首先识别可能被忽略的轨迹，并根据语义和接近度先验确定是否应将其整合。然后，它对新增的轨迹进行推理并生成描述每个观察到的变换的状态图。TubeletGraph 在变换下的跟踪性能达到了最先进的水平，同时展示了对物体变换的更深层次理解以及在复杂物体变换中的时间定位和语义推理的有希望的能力。代码、额外结果和基准数据集可在https://tubelet-graph.github.io/获取。

Summary / 总结

The paper addresses the challenge of tracking objects through significant transformations, such as an apple being cut or a butterfly emerging. It introduces the Track Any State task and a new benchmark dataset, VOST-TAS. The proposed TubeletGraph system identifies and integrates potentially overlooked tracks based on semantic and proximity priors, and generates a state graph to describe object transformations. TubeletGraph outperforms existing methods in tracking and demonstrates deeper understanding of object transformations and capabilities in temporal grounding and semantic reasoning.

论文针对苹果被切开或蝴蝶破茧而出等显著状态转换下的物体跟踪难题，引入了Track Any State任务和新的基准数据集VOST-TAS。提出的TubeletGraph系统基于语义和接近性先验识别并整合潜在遗漏的轨迹，并生成状态图来描述转换。TubeletGraph在跟踪上超越了现有方法，并展示了对物体转换的更深入理解以及在复杂物体转换中的时间定位和语义推理能力。

InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

Authors: Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, Zehuan Yuan

Venue: NeurIPS 2025 Oral

First: 2025-11-06T18:58:03+00:00 · Latest: 2025-11-06T18:58:03+00:00

Comments: NeurIPS 2025 Oral

Abs · PDF · Code1 · Code2

Abstract

We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

中文标题/摘要

标题：InfinityStar：统一时空自回归建模用于视觉生成

我们介绍了InfinityStar，这是一种统一的时空自回归框架，用于高分辨率图像和动态视频合成。基于视觉和语言领域自回归建模的最新成果，我们提出了一种纯离散的方法，可以在单一架构中同时捕捉空间和时间依赖性。这种统一设计自然支持多种生成任务，如文本到图像、文本到视频、图像到视频以及长时间交互视频合成，通过简单的时序自回归即可实现。大量实验表明，InfinityStar在VBench上的得分为83.74，远超所有自回归模型，甚至超过了某些竞争的扩散模型如HunyuanVideo。在没有额外优化的情况下，我们的模型生成一个5秒、720p的视频比领先的基于扩散的方法快约10倍。据我们所知，InfinityStar是第一个能够生成工业级720p视频的离散自回归视频生成器。我们发布了所有代码和模型，以促进高效、高质量视频生成的研究。

Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos

Authors: Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li

First: 2025-06-18T17:59:38+00:00 · Latest: 2025-11-06T18:57:44+00:00

Comments: Project page: https://kywind.github.io/pgnd

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .

中文标题/摘要

标题：粒子-网格神经动力学：从RGB-D视频学习可变形物体模型

由于可变形物体具有多样的物理属性以及从有限的视觉信息中估计状态的难度，建模其动力学具有挑战性。我们通过结合物体粒子和空间网格的混合表示，提出了一种神经动力学框架来应对这些挑战。我们的粒子-网格模型能够捕捉全局形状和运动信息，同时预测密集的粒子运动，从而能够建模具有不同形状和材料的物体。粒子代表物体形状，而空间网格将3D空间离散化，以确保空间连续性并提高学习效率。结合高斯分裂进行视觉渲染，我们的框架实现了可变形物体的完全基于学习的数字孪生，并生成3D条件动作视频。通过实验，我们证明了我们的模型能够从机器人与物体的稀疏视图RGB-D记录中学习各种物体的动力学，如绳子、布料、填充动物和纸袋，并在类别级别上泛化到未见过的实例。我们的方法在有限摄像头视图的场景中优于最先进的基于学习和基于物理的模拟器。此外，我们展示了我们学习模型在基于模型规划中的实用性，使其能够在一系列任务中实现目标条件下的物体操作。项目页面：https://kywind.github.io/pgnd

Summary / 总结

This paper addresses the challenge of modeling the dynamics of deformable objects by proposing a particle-grid neural dynamics framework. The model combines object particles and spatial grids to capture global shape and motion information, and it predicts dense particle movements to handle varied shapes and materials. Experiments show that the model can learn from sparse RGB-D recordings and generalize to unseen instances, outperforming existing simulators in scenarios with limited camera views. The learned models are also useful for model-based planning and goal-conditioned object manipulation in various tasks.

该论文通过提出一种粒子-网格神经动力学框架来解决可变形物体动力学建模的挑战。该方法结合使用粒子来表示物体形状，并使用空间网格来确保空间连续性，从而能够建模各种形状和材料。实验表明，该模型可以从稀疏的RGB-D数据中学习动力学，并泛化到未见过的实例，特别是在有限摄像头视角的情况下，其性能优于现有模拟器。此外，所学的模型还适用于基于模型的规划，可用于各种任务中的目标条件物体操作。

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Authors: Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia

First: 2025-11-06T18:56:30+00:00 · Latest: 2025-11-06T18:56:30+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

中文标题/摘要

标题：X-Diffusion：在跨体态人类演示上训练扩散策略

人类视频可以快速且大规模地录制，使其成为机器人学习的训练数据来源。然而，人类和机器人在体态上存在根本差异，导致动作执行不匹配。直接将人类手部动作的运动学重定位因此会产生机器人无法执行的动作。尽管存在这些低级差异，人类演示提供了关于如何操作和与物体交互的宝贵运动提示。我们的核心思想是利用前向扩散过程：随着动作中噪声的增加，低级执行差异逐渐消失，而高级任务指导得以保留。我们提出了X-Diffusion，这是一种原理性的框架，用于训练最大化利用人类数据而不学习动态上不可行的动作的扩散策略。X-Diffusion首先训练一个分类器来预测一个噪声动作是由人类还是机器人执行的。然后，在添加足够的噪声使得分类器无法区分其体态后，才将人类动作纳入策略训练。一致的动作与机器人执行一致，监督低噪声水平下的精细去噪，而不符的人类动作仅在较高噪声水平下提供粗略指导。我们的实验表明，执行不匹配下的简单共同训练会降低策略性能，而X-Diffusion始终能够提升性能。在五个操作任务中，X-Diffusion的平均成功率比最佳基线高出16%。项目网站可访问 https://portal-cornell.github.io/X-Diffusion/

Summary / 总结

The research aims to leverage human demonstrations for robot learning by addressing the embodiment mismatch between humans and robots. The method involves using the forward diffusion process to add noise to actions, preserving high-level task guidance while mitigating low-level execution differences. Experiments show that X-Diffusion, which selectively incorporates human actions into policy training based on noise levels, improves policy performance across five manipulation tasks, achieving a 16% higher average success rate compared to the best baseline without execution mismatches.

研究旨在解决使用人类演示训练机器人的问题，尽管人类演示可以快速轻松地录制，但由于存在体征差异，可能不直接适用。方法是利用前向扩散过程淡化低级差异，保留高级任务指导。X-Diffusion首先将动作分类为人类或机器人，然后在分类器无法区分体征差异的情况下将人类动作纳入策略训练。这种方法提高了策略性能，在五个操作任务中，X-Diffusion的平均成功率比最佳基线高出16%。

Cambrian-S: Towards Spatial Supersensing in Video

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie

First: 2025-11-06T18:55:17+00:00 · Latest: 2025-11-06T18:55:17+00:00

Comments: Website: https://cambrian-mllm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

中文标题/摘要

标题：Cambrian-S：迈向视频中的空间超感知

我们认为，真正的多模态智能的进步需要从反应性的、任务驱动的系统和粗暴的长上下文转向更广泛的超感知范式。我们将空间超感知定义为四个阶段：语义感知（命名所见之物）、流式事件认知（在连续体验中保持记忆）、隐含的三维空间认知（推断像素背后的现实世界）以及预测的世界建模（创建内部模型以过滤和组织信息）。当前的基准测试主要仅测试早期阶段，提供的空间认知覆盖范围狭窄，很少以需要真正世界建模的方式挑战模型。为了推动空间超感知的进步，我们提出了VSI-SUPER这一双部分基准：VSR（长时视觉空间回忆）和VSC（持续视觉空间计数）。这些任务需要任意长的视频输入，但对粗暴的上下文扩展具有抵抗力。我们通过收集VSI-590K并训练Cambrian-S，实现了VSI-Bench上绝对改进30%的效果，而不会牺牲通用能力。然而，VSI-SUPER上的性能仍然有限，表明规模本身不足以实现空间超感知。我们提出了预测感知作为前进的道路，展示了一个自监督的下一个潜在帧预测器的概念，该预测器利用惊讶（预测误差）来驱动记忆和事件分割。在VSI-SUPER上，这种方法显著优于领先的专有基线，表明空间超感知需要不仅能观察，还能预测、选择和组织经验的模型。

Summary / 总结

The research aims to advance spatial supersensing in video by proposing a new paradigm beyond linguistic-only understanding, focusing on semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling. The study introduces VSI-SUPER, a benchmark comprising VSR and VSC tasks, which require long video inputs and resist brute-force context expansion. Training Cambrian-S on a curated dataset, VSI-590K, led to a 30% improvement on VSI-Bench without compromising general capabilities. However, performance on VSI-SUPER remains limited, suggesting that scale alone is insufficient. The study proposes predictive sensing as a solution, demonstrating that models must not only perceive but also anticipate, select, and organize experiences to achieve true spatial supersensing.

研究旨在通过提出超越语言理解的新范式，推进视频中的空间超感知，重点关注语义感知、流式事件认知、隐式三维空间认知和预测世界建模。研究引入了VSI-SUPER基准，包括VSR和VSC任务，这些任务需要长视频输入且难以通过简单的上下文扩展来应对。通过在定制数据集上训练Cambrian-S，作者在VSI-Bench上实现了30%的改进，但发现VSI-SUPER的表现仍然有限，表明仅靠规模是不够的。研究提出预测感知作为前景，证明了模型不仅要感知还要预测、选择和组织经验才能实现真正的空间超感知。

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Authors: Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

First: 2025-11-06T18:53:31+00:00 · Latest: 2025-11-06T18:53:31+00:00

Comments: Project page: https://ellisbrown.github.io/sims-v

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

Summary / 总结

The research aims to enhance multimodal language models' spatial reasoning capabilities by leveraging a data-generation framework called SIMS-V, which uses 3D simulators to create spatially-rich video training data. The study identifies a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that are most effective for developing transferable spatial intelligence. The 7B-parameter video LLM fine-tuned on 25K simulated examples outperforms a 72B baseline and proprietary models on real-world spatial reasoning benchmarks, demonstrating robust generalization and substantial improvements on embodied tasks.

研究旨在通过解决获取多样空间标注视频数据的难题，提升多模态语言模型的空间推理能力。SIMS-V 是一个数据生成框架，利用 3D 模拟器生成丰富的空间视频训练数据。实验表明，三个问题类别（度量测量、视角依赖推理和时间跟踪）的有效性驱动了实际应用的迁移，使得一个 7B 参数的视频 LLM 在 25K 模拟示例上微调后，能够超越 72B 基准模型和专有模型，在严格的空间推理基准测试中表现出色。

Forgetting is Everywhere

Authors: Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

First: 2025-11-06T18:52:57+00:00 · Latest: 2025-11-06T18:52:57+00:00

Comments: Project page: https://ben-sanati.github.io/forgetting-is-everywhere-project/

Abs · PDF · Code1 · Code2 · Project1

Abstract

A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

中文标题/摘要

标题：遗忘无处不在

开发通用学习算法的基本挑战之一是在适应新数据时遗忘过去知识的倾向。解决这一问题需要对遗忘进行原则性的理解；然而，尽管已有数十年的研究，仍未出现能够揭示学习内在动态的统一定义。我们提出了一种算法和任务无关的理论，将遗忘定义为学习者对未来经验预测分布缺乏自我一致性，表现为预测信息的丧失。该理论自然地提供了一种通用的算法遗忘倾向度量方法。为了验证该理论，我们设计了一整套实验，涵盖分类、回归、生成建模和强化学习。我们实证展示了遗忘在所有学习设置中普遍存在，并在决定学习效率方面发挥着重要作用。这些结果共同建立了对遗忘的原理性理解，并为分析和改进通用学习算法的信息保留能力奠定了基础。

Summary / 总结

The paper addresses the challenge of forgetting in learning algorithms, proposing a theory that characterizes forgetting as a lack of self-consistency in a learner's predictive distribution. This theory provides a general measure of an algorithm's tendency to forget and is validated through experiments across various learning settings, including classification, regression, generative modeling, and reinforcement learning. The results show that forgetting is prevalent in all learning scenarios and significantly impacts learning efficiency, offering a principled understanding of the issue and a foundation for improving information retention in general learning algorithms.

研究通过提出一种理论来解决学习算法中的遗忘问题，该理论将遗忘定义为预测分布缺乏自我一致性。该理论在分类、回归、生成建模和强化学习等多种学习环境中得到了验证，结果显示遗忘普遍存在并对学习效率产生影响。这项工作为理解遗忘提供了原则性的认识，并为提高通用学习算法的信息保留能力奠定了基础。

Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions

Authors: Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, Yunzhu Li

First: 2025-11-06T18:52:08+00:00 · Latest: 2025-11-06T18:52:08+00:00

Comments: Website: https://real2sim-eval.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies. Website: https://real2sim-eval.github.io/

中文标题/摘要

标题：真实到模拟的机器人策略评估：基于高保真3D高斯点渲染的软体交互模拟

机器人操作策略正在迅速发展，但在现实世界中的直接评估仍然成本高昂、耗时且难以复制，尤其是对于涉及可变形物体的任务。模拟提供了一种可扩展且系统的替代方案，但现有的模拟器往往无法捕捉软体交互的耦合视觉和物理复杂性。我们提出了一种真实到模拟的策略评估框架，从现实世界的视频中构建软体的数字双胞胎，并使用3D高斯点渲染机器人、物体和环境，实现高保真度的渲染。我们通过代表性的可变形操作任务验证了该方法，包括毛绒玩具打包、绳索布线和T块推移，表明模拟运行与现实世界执行性能高度相关，并揭示了学习策略的关键行为模式。我们的结果表明，结合物理启发的重建与高质量渲染能够实现可复制、可扩展且准确的机器人操作策略评估。网站：https://real2sim-eval.github.io/

CREA: A Collaborative Multi-Agent Framework for Creative Image Editing and Generation

Authors: Kavana Venkatesh, Connor Dunlop, Pinar Yanardag

Venue: NeurIPS

First: 2025-04-07T17:59:51+00:00 · Latest: 2025-11-06T18:46:28+00:00

Comments: Published at NeurIPS'25 Main Conference

Abs · PDF · Code1 · Code2

Abstract

Creativity in AI imagery remains a fundamental challenge, requiring not only the generation of visually compelling content but also the capacity to add novel, expressive, and artistically rich transformations to images. Unlike conventional editing tasks that rely on direct prompt-based modifications, creative image editing requires an autonomous, iterative approach that balances originality, coherence, and artistic intent. To address this, we introduce CREA, a novel multi-agent collaborative framework that mimics the human creative process. Our framework leverages a team of specialized AI agents who dynamically collaborate to conceptualize, generate, critique, and enhance images. Through extensive qualitative and quantitative evaluations, we demonstrate that CREA significantly outperforms state-of-the-art methods in diversity, semantic alignment, and creative transformation. To the best of our knowledge, this is the first work to introduce the task of creative editing.

中文标题/摘要

标题：CREA：一种协作式多智能体框架，用于创意图像编辑和生成

AI图像中的创造力仍然是一个基本挑战，不仅需要生成视觉上引人注目的内容，还需要能够为图像添加新颖、表达性和艺术丰富的变换。与依赖直接提示修改的传统编辑任务不同，创意图像编辑需要一种自主的、迭代的方法，平衡原创性、连贯性和艺术意图。为了解决这一问题，我们引入了CREA，一种新颖的多智能体协作框架，模仿人类的创造性过程。我们的框架利用一组专门的AI智能体动态协作，构思、生成、评价和增强图像。通过广泛的定性和定量评估，我们证明CREA在多样性、语义对齐和创造性变换方面显著优于现有最先进的方法。据我们所知，这是首次提出创意编辑任务的工作。

Summary / 总结

The research aims to enhance AI's ability to create and edit images creatively, balancing originality and artistic intent. CREA, a multi-agent collaborative framework, is introduced, where specialized AI agents work together to conceptualize, generate, critique, and enhance images. Experiments show that CREA outperforms existing methods in diversity, semantic alignment, and creative transformation, marking the first work to address creative editing tasks.

CREA 是一个多智能体协作框架，旨在提升创意图像编辑和生成。该框架由一组专门的AI智能体组成，它们协同工作以构思、生成、评价和改进图像，模仿人类的创造性过程。CREA 在多样性、语义对齐和创造性转换方面显著优于现有方法，是首个处理创意编辑任务的工作。

Nowcast3D: Reliable precipitation nowcasting via gray-box learning

Authors: Huaguan Chen, Wei Han, Haofei Sun, Ning Lin, Xingtao Song, Yunfan Yang, Jie Tian, Yang Liu, Ji-Rong Wen, Xiaoye Zhang, Xueshun Shen, Hao Sun

First: 2025-11-06T18:44:35+00:00 · Latest: 2025-11-06T18:44:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Extreme precipitation nowcasting demands high spatiotemporal fidelity and extended lead times, yet existing approaches remain limited. Numerical Weather Prediction (NWP) and its deep-learning emulations are too slow and coarse for rapidly evolving convection, while extrapolation and purely data-driven models suffer from error accumulation and excessive smoothing. Hybrid 2D radar-based methods discard crucial vertical information, preventing accurate reconstruction of height-dependent dynamics. We introduce a gray-box, fully three-dimensional nowcasting framework that directly processes volumetric radar reflectivity and couples physically constrained neural operators with datadriven learning. The model learns vertically varying 3D advection fields under a conservative advection operator, parameterizes spatially varying diffusion, and introduces a Brownian-motion--inspired stochastic term to represent unresolved motions. A residual branch captures small-scale convective initiation and microphysical variability, while a diffusion-based stochastic module estimates uncertainty. The framework achieves more accurate forecasts up to three-hour lead time across precipitation regimes and ranked first in 57\% of cases in a blind evaluation by 160 meteorologists. By restoring full 3D dynamics with physical consistency, it offers a scalable and robust pathway for skillful and reliable nowcasting of extreme precipitation.

中文标题/摘要

标题：Nowcast3D：通过灰箱学习实现可靠的降水现在预报

极端降水现在预报需要高时空保真度和延长的提前量，但现有方法仍有限制。数值天气预报（NWP）及其深度学习模拟过于缓慢和粗糙，无法应对快速演变的对流，而外推和纯数据驱动模型则会积累误差并过度平滑。基于二维雷达的混合方法会丢弃关键的垂直信息，无法准确重建高度依赖的动力学。我们提出了一种灰箱、完全三维的现在预报框架，直接处理体积雷达反射率，并结合物理约束的神经算子与数据驱动学习。模型在保守的对流算子下学习垂直变化的三维输送场，参数化空间变化的扩散，并引入基于布朗运动的随机项来表示未解决的运动。残差分支捕捉小尺度对流的初始和微物理变异性，而基于扩散的随机模块估计不确定性。该框架在不同降水模式下的预报更为准确，提前量可达三小时，并在160名气象学家的盲测中以57%的胜率排名第一。通过恢复完整的三维动力学并保持物理一致性，它提供了一条可扩展且稳健的路径，用于实现极端降水的准确和可靠现在预报。

Summary / 总结

Nowcast3D addresses the limitations of existing precipitation nowcasting methods by introducing a gray-box, fully three-dimensional framework. It processes volumetric radar reflectivity and combines physically constrained neural operators with data-driven learning. The model achieves more accurate forecasts up to three-hour lead time, ranking first in 57% of cases in a blind evaluation by meteorologists.

Nowcast3D通过引入一个灰箱、全三维框架，处理体积雷达反射率并结合物理约束的神经运算符与数据驱动学习，解决了现有降水现在预报方法的局限性。该模型学习三维湍流场，参数化空间变化的扩散，并包含一个随机项来表示未解决的运动。它还包含一个残差分支来捕捉小尺度对流的初始和微物理变异性，以及一个基于扩散的随机模块来估计不确定性。Nowcast3D在三小时内的预报更为准确，并在160名气象学家进行的盲测中，有57%的情况下排名第一。

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Authors: Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie

First: 2025-11-06T18:43:21+00:00 · Latest: 2025-11-06T18:43:21+00:00

Comments: Project page: https://cambrian-mllm.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

中文标题/摘要

标题：基准设计者应“在测试集上训练”以揭示可利用的非视觉捷径

稳健的基准对于评估多模态大型语言模型（MLLMs）至关重要。然而，我们发现许多多模态基准模型可以在没有强大视觉理解的情况下通过，而是利用了偏差、语言先验和表面模式。对于旨在需要视觉输入的视觉中心基准，这尤其成问题。我们采用了一种基准设计的诊断原则：如果一个基准可以被利用，它就会被利用。因此，设计者应该首先尝试“利用”他们自己的基准，使用诊断和去偏见程序系统地识别和缓解非视觉偏差。有效的诊断需要直接“在测试集上训练”——探测测试集固有的、可利用的模式。我们通过两个组件来实现这一标准。首先，我们使用“测试集压力测试”（TsT）方法诊断基准的易利用性。我们的主要诊断工具是通过k折交叉验证对测试集的非视觉、文本输入进行微调，揭示捷径性能并为每个样本分配偏差分数s(x)。我们还通过基于手工特征的轻量级随机森林诊断程序进行快速、可解释的审计。其次，我们通过“迭代偏差修剪”（IBP）程序过滤高偏差样本来去偏基准。将这一框架应用于四个基准——VSI-Bench、CV-Bench、MMMU和VideoMME，我们发现了普遍存在的非视觉偏差。作为案例研究，我们将整个框架应用于创建VSI-Bench-Debiased，展示了降低非视觉可解性和扩大视觉盲性能差距。

CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation

Authors: Alyssa Unell, Noel C. F. Codella, Sam Preston, Peniel Argaw, Wen-wai Yim, Zelalem Gero, Cliff Wong, Rajesh Jena, Eric Horvitz, Amanda K. Hall, Ruican Rachel Zhong, Jiachen Li, Shrey Jain, Mu Wei, Matthew Lungren, Hoifung Poon

First: 2025-09-09T01:49:29+00:00 · Latest: 2025-11-06T18:38:30+00:00

Abs · PDF · Code1 · Code2

Abstract

The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent-based approach to automatically generate guideline-concordant treatment trajectories for patients with non-small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board-certified oncologists. Second, we demonstrate that existing LLMs possess domain-specific knowledge that enables high-quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert-annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta-classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom-tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM-based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.

中文标题/摘要

标题：CancerGUIDE：通过内部分歧估计理解癌症指南

国家综合癌症网络（NCCN）提供了基于证据的癌症治疗指南。将复杂的患者表现转化为符合指南的治疗建议既耗时又需要专门的专长，并且容易出错。大型语言模型（LLM）能力的进步有望减少生成治疗建议所需的时间并提高准确性。我们提出了一种基于LLM代理的方法，自动为非小细胞肺癌（NSCLC）患者生成符合指南的治疗轨迹。我们的贡献有三个方面。首先，我们构建了一个包含121例NSCLC患者的纵向数据集，其中包括临床会诊、诊断结果和医疗史，并由认证肿瘤学家专家标注了相应的NCCN指南轨迹。其次，我们证明现有的LLM具有领域特定的知识，能够生成高质量的代理基准，用于模型开发和评估，与专家标注基准的相关性（斯皮尔曼系数r=0.88，RMSE=0.08）很强。第三，我们开发了一种结合昂贵的人工标注和模型一致性信息的混合方法，创建了预测患者相关指南的代理框架，以及一个元分类器，通过校准的信心分数验证治疗建议的准确性（AUROC=0.800），这是传达输出准确性的关键能力，定制性能权衡，支持监管合规。这项工作建立了一个平衡准确度、可解释性和监管要求的临床可行的LLM基于指南遵从性系统框架，同时降低了标注成本，提供了一条自动临床决策支持的可扩展途径。

Summary / 总结

The research aims to automate the generation of guideline-compliant treatment recommendations for non-small cell lung cancer (NSCLC) patients using large language models (LLMs). The study constructs a longitudinal dataset of 121 NSCLC cases, expertly annotated with NCCN guidelines, and demonstrates that existing LLMs can generate high-quality proxy benchmarks. The key findings include a strong correlation (Spearman coefficient r=0.88, RMSE=0.08) with expert benchmarks and the development of a hybrid approach that combines human annotations with model consistency to predict and verify treatment recommendations with calibrated confidence scores (AUROC=0.800).

研究旨在利用大型语言模型（LLM）自动化生成非小细胞肺癌（NSCLC）患者的指南一致治疗建议。研究构建了一个包含121例NSCLC病例的纵向数据集，并由认证肿瘤学家专家注释了相应的NCCN指南。研究结果表明，现有的LLM可以生成高质量的代理基准。关键发现包括与专家基准的强相关性（Spearman系数r=0.88，RMSE=0.08），以及结合人类注释与模型一致性开发的混合方法，可以预测并验证治疗建议，并提供校准的信心分数（AUROC=0.800）。

DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration

Authors: Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong

First: 2025-11-06T18:37:18+00:00 · Latest: 2025-11-06T18:37:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Cooperative multi-agent planning requires agents to make joint decisions with partial information and limited communication. Coordination at the trajectory level often fails, as small deviations in timing or movement cascade into conflicts. Symbolic planning mitigates this challenge by raising the level of abstraction and providing a minimal vocabulary of actions that enable synchronization and collective progress. We present DR. WELL, a decentralized neurosymbolic framework for cooperative multi-agent planning. Cooperation unfolds through a two-phase negotiation protocol: agents first propose candidate roles with reasoning and then commit to a joint allocation under consensus and environment constraints. After commitment, each agent independently generates and executes a symbolic plan for its role without revealing detailed trajectories. Plans are grounded in execution outcomes via a shared world model that encodes the current state and is updated as agents act. By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids brittle step-level alignment and enables higher-level operations that are reusable, synchronizable, and interpretable. Experiments on cooperative block-push tasks show that agents adapt across episodes, with the dynamic world model capturing reusable patterns and improving task completion rates and efficiency. Experiments on cooperative block-push tasks show that our dynamic world model improves task completion and efficiency through negotiation and self-refinement, trading a time overhead for evolving, more efficient collaboration strategies.

中文标题/摘要

标题：DR. WELL：基于符号世界模型的动态推理与学习在体态多智能体协作中的应用

协同多智能体规划需要智能体在部分信息和有限通信的情况下做出联合决策。在轨迹层面的协调往往失败，因为时间或动作上的微小偏差会引发冲突。符号规划通过提高抽象层次和提供一种最小化的行动词汇表来缓解这一挑战，从而实现同步和集体进步。我们提出了DR. WELL，一种去中心化的神经符号框架，用于协同多智能体规划。合作通过两阶段谈判协议展开：智能体首先提出候选角色并进行推理，然后在达成共识和环境约束下承诺联合分配。在承诺之后，每个智能体独立生成并执行其角色的符号计划，而不透露详细的轨迹。计划通过共享世界模型进行接地，该模型编码当前状态并在智能体行动时更新。通过在符号计划而非原始轨迹上进行推理，DR. WELL 避免了脆弱的步骤级对齐，并使高级操作变得可重用、可同步和可解释。在协同推块任务中的实验表明，智能体在不同回合中能够适应，动态世界模型捕捉到可重用的模式并提高了任务完成率和效率。在协同推块任务中的实验表明，通过谈判和自我完善，我们的动态世界模型提高了任务完成率和效率，以时间开销换取了更高效的协作策略。

Summary / 总结

DR. WELL is a decentralized neurosymbolic framework for cooperative multi-agent planning, addressing coordination challenges through a two-phase negotiation protocol. Agents propose roles and commit to a joint allocation under constraints, then generate and execute symbolic plans independently. The dynamic world model, grounded in execution outcomes, captures reusable patterns, enhancing task completion rates and efficiency. Experiments on block-push tasks demonstrate agents' adaptability and improved performance through negotiation and self-refinement.

DR. WELL 是一个去中心化的神经符号框架，用于协同多智能体规划，通过两阶段协商协议解决协调挑战。智能体提出角色并进行推理，然后在约束条件下达成共识，独立生成并执行符号计划。动态世界模型基于执行结果，捕捉可重用的模式，提高任务完成率和效率。实验表明，通过协商和自我优化，智能体在积木推移任务中表现出更强的适应性和高效的协作策略。

Efficient probabilistic surrogate modeling techniques for partially-observed large-scale dynamical systems

Authors: Hans Harder, Abhijeet Vishwasrao, Luca Guastoni, Ricardo Vinuesa, Sebastian Peitz

First: 2025-11-06T18:35:01+00:00 · Latest: 2025-11-06T18:35:01+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper is concerned with probabilistic techniques for forecasting dynamical systems described by partial differential equations (such as, for example, the Navier-Stokes equations). In particular, it is investigating and comparing various extensions to the flow matching paradigm that reduce the number of sampling steps. In this regard, it compares direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs and rectified flows. Moreover, experiments are conducted on a set of challenging systems. In particular, we also address the challenge of directly predicting 2D slices of large-scale 3D simulations, paving the way for efficient inflow generation for solvers.

中文标题/摘要

标题：高效概率代理建模技术用于部分观测的大规模动力系统

本文关注用于预报由偏微分方程（例如，纳维-斯托克斯方程）描述的动力系统的概率技术。特别地，它研究并比较了减少采样步骤的各种流匹配范式的扩展。在这方面，它比较了直接蒸馏、渐进蒸馏、对抗扩散蒸馏、Wasserstein GAN 和修正流。此外，在一系列具有挑战性的系统上进行了实验。特别是，我们还直接预测了大规模3D模拟的2D切片，为求解器生成高效的入流铺平了道路。

Summary / 总结

This paper explores probabilistic methods for forecasting dynamical systems governed by partial differential equations, focusing on reducing the number of sampling steps through various extensions to the flow matching paradigm. The study compares techniques such as direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs, and rectified flows. Key findings include the efficiency of these methods in predicting 2D slices of large-scale 3D simulations, which can facilitate the generation of inflow for solvers.

该论文探讨了用于预报由偏微分方程描述的动力系统的一种高效概率技术，重点关注通过流匹配范式的各种扩展来减少采样步骤的数量。研究比较了直接蒸馏、逐步蒸馏、对抗扩散蒸馏、Wasserstein GANs 和归一化流。关键实验结果表明，这些方法可以有效地预测大规模3D模拟的2D切片，从而实现求解器的高效入流生成。

NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment

Authors: Kylie Cancilla, Alexander Moore, Amar Saini, Carmen Carrano

First: 2025-11-06T18:23:55+00:00 · Latest: 2025-11-06T18:23:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.

中文标题/摘要

标题：NovisVQ：一种用于无参考无意见感知帧质量评估的流式卷积神经网络

视频质量评估（VQA）对于计算机视觉任务至关重要，但现有方法面临重大限制：全参考（FR）指标需要干净的参考视频，而大多数无参考（NR）模型依赖于昂贵的人类意见标签。此外，大多数无意见的NR方法是基于图像的，忽略了视频对象检测中至关重要的时间上下文。在本工作中，我们提出了一种可扩展的基于流的VQA模型，该模型既是无参考的又是无意见的。我们的模型利用DAVIS数据集的合成退化，训练一种具有时间感知的卷积架构，直接从退化视频中预测FR指标（LPIPS、PSNR、SSIM），无需在推理时使用参考。我们展示了我们的流式方法在泛化到多种退化方面优于我们自己的基于图像的基线，突显了时间建模在实际视觉系统中可扩展VQA的价值。此外，我们证明了我们的模型与全参考指标的相关性高于BRISQUE，这是一种广泛使用的基于意见的图像质量评估基线，验证了我们的时间、无意见方法的有效性。

Summary / 总结

This paper introduces NovisVQ, a streaming convolutional neural network for no-reference opinion-unaware frame quality assessment. The model uses synthetic degradations of the DAVIS dataset to train a temporal-aware architecture that predicts full-reference metrics (LPIPS, PSNR, SSIM) directly from degraded video without needing references at inference. The study shows that NovisVQ outperforms an image-based baseline and achieves higher correlation with full-reference metrics compared to BRISQUE, highlighting the benefits of temporal modeling for scalable VQA in real-world vision systems.

本文介绍了NovisVQ，这是一种基于流式卷积神经网络的无参考无意见视频质量评估模型。该模型使用DAVIS数据集的合成降质训练一个时序感知的架构，可以直接从降质视频中预测全参考指标（LPIPS、PSNR、SSIM），无需在推理时使用参考视频。研究显示，NovisVQ在性能上优于基于图像的基线，并且与广泛使用的意见感知图像质量评估基准BRISQUE相比，具有更高的全参考指标相关性，突显了时序建模在实际视觉系统中进行可扩展VQA的价值。

Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Authors: Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen

First: 2025-09-10T14:02:18+00:00 · Latest: 2025-11-06T18:15:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.

中文标题/摘要

标题：医学中大型语言模型的记忆现象：普遍性、特征及影响

大型语言模型（LLMs）在医学领域展现了显著的潜力。到目前为止，LLMs 已被广泛应用于诊断辅助、医学问答和临床信息综合等任务。然而，一个关键的开放问题是：LLMs 在多大程度上记忆了医学训练数据。在本研究中，我们首次全面评估了医学中 LLMs 的记忆现象，评估其普遍性（发生频率）、特征（记忆内容）、体积（记忆内容量）以及潜在的下游影响（记忆如何影响医学应用）。我们系统分析了常见的适应场景：（1）继续在医学语料库上进行预训练，（2）在标准医学基准上进行微调，以及（3）在真实世界临床数据上进行微调，包括来自耶鲁纽黑文健康系统的超过 13,000 份独特的住院记录。结果表明，记忆现象在所有适应场景中普遍存在，并且显著高于一般领域报告的水平。记忆现象影响了 LLMs 在医学中的开发和应用，并可分类为三种类型：有益的（如准确回忆临床指南和生物医学参考）、无信息的（如重复的免责声明或模板化的医学文档语言）和有害的（如再生特定数据集或敏感的临床内容）。基于这些发现，我们提出了实用的建议，以促进有益的记忆现象，增强领域特定推理和事实准确性，减少无信息的记忆现象以促进更深层次的学习，避免表面模式，以及减轻有害的记忆现象以防止敏感或可识别患者信息的泄露。

Summary / 总结

This study evaluates the memorization of medical training data in Large Language Models (LLMs) across different adaptation scenarios, including continued pretraining, fine-tuning on benchmarks, and real-world clinical data. The research finds that memorization is prevalent and significantly higher in the medical domain compared to general domains. Memorization can be categorized into beneficial, uninformative, and harmful types, with implications for the development and adoption of LLMs in medicine. Practical recommendations are provided to enhance beneficial memorization, minimize uninformative memorization, and mitigate harmful memorization.

本研究评估了不同适应场景下（包括持续预训练、基准数据微调和真实临床数据微调）大型语言模型（LLM）对医学训练数据的记忆情况。研究发现，医学LLM的记忆现象普遍且显著高于通用领域模型。记忆可以分为有益、无信息和有害三种类型，对医学应用的发展和采用有重要影响。研究建议采取措施增强有益记忆、减少无信息记忆、减轻有害记忆，以提高医学应用的准确性和安全性。

Dynamic causal discovery in Alzheimer's disease through latent pseudotime modelling

Authors: Natalia Glazman, Jyoti Mangal, Pedro Borges, Sebastien Ourselin, M. Jorge Cardoso

Venue: NeurIPS 2025

First: 2025-11-06T18:12:09+00:00 · Latest: 2025-11-06T18:12:09+00:00

Comments: Accepted to the NeurIPS 2025 Workshop on CauScien: Uncovering Causality in Science

Abs · PDF · Code1 · Code2

Abstract

The application of causal discovery to diseases like Alzheimer's (AD) is limited by the static graph assumptions of most methods; such models cannot account for an evolving pathophysiology, modulated by a latent disease pseudotime. We propose to apply an existing latent variable model to real-world AD data, inferring a pseudotime that orders patients along a data-driven disease trajectory independent of chronological age, then learning how causal relationships evolve. Pseudotime outperformed age in predicting diagnosis (AUC 0.82 vs 0.59). Incorporating minimal, disease-agnostic background knowledge substantially improved graph accuracy and orientation. Our framework reveals dynamic interactions between novel (NfL, GFAP) and established AD markers, enabling practical causal discovery despite violated assumptions.

中文标题/摘要

标题：阿尔茨海默病中的动态因果发现通过潜在潜时间建模

将因果发现应用于阿尔茨海默病（AD）受到大多数方法静态图假设的限制；此类模型无法解释由潜在疾病潜时间调节的不断演变的病理生理学。我们提出应用现有的潜在变量模型到实际的AD数据中，推断出一个潜时间来按数据驱动的疾病轨迹对患者进行排序，独立于实际年龄，然后学习因果关系如何演变。潜时间在预测诊断方面优于年龄（AUC 0.82 vs 0.59）。结合最少的、与疾病无关的背景知识显著提高了图的准确性和方向性。我们的框架揭示了新型（NfL、GFAP）和已确立的AD标记之间的动态相互作用，尽管违反了假设条件，仍能实现实际的因果发现。

Summary / 总结

This study addresses the limitation of static causal models in Alzheimer's disease by proposing a latent pseudotime model. The model infers a disease trajectory independent of chronological age and learns evolving causal relationships. Pseudotime outperformed age in predicting diagnosis, and incorporating minimal background knowledge improved graph accuracy. The framework revealed dynamic interactions between novel and established AD markers.

该研究通过提出潜时间模型解决了阿尔茨海默病中静态因果模型的局限性。该模型独立于实际年龄推断疾病轨迹，并学习因果关系的变化。潜时间在预测诊断方面优于实际年龄，且融入少量背景知识提高了图的准确性。该框架揭示了新型和已确立的AD标记物之间的动态相互作用。

Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality

Authors: Tushar Kataria, Shikha Dubey, Mary Bronner, Jolanta Jedrzkiewicz, Ben J. Brintz, Shireen Y. Elhabian, Beatrice S. Knudsen

First: 2025-11-06T18:09:09+00:00 · Latest: 2025-11-06T18:09:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.

中文标题/摘要

标题：在虚拟免疫组化中建立信任：基于自动评估的图像质量

深度学习模型可以从苏木精和伊红（H&E）图像生成虚拟免疫组化（IHC）染色，提供一种可扩展且低成本的实验室IHC替代方案。然而，可靠地评估图像质量仍然是一个挑战，因为当前基于纹理和分布的指标量化的是图像保真度而非IHC染色的准确性。在这里，我们介绍了一种基于自动和准确性框架来确定十六个配对或非配对图像翻译模型的图像质量。通过颜色反卷积，我们生成了每个虚拟IHC模型预测的棕色（即IHC阳性）像素的掩码。我们使用真实和虚拟IHC的分割掩码来计算染色准确性指标（Dice、IoU、Hausdorff距离），直接量化正确的像素级标签，无需专家手动注释。我们的结果表明，传统的图像保真度指标，包括弗雷切特入射距离（FID）、峰值信噪比（PSNR）和结构相似性（SSIM），与染色准确性及病理学家评估的相关性较差。配对模型如金字塔Pix2Pix和自适应NCE获得最高的染色准确性，而基于扩散和GAN的非配对模型在提供准确的IHC阳性像素标签方面可靠性较低。此外，整个切片图像（WSI）在基于斑块的评估中无法揭示的性能下降强调了需要WSI级别的基准。总体而言，该框架定义了一种可重复的方法来评估虚拟IHC模型的质量，这是加速向病理学家常规使用的关键步骤。

Summary / 总结

The research aims to evaluate the quality of virtual immunohistochemistry (IHC) generated from hematoxylin and eosin (H&E) images using deep learning models. An automated framework was developed to assess the accuracy of IHC staining by comparing real and virtual IHC images using Dice, IoU, and Hausdorff distance metrics. The study found that conventional image fidelity metrics like FID, PSNR, and SSIM poorly correlate with stain accuracy and pathologist assessment. Paired models like PyramidPix2Pix and AdaptiveNCE showed the highest accuracy, while unpaired diffusion- and GAN-based models were less reliable. Whole-slide images revealed performance declines not visible in patch-based evaluations, highlighting the importance of WSI-level benchmarks.

研究旨在评估由hematoxylin和eosin（H&E）图像生成的虚拟免疫组织化学（IHC）图像的质量，使用了深度学习模型。研究引入了一种基于颜色反卷积的自动化框架，用于计算Dice、IoU和Hausdorff距离等标记准确度指标，直接衡量像素级标签的准确性。研究发现，传统的图像保真度指标与标记准确度和病理学家评估的相关性较差。配对模型如PyramidPix2Pix和AdaptiveNCE表现最佳，而无配对的扩散-和GAN基模型在提供准确的IHC阳性像素标签方面可靠性较低。全切片图像揭示了在基于块的评估中看不到的性能下降，突显了全切片图像级别基准的重要性。

Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning

Authors: Dongkwan Lee, Junhoo Lee, Nojun Kwak

First: 2025-10-13T07:56:55+00:00 · Latest: 2025-11-06T18:08:46+00:00

Comments: NeurIPS2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.

中文标题/摘要

标题：深度边缘滤波器：深度学习中的人工构建层的回归

我们介绍了深度边缘滤波器，这是一种新颖的方法，通过在深度神经网络特征上应用高通滤波来提高模型的泛化能力。我们的方法受到这样一个假设的启发，即神经网络在深度特征的高频分量中编码任务相关的语义信息，而在低频分量中存储领域特定的偏差。通过从原始特征中减去低通滤波输出，我们的方法隔离了可泛化的表示，同时保持了架构的完整性。实验结果表明，无论模型架构和数据模态如何，该方法在视觉、文本、3D和音频等多个领域都表现出一致的性能提升。分析表明，我们的方法导致了特征稀疏化，并有效地隔离了高频分量，为我们的核心假设提供了实证验证。代码可在https://github.com/dongkwani/DeepEdgeFilter 获取。

Summary / 总结

The research introduces the Deep Edge Filter, which applies high-pass filtering to deep neural network features to enhance model generalizability. Motivated by the hypothesis that neural networks encode task-relevant information in high-frequency components and domain-specific biases in low-frequency components, the method subtracts low-pass filtered outputs from original features to isolate generalizable representations. Experiments across various domains show consistent performance improvements, supporting the hypothesis that the method induces feature sparsification and isolates high-frequency components.

研究引入了Deep Edge Filter，该方法通过对深度神经网络特征进行高通滤波来提升模型的泛化能力。该方法假设高频成分包含任务相关的语义信息，而低频成分存储了领域特定的偏见。通过从原始特征中减去低通滤波输出，该方法隔离了通用表示。实验结果显示在多个领域中均有一致的性能提升，并且分析证实了特征的稀疏化和高频成分的有效隔离，验证了核心假设。

evomap: A Toolbox for Dynamic Mapping in Python

Authors: Maximilian Matthe

First: 2025-11-06T18:02:58+00:00 · Latest: 2025-11-06T18:02:58+00:00

Comments: Accepted for publication by the Journal of Statistical Software

Abs · PDF · Code1 · Code2

Abstract

This paper presents evomap, a Python package for dynamic mapping. Mapping methods are widely used across disciplines to visualize relationships among objects as spatial representations, or maps. However, most existing statistical software supports only static mapping, which captures objects' relationships at a single point in time and lacks tools to analyze how these relationships evolve. evomap fills this gap by implementing the dynamic mapping framework EvoMap, originally proposed by Matthe, Ringel, and Skiera (2023), which adapts traditional static mapping methods for dynamic analyses. The package supports multiple mapping techniques, including variants of Multidimensional Scaling (MDS), Sammon Mapping, and t-distributed Stochastic Neighbor Embedding (t-SNE). It also includes utilities for data preprocessing, exploration, and result evaluation, offering a comprehensive toolkit for dynamic mapping applications. This paper outlines the foundations of static and dynamic mapping, describes the architecture and functionality of evomap, and illustrates its application through an extensive usage example.

中文标题/摘要

标题：evomap：Python中的动态映射工具箱

本文介绍了evomap，这是一个用于动态映射的Python软件包。映射方法在各个学科中被广泛使用，用于将对象之间的关系可视化为空间表示或地图。然而，现有的大多数统计软件仅支持静态映射，只能捕捉对象在某一时间点的关系，缺乏分析这些关系如何演变的工具。evomap通过实现Matthe、Ringel和Skiera（2023）提出的动态映射框架EvoMap来填补这一空白，该框架将传统的静态映射方法适应于动态分析。该软件包支持多种映射技术，包括多维尺度（MDS）的变体、Sammon映射和t分布随机邻域嵌入（t-SNE）。它还包括数据预处理、探索和结果评估的工具，为动态映射应用提供了一个全面的工具包。本文概述了静态和动态映射的基础，描述了evomap的架构和功能，并通过一个详尽的应用示例进行了说明。

Summary / 总结

This paper introduces evomap, a Python package designed for dynamic mapping, which addresses the limitation of existing software that only supports static mapping. By implementing the dynamic mapping framework EvoMap, evomap adapts traditional static mapping methods such as MDS, Sammon Mapping, and t-SNE for dynamic analyses. The package includes utilities for data preprocessing, exploration, and result evaluation, providing a comprehensive toolkit for dynamic mapping applications.

本文介绍了evomap，这是一个用于动态映射的Python包。它解决了现有统计软件的局限性，提供了可视化和分析随时间变化的对象间关系的工具。该包实现了EvoMap动态映射框架，支持多种映射技术如MDS、Sammon Mapping和t-SNE，还包含数据预处理和评估的工具。通过一个广泛的使用示例展示了其应用。

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Authors: Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li, Lingxiang Wu, Haiyun Guo, Jinqiao Wang

First: 2025-11-06T17:54:12+00:00 · Latest: 2025-11-06T17:54:12+00:00

Abs · PDF · Code1 · Code2

Abstract

While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model's fine-grained vision-language alignment. However, the inherent token length limitation of CLIP's text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

中文标题/摘要

标题：PixCLIP：通过任意粒度像素-文本对齐学习实现精细粒度的视觉语言理解

尽管对比语言-图像预训练(CLIP)模型在各种下游视觉语言理解任务中取得了显著成功，但增强其对精细粒度图像-文本对齐的能力仍然是一个活跃的研究焦点。为此，大多数现有工作采用显式增加视觉信息处理粒度的策略，例如，通过引入视觉提示来引导模型关注图像中的特定局部区域。同时，多模态大型语言模型(MLLMs)的研究表明，使用长而详细的文本描述进行训练可以有效提高模型的精细粒度视觉-语言对齐能力。然而，CLIP文本编码器固有的令牌长度限制从根本上限制了CLIP处理长文本序列中嵌入的更精细文本信息的能力。为了协同利用增强视觉和文本内容处理粒度的优势，我们提出PixCLIP，一种新型框架，旨在同时接受视觉提示输入并处理长文本描述。具体而言，我们首先建立了一个自动注释流水线，能够为图像生成像素级局部化、长形式的文本描述。利用该流水线，我们构建了包含近150万个样本的高质量LongGRIT数据集。其次，我们用LLM替换CLIP的原始文本编码器，并提出了一种三支路像素-文本对齐学习框架，促进图像区域与相应文本描述在任意粒度下的精细对齐。实验表明，PixCLIP在像素级交互和处理长文本方面取得了突破，实现了最先进的性能。

Summary / 总结

The research aims to improve the fine-grained visual language understanding of the CLIP model by enhancing both visual and textual granularity. PixCLIP proposes an automated annotation pipeline to generate pixel-level localized, long-form textual descriptions and a three-branch pixel-text alignment learning framework. Experiments show that PixCLIP outperforms existing methods in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

研究旨在通过结合视觉提示和长文本描述来提升CLIP模型的细粒度视觉语言理解能力，提出了PixCLIP框架。PixCLIP构建了一个自动化注释管道来生成像素级局部化的文本描述，并采用三分支像素-文本对齐学习框架。实验结果表明，PixCLIP在像素级交互和处理长文本方面取得了突破，达到了最先进的性能。

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

Authors: Hampus Åström, Elin Anna Topp, Jacek Malec

First: 2025-11-06T17:51:11+00:00 · Latest: 2025-11-06T17:51:11+00:00

Comments: 8 pages without cover, references and supplementary materials, 11 with. Submitted to RLC 2025's workshop RLBrew and IMOL 2025

Abs · PDF · Code1 · Code2

Abstract

In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.

中文标题/摘要

标题：环境无关的目标条件化：无奖励自主学习研究

在本文中，我们研究如何将常规强化学习环境转换为目标条件化环境，从而使智能体能够自主且无奖励地学习解决任务。我们展示了智能体可以通过以环境无关的方式选择自己的目标来学习解决任务，在训练时间上与外部引导的强化学习相当。我们的方法独立于底层的离策略学习算法。由于我们的方法是环境无关的，智能体不会将任何目标置于其他目标之上，导致单个目标的性能不稳定。然而，在我们的实验中，我们展示了平均目标成功率的提高和稳定。使用此方法训练的智能体可以被指示寻求环境中的任何观察结果，从而在特定应用场景之前实现通用智能体的训练。

Summary / 总结

This paper investigates transforming reinforcement learning environments into goal-conditioned ones to enable autonomous and reward-free learning. The method allows agents to learn by setting their own goals, which is comparable in training time to externally guided reinforcement learning. Despite potential instability for individual goals, the average success rate improves and stabilizes. Agents can be trained to pursue any observations in the environment, facilitating generic training before specific use cases.

本文探讨了将强化学习环境转化为目标导向环境的方法，以实现自主且无需奖励的学习。该方法允许代理通过自主设定目标来学习，这种方法对环境是无依赖的，并且与各种离策略学习算法兼容。尽管单个目标的表现可能不稳定，但平均成功率会提高并趋于稳定。通过这种方式训练的代理可以被指示寻求环境中的任何观察，从而在特定任务之前实现通用预训练。