arXiv 论文速递

TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models

Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem

Venue: Transactions on Machine Learning Research, 2025

First: 2025-05-29T17:59:59+00:00 · Latest: 2025-11-06T18:59:57+00:00

Comments: Published in TMLR, with a J2C Certification

Abstract

Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

中文标题/摘要

标题：TextRegion: 冻结图像-文本模型的文本对齐区域标记

图像-文本模型在图像级任务上表现出色，但在详细的视觉理解方面存在困难。尽管这些模型提供了强大的视觉-语言对齐，但分割模型如SAM2能够提供精确的空间边界。为此，我们提出了一种简单、有效且无需训练的TextRegion框架，该框架结合了图像-文本模型和SAM2的优点，生成强大的文本对齐区域标记。这些标记能够实现详细的视觉理解，同时保留开放词汇的能力。它们可以直接应用于各种下游任务，包括开放世界语义分割、指示表达理解以及语义定位。我们进行了广泛的评估，并且在与最先进的无需训练方法的比较中，始终取得了优越或竞争力的表现。此外，我们的框架与许多图像-文本模型兼容，使其非常实用且易于扩展，随着更强的模型出现。代码可在：https://github.com/avaxiao/TextRegion 获取。

Summary / 总结

The research aims to enhance detailed visual understanding by combining the strengths of image-text models and segmentation models like SAM2. The proposed TextRegion framework generates text-aligned region tokens without requiring additional training, achieving superior or competitive performance in various downstream tasks such as open-world semantic segmentation and referring expression comprehension. This framework is compatible with multiple image-text models, making it practical and extensible for future advancements.

TextRegion 是一种框架，将图像-文本模型的视觉-语言对齐与 SAM2 的空间精度相结合，生成文本对齐的区域标记。这种方法增强了详细的视觉理解能力，同时保持了开放词汇表的能力。广泛的评估表明，TextRegion 在开放世界语义分割、指示表达理解和语义定位等任务上优于或匹配最先进的无训练方法。该框架兼容多种图像-文本模型，使其实用且易于扩展。代码可在 https://github.com/avaxiao/TextRegion 获取。

GentleHumanoid: Learning Upper-body Compliance for Contact-rich Human and Object Interaction

Authors: Qingzhou Lu, Yao Feng, Baiyu Shi, Michael Piseno, Zhenan Bao, C. Karen Liu

First: 2025-11-06T18:59:33+00:00 · Latest: 2025-11-06T18:59:33+00:00

Comments: Home page: https://gentle-humanoid.axell.top

Abs · PDF · Code1 · Code2

Abstract

Humanoid robots are expected to operate in human-centered environments where safe and natural physical interaction is essential. However, most recent reinforcement learning (RL) policies emphasize rigid tracking and suppress external forces. Existing impedance-augmented approaches are typically restricted to base or end-effector control and focus on resisting extreme forces rather than enabling compliance. We introduce GentleHumanoid, a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance. At its core is a unified spring-based formulation that models both resistive contacts (restoring forces when pressing against surfaces) and guiding contacts (pushes or pulls sampled from human motion data). This formulation ensures kinematically consistent forces across the shoulder, elbow, and wrist, while exposing the policy to diverse interaction scenarios. Safety is further supported through task-adjustable force thresholds. We evaluate our approach in both simulation and on the Unitree G1 humanoid across tasks requiring different levels of compliance, including gentle hugging, sit-to-stand assistance, and safe object manipulation. Compared to baselines, our policy consistently reduces peak contact forces while maintaining task success, resulting in smoother and more natural interactions. These results highlight a step toward humanoid robots that can safely and effectively collaborate with humans and handle objects in real-world environments.

中文标题/摘要

标题：GentleHumanoid：学习接触丰富的人与物体交互的上身顺应性

类人机器人期望在以人类为中心的环境中操作，其中安全而自然的物理交互至关重要。然而，最近的强化学习（RL）策略大多强调刚性跟踪并抑制外部力。现有的阻抗增强方法通常仅限于基座或末端执行器控制，并专注于抵抗极端力而不是实现顺应性。我们引入了GentleHumanoid框架，该框架将阻抗控制整合到全身运动跟踪策略中，以实现上身顺应性。其核心是一个统一的弹簧模型，该模型既描述了抵抗接触（在接触表面时的恢复力）也描述了引导接触（从人类运动数据中采样的推力或拉力）。该模型确保肩部、肘部和腕部的力在运动学上一致，同时使策略暴露于各种交互场景中。通过任务可调的力阈值进一步支持安全性。我们在模拟和Unitree G1类人机器人上评估了我们的方法，涵盖不同顺应性水平的任务，包括温柔拥抱、坐起辅助和安全物体操作。与基线相比，我们的策略始终降低峰值接触力，同时保持任务成功，从而实现更平滑和自然的交互。这些结果突显了向能够安全有效地与人类合作并处理物体的类人机器人迈进的一步。

Summary / 总结

GentleHumanoid is a framework that integrates impedance control into a whole-body motion tracking policy to achieve upper-body compliance in human and object interaction. It uses a unified spring-based formulation to model resistive and guiding contacts, ensuring kinematically consistent forces across the shoulder, elbow, and wrist. The approach reduces peak contact forces while maintaining task success in various scenarios, such as gentle hugging and object manipulation, leading to smoother and more natural interactions.

GentleHumanoid 是一个框架，将阻抗控制集成到全身运动跟踪策略中，以使类人机器人在人类中心环境中具备上半身顺应性。它使用统一的弹簧模型来模拟阻力接触和引导接触，确保力的一致性，并使策略暴露在各种交互场景中。实验结果表明，GentleHumanoid 在各种场景中减少了峰值接触力，同时保持任务成功率，从而实现更平滑和自然的交互，优于基线方法。

Residual Kolmogorov-Arnold Network for Enhanced Deep Learning

Authors: Ray Congrui Yu, Sherry Wu, Jiang Gui

First: 2024-10-07T21:12:32+00:00 · Latest: 2025-11-06T18:59:32+00:00

Comments: Code is available at https://github.com/withray/residualKAN.git

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite their immense success, deep convolutional neural networks (CNNs) can be difficult to optimize and costly to train due to hundreds of layers within the network depth. Conventional convolutional operations are fundamentally limited by their linear nature along with fixed activations, where many layers are needed to learn meaningful patterns in data. Because of the sheer size of these networks, this approach is simply computationally inefficient, and poses overfitting or gradient explosion risks, especially in small datasets. As a result, we introduce a "plug-in" module, called Residual Kolmogorov-Arnold Network (RKAN). Our module is highly compact, so it can be easily added into any stage (level) of traditional deep networks, where it learns to integrate supportive polynomial feature transformations to existing convolutional frameworks. RKAN offers consistent improvements over baseline models in different vision tasks and widely tested benchmarks, accomplishing cutting-edge performance on them.

Summary / 总结

This paper addresses the challenges of optimizing and training deep convolutional neural networks (CNNs) by introducing the Residual Kolmogorov-Arnold Network (RKAN), a compact module that can be integrated into existing CNNs. RKAN enhances the network by learning to integrate polynomial feature transformations, which improves performance in various vision tasks and benchmarks, achieving state-of-the-art results.

本文通过引入残差柯尔莫哥洛夫-阿诺尔德网络（RKAN），解决深度卷积神经网络（CNN）的优化和训练难题。RKAN 是一个紧凑的模块，可以集成到现有的 CNN 中，通过学习集成多项式特征变换来提升网络性能，在各种视觉任务和基准测试中取得了显著改进，并达到了最先进的效果。

Tracking and Understanding Object Transformations

Authors: Yihong Sun, Xinyu Yang, Jennifer J. Sun, Bharath Hariharan

Venue: NeurIPS 2025

First: 2025-11-06T18:59:30+00:00 · Latest: 2025-11-06T18:59:30+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.

中文标题/摘要

标题：追踪和理解对象变换

现实世界中的物体经常经历状态变换。从苹果被切成片到蝴蝶从蛹中出来，追踪这些变化对于理解现实世界中的物体和动力学非常重要。然而，现有方法往往在物体变换后会失去目标物体，因为物体外观发生了显著变化。为了解决这一限制，我们提出了“追踪任意状态”任务：在物体变换过程中追踪物体并检测、描述状态变化，同时引入了一个新的基准数据集VOST-TAS。为了解决这个问题，我们提出了TubeletGraph，这是一种零样本系统，可以在变换后恢复丢失的物体，并描绘出物体状态随时间的变化。TubeletGraph 首先识别可能被忽略的轨迹，并根据语义和接近度先验确定是否应将其整合。然后，它对新增的轨迹进行推理并生成描述每个观察到的变换的状态图。TubeletGraph 在变换下的跟踪性能达到了最先进的水平，同时展示了对物体变换的更深层次理解以及在复杂物体变换中的时间定位和语义推理的有希望的能力。代码、额外结果和基准数据集可在https://tubelet-graph.github.io/获取。

Summary / 总结

The paper addresses the challenge of tracking objects through significant state transformations, such as an apple being cut or a butterfly emerging. It introduces the Track Any State task and a new benchmark dataset, VOST-TAS. The proposed TubeletGraph system identifies and integrates potentially overlooked tracks based on semantic and proximity priors, and generates a state graph to describe transformations. TubeletGraph outperforms existing methods in tracking performance and demonstrates deeper understanding of object transformations and capabilities in temporal grounding and semantic reasoning.

论文针对苹果被切开或蝴蝶破茧而出等显著状态变换下的物体跟踪难题，引入了Track Any State任务和新的基准数据集VOST-TAS。作者提出了一种零样本系统TubeletGraph，该系统能够恢复变换后的丢失跟踪，并描绘出物体状态随时间的变化。TubeletGraph利用语义和接近度先验来整合跟踪，并生成状态图来描述变换，实现了卓越的跟踪性能，并展示了对复杂物体变换的更深层次理解以及在时间定位和语义推理方面的潜力。

InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation

Authors: Jinlai Liu, Jian Han, Bin Yan, Hui Wu, Fengda Zhu, Xing Wang, Yi Jiang, Bingyue Peng, Zehuan Yuan

Venue: NeurIPS 2025 Oral

First: 2025-11-06T18:58:03+00:00 · Latest: 2025-11-06T18:58:03+00:00

Comments: NeurIPS 2025 Oral

Abs · PDF · Code1 · Code2

Abstract

We introduce InfinityStar, a unified spacetime autoregressive framework for high-resolution image and dynamic video synthesis. Building on the recent success of autoregressive modeling in both vision and language, our purely discrete approach jointly captures spatial and temporal dependencies within a single architecture. This unified design naturally supports a variety of generation tasks such as text-to-image, text-to-video, image-to-video, and long interactive video synthesis via straightforward temporal autoregression. Extensive experiments demonstrate that InfinityStar scores 83.74 on VBench, outperforming all autoregressive models by large margins, even surpassing some diffusion competitors like HunyuanVideo. Without extra optimizations, our model generates a 5s, 720p video approximately 10x faster than leading diffusion-based methods. To our knowledge, InfinityStar is the first discrete autoregressive video generator capable of producing industrial level 720p videos. We release all code and models to foster further research in efficient, high-quality video generation.

中文标题/摘要

标题：InfinityStar：统一时空自回归建模用于视觉生成

我们介绍了InfinityStar，这是一种统一的时空自回归框架，用于高分辨率图像和动态视频合成。基于视觉和语言领域自回归建模的最新成果，我们提出了一种纯离散的方法，可以在单一架构中同时捕捉空间和时间依赖性。这种统一的设计自然支持多种生成任务，如文本到图像、文本到视频、图像到视频以及长时间交互视频合成，通过简单的时序自回归即可实现。大量实验表明，InfinityStar在VBench上的得分为83.74，远超所有自回归模型，甚至超过了某些竞争的扩散模型如HunyuanVideo。在没有额外优化的情况下，我们的模型生成一个5秒、720p的视频大约比领先的基于扩散的方法快10倍。据我们所知，InfinityStar是第一个能够生成工业级720p视频的离散自回归视频生成器。我们发布了所有代码和模型，以促进高效、高质量视频生成的研究。

Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos

Authors: Kaifeng Zhang, Baoyu Li, Kris Hauser, Yunzhu Li

First: 2025-06-18T17:59:38+00:00 · Latest: 2025-11-06T18:57:44+00:00

Comments: Project page: https://kywind.github.io/pgnd

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects -- such as ropes, cloths, stuffed animals, and paper bags -- from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks. The project page is available at https://kywind.github.io/pgnd .

中文标题/摘要

标题：粒子-网格神经动力学：从RGB-D视频学习可变形物体模型

由于可变形物体具有多样的物理属性以及从有限的视觉信息中估计状态的难度，建模其动力学具有挑战性。我们通过结合物体粒子和空间网格的混合表示，提出了一种神经动力学框架来应对这些挑战。我们的粒子-网格模型能够捕捉全局形状和运动信息，同时预测密集的粒子运动，从而能够建模具有不同形状和材料的物体。粒子代表物体形状，而空间网格将3D空间离散化，以确保空间连续性并提高学习效率。结合高斯分裂进行视觉渲染，我们的框架实现了可变形物体的全学习驱动数字孪生，并生成3D条件动作视频。通过实验，我们证明了我们的模型能够从机器人与物体的稀疏视图RGB-D交互记录中学习各种物体的动力学，包括绳子、布料、填充动物和纸袋，并在类别级别上泛化到未见过的实例。我们的方法在有限摄像头视图的场景中优于最先进的基于学习和基于物理的模拟器。此外，我们展示了我们学习模型在基于模型规划中的实用性，使其能够在一系列任务中实现目标条件下的物体操作。项目页面：https://kywind.github.io/pgnd

Summary / 总结

This research aims to model the dynamics of deformable objects by addressing the challenges of their diverse physical properties and limited visual information. The method combines particle and grid representations to capture global shape and motion while predicting dense particle movements. Experiments show that the model can learn the dynamics of various objects from sparse RGB-D videos and generalize to unseen instances, outperforming existing simulators in scenarios with limited camera views. The learned models are also useful for model-based planning and goal-conditioned object manipulation in different tasks.

研究提出了一种粒子-网格神经动力学框架来解决可变形物体动力学建模的挑战。该框架结合了物体粒子和空间网格，以捕捉全局形状和运动信息，并预测密集的粒子运动。实验表明，该模型可以从稀疏的RGB-D记录中学习各种物体的动力学，并在有限的摄像头视角场景中超越现有模拟器，实现对未见实例的泛化。此外，所学模型还适用于基于模型的规划和目标导向的物体操作，以完成多种任务。

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Authors: Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia

First: 2025-11-06T18:56:30+00:00 · Latest: 2025-11-06T18:56:30+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

中文标题/摘要

标题：X-Diffusion：在跨体态人类演示上训练扩散策略

人类视频可以快速且大规模地录制，使其成为机器人学习的训练数据来源。然而，人类和机器人在体态上存在根本差异，导致动作执行不匹配。直接将人类手部动作的运动学重定位因此会产生机器人无法执行的动作。尽管存在这些低级差异，人类演示提供了关于如何操作和与物体交互的宝贵运动提示。我们的核心思想是利用前向扩散过程：随着动作中噪声的增加，低级执行差异逐渐消失，而高级任务指导得以保留。我们提出了X-Diffusion，这是一种原理性的框架，用于训练最大化利用人类数据而不学习动态上不可行的动作的扩散策略。X-Diffusion首先训练一个分类器来预测一个噪声动作是由人类还是机器人执行的。然后，在添加足够的噪声使得分类器无法区分其体态后，才将人类动作纳入策略训练。一致的动作与机器人执行一致，监督低噪声水平下的精细去噪，而不匹配的人类动作仅在较高噪声水平下提供粗略指导。我们的实验表明，执行不匹配下的简单共同训练会降低策略性能，而X-Diffusion始终能够提升性能。在五个操作任务中，X-Diffusion的平均成功率比最佳基线高出16%。项目网站可访问 https://portal-cornell.github.io/X-Diffusion/

Summary / 总结

The research aims to leverage human demonstrations for robot learning despite embodiment differences. The method involves using forward diffusion to add noise to actions, preserving high-level task guidance while reducing low-level execution mismatches. Experiments show that X-Diffusion improves policy performance by 16% across five manipulation tasks compared to the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

研究旨在解决使用人类演示来训练机器人时存在的挑战，考虑到人类和机器人在身体上的根本差异。方法是使用前向扩散过程向动作中添加噪声，保留高层次的任务指导，同时减少低层次的执行差异。实验表明，X-Diffusion通过训练分类器区分人类和机器人动作，并仅在动作符合机器人执行的情况下将其纳入策略训练，从而提高策略性能。在五个操作任务中，X-Diffusion的平均成功率比最佳基线方法高出16%。

Cambrian-S: Towards Spatial Supersensing in Video

Authors: Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie

First: 2025-11-06T18:55:17+00:00 · Latest: 2025-11-06T18:55:17+00:00

Comments: Website: https://cambrian-mllm.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

中文标题/摘要

标题：Cambrian-S：迈向视频中的空间超感知

我们主张，真正的多模态智能的进步需要从反应性的、任务驱动的系统和粗暴的长上下文转向更广泛的超感知范式。我们将空间超感知定义为四个阶段：语义感知（命名所见之物）、流式事件认知（在连续体验中保持记忆）、隐含的三维空间认知（推断像素背后的现实世界）以及预测的世界建模（创建内部模型以过滤和组织信息）。当前的基准测试主要测试早期阶段，提供的空间认知覆盖范围狭窄，很少以需要真正世界建模的方式挑战模型。为了推动空间超感知的进步，我们提出了VSI-SUPER这一双部分基准：VSR（长时视觉空间回忆）和VSC（持续视觉空间计数）。这些任务需要任意长的视频输入，但对粗暴的上下文扩展具有抵抗力。我们通过收集VSI-590K并训练Cambrian-S，实现了在VSI-Bench上绝对改进30%而无需牺牲通用能力。然而，VSI-SUPER上的性能仍然有限，表明规模本身不足以实现空间超感知。我们提出了预测感知作为前进的道路，展示了一个自监督的下一个潜在帧预测器的概念，该预测器利用惊讶（预测误差）来驱动记忆和事件分割。在VSI-SUPER上，这种方法显著优于领先的专有基线，表明空间超感知需要不仅能观察还能预测、选择和组织经验的模型。

Summary / 总结

The research aims to advance spatial supersensing in video by proposing a new paradigm beyond linguistic understanding, focusing on semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling. The authors introduce VSI-SUPER, a benchmark comprising VSR and VSC tasks, which require long video inputs and resist brute-force context expansion. By training Cambrian-S on a curated dataset, they achieve a 30% improvement on VSI-Bench but still face limitations on VSI-SUPER, suggesting that scale alone is insufficient. The study proposes predictive sensing as a promising approach, demonstrating that models must not only perceive but also anticipate, select, and organize experiences to achieve true spatial supersensing.

研究旨在通过提出超越语言理解的新范式，推进视频中的空间超感知，重点关注语义感知、流式事件认知、隐式3D空间认知和预测世界建模。作者引入了VSI-SUPER基准，包括VSR和VSC任务，这些任务需要长视频输入并抵制暴力上下文扩展。通过在精心收集的数据集上训练Cambrian-S，他们在VSI-Bench上实现了30%的改进，但在VSI-SUPER上仍面临限制，表明规模本身不足以实现空间超感知。研究提出预测感知作为前景，证明模型不仅需要感知，还需要预测、选择和组织经验以实现真正的空间超感知。

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Authors: Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

First: 2025-11-06T18:53:31+00:00 · Latest: 2025-11-06T18:53:31+00:00

Comments: Project page: https://ellisbrown.github.io/sims-v

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

中文标题/摘要

标题：SIMS-V：模拟指令调优的空间视频理解

尽管多模态语言模型在高层次视频理解方面表现出色，但在时间和空间上的空间推理方面仍存在困难。当前的空间训练方法依赖于真实世界的视频数据，但获得具有精确空间注释的多样化视频片段仍然是一个瓶颈。为了解决这一瓶颈，我们提出了SIMS-V——一种系统性的数据生成框架，利用3D模拟器的特权信息来创建多模态语言模型的空间丰富的视频训练数据。通过这种方法，我们研究了哪些模拟数据的属性能够有效驱动现实世界的迁移，通过系统地删除问题类型、混合和规模来实现。我们确定了三个问题类别（度量测量、视角依赖性推理和时间跟踪）的最小集合，这些类别对于开发可迁移的空间智能最为有效，尽管使用了较少的问题类型，但其性能优于全面覆盖。这些见解使训练变得非常高效：我们仅使用25,000个模拟示例对7B参数的视频LLM进行微调，其性能优于更大的72B基线，并在严格的现实世界空间推理基准测试中达到了与专有模型相当的性能。我们的方法展示了稳健的泛化能力，在保持一般视频理解性能的同时，在体现性和现实世界的空间任务上显示出显著的改进。

Summary / 总结

The research aims to enhance multimodal language models' spatial reasoning capabilities by addressing the challenge of obtaining diverse spatially annotated video data. SIMS-V, a data-generation framework, uses 3D simulators to create rich spatial video training data. Experiments show that a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) significantly improves transfer learning, outperforming larger models on real-world spatial reasoning benchmarks with fewer training examples. This approach enables efficient and robust generalization across various spatial tasks.

研究旨在通过利用3D模拟器生成富含空间信息的视频数据，增强多模态语言模型的空间推理能力。研究通过消融研究探讨不同模拟数据属性对实际应用中迁移的影响，并确定了三个关键问题类别，显著提高了空间智能。7B参数的视频LLM仅在25K模拟示例上微调后，就超越了72B基线模型和专有模型，在严格的现实世界空间推理基准测试中表现出色，展示了强大的泛化能力和高效性。

Forgetting is Everywhere

Authors: Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

First: 2025-11-06T18:52:57+00:00 · Latest: 2025-11-06T18:52:57+00:00

Comments: Project page: https://ben-sanati.github.io/forgetting-is-everywhere-project/

Abs · PDF · Code1 · Code2 · Project1

Abstract

A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution over future experiences, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

中文标题/摘要

标题：遗忘无处不在

开发通用学习算法的基本挑战之一是在适应新数据时遗忘过去知识的倾向。解决这一问题需要对遗忘进行原则性的理解；然而，尽管已有数十年的研究，仍未出现能够揭示学习内在动态的统一定义。我们提出了一种算法和任务无关的理论，将遗忘定义为学习者对未来经验预测分布缺乏自我一致性，表现为预测信息的丧失。该理论自然地提供了一种通用的算法遗忘倾向度量方法。为了验证该理论，我们设计了一整套实验，涵盖分类、回归、生成建模和强化学习。我们实证展示了遗忘在所有学习设置中普遍存在，并在决定学习效率方面发挥着重要作用。这些结果共同建立了对遗忘的原则性理解，并为分析和改进通用学习算法的信息保留能力奠定了基础。

Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions

Authors: Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, Yunzhu Li

First: 2025-11-06T18:52:08+00:00 · Latest: 2025-11-06T18:52:08+00:00

Comments: Website: https://real2sim-eval.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies. Website: https://real2sim-eval.github.io/

中文标题/摘要

标题：真实到模拟的机器人策略评估：基于软体交互的高斯点云模拟

机器人操作策略正在迅速发展，但在现实世界中的直接评估仍然成本高昂、耗时且难以复制，尤其是对于涉及可变形物体的任务。模拟提供了一种可扩展且系统的替代方案，但现有的模拟器往往无法捕捉软体交互的耦合视觉和物理复杂性。我们提出了一种真实到模拟的策略评估框架，从现实世界的视频中构建软体数字双胞胎，并使用3D高斯点云渲染机器人、物体和环境，以具象真实度进行呈现。我们通过代表性的可变形操作任务验证了该方法，包括毛绒玩具打包、绳索布线和T块推移，表明模拟运行与现实世界执行性能高度相关，并揭示了学习策略的关键行为模式。我们的结果表明，结合物理启发的重建与高质量渲染能够实现可复制、可扩展且准确的机器人操作策略评估。网站：https://real2sim-eval.github.io/

Summary / 总结

The research aims to evaluate robotic manipulation policies more efficiently by using simulation, addressing the challenges of real-world evaluation, especially for tasks involving deformable objects. The method involves constructing soft-body digital twins from real-world videos and rendering them with photorealistic fidelity using 3D Gaussian Splatting. Key findings show that simulated rollouts strongly correlate with real-world performance and reveal important behavioral patterns of learned policies, suggesting the approach enables reproducible, scalable, and accurate evaluations.

研究旨在通过仿真高效评估机器人操作策略，解决现实世界评估的挑战。方法是使用3D高斯点绘从真实视频中构建软体交互的高保真数字孪生。关键发现表明，在玩偶打包、绳索布线和T块推移等任务上，模拟和实际表现之间存在强烈的相关性，表明该方法在策略评估中的准确性和可扩展性。

CREA: A Collaborative Multi-Agent Framework for Creative Image Editing and Generation

Authors: Kavana Venkatesh, Connor Dunlop, Pinar Yanardag

Venue: NeurIPS

First: 2025-04-07T17:59:51+00:00 · Latest: 2025-11-06T18:46:28+00:00

Comments: Published at NeurIPS'25 Main Conference

Abs · PDF · Code1 · Code2

Abstract

Creativity in AI imagery remains a fundamental challenge, requiring not only the generation of visually compelling content but also the capacity to add novel, expressive, and artistically rich transformations to images. Unlike conventional editing tasks that rely on direct prompt-based modifications, creative image editing requires an autonomous, iterative approach that balances originality, coherence, and artistic intent. To address this, we introduce CREA, a novel multi-agent collaborative framework that mimics the human creative process. Our framework leverages a team of specialized AI agents who dynamically collaborate to conceptualize, generate, critique, and enhance images. Through extensive qualitative and quantitative evaluations, we demonstrate that CREA significantly outperforms state-of-the-art methods in diversity, semantic alignment, and creative transformation. To the best of our knowledge, this is the first work to introduce the task of creative editing.

中文标题/摘要

标题：CREA：一种协作式多智能体框架，用于创意图像编辑和生成

AI图像中的创造力仍然是一个基本挑战，不仅需要生成视觉上引人注目的内容，还需要能够为图像添加新颖、表达性和艺术丰富的变换。与依赖直接提示修改的传统编辑任务不同，创意图像编辑需要一种自主的、迭代的方法，平衡原创性、连贯性和艺术意图。为了解决这一问题，我们引入了CREA，一种新颖的多智能体协作框架，模仿人类的创造性过程。我们的框架利用一组专门的AI智能体动态协作，构思、生成、评价和增强图像。通过广泛的定性和定量评估，我们证明CREA在多样性、语义对齐和创造性变换方面显著优于现有最先进的方法。据我们所知，这是首次提出创意编辑任务的工作。

Summary / 总结

The research motivation is to enhance AI's ability to create visually compelling and artistically rich images through creative image editing. The main method involves a collaborative multi-agent framework called CREA, which consists of specialized AI agents working together to conceptualize, generate, critique, and enhance images. Key experimental findings show that CREA outperforms existing methods in terms of diversity, semantic alignment, and creative transformation, marking the first work to introduce the task of creative editing.

研究动机是通过创意图像编辑增强AI生成视觉上引人注目且富有艺术性的图像的能力。主要方法是采用一种名为CREA的协作多智能体框架，该框架由专门的AI代理协同工作，进行概念化、生成、批评和增强图像。关键实验发现表明，CREA在多样性、语义对齐和创意转换方面优于现有方法，这是首次引入创意编辑任务的工作。

Nowcast3D: Reliable precipitation nowcasting via gray-box learning

Authors: Huaguan Chen, Wei Han, Haofei Sun, Ning Lin, Xingtao Song, Yunfan Yang, Jie Tian, Yang Liu, Ji-Rong Wen, Xiaoye Zhang, Xueshun Shen, Hao Sun

First: 2025-11-06T18:44:35+00:00 · Latest: 2025-11-06T18:44:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Extreme precipitation nowcasting demands high spatiotemporal fidelity and extended lead times, yet existing approaches remain limited. Numerical Weather Prediction (NWP) and its deep-learning emulations are too slow and coarse for rapidly evolving convection, while extrapolation and purely data-driven models suffer from error accumulation and excessive smoothing. Hybrid 2D radar-based methods discard crucial vertical information, preventing accurate reconstruction of height-dependent dynamics. We introduce a gray-box, fully three-dimensional nowcasting framework that directly processes volumetric radar reflectivity and couples physically constrained neural operators with datadriven learning. The model learns vertically varying 3D advection fields under a conservative advection operator, parameterizes spatially varying diffusion, and introduces a Brownian-motion--inspired stochastic term to represent unresolved motions. A residual branch captures small-scale convective initiation and microphysical variability, while a diffusion-based stochastic module estimates uncertainty. The framework achieves more accurate forecasts up to three-hour lead time across precipitation regimes and ranked first in 57\% of cases in a blind evaluation by 160 meteorologists. By restoring full 3D dynamics with physical consistency, it offers a scalable and robust pathway for skillful and reliable nowcasting of extreme precipitation.

中文标题/摘要

标题：Nowcast3D：通过灰箱学习实现可靠的降水现在预报

极端降水现在预报需要高时空保真度和延长的提前量，但现有方法仍有限制。数值天气预报（NWP）及其深度学习模拟过于缓慢和粗糙，无法应对快速演变的对流，而外推和纯数据驱动模型则会积累误差并过度平滑。基于二维雷达的混合方法会丢弃关键的垂直信息，无法准确重建高度依赖的动力学。我们提出了一种灰箱、完全三维的现在预报框架，直接处理体积雷达反射率，并结合物理约束的神经算子与数据驱动学习。模型在保守对流算子下学习垂直变化的三维输送场，参数化空间变化的扩散，并引入基于布朗运动的随机项来表示未解决的运动。残差分支捕捉小尺度对流的初始和微物理变异性，而基于扩散的随机模块估计不确定性。该框架在降水模式下实现了更准确的预报，提前量可达三小时，并在160名气象学家进行的盲测中以57%的胜率排名第一。通过恢复完整的三维动力学并保持物理一致性，它提供了一条可扩展且稳健的路径，用于实现极端降水的准确和可靠现在预报。

Summary / 总结

Nowcast3D addresses the limitations of existing precipitation nowcasting methods by introducing a gray-box, fully three-dimensional framework. It processes volumetric radar reflectivity and combines physically constrained neural operators with data-driven learning. The model achieves more accurate forecasts up to three-hour lead time and ranked first in 57% of cases in a blind evaluation by meteorologists.

Nowcast3D 通过将物理约束的神经运算与数据驱动学习结合在完全三维框架中，解决了现有降水现在预报方法的局限性。该方法处理体积雷达反射率，学习三维湍流场，并引入随机项来表示未解决的运动。该模型在三小时预报中实现了更高的准确性，并在气象学家的盲测中57%的情况下排名第一。

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Authors: Ellis Brown, Jihan Yang, Shusheng Yang, Rob Fergus, Saining Xie

First: 2025-11-06T18:43:21+00:00 · Latest: 2025-11-06T18:43:21+00:00

Comments: Project page: https://cambrian-mllm.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Robust benchmarks are crucial for evaluating Multimodal Large Language Models (MLLMs). Yet we find that models can ace many multimodal benchmarks without strong visual understanding, instead exploiting biases, linguistic priors, and superficial patterns. This is especially problematic for vision-centric benchmarks that are meant to require visual inputs. We adopt a diagnostic principle for benchmark design: if a benchmark can be gamed, it will be. Designers should therefore try to ``game'' their own benchmarks first, using diagnostic and debiasing procedures to systematically identify and mitigate non-visual biases. Effective diagnosis requires directly ``training on the test set'' -- probing the released test set for its intrinsic, exploitable patterns. We operationalize this standard with two components. First, we diagnose benchmark susceptibility using a ``Test-set Stress-Test'' (TsT) methodology. Our primary diagnostic tool involves fine-tuning a powerful Large Language Model via $k$-fold cross-validation on exclusively the non-visual, textual inputs of the test set to reveal shortcut performance and assign each sample a bias score $s(x)$. We complement this with a lightweight Random Forest-based diagnostic operating on hand-crafted features for fast, interpretable auditing. Second, we debias benchmarks by filtering high-bias samples using an ``Iterative Bias Pruning'' (IBP) procedure. Applying this framework to four benchmarks -- VSI-Bench, CV-Bench, MMMU, and VideoMME -- we uncover pervasive non-visual biases. As a case study, we apply our full framework to create VSI-Bench-Debiased, demonstrating reduced non-visual solvability and a wider vision-blind performance gap than the original.

中文标题/摘要

标题：基准设计者应“在测试集上训练”以揭示可利用的非视觉捷径

稳健的基准对于评估多模态大型语言模型（MLLMs）至关重要。然而，我们发现许多多模态基准模型可以在没有强大视觉理解的情况下通过，而是利用了偏差、语言先验和表面模式。对于旨在需要视觉输入的视觉中心基准，这尤其成问题。我们采用了一种基准设计的诊断原则：如果一个基准可以被利用，它就会被利用。因此，设计者应该首先尝试“利用”他们自己的基准，使用诊断和去偏见程序系统地识别和缓解非视觉偏差。有效的诊断需要直接“在测试集上训练”——探测测试集固有的、可利用的模式。我们通过两个组件来实现这一标准。首先，我们使用“测试集压力测试”（TsT）方法诊断基准的易利用性。我们的主要诊断工具是通过k折交叉验证对测试集的非视觉、文本输入进行微调，揭示捷径性能并为每个样本分配偏差分数s(x)。我们还通过基于手工特征的轻量级随机森林诊断程序进行快速、可解释的审计。其次，我们通过“迭代偏差修剪”（IBP）程序过滤高偏差样本来去偏基准。将这一框架应用于四个基准——VSI-Bench、CV-Bench、MMMU和VideoMME，我们发现了普遍存在的非视觉偏差。作为案例研究，我们将整个框架应用于创建VSI-Bench-Debiased，展示了降低非视觉可解性和扩大视觉盲性能差距。

CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation

Authors: Alyssa Unell, Noel C. F. Codella, Sam Preston, Peniel Argaw, Wen-wai Yim, Zelalem Gero, Cliff Wong, Rajesh Jena, Eric Horvitz, Amanda K. Hall, Ruican Rachel Zhong, Jiachen Li, Shrey Jain, Mu Wei, Matthew Lungren, Hoifung Poon

First: 2025-09-09T01:49:29+00:00 · Latest: 2025-11-06T18:38:30+00:00

Abs · PDF · Code1 · Code2

Abstract

The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent-based approach to automatically generate guideline-concordant treatment trajectories for patients with non-small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board-certified oncologists. Second, we demonstrate that existing LLMs possess domain-specific knowledge that enables high-quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert-annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta-classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom-tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM-based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.

中文标题/摘要

标题：CancerGUIDE：通过内部分歧估计理解癌症指南

国家综合癌症网络（NCCN）提供了基于证据的癌症治疗指南。将复杂的患者表现转化为符合指南的治疗建议既耗时又需要专门的专长，且容易出错。大型语言模型（LLM）能力的进步有望减少生成治疗建议所需的时间并提高准确性。我们提出了一种基于LLM代理的方法，自动为非小细胞肺癌（NSCLC）患者生成符合指南的治疗轨迹。我们的贡献有三个方面。首先，我们构建了一个包含121例NSCLC患者的纵向数据集，其中包括临床会诊、诊断结果和医疗历史，每个数据点都由认证肿瘤学家用相应的NCCN指南轨迹进行了专家注释。其次，我们证明现有的LLM具有领域特定的知识，能够生成高质量的代理基准，用于模型开发和评估，与专家注释基准的相关性（斯皮尔曼系数r=0.88，均方根误差RMSE=0.08）很强。第三，我们开发了一种结合昂贵的人工注释和模型一致性信息的混合方法，创建了一个代理框架来预测患者的相关指南，以及一个元分类器来通过校准的信心分数验证治疗建议的准确性（AUROC=0.800），这是传达输出准确性的关键能力，定制性能权衡，支持监管合规。这项工作建立了一个平衡准确性、可解释性和监管要求的临床可行的LLM基于指南遵从性系统框架，同时降低了注释成本，提供了一条自动临床决策支持的可扩展途径。

Summary / 总结

CancerGUIDE aims to automate the generation of guideline-concordant treatment recommendations for non-small cell lung cancer patients using large language models. The approach involves creating a longitudinal dataset of 121 NSCLC cases, developing a hybrid model combining human annotations and model consistency, and achieving strong correlation with expert benchmarks. The system provides calibrated confidence scores for treatment recommendations, enhancing accuracy and regulatory compliance.

CancerGUIDE 使用大型语言模型自动化生成非小细胞肺癌患者的指南一致治疗建议。该方法包括创建121例NSCLC病例的纵向数据集、结合人工注释和模型一致性开发混合模型，并与专家基准实现强相关性。系统提供治疗建议的校准置信分数，提高准确性和监管合规性。

DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration

Authors: Narjes Nourzad, Hanqing Yang, Shiyu Chen, Carlee Joe-Wong

First: 2025-11-06T18:37:18+00:00 · Latest: 2025-11-06T18:37:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Cooperative multi-agent planning requires agents to make joint decisions with partial information and limited communication. Coordination at the trajectory level often fails, as small deviations in timing or movement cascade into conflicts. Symbolic planning mitigates this challenge by raising the level of abstraction and providing a minimal vocabulary of actions that enable synchronization and collective progress. We present DR. WELL, a decentralized neurosymbolic framework for cooperative multi-agent planning. Cooperation unfolds through a two-phase negotiation protocol: agents first propose candidate roles with reasoning and then commit to a joint allocation under consensus and environment constraints. After commitment, each agent independently generates and executes a symbolic plan for its role without revealing detailed trajectories. Plans are grounded in execution outcomes via a shared world model that encodes the current state and is updated as agents act. By reasoning over symbolic plans rather than raw trajectories, DR. WELL avoids brittle step-level alignment and enables higher-level operations that are reusable, synchronizable, and interpretable. Experiments on cooperative block-push tasks show that agents adapt across episodes, with the dynamic world model capturing reusable patterns and improving task completion rates and efficiency. Experiments on cooperative block-push tasks show that our dynamic world model improves task completion and efficiency through negotiation and self-refinement, trading a time overhead for evolving, more efficient collaboration strategies.

中文标题/摘要

标题：DR. WELL：基于符号世界模型的动态推理与学习在体态多智能体协作中的应用

协同多智能体规划需要智能体在部分信息和有限通信的情况下做出联合决策。在轨迹层面的协调往往失败，因为时间或动作上的微小偏差会引发冲突。符号规划通过提高抽象层次和提供一种最小化的行动词汇表来缓解这一挑战，从而实现同步和集体进步。我们提出了DR. WELL，一种去中心化的神经符号框架，用于协同多智能体规划。合作通过两阶段协商协议展开：智能体首先提出候选角色并进行推理，然后在共识和环境约束下承诺联合分配。在承诺之后，每个智能体独立生成并执行其角色的符号计划，而不透露详细的轨迹。计划通过共享世界模型进行接地，该模型编码当前状态并在智能体行动时更新。通过在符号计划而非原始轨迹上进行推理，DR. WELL 避免了脆弱的步骤级对齐，并使高级操作变得可重用、可同步和可解释。在协同推块任务中的实验表明，智能体在不同回合中能够适应，动态世界模型捕捉到可重用的模式并提高了任务完成率和效率。在协同推块任务中的实验表明，通过协商和自我完善，我们的动态世界模型提高了任务完成率和效率，以时间开销换取了更高效的协作策略。

Summary / 总结

DR. WELL is a decentralized neurosymbolic framework for cooperative multi-agent planning, addressing coordination challenges through a two-phase negotiation protocol. Agents propose roles with reasoning and commit to a joint allocation under constraints, then generate and execute symbolic plans independently. The dynamic world model, grounded in execution outcomes, captures reusable patterns, enhancing task completion rates and efficiency. Experiments on block-push tasks demonstrate that agents adapt across episodes, with improved performance due to negotiation and self-refinement.

DR. WELL 是一种分布式神经符号框架，用于协同多智能体规划，通过两阶段协商协议解决协调问题。智能体提出角色并根据约束达成共识，然后独立生成和执行符号计划。动态世界模型基于执行结果，捕捉可重用的模式，提高任务完成率和效率。实验表明，智能体在多个回合中适应并改进了协作策略，通过协商和自我完善提升了性能。

Efficient probabilistic surrogate modeling techniques for partially-observed large-scale dynamical systems

Authors: Hans Harder, Abhijeet Vishwasrao, Luca Guastoni, Ricardo Vinuesa, Sebastian Peitz

First: 2025-11-06T18:35:01+00:00 · Latest: 2025-11-06T18:35:01+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper is concerned with probabilistic techniques for forecasting dynamical systems described by partial differential equations (such as, for example, the Navier-Stokes equations). In particular, it is investigating and comparing various extensions to the flow matching paradigm that reduce the number of sampling steps. In this regard, it compares direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs and rectified flows. Moreover, experiments are conducted on a set of challenging systems. In particular, we also address the challenge of directly predicting 2D slices of large-scale 3D simulations, paving the way for efficient inflow generation for solvers.

中文标题/摘要

标题：高效概率代理建模技术用于部分观测的大规模动力系统

本文关注用于预报由偏微分方程（例如，纳维-斯托克斯方程）描述的动力系统的概率技术。特别地，它研究并比较了减少采样步骤的各种流匹配范式的扩展方法。在这方面，它比较了直接蒸馏、渐进蒸馏、对抗扩散蒸馏、Wasserstein GAN 和修正流。此外，在一组具有挑战性的系统上进行了实验。特别是，我们还直接预测了大规模3D模拟的2D切片，为求解器生成高效的入流铺平了道路。

Summary / 总结

This paper explores probabilistic techniques for forecasting dynamical systems described by partial differential equations, focusing on reducing the number of sampling steps through various extensions of the flow matching paradigm. The study compares direct distillation, progressive distillation, adversarial diffusion distillation, Wasserstein GANs, and rectified flows. Key experimental findings show that these methods can efficiently predict 2D slices of large-scale 3D simulations, which is crucial for generating efficient inflow for solvers.

该论文探讨了用于预报由偏微分方程描述的动力系统的方法，重点关注通过扩展流匹配范式来减少采样步骤的数量。研究比较了直接蒸馏、逐步蒸馏、对抗扩散蒸馏、Wasserstein GANs 和修正流等方法。关键实验结果表明，这些技术可以有效地预测大规模3D模拟的2D切片，从而为求解器生成有效的入流。

NovisVQ: A Streaming Convolutional Neural Network for No-Reference Opinion-Unaware Frame Quality Assessment

Authors: Kylie Cancilla, Alexander Moore, Amar Saini, Carmen Carrano

First: 2025-11-06T18:23:55+00:00 · Latest: 2025-11-06T18:23:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Video quality assessment (VQA) is vital for computer vision tasks, but existing approaches face major limitations: full-reference (FR) metrics require clean reference videos, and most no-reference (NR) models depend on training on costly human opinion labels. Moreover, most opinion-unaware NR methods are image-based, ignoring temporal context critical for video object detection. In this work, we present a scalable, streaming-based VQA model that is both no-reference and opinion-unaware. Our model leverages synthetic degradations of the DAVIS dataset, training a temporal-aware convolutional architecture to predict FR metrics (LPIPS , PSNR, SSIM) directly from degraded video, without references at inference. We show that our streaming approach outperforms our own image-based baseline by generalizing across diverse degradations, underscoring the value of temporal modeling for scalable VQA in real-world vision systems. Additionally, we demonstrate that our model achieves higher correlation with full-reference metrics compared to BRISQUE, a widely-used opinion-aware image quality assessment baseline, validating the effectiveness of our temporal, opinion-unaware approach.

中文标题/摘要

标题：NovisVQ：一种用于无参考无意见感知帧质量评估的流式卷积神经网络

视频质量评估（VQA）对于计算机视觉任务至关重要，但现有方法面临重大限制：全参考（FR）指标需要干净的参考视频，而大多数无参考（NR）模型依赖于昂贵的人类意见标签的训练。此外，大多数无意见的NR方法是基于图像的，忽略了对视频对象检测至关重要的时间上下文。在本工作中，我们提出了一种可扩展的基于流的VQA模型，该模型既是无参考的又是无意见的。我们的模型利用DAVIS数据集的合成退化，训练一种具有时间感知的卷积架构，直接从退化视频中预测FR指标（LPIPS、PSNR、SSIM），无需在推理时使用参考。我们展示了我们的流式方法在泛化到多种退化方面优于我们自己的基于图像的基线，突显了时间建模在实际视觉系统中可扩展VQA的价值。此外，我们证明了我们的模型与全参考指标的相关性高于BRISQUE，这是一种广泛使用的基于意见的图像质量评估基线，验证了我们的时间、无意见方法的有效性。

Summary / 总结

This paper introduces NovisVQ, a streaming convolutional neural network for no-reference opinion-unaware frame quality assessment. The model uses synthetic degradations of the DAVIS dataset to train a temporal-aware architecture that predicts full-reference metrics (LPIPS, PSNR, SSIM) directly from degraded video without needing references at inference. The model outperforms an image-based baseline and shows higher correlation with full-reference metrics compared to BRISQUE, validating the effectiveness of its temporal, opinion-unaware approach.

研究旨在开发一种可扩展的无参考且无意见的视频质量评估模型，以解决现有全参考和意见导向方法的局限性。该模型使用流式卷积神经网络，通过合成的DAVIS数据集退化训练，直接从退化视频中预测全参考指标（LPIPS、PSNR、SSIM），无需在推理时使用参考。该模型优于基于图像的基线，并且与全参考指标的关联性高于BRISQUE，验证了其在实际视觉系统中可扩展的VQA的有效性。

Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Authors: Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen

First: 2025-09-10T14:02:18+00:00 · Latest: 2025-11-06T18:15:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.

中文标题/摘要

标题：医学中大型语言模型的记忆现象：普遍性、特征及影响

大型语言模型（LLMs）在医学领域展现了显著的潜力。到目前为止，LLMs 已被广泛应用于诊断辅助、医学问答和临床信息综合等任务。然而，一个关键的开放问题是：LLMs 在多大程度上记忆了医学训练数据。在本研究中，我们首次全面评估了医学中 LLMs 的记忆现象，评估了其普遍性（发生频率）、特征（记忆内容）、体积（记忆内容量）以及潜在的下游影响（记忆如何影响医学应用）。我们系统分析了常见的适应场景：（1）继续在医学语料库上进行预训练，（2）在标准医学基准上进行微调，以及（3）在真实世界临床数据上进行微调，包括来自耶鲁纽黑文健康系统的超过 13,000 份独特的住院记录。结果表明，记忆现象在所有适应场景中普遍存在，并且显著高于一般领域报告的水平。记忆现象影响了 LLMs 在医学中的开发和应用，并可分类为三种类型：有益的（如准确回忆临床指南和生物医学参考）、无信息的（如重复免责声明或模板化医学文档语言）和有害的（如再生特定数据集或敏感的临床内容）。基于这些发现，我们提出了实用的建议，以促进有益的记忆现象，增强领域特定推理和事实准确性，减少无信息的记忆现象以促进更深层次的学习，避免表面模式，以及减轻有害的记忆现象以防止敏感或可识别患者信息的泄露。

Summary / 总结

This study evaluates the memorization of medical training data in Large Language Models (LLMs) across various adaptation scenarios, including continued pretraining, fine-tuning on standard benchmarks, and real-world clinical data. The research finds that memorization is prevalent and significantly higher in the medical domain compared to general domains. It categorizes memorization into beneficial, uninformative, and harmful types, and provides recommendations to enhance beneficial memorization, minimize uninformative memorization, and mitigate harmful memorization to improve the development and adoption of LLMs in medicine.

本研究评估了应用于医学领域的大型语言模型（LLM）的记忆现象，考察其发生频率、特征、规模及其影响。在各种适应场景下，研究发现记忆现象普遍存在且高于通用领域。记忆现象被分为有益、无信息和有害三种类型，影响LLM的发展和应用。研究建议采取措施增强有益记忆、减少无信息记忆、减轻有害记忆，以提高LLM在医学应用中的性能。

Dynamic causal discovery in Alzheimer's disease through latent pseudotime modelling

Authors: Natalia Glazman, Jyoti Mangal, Pedro Borges, Sebastien Ourselin, M. Jorge Cardoso

Venue: NeurIPS 2025

First: 2025-11-06T18:12:09+00:00 · Latest: 2025-11-06T18:12:09+00:00

Comments: Accepted to the NeurIPS 2025 Workshop on CauScien: Uncovering Causality in Science

Abs · PDF · Code1 · Code2

Abstract

The application of causal discovery to diseases like Alzheimer's (AD) is limited by the static graph assumptions of most methods; such models cannot account for an evolving pathophysiology, modulated by a latent disease pseudotime. We propose to apply an existing latent variable model to real-world AD data, inferring a pseudotime that orders patients along a data-driven disease trajectory independent of chronological age, then learning how causal relationships evolve. Pseudotime outperformed age in predicting diagnosis (AUC 0.82 vs 0.59). Incorporating minimal, disease-agnostic background knowledge substantially improved graph accuracy and orientation. Our framework reveals dynamic interactions between novel (NfL, GFAP) and established AD markers, enabling practical causal discovery despite violated assumptions.

中文标题/摘要

标题：阿尔茨海默病中的动态因果发现通过潜在潜时间建模

将因果发现应用于阿尔茨海默病(AD)受到大多数方法静态图假设的限制；此类模型无法解释由潜在疾病潜时间调节的不断演变的病理生理学。我们提出应用现有的潜在变量模型到实际的AD数据中，推断出一个潜时间来按数据驱动的疾病轨迹对患者进行排序，独立于实际年龄，然后学习因果关系如何演变。潜时间在预测诊断方面优于年龄(AUC 0.82 vs 0.59)。结合少量、疾病无关的背景知识显著提高了图的准确性和方向性。我们的框架揭示了新型(NfL, GFAP)和已确立的AD标记之间的动态相互作用，尽管违反了假设条件，仍能实现实用的因果发现。

Summary / 总结

This study addresses the limitation of static causal models in Alzheimer's disease by proposing a latent pseudotime model. The model infers a disease trajectory independent of chronological age and learns evolving causal relationships. Results show that pseudotime outperforms age in predicting diagnosis (AUC 0.82 vs 0.59), and incorporating minimal background knowledge improves graph accuracy. The framework reveals dynamic interactions between novel and established AD markers.

该研究通过提出潜时间模型解决了阿尔茨海默病中静态因果模型的局限性。该模型独立于实际年龄推断疾病轨迹，并学习因果关系随时间的变化。结果显示，潜时间在预测诊断方面优于实际年龄（AUC 0.82 vs 0.59），并且加入少量背景知识可以提高图的准确性。该框架揭示了新型和已知阿尔茨海默病标记物之间的动态相互作用。

Building Trust in Virtual Immunohistochemistry: Automated Assessment of Image Quality

Authors: Tushar Kataria, Shikha Dubey, Mary Bronner, Jolanta Jedrzkiewicz, Ben J. Brintz, Shireen Y. Elhabian, Beatrice S. Knudsen

First: 2025-11-06T18:09:09+00:00 · Latest: 2025-11-06T18:09:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep learning models can generate virtual immunohistochemistry (IHC) stains from hematoxylin and eosin (H&E) images, offering a scalable and low-cost alternative to laboratory IHC. However, reliable evaluation of image quality remains a challenge as current texture- and distribution-based metrics quantify image fidelity rather than the accuracy of IHC staining. Here, we introduce an automated and accuracy grounded framework to determine image quality across sixteen paired or unpaired image translation models. Using color deconvolution, we generate masks of pixels stained brown (i.e., IHC-positive) as predicted by each virtual IHC model. We use the segmented masks of real and virtual IHC to compute stain accuracy metrics (Dice, IoU, Hausdorff distance) that directly quantify correct pixel - level labeling without needing expert manual annotations. Our results demonstrate that conventional image fidelity metrics, including Frechet Inception Distance (FID), peak signal-to-noise ratio (PSNR), and structural similarity (SSIM), correlate poorly with stain accuracy and pathologist assessment. Paired models such as PyramidPix2Pix and AdaptiveNCE achieve the highest stain accuracy, whereas unpaired diffusion- and GAN-based models are less reliable in providing accurate IHC positive pixel labels. Moreover, whole-slide images (WSI) reveal performance declines that are invisible in patch-based evaluations, emphasizing the need for WSI-level benchmarks. Together, this framework defines a reproducible approach for assessing the quality of virtual IHC models, a critical step to accelerate translation towards routine use by pathologists.

中文标题/摘要

标题：在虚拟免疫组化中建立信任：基于自动评估的图像质量

深度学习模型可以从苏木精和伊红（H&E）图像生成虚拟免疫组化（IHC）染色，提供一种可扩展且低成本的实验室IHC替代方案。然而，可靠地评估图像质量仍然是一个挑战，因为当前基于纹理和分布的指标量化的是图像保真度而非IHC染色的准确性。在这里，我们介绍了一种基于自动和准确性框架来确定十六个配对或非配对图像翻译模型的图像质量。通过颜色反卷积，我们生成了每个虚拟IHC模型预测的棕色（即IHC阳性）像素的掩码。我们使用真实和虚拟IHC的分割掩码来计算染色准确性指标（Dice、IoU、Hausdorff距离），直接量化正确的像素级标签，无需专家手动注释。我们的结果表明，传统的图像保真度指标，包括弗雷切特入射距离（FID）、峰值信噪比（PSNR）和结构相似性（SSIM），与染色准确性及病理学家评估的相关性较差。配对模型如金字塔Pix2Pix和自适应NCE获得最高的染色准确性，而基于扩散和GAN的非配对模型在提供准确的IHC阳性像素标签方面可靠性较低。此外，整个切片图像（WSI）在基于斑块的评估中无法揭示的性能下降强调了需要WSI级别的基准。总体而言，该框架定义了一种可重复的方法来评估虚拟IHC模型的质量，这是加速向病理学家常规使用的关键步骤。

Summary / 总结

This study aims to evaluate the quality of virtual immunohistochemistry (IHC) generated from hematoxylin and eosin (H&E) images using deep learning models. The authors introduce an automated framework that uses color deconvolution to generate masks of IHC-positive pixels and computes stain accuracy metrics such as Dice, IoU, and Hausdorff distance. The results show that conventional image fidelity metrics like FID, PSNR, and SSIM poorly correlate with stain accuracy and pathologist assessment. Paired models like PyramidPix2Pix and AdaptiveNCE outperform unpaired diffusion- and GAN-based models in providing accurate IHC positive pixel labels, and whole-slide images reveal performance declines not visible in patch-based evaluations.

该研究旨在评估从苏木精和伊红（H&E）图像生成的虚拟免疫组织化学（IHC）的质量，使用了基于颜色反卷积的自动化框架生成IHC阳性像素的掩码，并计算Dice、IoU和Hausdorff距离等染色准确性指标。结果表明，传统的图像保真度指标与染色准确性和病理学家评估的相关性较差，而配对模型如PyramidPix2Pix和AdaptiveNCE表现更好，而非配对的扩散-和GAN基模型在提供准确的IHC阳性像素标签方面可靠性较低。全切片图像揭示了在基于斑块的评估中看不到的性能下降，强调了全切片图像（WSI）级别基准的重要性。

Deep Edge Filter: Return of the Human-Crafted Layer in Deep Learning

Authors: Dongkwan Lee, Junhoo Lee, Nojun Kwak

First: 2025-10-13T07:56:55+00:00 · Latest: 2025-11-06T18:08:46+00:00

Comments: NeurIPS2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce the Deep Edge Filter, a novel approach that applies high-pass filtering to deep neural network features to improve model generalizability. Our method is motivated by our hypothesis that neural networks encode task-relevant semantic information in high-frequency components while storing domain-specific biases in low-frequency components of deep features. By subtracting low-pass filtered outputs from original features, our approach isolates generalizable representations while preserving architectural integrity. Experimental results across diverse domains such as Vision, Text, 3D, and Audio demonstrate consistent performance improvements regardless of model architecture and data modality. Analysis reveals that our method induces feature sparsification and effectively isolates high-frequency components, providing empirical validation of our core hypothesis. The code is available at https://github.com/dongkwani/DeepEdgeFilter.

中文标题/摘要

标题：深度边缘滤波器：深度学习中的人工构建层的回归

我们介绍了深度边缘滤波器，这是一种新颖的方法，通过在深度神经网络特征上应用高通滤波来提高模型的泛化能力。我们的方法受到假设的启发，即神经网络在深层特征的高频分量中编码任务相关的语义信息，而在低频分量中存储领域特定的偏差。通过从原始特征中减去低通滤波输出，我们的方法隔离了可泛化的表示，同时保持了架构的完整性。实验结果表明，无论模型架构和数据模态如何，该方法在视觉、文本、3D和音频等多个领域都表现出一致的性能提升。分析表明，我们的方法导致了特征稀疏化，并有效地隔离了高频分量，为我们的核心假设提供了实证验证。代码可在https://github.com/dongkwani/DeepEdgeFilter 获取。

Summary / 总结

The research introduces the Deep Edge Filter, which applies high-pass filtering to deep neural network features to enhance model generalizability. Motivated by the hypothesis that high-frequency components contain task-relevant information while low-frequency components store domain-specific biases, the method subtracts low-pass filtered outputs from original features. Experiments across various domains show consistent performance improvements, and analysis confirms feature sparsification and isolation of high-frequency components.

研究引入了Deep Edge Filter，该方法通过对深度神经网络特征进行高通滤波来提升模型的泛化能力。该方法基于假设，神经网络在高频成分中编码任务相关信息，在低频成分中存储领域特定的偏见，通过从原始特征中减去低通滤波输出来隔离可泛化的表示。该方法在视觉、文本、3D和音频等多个领域的一致性实验结果中表现出性能提升，并且分析显示特征稀疏化和有效隔离高频成分，验证了核心假设。代码可在GitHub上获得。

evomap: A Toolbox for Dynamic Mapping in Python

Authors: Maximilian Matthe

First: 2025-11-06T18:02:58+00:00 · Latest: 2025-11-06T18:02:58+00:00

Comments: Accepted for publication by the Journal of Statistical Software

Abs · PDF · Code1 · Code2

Abstract

This paper presents evomap, a Python package for dynamic mapping. Mapping methods are widely used across disciplines to visualize relationships among objects as spatial representations, or maps. However, most existing statistical software supports only static mapping, which captures objects' relationships at a single point in time and lacks tools to analyze how these relationships evolve. evomap fills this gap by implementing the dynamic mapping framework EvoMap, originally proposed by Matthe, Ringel, and Skiera (2023), which adapts traditional static mapping methods for dynamic analyses. The package supports multiple mapping techniques, including variants of Multidimensional Scaling (MDS), Sammon Mapping, and t-distributed Stochastic Neighbor Embedding (t-SNE). It also includes utilities for data preprocessing, exploration, and result evaluation, offering a comprehensive toolkit for dynamic mapping applications. This paper outlines the foundations of static and dynamic mapping, describes the architecture and functionality of evomap, and illustrates its application through an extensive usage example.

中文标题/摘要

标题：evomap：Python中的动态映射工具箱

本文介绍了evomap，这是一个用于动态映射的Python软件包。映射方法在各个学科中被广泛使用，用于将对象之间的关系可视化为空间表示或地图。然而，现有的大多数统计软件仅支持静态映射，只能捕捉对象在某一时间点的关系，缺乏分析这些关系如何演变的工具。evomap通过实现Matthe、Ringel和Skiera（2023）提出的动态映射框架EvoMap来填补这一空白，该框架将传统的静态映射方法适应于动态分析。该软件包支持多种映射技术，包括多维尺度（MDS）的变体、Sammon映射和t分布随机邻域嵌入（t-SNE）。它还包括数据预处理、探索和结果评估的工具，为动态映射应用提供了一个全面的工具包。本文概述了静态和动态映射的基础，描述了evomap的架构和功能，并通过一个详尽的应用示例进行了说明。

Summary / 总结

The paper introduces evomap, a Python package designed for dynamic mapping, addressing the limitation of existing software that only supports static mapping. By implementing the dynamic mapping framework EvoMap, evomap adapts traditional static mapping techniques like MDS, Sammon Mapping, and t-SNE for analyzing evolving relationships over time. The package offers a suite of tools for data preprocessing, exploration, and result evaluation, providing a comprehensive toolkit for dynamic mapping applications.

本文介绍了evomap，这是一个用于动态映射的Python包，解决了现有软件只能支持静态映射的局限性。通过实现动态映射框架EvoMap，evomap将传统的静态映射方法如MDS、Sammon Mapping和t-SNE适应于动态分析。该包包括数据预处理、探索和结果评估的工具，提供了一个全面的动态映射应用工具包。

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Authors: Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li, Lingxiang Wu, Haiyun Guo, Jinqiao Wang

First: 2025-11-06T17:54:12+00:00 · Latest: 2025-11-06T17:54:12+00:00

Abs · PDF · Code1 · Code2

Abstract

While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model's fine-grained vision-language alignment. However, the inherent token length limitation of CLIP's text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP's original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

中文标题/摘要

标题：PixCLIP：通过任意粒度像素-文本对齐学习实现精细粒度的视觉语言理解

尽管对比语言-图像预训练(CLIP)模型在各种下游视觉语言理解任务中取得了显著成功，但增强其对精细粒度图像-文本对齐的能力仍然是一个活跃的研究焦点。为此，大多数现有工作采用显式增加视觉信息处理粒度的策略，例如，通过引入视觉提示来引导模型关注图像中的特定局部区域。同时，多模态大型语言模型(MLLMs)的研究表明，使用长而详细的文本描述进行训练可以有效提高模型的精细粒度视觉-语言对齐能力。然而，CLIP文本编码器固有的令牌长度限制从根本上限制了CLIP处理长文本序列中嵌入的更精细文本信息的能力。为了协同利用增强视觉和文本内容处理粒度的优势，我们提出PixCLIP，一种新型框架，旨在同时接受视觉提示输入并处理长文本描述。具体而言，我们首先建立了一个自动注释流水线，能够为图像生成像素级局部化、长形式的文本描述。利用该流水线，我们构建了包含近150万个样本的高质量LongGRIT数据集。其次，我们用LLM替换CLIP的原始文本编码器，并提出了一种三支路像素-文本对齐学习框架，促进图像区域与相应文本描述在任意粒度下的精细对齐。实验表明，PixCLIP在像素级交互和处理长文本方面取得了突破，实现了最先进的性能。

Summary / 总结

PixCLIP is designed to improve fine-grained visual language understanding by integrating pixel-level localized textual descriptions and leveraging a multimodal large language model. It constructs a LongGRIT dataset with nearly 1.5 million samples and proposes a three-branch pixel-text alignment learning framework. Experimental results show that PixCLIP outperforms existing methods in handling long-form texts and achieving pixel-level interaction, setting new benchmarks in fine-grained vision-language alignment.

PixCLIP旨在通过结合像素级局部化文本描述和长文本处理来增强细粒度的视觉语言理解。它引入了一个自动注释管道来生成详细的文本描述，并提出了一种三分支像素-文本对齐学习框架。实验结果表明，PixCLIP在像素级交互和长文本处理方面取得了突破，达到了最先进的性能。

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

Authors: Hampus Åström, Elin Anna Topp, Jacek Malec

First: 2025-11-06T17:51:11+00:00 · Latest: 2025-11-06T17:51:11+00:00

Comments: 8 pages without cover, references and supplementary materials, 11 with. Submitted to RLC 2025's workshop RLBrew and IMOL 2025

Abs · PDF · Code1 · Code2

Abstract

In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.

中文标题/摘要

标题：环境无关的目标条件化：无奖励自主学习研究

在本文中，我们研究如何将常规强化学习环境转换为目标条件化环境，从而使智能体能够自主且无奖励地学习解决任务。我们展示了智能体可以通过在环境无关的方式选择自己的目标来学习解决任务，在训练时间上与外部引导的强化学习相当。我们的方法独立于底层的离策略学习算法。由于我们的方法是环境无关的，智能体不会将任何目标的价值视为高于其他目标，这会导致单个目标的性能不稳定。然而，在我们的实验中，我们展示了平均目标成功率的提高和稳定。使用此方法训练的智能体可以被指示寻求环境中的任何观察结果，从而在特定用例之前实现通用智能体的训练。

Summary / 总结

This paper investigates transforming reinforcement learning environments into goal-conditioned ones to enable autonomous, reward-free learning. The method allows agents to learn by setting their own goals, achieving comparable training times to externally guided reinforcement learning. While individual goal performance can be unstable, the average success rate improves and stabilizes. Agents can be trained to pursue any observations from the environment, facilitating general training before specific tasks.

研究旨在通过将常规强化学习环境转换为基于目标的环境，实现自主且无需奖励的学习。该方法允许代理自主设定目标进行学习，训练时间与外部引导的强化学习相当。尽管单个目标可能存在不稳定，但总体成功率会提高并趋于稳定。代理可以被训练去寻求环境中的任何观察结果，从而在特定用途前实现通用训练。

UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction

Authors: Chen Shi, Shaoshuai Shi, Xiaoyang Lyu, Chunyang Liu, Kehua Sheng, Bo Zhang, Li Jiang

First: 2025-11-06T17:49:39+00:00 · Latest: 2025-11-06T17:49:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining point-anchored refinement with voxel-based generation, and maintain a persistent memory of static Gaussians to enable streaming scene completion beyond current camera coverage. Extensive experiments on real-world datasets demonstrate that UniSplat achieves state-of-the-art performance in novel view synthesis, while providing robust and high-quality renderings even for viewpoints outside the original camera coverage.

中文标题/摘要

标题：UniSplat：通过3D潜在支架实现统一时空融合的动态驾驶场景重建

面向自主驾驶的前馈3D重建技术取得了快速进展，但现有方法难以应对稀疏、不重叠的摄像机视图和复杂场景动态的联合挑战。我们提出UniSplat，这是一种通用的前馈框架，通过统一的潜在时空融合学习稳健的动态场景重建。UniSplat构建了一个3D潜在支架，这是一种结构化的表示，通过利用预训练的基础模型捕捉几何和语义场景上下文。为了有效地整合空间视图和时间帧之间的信息，我们引入了一种高效的融合机制，该机制直接在3D支架内操作，实现一致的时空对齐。为了确保完整的详细重建，我们设计了一种双分支解码器，从融合的支架中生成动态感知的高斯分布，结合点锚定细化与体素生成，同时保持静态高斯分布的持久记忆，以实现超出当前摄像机覆盖范围的流式场景完成。在真实世界数据集上的广泛实验表明，UniSplat在新颖视图合成方面达到了最先进的性能，即使对于超出原始摄像机覆盖范围的视角也能提供稳健且高质量的渲染。

Summary / 总结

UniSplat is a unified feed-forward framework for dynamic driving scene reconstruction that addresses the challenges of sparse and non-overlapping camera views and complex scene dynamics. It constructs a 3D latent scaffold to capture geometric and semantic scene context and introduces an efficient fusion mechanism for consistent spatio-temporal alignment. The dual-branch decoder generates dynamic-aware Gaussians and maintains a persistent memory of static Gaussians, enabling detailed reconstructions beyond the current camera coverage. Experiments show that UniSplat outperforms existing methods in novel view synthesis and provides robust renderings for viewpoints outside the original camera coverage.

UniSplat 是一个统一的前馈框架，旨在实现自主驾驶中的稳健动态场景重建。它构建了一个 3D 潜在支架来捕捉几何和语义场景上下文，并引入了一种高效的融合机制以实现一致的空间-时间对齐。双分支解码器生成动态感知的高斯分布，并通过保持静态高斯的持久记忆确保完整的重建。实验表明，UniSplat 在新颖视图合成中优于现有方法，并且即使在原始摄像头覆盖范围之外也能提供高质量的渲染结果。

Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems

Authors: Utkarsh U. Chavan, Prashant Trivedi, Nandyala Hemachandra

Venue: NeurIPS 2025

First: 2025-11-06T17:49:33+00:00 · Latest: 2025-11-06T17:49:33+00:00

Comments: To appear in 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Abs · PDF · Code1 · Code2

Abstract

Multi-agent systems (MAS) are central to applications such as swarm robotics and traffic routing, where agents must coordinate in a decentralized manner to achieve a common objective. Stochastic Shortest Path (SSP) problems provide a natural framework for modeling decentralized control in such settings. While the problem of learning in SSP has been extensively studied in single-agent settings, the decentralized multi-agent variant remains largely unexplored. In this work, we take a step towards addressing that gap. We study decentralized multi-agent SSPs (Dec-MASSPs) under linear function approximation, where the transition dynamics and costs are represented using linear models. Applying novel symmetry-based arguments, we identify the structure of optimal policies. Our main contribution is the first regret lower bound for this setting based on the construction of hard-to-learn instances for any number of agents, $n$. Our regret lower bound of $\Omega(\sqrt{K})$, over $K$ episodes, highlights the inherent learning difficulty in Dec-MASSPs. These insights clarify the learning complexity of decentralized control and can further guide the design of efficient learning algorithms in multi-agent systems.

中文标题/摘要

标题：分散式多智能体随机最短路径问题的后悔下界

多智能体系统（MAS）在诸如群集机器人和交通路由等应用中至关重要，其中智能体必须以分散的方式协调行动以实现共同目标。随机最短路径（SSP）问题为在这些环境中建模分散控制提供了自然框架。尽管在单智能体设置中学习SSP的问题已被广泛研究，但分散式多智能体变体仍然鲜有探索。在本文中，我们朝着填补这一空白迈出了一步。我们研究了在线性函数逼近下的分散式多智能体SSP（Dec-MASSPs），其中转移动力学和成本用线性模型表示。通过新颖的对称性论证，我们确定了最优策略的结构。我们的主要贡献是在Dec-MASSPs设置中首次基于构建了任意数量智能体的难以学习实例，提出了后悔下界$\Omega(\sqrt{K})$，其中$K$表示回合数。这些见解阐明了分散控制的学习复杂性，并可进一步指导多智能体系统中高效学习算法的设计。

Complexity as Advantage: A Regret-Based Perspective on Emergent Structure

Authors: Oshri Naparstek

Venue: ICML 2026

First: 2025-11-06T17:46:53+00:00 · Latest: 2025-11-06T17:46:53+00:00

Comments: 15 pages. Under preparation for submission to ICML 2026. Feedback welcome

Abs · PDF · Code1 · Code2

Abstract

We introduce Complexity as Advantage (CAA), a framework that defines the complexity of a system relative to a family of observers. Instead of measuring complexity as an intrinsic property, we evaluate how much predictive regret a system induces for different observers attempting to model it. A system is complex when it is easy for some observers and hard for others, creating an information advantage. We show that this formulation unifies several notions of emergent behavior, including multiscale entropy, predictive information, and observer-dependent structure. The framework suggests that "interesting" systems are those positioned to create differentiated regret across observers, providing a quantitative grounding for why complexity can be functionally valuable. We demonstrate the idea through simple dynamical models and discuss implications for learning, evolution, and artificial agents.

中文标题/摘要

标题：复杂性作为优势：基于后悔的观点探讨涌现结构

我们引入了复杂性作为优势（CAA），这是一种框架，它将系统的复杂性定义为相对于一组观察者的复杂性。我们不是衡量复杂性作为一种内在属性，而是评估系统对试图对其建模的不同观察者所引起的预测后悔量。当系统对某些观察者容易而对其他观察者困难时，系统是复杂的，从而创造了一种信息优势。我们展示了这种表述统一了多种涌现行为的概念，包括多尺度熵、预测信息和观察者依赖的结构。该框架表明，“有趣”的系统是那些能够在观察者之间创造差异化后悔的位置，从而为复杂性为何具有功能价值提供了定量的基础。我们通过简单的动力学模型展示了这一概念，并讨论了其对学习、进化和人工代理的影响。

Summary / 总结

The paper introduces Complexity as Advantage (CAA), a framework that measures the complexity of a system based on the predictive regret it induces on different observers. This approach shows that a system is complex when it is easy for some observers and hard for others, creating an information advantage. The framework unifies various concepts of emergent behavior and suggests that systems that create differentiated regret across observers are functionally valuable. The authors demonstrate this through simple dynamical models and discuss its implications for learning, evolution, and artificial agents.

论文提出了复杂性作为优势（CAA）的框架，该框架根据系统对不同观察者的预测后悔来衡量复杂性。这种方法表明，当一个系统对某些观察者容易预测而对其他观察者难以预测时，系统是复杂的，从而创造了信息优势。该框架统一了各种涌现行为的概念，并表明能够跨观察者创造不同后悔的系统具有功能价值。作者通过简单的动力学模型展示了这一观点，并讨论了其对学习、进化和人工代理的含义。

Information-driven design of imaging systems

Authors: Henry Pinkard, Leyla Kabuli, Eric Markley, Tiffany Chien, Jiantao Jiao, Laura Waller

First: 2024-05-31T00:57:58+00:00 · Latest: 2025-11-06T17:33:32+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Imaging systems have traditionally been designed to mimic the human eye and produce visually interpretable measurements. Modern imaging systems, however, process raw measurements computationally before or instead of human viewing. As a result, the information content of raw measurements matters more than their visual interpretability. Despite the importance of measurement information content, current approaches for evaluating imaging system performance do not quantify it: they instead either use alternative metrics that assess specific aspects of measurement quality or assess measurements indirectly with performance on secondary tasks. We developed the theoretical foundations and a practical method to directly quantify mutual information between noisy measurements and unknown objects. By fitting probabilistic models to measurements and their noise characteristics, our method estimates information by upper bounding its true value. By applying gradient-based optimization to these estimates, we also developed a technique for designing imaging systems called Information-Driven Encoder Analysis Learning (IDEAL). Our information estimates accurately captured system performance differences across four imaging domains (color photography, radio astronomy, lensless imaging, and microscopy). Systems designed with IDEAL matched the performance of those designed with end-to-end optimization, the prevailing approach that jointly optimizes hardware and image processing algorithms. These results establish mutual information as a universal performance metric for imaging systems that enables both computationally efficient design optimization and evaluation in real-world conditions. A video summarizing this work can be found at: https://waller-lab.github.io/EncodingInformationWebsite/

Summary / 总结

This research addresses the need for a new approach to design imaging systems that focuses on the information content of raw measurements rather than their visual interpretability. The authors developed a method called Information-Driven Encoder Analysis Learning (IDEAL) to directly quantify mutual information between noisy measurements and unknown objects. By fitting probabilistic models and using gradient-based optimization, IDEAL was able to design imaging systems that matched the performance of those optimized with end-to-end methods across four domains: color photography, radio astronomy, lensless imaging, and microscopy. This work establishes mutual information as a universal performance metric for imaging systems, enabling efficient design and evaluation in real-world conditions.

该研究针对传统成像系统注重视觉可解释性的问题，引入了一种直接量化原始测量信息量的方法。通过使用概率模型和梯度优化，信息驱动编码分析学习（IDEAL）技术设计的成像系统能够与端到端优化方法的性能相匹配。该方法准确地捕捉了不同成像领域的性能差异，并将互信息确立为评估和设计成像系统的一种通用指标，能够在实际条件下高效地进行性能评价和优化。

Physics-Informed Neural Networks and Neural Operators for Parametric PDEs: A Human-AI Collaborative Analysis

Authors: Zhuo Zhang, Xiong Xiong, Sen Zhang, Yuan Zhao, Xi Yang

First: 2025-11-06T17:31:59+00:00 · Latest: 2025-11-06T17:31:59+00:00

Comments: 61 pages, 3 figures. Submitted to The 1st International Conference on AI Scientists (ICAIS 2025)

Abs · PDF · Code1 · Code2

Abstract

PDEs arise ubiquitously in science and engineering, where solutions depend on parameters (physical properties, boundary conditions, geometry). Traditional numerical methods require re-solving the PDE for each parameter, making parameter space exploration prohibitively expensive. Recent machine learning advances, particularly physics-informed neural networks (PINNs) and neural operators, have revolutionized parametric PDE solving by learning solution operators that generalize across parameter spaces. We critically analyze two main paradigms: (1) PINNs, which embed physical laws as soft constraints and excel at inverse problems with sparse data, and (2) neural operators (e.g., DeepONet, Fourier Neural Operator), which learn mappings between infinite-dimensional function spaces and achieve unprecedented generalization. Through comparisons across fluid dynamics, solid mechanics, heat transfer, and electromagnetics, we show neural operators can achieve computational speedups of $10^3$ to $10^5$ times faster than traditional solvers for multi-query scenarios, while maintaining comparable accuracy. We provide practical guidance for method selection, discuss theoretical foundations (universal approximation, convergence), and identify critical open challenges: high-dimensional parameters, complex geometries, and out-of-distribution generalization. This work establishes a unified framework for understanding parametric PDE solvers via operator learning, offering a comprehensive, incrementally updated resource for this rapidly evolving field

中文标题/摘要

标题：物理知情神经网络和神经算子在参数化偏微分方程中的应用：人机协作分析

偏微分方程（PDEs）在科学和工程中无处不在，其解依赖于参数（物理属性、边界条件、几何形状）。传统数值方法需要为每个参数重新求解PDE，使得参数空间探索变得极其昂贵。最近的机器学习进展，特别是物理知情神经网络（PINNs）和神经算子，通过学习能够跨参数空间泛化的解算子，彻底革新了参数化PDE求解。我们批判性地分析了两种主要范式：（1）PINNs，它将物理定律嵌入为软约束，擅长处理稀疏数据的逆问题；（2）神经算子（例如DeepONet，傅里叶神经算子），它们学习无穷维函数空间之间的映射，并实现了前所未有的泛化能力。通过流体动力学、固体力学、热传导和电磁学领域的比较，我们展示了在多查询场景中，神经算子可以比传统求解器快1000到100000倍，同时保持相当的准确性。我们提供了方法选择的实用指导，讨论了理论基础（通用逼近性、收敛性），并指出了关键的开放挑战：高维参数、复杂几何形状和离分布泛化。本研究建立了一个统一框架，通过算子学习理解参数化PDE求解器，提供了一个全面且逐步更新的资源，以应对这一快速发展的领域

Summary / 总结

The paper explores the use of physics-informed neural networks (PINNs) and neural operators for solving parametric partial differential equations (PDEs). It compares these methods in fluid dynamics, solid mechanics, heat transfer, and electromagnetics, showing that neural operators can achieve up to $10^5$ times faster computation than traditional solvers while maintaining similar accuracy. The study also discusses theoretical foundations and practical considerations for method selection, highlighting challenges in high-dimensional parameters and out-of-distribution generalization.

该论文探讨了使用物理信息神经网络（PINNs）和神经算子解决参数偏微分方程（PDEs）的方法。通过在流体动力学、固体力学、热传递和电磁学中的比较，研究显示神经算子可以在比传统求解器快10万倍的情况下实现类似精度的计算。研究还讨论了这些方法的理论基础和实际应用中的考虑因素，指出了高维参数和离分布外泛化的挑战。

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Authors: Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

First: 2025-11-05T09:40:16+00:00 · Latest: 2025-11-06T17:28:59+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool--tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11\% on REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

中文标题/摘要

标题：SurgViVQA：手术场景理解中的时间接地视频问答

手术领域的视频问答（VideoQA）旨在通过使AI模型能够在时间上连贯的事件中进行推理，而不是孤立的帧，来增强术中理解。当前的方法仅限于静态图像特征，可用的数据集往往缺乏时间注释，忽略了准确的程序解释所需的动态信息。我们提出了SurgViVQA，这是一种扩展视觉推理从静态图像到动态手术场景的手术视频问答模型。它使用掩蔽视频-文本编码器融合视频和问题特征，捕捉诸如运动和工具-组织交互等时间线索，然后由微调的大语言模型（LLM）解码为连贯的答案。为了评估其性能，我们编纂了REAL-Colon-VQA数据集，包括与运动相关的问题和诊断属性，以及超出模板的问题，以评估模型的鲁棒性。在REAL-Colon-VQA和公开的EndoVis18-VQA数据集上的实验验证表明，SurgViVQA在关键词准确性方面优于现有的基于图像的VQA基准模型，特别是在REAL-Colon-VQA上提高了11%，在EndoVis18-VQA上提高了9%。进一步的扰动研究还证实了其对问题表述变化的泛化能力和鲁棒性。SurgViVQA和REAL-Colon-VQA数据集为手术视频问答中的时间感知理解提供了一个框架，使AI模型能够更有效地解释动态程序背景。代码和数据集可在https://github.com/madratak/SurgViVQA/获取。

ARETE: an R package for Automated REtrieval from TExt with large language models

Authors: Vasco V. Branco, Jandó Benedek, Lidia Pivovarova, Luís Correia, Pedro Cardoso

First: 2025-11-06T17:26:48+00:00 · Latest: 2025-11-06T17:26:48+00:00

Abs · PDF · Code1 · Code2

Abstract

1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE through systematic comparison between what is modelled and the work of human annotators. 3. We demonstrate the usefulness of the approach by comparing range maps produced using GBIF data and with those automatically extracted for 100 species of spiders. Newly extracted data allowed to expand the known Extent of Occurrence by a mean three orders of magnitude, revealing new areas where the species were found in the past, which mayhave important implications for spatial conservation planning and extinction risk assessments. 4. ARETE allows faster access to hitherto untapped occurrence data, a potential game changer in projects requiring such data. Researchers will be able to better prioritize resources, manually verifying selected species while maintaining automated extraction for the majority. This workflow also allows predicting available bibliographic data during project planning.

中文标题/摘要

标题：ARETE：一种基于大型语言模型的R包，用于自动从文本中检索数据

1. 严格保护措施的实施受到关键物种数据，尤其是分布数据的缺乏限制。此外，由于人类活动导致新信息的收集和处理速度加快，研究人员必须应对这一挑战。从科学论文到灰色文献的出版物中包含这些关键信息，但其数据通常不可机器读取，需要大量人力工作才能提取。2. 我们介绍了ARETE R包，这是一个开源软件，旨在利用大型语言模型（特别是使用chatGPT应用程序编程接口）自动化物种分布数据的提取过程。该R包整合了从光学字符识别到异常值检测和表格输出的所有数据提取和验证步骤。此外，我们通过系统比较模型结果和人类注释者的成果来验证ARETE。3. 我们通过将使用GBIF数据生成的分布图与自动提取的100种蜘蛛物种的分布图进行比较，展示了该方法的有效性。新提取的数据使已知分布范围扩展了三个数量级，揭示了过去物种分布的新区域，这对空间保护规划和灭绝风险评估具有重要意义。4. ARETE允许更快地访问以前未被利用的分布数据，这可能在需要此类数据的项目中成为游戏规则的改变者。研究人员将能够更好地优先分配资源，手动验证选定的物种，同时保持对大多数物种的自动化提取。此工作流程还允许在项目规划期间预测可用的文献数据。

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu

First: 2025-11-06T17:25:23+00:00 · Latest: 2025-11-06T17:25:23+00:00

Comments: 36 pages, 14 figures

Abs · PDF · Code1 · Code2

Abstract

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

中文标题/摘要

标题：视频思维：视频生成作为有前景的多模态推理范式

"文本思维"和"图像思维"范式显著提高了大型语言模型（LLMs）和视觉语言模型（VLMs）的推理能力。然而，这些范式存在固有的局限性。首先，图像只能捕捉单一时刻，无法表示动态过程或连续变化；其次，将文本和视觉视为不同的模态，阻碍了统一的多模态理解和生成。为克服这些局限，我们引入了“视频思维”这一新范式，利用视频生成模型（如Sora-2）在统一的时间框架内结合视觉和文本推理。为支持这一探索，我们开发了视频思维基准（VideoThinkBench）。VideoThinkBench 包含两类任务：（1）视觉中心任务（如眼力谜题），（2）文本中心任务（如GSM8K和MMMU的子集）。我们的评估表明Sora-2是一个有效的推理者。在视觉中心任务中，Sora-2通常与最先进的视觉语言模型（SOTA）相当，甚至在某些任务（如眼力游戏）上超过了VLMs。在文本中心任务中，Sora-2在MATH上的准确率为92%，在MMMU上的准确率为75.53%。此外，我们系统地分析了这些能力的来源。我们还发现，自我一致性与上下文学习可以提高Sora-2的性能。总之，我们的研究结果表明，视频生成模型可能是统一的多模态理解和生成模型，将“视频思维”定位为统一的多模态推理范式。

Are Minimal Radial Distortion Solvers Necessary for Relative Pose Estimation?

Authors: Charalambos Tzamos, Viktor Kocur, Yaqing Ding, Torsten Sattler, Zuzana Kukelova

First: 2024-10-08T12:30:29+00:00 · Latest: 2025-11-06T17:12:05+00:00

Comments: Code available at: https://github.com/kocurvik/rd or https://doi.org/10.5281/zenodo.14672694

Abs · PDF · Code1 · Code2 · Code3

Abstract

Estimating the relative pose between two cameras is a fundamental step in many applications such as Structure-from-Motion. The common approach to relative pose estimation is to apply a minimal solver inside a RANSAC loop. Highly efficient solvers exist for pinhole cameras. Yet, (nearly) all cameras exhibit radial distortion. Not modeling radial distortion leads to (significantly) worse results. However, minimal radial distortion solvers are significantly more complex than pinhole solvers, both in terms of run-time and implementation efforts. This paper compares radial distortion solvers with a simple-to-implement approach that combines an efficient pinhole solver with sampled radial distortion parameters. Extensive experiments on multiple datasets and RANSAC variants show that this simple approach performs similarly or better than the most accurate minimal distortion solvers at faster run-times while being significantly more accurate than faster non-minimal solvers. We clearly show that complex radial distortion solvers are not necessary in practice. Code and benchmark are available at https://github.com/kocurvik/rd.

中文标题/摘要

标题：最小径向失真求解器在相对姿态估计中是否必要？

在许多应用如结构从运动中，估计两台相机之间的相对姿态是一个基本步骤。相对姿态估计的常见方法是在RANSAC循环中应用最小求解器。对于针孔相机，存在高效的求解器。然而，几乎所有相机都表现出径向失真。不建模径向失真会导致显著更差的结果。然而，最小径向失真求解器在运行时间和实现方面比针孔求解器复杂得多。本文将径向失真求解器与一种简单易实现的方法进行了比较，该方法结合了高效的针孔求解器和采样的径向失真参数。在多个数据集和RANSAC变体上的大量实验表明，这种简单方法在运行速度更快的情况下，性能与最准确的最小失真求解器相当或更好，同时比更快的非最小求解器更准确。我们清楚地表明，在实践中，复杂的径向失真求解器并非必要。代码和基准可在https://github.com/kocurvik/rd 获取。

Summary / 总结

This paper investigates the necessity of using minimal radial distortion solvers for relative pose estimation. It compares these solvers with a simpler approach that combines an efficient pinhole solver with sampled radial distortion parameters. The experiments show that the simpler approach performs similarly or better than the most accurate minimal distortion solvers, with faster run-times and higher accuracy than faster non-minimal solvers, indicating that complex radial distortion solvers are not essential in practice.

该论文探讨了最小径向失真求解器在相对位姿估计中的必要性。研究将这些求解器与一种更简单的方法进行了比较，该方法结合了高效的针孔模型求解器和采样的径向失真参数。实验结果显示，这种方法在运行速度更快且准确性更高的情况下，与最准确的最小失真求解器表现相似或更好。研究证明，在相对位姿估计任务中，复杂的径向失真求解器并非必要。

Measure-Theoretic Time-Delay Embedding

Authors: Jonah Botvinick-Greenhouse, Maria Oprea, Romit Maulik, Yunan Yang

First: 2024-09-13T12:20:41+00:00 · Latest: 2025-11-06T17:10:11+00:00

Comments: 41 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

The celebrated Takens' embedding theorem provides a theoretical foundation for reconstructing the full state of a dynamical system from partial observations. However, the classical theorem assumes that the underlying system is deterministic and that observations are noise-free, limiting its applicability in real-world scenarios. Motivated by these limitations, we formulate a measure-theoretic generalization that adopts an Eulerian description of the dynamics and recasts the embedding as a pushforward map between spaces of probability measures. Our mathematical results leverage recent advances in optimal transport. Building on the proposed measure-theoretic time-delay embedding theory, we develop a computational procedure that aims to reconstruct the full state of a dynamical system from time-lagged partial observations, engineered with robustness to handle sparse and noisy data. We evaluate our measure-based approach across several numerical examples, ranging from the classic Lorenz-63 system to real-world applications such as NOAA sea surface temperature reconstruction and ERA5 wind field reconstruction.

中文标题/摘要

标题：测度论时间延迟嵌入

著名的Takens嵌入定理为从部分观测重构动力系统完整状态提供了理论基础。然而，经典的定理假设底层系统是确定性的，并且观测是无噪声的，这限制了其在实际场景中的应用。受这些限制的启发，我们提出了一个测度论的推广，采用欧拉描述动力学，并将嵌入重新表述为概率测度空间之间的推进映射。我们的数学结果利用了最优传输领域的最新进展。基于提出的测度论时间延迟嵌入理论，我们开发了一种计算程序，旨在从时间延迟的部分观测中重构动力系统的完整状态，并设计了鲁棒性以处理稀疏和噪声数据。我们通过几个数值示例评估了基于测度的方法，从经典的Lorenz-63系统到实际应用，如NOAA海面温度重构和ERA5风场重构。

Summary / 总结

This paper addresses the limitations of Takens' embedding theorem by proposing a measure-theoretic generalization that can handle sparse and noisy data. The method formulates the embedding as a pushforward map between spaces of probability measures, leveraging recent advances in optimal transport. Key experimental findings include successful reconstruction of the full state of dynamical systems from time-lagged partial observations, demonstrated through both numerical examples and real-world applications like sea surface temperature and wind field reconstructions.

本文通过提出测度论时间延迟嵌入方法，解决了Takens嵌入定理的局限性。该方法采用动力学的欧拉描述，并利用最优运输理论处理稀疏和噪声数据。主要发现包括成功重构洛伦兹-63系统以及实际应用如海表面温度和风场重构的完整状态。

Integrating Temporal and Structural Context in Graph Transformers for Relational Deep Learning

Authors: Divyansha Lachi, Mahmoud Mohammadi, Joe Meyer, Vinam Arora, Tom Palczewski, Eva L. Dyer

First: 2025-11-06T17:08:21+00:00 · Latest: 2025-11-06T17:08:21+00:00

Abs · PDF · Code1 · Code2

Abstract

In domains such as healthcare, finance, and e-commerce, the temporal dynamics of relational data emerge from complex interactions-such as those between patients and providers, or users and products across diverse categories. To be broadly useful, models operating on these data must integrate long-range spatial and temporal dependencies across diverse types of entities, while also supporting multiple predictive tasks. However, existing graph models for relational data primarily focus on spatial structure, treating temporal information merely as a filtering constraint to exclude future events rather than a modeling signal, and are typically designed for single-task prediction. To address these gaps, we introduce a temporal subgraph sampler that enhances global context by retrieving nodes beyond the immediate neighborhood to capture temporally relevant relationships. In addition, we propose the Relational Graph Perceiver (RGP), a graph transformer architecture for relational deep learning that leverages a cross-attention-based latent bottleneck to efficiently integrate information from both structural and temporal contexts. This latent bottleneck integrates signals from different node and edge types into a common latent space, enabling the model to build global context across the entire relational system. RGP also incorporates a flexible cross-attention decoder that supports joint learning across tasks with disjoint label spaces within a single model. Experiments on RelBench, SALT, and CTU show that RGP delivers state-of-the-art performance, offering a general and scalable solution for relational deep learning with support for diverse predictive tasks.

中文标题/摘要

标题：在图变换器中结合时空上下文进行关系深度学习

在医疗保健、金融和电子商务等领域，关系数据的时间动态源自复杂的相互作用，如患者与提供者之间的相互作用，或用户与不同类别产品之间的相互作用。为了广泛适用，处理这些数据的模型必须整合跨多种实体的长距离空间和时间依赖性，同时支持多种预测任务。然而，现有的关系数据图模型主要关注空间结构，将时间信息仅视为排除未来事件的筛选约束，而不是建模信号，并且通常设计为单任务预测。为解决这些差距，我们引入了一种时间子图采样器，通过检索超出即时邻域的节点来增强全局上下文，以捕捉时间相关的关系。此外，我们提出了关系图感知机（RGP），这是一种利用交叉注意力机制的潜在瓶颈来高效整合结构和时间上下文信息的图变换器架构。这种潜在瓶颈将不同节点和边类型的信息整合到一个共同的潜在空间中，使模型能够构建整个关系系统的全局上下文。RGP 还包含一个灵活的交叉注意力解码器，支持在单一模型中联合学习具有不同标签空间的任务。在 RelBench、SALT 和 CTU 上的实验表明，RGP 提供了最先进的性能，为关系深度学习提供了通用且可扩展的解决方案，支持多种预测任务。

Summary / 总结

The paper addresses the need for models to integrate both spatial and temporal dependencies in relational data for applications like healthcare and finance. It introduces a temporal subgraph sampler and the Relational Graph Perceiver (RGP), a graph transformer that uses cross-attention to combine structural and temporal information. Experiments show that RGP outperforms existing models on RelBench, SALT, and CTU datasets, supporting multiple predictive tasks effectively.

论文旨在解决模型在处理医疗保健和金融等领域的关系数据时需要整合空间和时间依赖性的问题。文中提出了一种时间子图采样器和关系图感知器（RGP），这是一种使用交叉注意力来结合结构和时间信息的图变换器。实验结果显示，RGP在RelBench、SALT和CTU数据集上优于现有模型，能够有效支持多种预测任务。

Optimizing Sensor Placement in Urban Storm Sewers: A Data-Driven Sparse Sensing Approach

Authors: Zihang Ding, Kun Zhang

First: 2025-11-06T17:08:19+00:00 · Latest: 2025-11-06T17:08:19+00:00

Comments: 32 pages (including supplementary information), 11 figures (and 7 figures in supplementary). Submitted to Nature Water. Partially presented at HydroML 2025 Symposium, Minnesota Water Resources Conference 2025, and will be presented at AGU Fall Meeting 2025

Abs · PDF · Code1 · Code2

Abstract

Urban surface water flooding, triggered by intense rainfall overwhelming drainage systems, is increasingly frequent and widespread. While flood prediction and monitoring in high spatial-temporal resolution are desired, practical constraints in time, budget, and technology hinder its full implementation. How to monitor urban drainage networks and predict flow conditions under constrained resource is a major challenge. This study presents a data-driven sparse sensing (DSS) framework, integrated with EPA-SWMM, to optimize sensor placement and reconstruct peak flowrates in a stormwater system, using the Woodland Avenue catchment in Duluth, Minnesota, as a case study. We utilized a SWMM model to generate a training dataset of peak flowrate profiles across the stormwater network. Furthermore, we applied DSS - leveraging singular value decomposition for dimensionality reduction and QR factorization for sensor allocation - to identify the optimal monitoring nodes based on the simulated training dataset. We then validated the representativeness of these identified monitoring nodes by comparing the DSS-reconstructed peak flowrate profiles with those obtained from SWMM. Three optimally placed sensors among 77 nodes achieved satisfactory reconstruction performance with Nash-Sutcliffe Efficiency (NSE) values of 0.92-0.95 (25th to 75th percentiles). In addition, the model showed good robustness to uncertainty in measurements. Its robustness to sensor failures is location-dependent and improves with the number of sensors deployed. The framework balances computational efficiency and physical interpretability, enabling high-accuracy flow reconstruction with minimal sensors. This DSS framework can be further integrated with predictive models to realize flood early warning and real-time control under limited sensing and monitoring resource.

中文标题/摘要

标题：城市暴雨排水系统中传感器布局优化：基于数据驱动的稀疏传感方法

由强降雨超过排水系统能力引发的城市地表水洪涝越来越频繁和广泛。虽然高空间-时间分辨率的洪水预测和监测是所需的，但在时间、预算和技术方面的实际限制阻碍了其全面实施。如何在资源受限的情况下监测城市排水网络和预测流态是一个主要挑战。本研究提出了一种基于数据驱动的稀疏传感（DSS）框架，结合EPA-SWMM，以优化传感器布局并重建暴雨排水系统中的峰值流量，以明尼苏达州杜鲁斯市伍德兰大道汇水区为例进行案例研究。我们利用SWMM模型生成暴雨排水网络中峰值流量分布的训练数据集。此外，我们应用DSS - 利用奇异值分解进行降维和QR分解进行传感器分配 - 根据模拟的训练数据集识别最优监测节点。然后，通过将DSS重建的峰值流量分布与SWMM获得的结果进行比较，验证这些识别出的监测节点的代表性。在77个节点中，3个最优放置的传感器实现了满意的重建性能，Nash-Sutcliffe效率（NSE）值为0.92-0.95（第25百分位到第75百分位）。此外，该模型对测量不确定性具有良好的鲁棒性。其对传感器故障的鲁棒性取决于位置，并随着部署的传感器数量增加而提高。该框架平衡了计算效率和物理可解释性，能够在最少的传感器下实现高精度的流量重建。该DSS框架可以进一步与预测模型集成，在有限的传感和监测资源下实现洪水早期预警和实时控制。

Summary / 总结

This study addresses the challenge of monitoring urban drainage networks under constrained resources by proposing a data-driven sparse sensing (DSS) framework integrated with EPA-SWMM. The framework optimizes sensor placement and reconstructs peak flowrates using singular value decomposition and QR factorization. In the Woodland Avenue catchment, three optimally placed sensors among 77 nodes achieved satisfactory reconstruction performance with Nash-Sutcliffe Efficiency values of 0.92-0.95. The model also demonstrated robustness to sensor failures and measurement uncertainties.

该研究提出了一种数据驱动的稀疏传感（DSS）框架，结合EPA-SWMM，以解决有限资源下城市排水网络的监测难题。以明尼苏达州杜鲁斯市的Woodland Avenue流域为例，研究人员生成了峰值流量率的训练数据集，并应用DSS识别最优监测节点。该框架利用奇异值分解和QR分解成功重建了峰值流量率，Nash-Sutcliffe效率值为0.92-0.95，展示了对传感器故障的鲁棒性和计算效率。

Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment

Authors: Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao

First: 2025-11-06T17:07:49+00:00 · Latest: 2025-11-06T17:07:49+00:00

Comments: Github: https://github.com/MINT-SJTU/Evo-1

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.

中文标题/摘要

标题：Evo-1：轻量级视觉-语言-行动模型，保留语义对齐

视觉-语言-行动（VLA）模型已成为一种强大的框架，统一了感知、语言和控制，使机器人能够通过多模态理解执行多种任务。然而，当前的VLA模型通常包含大量参数，并且依赖大规模机器人数据的预训练，导致训练时计算成本高，且实时推理部署能力有限。此外，大多数训练范式往往会降低视觉-语言主干的感知表示，导致过拟合和下游任务泛化能力差。在本研究中，我们提出了Evo-1，这是一种轻量级的VLA模型，减少了计算量并提高了部署效率，同时保持了强大的性能，无需使用机器人数据进行预训练。Evo-1基于原生多模态视觉-语言模型（VLM），结合了一种新颖的跨模态扩散变换器以及优化的集成模块，共同形成了有效的架构。我们还引入了一种两阶段训练范式，逐步将行动与感知对齐，保留了VLM的表示。值得注意的是，仅包含0.77亿参数的Evo-1在Meta-World和RoboTwin套件上取得了最先进的结果，分别超越了之前最佳模型12.4%和6.9%，并在LIBERO上也取得了竞争力的结果，达到94.8%。在实际评估中，Evo-1在高推理频率和低内存开销的情况下，成功率达到78%，超越了所有基线方法。我们发布了代码、数据和模型权重，以促进轻量级和高效VLA模型的未来研究。

Legal Fact Prediction: The Missing Piece in Legal Judgment Prediction

Authors: Junkai Liu, Yujie Tong, Hui Huang, Bowen Zheng, Yiran Hu, Peicheng Wu, Chuan Xiao, Makoto Onizuka, Muyun Yang, Shuyuan Zheng

Venue: EMNLP 2025

First: 2024-09-11T07:01:08+00:00 · Latest: 2025-11-06T16:53:37+00:00

Comments: Accepted for EMNLP 2025 Main Conference

Abs · PDF · Code1 · Code2

Abstract

Legal judgment prediction (LJP), which enables litigants and their lawyers to forecast judgment outcomes and refine litigation strategies, has emerged as a crucial legal NLP task. Existing studies typically utilize legal facts, i.e., facts that have been established by evidence and determined by the judge, to predict the judgment. However, legal facts are often difficult to obtain in the early stages of litigation, significantly limiting the practical applicability of fact-based LJP. To address this limitation, we propose a novel legal NLP task: legal fact prediction (LFP), which takes the evidence submitted by litigants for trial as input to predict legal facts, thereby empowering fact-based LJP technologies to make predictions in the absence of ground-truth legal facts. We also propose the first benchmark dataset, LFPBench, for evaluating the LFP task. Our extensive experiments on LFPBench demonstrate the effectiveness of LFP-empowered LJP and highlight promising research directions for LFP.

中文标题/摘要

标题：法律事实预测：法律判决预测中的缺失环节

法律判决预测（LJP），使诉讼当事人及其律师能够预测判决结果并优化诉讼策略，已成为关键的法律NLP任务。现有研究通常利用法律事实，即通过证据确立并由法官确定的事实来预测判决。然而，在诉讼早期阶段，法律事实往往难以获得，极大地限制了基于事实的LJP的实际应用。为解决这一限制，我们提出了一项新的法律NLP任务：法律事实预测（LFP），该任务以诉讼当事人提交的证据作为输入，预测法律事实，从而使基于事实的LJP技术能够在没有真实法律事实的情况下进行预测。我们还提出了首个基准数据集LFPBench，用于评估LFP任务。我们在LFPBench上的广泛实验表明，LFP增强的LJP的有效性，并指出了LFP研究的有希望的研究方向。

Summary / 总结

The paper addresses the limitation of legal judgment prediction (LJP) by focusing on the difficulty of obtaining legal facts early in litigation. It introduces a new task, legal fact prediction (LFP), which predicts legal facts from evidence submitted by litigants. The authors developed the first benchmark dataset, LFPBench, and showed that LFP can effectively enable LJP even without ground-truth legal facts, opening up new research directions.

论文针对诉讼早期难以获取法律事实的问题，提出了一个新的任务——法律事实预测（LFP），该任务通过预测提交给法庭的证据中的法律事实来支持法律判断预测（LJP）。作者提出了一个基准数据集LFPBench，并证明了即使没有真实法律事实，LFP也能有效支持LJP，为LFP研究开辟了新的方向。

Rater Equivalence: Evaluating Classifiers in Human Judgment Settings

Authors: Paul Resnick, Yuqing Kong, Grant Schoenebeck, Tim Weninger

First: 2021-06-02T16:07:32+00:00 · Latest: 2025-11-06T16:52:50+00:00

Abs · PDF · Code1 · Code2

Abstract

In many decision settings, the definitive ground truth is either non-existent or inaccessible. We introduce a framework for evaluating classifiers based solely on human judgments. In such cases, it is helpful to compare automated classifiers to human judgment. We quantify a classifier's performance by its rater equivalence: the smallest number of human raters whose combined judgment matches the classifier's performance. Our framework uses human-generated labels both to construct benchmark panels and to evaluate performance. We distinguish between two models of utility: one based on agreement with the assumed but inaccessible ground truth, and one based on matching individual human judgments. Using case studies and formal analysis, we demonstrate how this framework can inform the evaluation and deployment of AI systems in practice.

中文标题/摘要

标题：评分者等效性：在人类判断环境中评估分类器

在许多决策环境中，确定的地面真实值要么不存在，要么不可访问。我们提出了一种仅基于人类判断来评估分类器的框架。在这种情况下，比较自动化分类器与人类判断是有帮助的。我们通过评分者等效性来量化分类器的性能：即与分类器性能相匹配的最小人类评分者的数量。我们的框架使用人类生成的标签来构建基准面板并评估性能。我们区分了两种效用模型：一种基于与假设但不可访问的地面真实值的一致性，另一种基于与个体人类判断的匹配。通过案例研究和形式分析，我们展示了这种框架如何在实践中指导人工智能系统的评估和部署。

Summary / 总结

The paper introduces a framework for evaluating classifiers using human judgments when ground truth is unavailable. It measures a classifier's performance by its rater equivalence, which is the minimum number of human raters needed to match the classifier's performance. The framework uses human-generated labels to construct benchmark panels and evaluate performance, considering two models of utility: agreement with an assumed ground truth and matching individual human judgments. Case studies and formal analysis illustrate the practical application of this framework in evaluating and deploying AI systems.

论文提出了一种使用人类判断来评估分类器的方法，当无法获得真实基准时。通过计算需要多少人类评判者的判断与分类器性能一致来衡量分类器的表现。该框架使用人类生成的标签构建基准面板并评估性能，考虑了两种效用模型：与假设的真实基准的一致性以及与个体人类判断的匹配。研究展示了该框架在评估和部署AI系统中的实际应用。