arXiv 论文速递

Native 3D Editing with Full Attention

Authors: Weiwei Cai, Shuangkang Fang, Weicai Ye, Xin Dong, Yunhan Yang, Xuanyang Zhang, Wei Cheng, Yanpei Cao, Gang Yu, Tao Chen

First: 2025-11-21T18:59:26+00:00 · Latest: 2025-11-21T18:59:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Instruction-guided 3D editing is a rapidly emerging field with the potential to broaden access to 3D content creation. However, existing methods face critical limitations: optimization-based approaches are prohibitively slow, while feed-forward approaches relying on multi-view 2D editing often suffer from inconsistent geometry and degraded visual quality. To address these issues, we propose a novel native 3D editing framework that directly manipulates 3D representations in a single, efficient feed-forward pass. Specifically, we create a large-scale, multi-modal dataset for instruction-guided 3D editing, covering diverse addition, deletion, and modification tasks. This dataset is meticulously curated to ensure that edited objects faithfully adhere to the instructional changes while preserving the consistency of unedited regions with the source object. Building upon this dataset, we explore two distinct conditioning strategies for our model: a conventional cross-attention mechanism and a novel 3D token concatenation approach. Our results demonstrate that token concatenation is more parameter-efficient and achieves superior performance. Extensive evaluations show that our method outperforms existing 2D-lifting approaches, setting a new benchmark in generation quality, 3D consistency, and instruction fidelity.

中文标题/摘要

标题：原生3D编辑与全注意力

指令引导的3D编辑是一个迅速发展的领域，有望扩大3D内容创作的访问范围。然而，现有方法面临关键限制：基于优化的方法过于缓慢，而依赖多视角2D编辑的前馈方法往往导致几何不一致和视觉质量下降。为了解决这些问题，我们提出了一种新颖的原生3D编辑框架，该框架可以在单个高效的前馈过程中直接操作3D表示。具体而言，我们创建了一个大规模的多模态数据集，用于指令引导的3D编辑，涵盖了多样化的添加、删除和修改任务。该数据集精心策划，以确保编辑对象忠实于指令变化，同时保持未编辑区域与源对象的一致性。基于此数据集，我们探索了两种不同的条件策略：传统的交叉注意力机制和新颖的3D标记连接方法。我们的结果表明，标记连接更具有参数效率，并且性能更优。广泛的评估显示，我们的方法优于现有的2D提升方法，为生成质量、3D一致性和指令忠实度设立了新的基准。

Summary / 总结

The paper addresses the limitations of existing 3D editing methods by proposing a novel native 3D editing framework that directly manipulates 3D representations in a single feed-forward pass. It introduces a large-scale multi-modal dataset for instruction-guided 3D editing and explores two conditioning strategies: cross-attention and 3D token concatenation. The results show that the 3D token concatenation approach is more parameter-efficient and achieves better performance, outperforming existing 2D-lifting methods in generation quality, 3D consistency, and instruction fidelity.

论文通过提出一种新的直接操作3D表示的单一前向传递框架，解决了现有3D编辑方法的局限性。作者创建了一个大规模的多模态数据集，用于指导指令下的3D编辑，涵盖了添加、删除和修改等多种任务。他们探索了两种条件策略：交叉注意力和3D令牌连接，发现后者更高效且性能更优。实验结果表明，所提出的方法在生成质量、3D一致性及指令忠实度方面优于现有2D提升方法。

The Loss of Control Playbook: Degrees, Dynamics, and Preparedness

Authors: Charlotte Stix, Annika Hallensleben, Alejandro Ortega, Matteo Pistillo

First: 2025-11-19T20:10:39+00:00 · Latest: 2025-11-21T18:53:03+00:00

Abs · PDF · Code1 · Code2

Abstract

This research report addresses the absence of an actionable definition for Loss of Control (LoC) in AI systems by developing a novel taxonomy and preparedness framework. Despite increasing policy and research attention, existing LoC definitions vary significantly in scope and timeline, hindering effective LoC assessment and mitigation. To address this issue, we draw from an extensive literature review and propose a graded LoC taxonomy, based on the metrics of severity and persistence, that distinguishes between Deviation, Bounded LoC, and Strict LoC. We model pathways toward a societal state of vulnerability in which sufficiently advanced AI systems have acquired or could acquire the means to cause Bounded or Strict LoC once a catalyst, either misalignment or pure malfunction, materializes. We argue that this state becomes increasingly likely over time, absent strategic intervention, and propose a strategy to avoid reaching a state of vulnerability. Rather than focusing solely on intervening on AI capabilities and propensities potentially relevant for LoC or on preventing potential catalysts, we introduce a complementary framework that emphasizes three extrinsic factors: Deployment context, Affordances, and Permissions (the DAP framework). Compared to work on intrinsic factors and catalysts, this framework has the unfair advantage of being actionable today. Finally, we put forward a plan to maintain preparedness and prevent the occurrence of LoC outcomes should a state of societal vulnerability be reached, focusing on governance measures (threat modeling, deployment policies, emergency response) and technical controls (pre-deployment testing, control measures, monitoring) that could maintain a condition of perennial suspension.

中文标题/摘要

标题：失控 playbook：程度、动态与准备

本研究报告通过开发新的分类法和准备框架，解决了人工智能系统中缺乏可操作的失控（LoC）定义的问题。尽管政策和研究的关注度不断增加，现有的LoC定义在范围和时间上差异显著，阻碍了有效的LoC评估和缓解。为解决这一问题，我们借鉴了广泛文献综述，并提出了一种基于严重性和持久性的分级LoC分类法，区分了偏差、边界失控和严格失控。我们建模了通向社会脆弱状态的路径，在这种状态下，足够先进的AI系统一旦出现催化剂（无论是失准还是纯粹故障），就可能获得或能够获得造成边界或严格失控的手段。我们认为，在缺乏战略干预的情况下，这种状态随着时间的推移变得越来越有可能，并提出了一种避免达到脆弱状态的策略。我们不仅关注干预可能与LoC相关的AI能力和倾向，或防止潜在催化剂，还引入了一个互补框架，强调三个外在因素：部署背景、便利条件和许可（DAP框架）。与内在因素和催化剂的工作相比，该框架今天具有可操作性的不公平优势。最后，我们提出了一项计划，以维持准备状态并防止在社会脆弱状态达到时发生LoC结果，重点关注治理措施（威胁建模、部署政策、应急响应）和技术控制（预部署测试、控制措施、监控），以维持一种持久的暂停状态。

Summary / 总结

This research develops a novel taxonomy and preparedness framework for Loss of Control (LoC) in AI systems, addressing the lack of a clear definition. By distinguishing between Deviation, Bounded LoC, and Strict LoC based on severity and persistence, the study models pathways to societal vulnerability and proposes a DAP framework focusing on Deployment context, Affordances, and Permissions to mitigate LoC risks. Key findings include the increasing likelihood of societal vulnerability without strategic intervention and the need for governance and technical measures to maintain preparedness.

研究开发了一种新的LoC分类和准备框架，以解决AI系统中缺乏明确定义的问题。通过根据严重性和持续性区分偏差、边界LoC和严格LoC，该研究模型化了通往社会脆弱性的路径，并提出了一个侧重于部署环境、便利性和许可的DAP框架以防止LoC。关键发现包括在缺乏战略干预的情况下社会脆弱性的增加可能性，以及需要治理和技术措施来维持准备状态。

EvDiff: High Quality Video with an Event Camera

Authors: Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma, Kaiwei Wang, Ming-Hsuan Yang, Luc Van Gool, Danda Pani Paudel

First: 2025-11-21T18:49:18+00:00 · Latest: 2025-11-21T18:49:18+00:00

Abs · PDF · Code1 · Code2

Abstract

As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.

中文标题/摘要

标题：EvDiff：基于事件的高质量视频

作为神经形态传感器，事件相机异步记录亮度变化，以稀疏事件流的形式，具有高时间分辨率和高动态范围的优势。从事件重建强度图像是一项高度病态的任务，由于绝对亮度的固有模糊性。早期方法通常遵循端到端回归范式，直接以确定性方式将事件映射到强度帧。虽然在一定程度上有效，但这些方法往往产生视觉上劣质的结果，并且难以在模型容量和训练数据上进行扩展。在本文中，我们提出了一种基于事件的扩散模型EvDiff，该模型遵循代理训练框架以生成高质量的视频。为了减少高帧率视频生成的高昂计算成本，我们设计了一种仅执行一次前向扩散步骤的基于事件的扩散模型，并配备了时间一致的EvEncoder。此外，我们提出的代理训练框架消除了对配对事件-图像数据集的依赖，使模型能够利用大规模图像数据集以更高的容量进行训练。所提出的EvDiff能够仅从单色事件流中生成高质量的彩色视频。在真实世界数据集上的实验表明，我们的方法在像素级和感知度量上均优于现有方法，在准确性和逼真度之间找到了一个理想的平衡。

Summary / 总结

The research aims to improve the quality of videos generated from event cameras, which record changes in brightness as sparse events. To address the ill-posed problem of reconstructing intensity images, the authors propose EvDiff, an event-based diffusion model that uses a surrogate training framework. This model performs a single forward diffusion step and includes a temporally consistent EvEncoder. By eliminating the need for paired event-image datasets, EvDiff can leverage large-scale image datasets, enhancing its capacity. Experimental results show that EvDiff generates high-quality, colorful videos from monochromatic event streams, outperforming existing methods in both pixel-level and perceptual metrics.

研究旨在通过解决从稀疏事件重建强度图像的挑战，提高由事件相机数据生成的视频质量。提出的EvDiff模型使用代理训练框架，在单次前向扩散步骤中生成高质量视频，从而降低计算成本。关键发现表明，EvDiff在像素级和感知度指标上均优于现有方法，能够从单色事件流中生成高质量彩色视频。

Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Authors: Yolo Yunlong Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

First: 2025-11-21T18:47:09+00:00 · Latest: 2025-11-21T18:47:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning.

中文标题/摘要

标题：视频-R4：通过视觉沉思强化文本丰富的视频推理

理解文本丰富的视频需要阅读小而短暂的文字提示，这些提示通常需要反复检查。然而，大多数视频问答模型依赖于对固定帧的单次感知，导致出现幻觉和对细微证据的失败。受人类暂停、放大和重读关键区域的启发，我们引入了Video-R4（通过视觉沉思强化文本丰富的视频推理），这是一种视频推理LMM，能够进行视觉沉思：迭代选择帧，放大到信息丰富的区域，重新编码检索到的像素，并更新其推理状态。我们构建了两个包含可执行沉思轨迹的数据集：Video-R4-CoT-17k 用于监督练习，Video-R4-RL-30k 用于强化学习。我们提出了一种多阶段沉思学习框架，逐步微调一个7B LMM，通过SFT和基于GRPO的RL学习原子和混合视觉操作。Video-R4-7B 在M4-ViteVQA 上达到了最先进的结果，并进一步泛化到多页文档问答、幻灯片问答和通用视频问答，证明了迭代沉思是像素接地多模态推理的有效范式。

Harnessing Data from Clustered LQR Systems: Personalized and Collaborative Policy Optimization

Authors: Vinay Kanakeri, Shivam Bajaj, Ashwin Verma, Vijay Gupta, Aritra Mitra

First: 2025-11-21T18:45:53+00:00 · Latest: 2025-11-21T18:45:53+00:00

Abs · PDF · Code1 · Code2

Abstract

It is known that reinforcement learning (RL) is data-hungry. To improve sample-efficiency of RL, it has been proposed that the learning algorithm utilize data from 'approximately similar' processes. However, since the process models are unknown, identifying which other processes are similar poses a challenge. In this work, we study this problem in the context of the benchmark Linear Quadratic Regulator (LQR) setting. Specifically, we consider a setting with multiple agents, each corresponding to a copy of a linear process to be controlled. The agents' local processes can be partitioned into clusters based on similarities in dynamics and tasks. Combining ideas from sequential elimination and zeroth-order policy optimization, we propose a new algorithm that performs simultaneous clustering and learning to output a personalized policy (controller) for each cluster. Under a suitable notion of cluster separation that captures differences in closed-loop performance across systems, we prove that our approach guarantees correct clustering with high probability. Furthermore, we show that the sub-optimality gap of the policy learned for each cluster scales inversely with the size of the cluster, with no additional bias, unlike in prior works on collaborative learning-based control. Our work is the first to reveal how clustering can be used in data-driven control to learn personalized policies that enjoy statistical gains from collaboration but do not suffer sub-optimality due to inclusion of data from dissimilar processes. From a distributed implementation perspective, our method is attractive as it incurs only a mild logarithmic communication overhead.

中文标题/摘要

标题：利用聚类LQR系统的数据：个性化与协作策略优化

已知强化学习（RL）是数据饥渴的。为了提高RL的样本效率，提出了让学习算法利用“大致相似”的过程的数据。然而，由于过程模型未知，识别哪些过程相似是一个挑战。在本文中，我们在线性二次调节器（LQR）基准设置的背景下研究了这个问题。具体而言，我们考虑了一个多代理设置，每个代理对应一个需要控制的线性过程的副本。代理的局部过程可以根据动力学和任务的相似性进行聚类。结合顺序淘汰和零阶策略优化的思想，我们提出了一种新的算法，同时进行聚类和学习，为每个聚类输出一个个性化的策略（控制器）。在一种适合的聚类分离概念下，该概念捕捉了系统闭环性能的差异，我们证明了我们的方法以高概率保证正确的聚类。此外，我们展示了为每个聚类学习的策略的次优性差距与聚类大小成反比，没有额外的偏差，这与基于协作学习的控制的先前工作不同。我们的工作首次揭示了如何在数据驱动的控制中利用聚类来学习享受合作统计增益的个性化策略，但不会因包含不相似过程的数据而遭受次优性。从分布式实现的角度来看，我们的方法具有吸引力，因为它仅产生轻微的对数通信开销。

Summary / 总结

This paper addresses the challenge of improving the sample efficiency of reinforcement learning (RL) by utilizing data from similar processes. It proposes a new algorithm for clustering and learning personalized policies in a Linear Quadratic Regulator (LQR) setting with multiple agents. The algorithm combines sequential elimination and zeroth-order policy optimization to achieve correct clustering with high probability and sub-optimality gaps that scale inversely with cluster size. This work demonstrates that clustering can provide statistical gains from collaboration without the bias of including dissimilar data, and it requires only a mild communication overhead.

该论文旨在通过利用相似过程的数据来提高强化学习（RL）的样本效率。它提出了一种新的算法，该算法同时对多个线性过程进行聚类和学习策略，每个过程对应一个聚类。该算法使用顺序淘汰和零阶策略优化来为每个聚类输出个性化策略。理论分析表明，该方法能够以高概率正确聚类过程，并且所学策略的次优性差距随着聚类规模的增加而减小。这项工作展示了聚类如何在提供合作统计收益的同时避免因包含不相似数据而导致的次优性增加。

An Artificial Intelligence Framework for Measuring Human Spine Aging Using MRI

Authors: Roozbeh Bazargani, Saqib Abdullah Basar, Daniel Daly-Grafstein, Rodrigo Solis Pompa, Soojin Lee, Saurabh Garg, Yuntong Ma, John A. Carrino, Siavash Khallaghi, Sam Hashemi

First: 2025-11-21T18:40:21+00:00 · Latest: 2025-11-21T18:40:21+00:00

Comments: 17 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

The human spine is a complex structure composed of 33 vertebrae. It holds the body and is important for leading a healthy life. The spine is vulnerable to age-related degenerations that can be identified through magnetic resonance imaging (MRI). In this paper we propose a novel computer-vison-based deep learning method to estimate spine age using images from over 18,000 MRI series. Data are restricted to subjects with only age-related spine degeneration. Eligibility criteria are created by identifying common age-based clusters of degenerative spine conditions using uniform manifold approximation and projection (UMAP) and hierarchical density-based spatial clustering of applications with noise (HDBSCAN). Model selection is determined using a detailed ablation study on data size, loss, and the effect of different spine regions. We evaluate the clinical utility of our model by calculating the difference between actual spine age and model-predicted age, the spine age gap (SAG), and examining the association between these differences and spine degenerative conditions and lifestyle factors. We find that SAG is associated with conditions including disc bulges, disc osteophytes, spinal stenosis, and fractures, as well as lifestyle factors like smoking and physically demanding work, and thus may be a useful biomarker for measuring overall spine health.

中文标题/摘要

标题：一种基于MRI测量人类脊柱衰老的人工智能框架

人类脊柱由33块椎骨组成，是支撑身体的重要结构，对健康生活至关重要。脊柱易受年龄相关退化的影响，这些退化可以通过磁共振成像（MRI）识别。本文提出了一种基于计算机视觉的深度学习方法，利用超过18,000个MRI系列图像来估算脊柱年龄。数据仅限于年龄相关脊柱退化患者。通过使用均匀流形逼近和投影（UMAP）和层次密度基于空间聚类的应用程序噪声（HDBSCAN）来识别年龄相关的退化脊柱条件的常见簇，制定了入选标准。通过详细的数据规模、损失函数和不同脊柱区域影响的消融研究来确定模型选择。通过计算实际脊柱年龄与模型预测年龄之间的差异，即脊柱年龄差距（SAG），并检查这些差异与脊柱退化状况和生活方式因素之间的关联，评估了该模型的临床用途。我们发现SAG与椎间盘膨出、椎间盘骨赘、椎管狭窄和骨折等状况以及吸烟和体力劳动等生活方式因素相关，因此可能是一个衡量整体脊柱健康的有效生物标志物。

Summary / 总结

This paper introduces a deep learning framework to estimate human spine age from MRI images, focusing on age-related degenerations. The method uses a detailed ablation study to optimize model performance and evaluates its clinical utility by assessing the spine age gap (SAG) and its association with degenerative conditions and lifestyle factors. Key findings include SAG's correlation with conditions such as disc bulges, osteophytes, spinal stenosis, and fractures, as well as with lifestyle factors like smoking and physically demanding work, suggesting its potential as a biomarker for spine health.

本文提出了一种基于深度学习的方法，用于从MRI图像估计人类脊柱年龄，重点关注与年龄相关的退化。该方法通过详细的消融研究优化模型参数，并通过评估脊柱年龄差距（SAG）及其与各种退化状况和生活方式因素的关系来评估其临床应用价值。关键发现包括SAG与椎间盘突出、骨赘、椎管狭窄和骨折等状况以及吸烟和体力劳动等生活方式因素的相关性，表明其可能作为评估脊柱健康状况的生物标志物。

Generalizable Radio-Frequency Radiance Fields for Spatial Spectrum Synthesis

Authors: Kang Yang, Yuning Chen, Wan Du

First: 2025-02-08T22:03:08+00:00 · Latest: 2025-11-21T18:40:21+00:00

Abs · PDF · Code1 · Code2

Abstract

We present GRaF, Generalizable Radio-Frequency (RF) Radiance Fields, a framework that models RF signal propagation to synthesize spatial spectra at arbitrary transmitter or receiver locations, where each spectrum measures signal power across all surrounding directions at the receiver. Unlike state-of-the-art methods that adapt vanilla Neural Radiance Fields (NeRF) to the RF domain with scene-specific training, GRaF generalizes across scenes to synthesize spectra. To enable this, we prove an interpolation theory in the RF domain: the spatial spectrum from a transmitter can be approximated using spectra from geographically proximate transmitters. Building on this theory, GRaF comprises two components: (i) a geometry-aware Transformer encoder that captures spatial correlations from neighboring transmitters to learn a scene-independent latent RF radiance field, and (ii) a neural ray tracing algorithm that estimates spectrum reception at the receiver. Experimental results demonstrate that GRaF outperforms existing methods on single-scene benchmarks and achieves state-of-the-art performance on unseen scene layouts.

中文标题/摘要

标题：通用射频辐射场用于空间频谱合成

我们提出了GRaF，通用射频（RF）辐射场，这是一种框架，用于建模RF信号传播以在任意发射器或接收器位置合成空间频谱，其中每个频谱测量接收器周围所有方向上的信号功率。与现有方法不同，这些方法通过场景特定的训练将vanilla神经辐射场（NeRF）适应到RF域，GRaF在不同场景中进行泛化以合成频谱。为了实现这一点，我们在RF域中证明了一种插值理论：发射器的频谱可以通过地理上邻近发射器的频谱进行近似。基于这一理论，GRaF 包含两个组件：（i）一个几何感知的Transformer编码器，用于从邻近发射器捕获空间相关性以学习场景独立的RF辐射场，以及（ii）一种神经射线追踪算法，用于估计接收器处的频谱接收。实验结果表明，GRaF 在单场景基准测试中优于现有方法，并在未见过的场景布局上实现了最先进的性能。

Summary / 总结

The research aims to develop a framework for synthesizing spatial RF spectra at arbitrary locations without scene-specific training. GRaF uses a geometry-aware Transformer encoder to learn a scene-independent latent RF radiance field and a neural ray tracing algorithm to estimate spectrum reception. Experiments show that GRaF outperforms existing methods on single-scene benchmarks and achieves state-of-the-art performance on unseen scene layouts.

研究旨在开发一种无需针对特定场景进行训练即可在任意位置合成空间RF光谱的框架。GRaF 使用几何感知的Transformer编码器来学习一个场景无关的RF辐射场，并使用神经射线追踪算法来估计接收端的光谱接收。实验结果表明，GRaF 在单场景基准测试中优于现有方法，并在未见过的场景布局上达到了最先进的性能。

Radar2Shape: 3D Shape Reconstruction from High-Frequency Radar using Multiresolution Signed Distance Functions

Authors: Neel Sortur, Justin Goodwin, Purvik Patel, Luis Enrique Martinez, Tzofi Klinghoffer, Rajmonda S. Caceres, Robin Walters

First: 2025-11-21T18:40:03+00:00 · Latest: 2025-11-21T18:40:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Determining the shape of 3D objects from high-frequency radar signals is analytically complex but critical for commercial and aerospace applications. Previous deep learning methods have been applied to radar modeling; however, they often fail to represent arbitrary shapes or have difficulty with real-world radar signals which are collected over limited viewing angles. Existing methods in optical 3D reconstruction can generate arbitrary shapes from limited camera views, but struggle when they naively treat the radar signal as a camera view. In this work, we present Radar2Shape, a denoising diffusion model that handles a partially observable radar signal for 3D reconstruction by correlating its frequencies with multiresolution shape features. Our method consists of a two-stage approach: first, Radar2Shape learns a regularized latent space with hierarchical resolutions of shape features, and second, it diffuses into this latent space by conditioning on the frequencies of the radar signal in an analogous coarse-to-fine manner. We demonstrate that Radar2Shape can successfully reconstruct arbitrary 3D shapes even from partially-observed radar signals, and we show robust generalization to two different simulation methods and real-world data. Additionally, we release two synthetic benchmark datasets to encourage future research in the high-frequency radar domain so that models like Radar2Shape can safely be adapted into real-world radar systems.

中文标题/摘要

标题：Radar2Shape：使用多分辨率符号距离函数从高频雷达信号重建3D形状

从高频雷达信号中确定3D物体的形状在理论上是复杂的，但对于商业和航空航天应用来说是至关重要的。以前的深度学习方法已被应用于雷达建模，但它们往往无法表示任意形状，或者难以处理有限视角收集的真实雷达信号。现有的光学3D重建方法可以从有限的摄像头视角生成任意形状，但在将雷达信号视为摄像头视角时会遇到困难。在本文中，我们提出了Radar2Shape，这是一种去噪扩散模型，通过将雷达信号的频率与多分辨率形状特征相关联来处理部分可观测的雷达信号进行3D重建。我们的方法包括两阶段：首先，Radar2Shape学习一个正则化的隐空间，具有形状特征的分层分辨率；其次，它通过条件依赖雷达信号的频率以类似粗到细的方式扩散到这个隐空间中。我们证明Radar2Shape可以从部分观测的雷达信号中成功重建任意3D形状，并展示了其在两种不同的模拟方法和真实数据上的鲁棒泛化能力。此外，我们发布了两个合成基准数据集，以鼓励未来在高频雷达领域的研究，使模型如Radar2Shape能够安全地适应到实际雷达系统中。

Summary / 总结

Radar2Shape is a denoising diffusion model designed to reconstruct 3D shapes from high-frequency radar signals. It uses multiresolution signed distance functions to handle partially observable radar data, correlating signal frequencies with hierarchical shape features. The model demonstrates successful reconstruction of arbitrary 3D shapes from limited radar views and shows robust performance across different simulation and real-world datasets.

Radar2Shape 是一种去噪扩散模型，用于从高频率雷达信号中重建 3D 形状，采用多分辨率符号距离函数。该模型通过学习具有分层形状特征的正则化潜在空间，并基于雷达信号的频率进行逐级扩散，成功地从部分观测的雷达信号中重建任意 3D 形状，并在不同模拟和真实世界数据中展示了鲁棒的泛化能力。

Counterfactual World Models via Digital Twin-conditioned Video Diffusion

Authors: Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

First: 2025-11-21T18:37:23+00:00 · Latest: 2025-11-21T18:37:23+00:00

Abs · PDF · Code1 · Code2

Abstract

World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be selectively modified. This modeling choice prevents targeted interventions on specific scene properties. We introduce CWMDT, a framework to overcome those limitations, turning standard video diffusion models into effective counterfactual world models. First, CWMDT constructs digital twins of observed scenes to explicitly encode objects and their relationships, represented as structured text. Second, CWMDT applies large language models to reason over these representations and predict how a counterfactual intervention propagates through time to alter the observed scene. Third, CWMDT conditions a video diffusion model with the modified representation to generate counterfactual visual sequences. Evaluations on two benchmarks show that the CWMDT approach achieves state-of-the-art performance, suggesting that alternative representations of videos, such as the digital twins considered here, offer powerful control signals for video forward simulation-based world models.

中文标题/摘要

标题：基于数字孪生条件下的视频扩散世界的反事实世界模型

世界模型学习根据控制信号预测视觉观察的时间演变，可能使智能体通过前向模拟来推理环境。由于专注于前向模拟，当前的世界模型基于事实观察生成预测。对于许多新兴应用，如在不同条件下全面评估物理AI行为的能力，世界模型回答反事实查询（例如，“如果移除这个物体会发生什么？”）的能力变得越来越重要。我们形式化了反事实世界模型，这些模型还以干预作为显式输入，预测在假设修改观察场景属性下的时间序列。传统的世界模型直接操作纠缠的像素空间表示，其中对象属性和关系无法选择性修改。这种建模选择阻止了对特定场景属性的针对性干预。我们引入了CWMDT框架，以克服这些限制，将标准的视频扩散模型转变为有效的反事实世界模型。首先，CWMDT构建观察场景的数字孪生，明确编码对象及其关系，表示为结构化文本。其次，CWMDT应用大型语言模型来推理这些表示，并预测反事实干预如何随时间传播以改变观察场景。第三，CWMDT用修改后的表示条件化视频扩散模型以生成反事实视觉序列。在两个基准上的评估表明，CWMDT方法达到了最先进的性能，表明视频前向模拟基于世界模型的替代视频表示（如这里考虑的数字孪生）提供了强大的控制信号。

Summary / 总结

This paper addresses the limitation of current world models in handling counterfactual queries by introducing CWMDT, a framework that constructs digital twins of observed scenes and uses them to predict temporal sequences under hypothetical modifications. CWMDT first encodes objects and their relationships as structured text, then uses large language models to reason about these representations and predict the effects of counterfactual interventions. Finally, it conditions a video diffusion model to generate counterfactual visual sequences. Experiments on two benchmarks demonstrate that CWMDT outperforms existing methods in generating counterfactual predictions, highlighting the potential of alternative video representations for forward simulation-based world models.

本文通过引入CWMDT框架解决了传统世界模型在处理反事实查询方面的局限性，该框架通过构建观察场景的数字孪生并预测在假设修改下的时间序列来解决这一问题。CWMDT首先将物体及其关系编码为结构化文本，然后使用大型语言模型推理这些表示以预测反事实干预的影响。最后，它使用修改后的表示来条件化视频扩散模型以生成反事实视觉序列。实验表明，CWMDT在生成反事实视觉序列方面优于现有方法。

ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation

Authors: Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong MU

First: 2025-11-01T11:29:14+00:00 · Latest: 2025-11-21T18:35:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.

中文标题/摘要

标题：ID-Crafter：基于VLM的在线强化学习多主体视频生成

在高保真视频合成方面取得了显著进展，但当前范式往往难以有效整合多个主体的身份信息，导致语义冲突和身份及互动的次优表现，限制了可控性和应用范围。为解决这一问题，我们提出了ID-Crafter，一种实现优越身份保留和语义一致性的多主体视频生成框架。ID-Crafter 结合了三个关键组件：(i) 一种分层的身份保留注意力机制，逐步在主体内、主体间和跨模态层面聚合特征；(ii) 由预训练的视觉-语言模型（VLM）驱动的语义理解模块，提供精细指导并捕捉复杂的主体间关系；(iii) 一个在线强化学习阶段，进一步细化模型以处理关键概念。此外，我们构建了一个新的数据集以促进稳健的训练和评估。大量实验表明，ID-Crafter 在多主体视频生成基准测试中建立了新的最佳性能，特别是在身份保留、时间一致性和整体视频质量方面表现出色。

Summary / 总结

ID-Crafter is a framework for multi-subject video generation that integrates a hierarchical identity-preserving attention mechanism, a semantic understanding module using a pretrained Vision-Language Model, and an online reinforcement learning phase. This approach enhances identity preservation and semantic coherence, leading to superior performance on multi-subject video generation benchmarks compared to existing methods.

ID-Crafter 是一个多主体视频生成框架，结合了层次化的身份保留注意力机制、使用预训练视觉-语言模型的语义理解模块以及在线强化学习阶段。这种方法增强了身份保留和语义一致性，使其在多主体视频生成基准测试中表现出色。

YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

Authors: Ori Meiraz, Sharon Shalev, Avishai Weizman

First: 2025-11-17T13:11:11+00:00 · Latest: 2025-11-21T18:33:16+00:00

Comments: 1 figure, 1 table

Abs · PDF · Code1 · Code2

Abstract

This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

中文标题/摘要

标题：YOLO与混合专家模型结合：自适应专家路由以实现稳健的目标检测

本文提出了一种新颖的混合专家框架用于目标检测，结合了多个YOLOv9-T专家之间的自适应路由，以实现动态特征专业化，并在平均精度（mAP）和平均召回率（AR）方面优于单一的YOLOv9-T模型。

Summary / 总结

This paper introduces a Mixture-of-Experts framework for object detection that uses adaptive routing among multiple YOLOv9-T experts to achieve better performance. The method results in higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model. The approach enables dynamic feature specialization, enhancing robustness in object detection tasks.

该论文提出了一种混合专家框架用于目标检测，通过在多个YOLOv9-T专家之间进行自适应路由来增强特征专业化。该方法在mAP和AR方面优于单一的YOLOv9-T模型。

MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features

Authors: Sejuti Rahman, Swakshar Deb, MD. Sameer Iqbal Chowdhury, MD. Jubair Ahmed Sourov, Mohammad Shamsuddin

First: 2025-11-19T18:18:53+00:00 · Latest: 2025-11-21T18:28:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Depression is a prevalent global mental health disorder, characterised by persistent low mood and anhedonia. However, it remains underdiagnosed because current diagnostic methods depend heavily on subjective clinical assessments. To enable objective detection, we introduce a gold standard dataset of 103 clinically assessed participants collected through a tripartite data approach which uniquely integrated eye tracking data with audio and video to give a comprehensive representation of depressive symptoms. Eye tracking data quantifies the attentional bias towards negative stimuli that is frequently observed in depressed groups. Audio and video data capture the affective flattening and psychomotor retardation characteristic of depression. Statistical validation confirmed their significant discriminative power in distinguishing depressed from non depressed groups. We address a critical limitation of existing graph-based models that focus on low-frequency information and propose a Multi-Frequency Graph Convolutional Network (MF-GCN). This framework consists of a novel Multi-Frequency Filter Bank Module (MFFBM), which can leverage both low and high frequency signals. Extensive evaluation against traditional machine learning algorithms and deep learning frameworks demonstrates that MF-GCN consistently outperforms baselines. In binary classification, the model achieved a sensitivity of 0.96 and F2 score of 0.94. For the 3 class classification task, the proposed method achieved a sensitivity of 0.79 and specificity of 0.87 and siginificantly suprassed other models. To validate generalizability, the model was also evaluated on the Chinese Multimodal Depression Corpus (CMDC) dataset and achieved a sensitivity of 0.95 and F2 score of 0.96. These results confirm that our trimodal, multi frequency framework effectively captures cross modal interaction for accurate depression detection.

中文标题/摘要

标题：MF-GCN：一种用于利用眼动追踪、面部和声学特征进行三模态抑郁检测的多频图卷积网络

抑郁是一种普遍的全球性精神健康障碍，特征为持续的低情绪和快感缺乏。然而，由于当前诊断方法主要依赖主观临床评估，抑郁往往被低估。为了实现客观检测，我们引入了一个包含103名临床评估参与者的黄金标准数据集，该数据集通过三重数据方法独特地将眼动追踪数据与音频和视频结合，提供抑郁症状的全面表现。眼动追踪数据量化了抑郁组中对负面刺激的注意力偏向。音频和视频数据捕捉了抑郁中的情感平淡和心理运动迟缓。统计验证证实了它们在区分抑郁与非抑郁组方面具有显著的鉴别能力。我们解决了现有基于图的模型集中在低频信息上的关键局限性，提出了一种多频图卷积网络（MF-GCN）。该框架包括一种新颖的多频滤波器模块（MFFBM），可以利用低频和高频信号。与传统的机器学习算法和深度学习框架的广泛评估表明，MF-GCN 一贯优于基线。在二分类中，该模型的灵敏度为0.96，F2分数为0.94。对于三分类任务，所提出的方法的灵敏度为0.79，特异性为0.87，并显著超越其他模型。为了验证泛化能力，该模型还在中文多模态抑郁语料库（CMDC）数据集上进行了评估，灵敏度为0.95，F2分数为0.96。这些结果证实了我们三模态、多频框架有效地捕捉了跨模态交互，以实现准确的抑郁检测。

Summary / 总结

The paper introduces MF-GCN, a Multi-Frequency Graph Convolutional Network, for detecting depression using tri-modal data (eye-tracking, facial, and acoustic features). It addresses the limitation of existing models by incorporating both low and high frequency signals. The model was evaluated against traditional machine learning and deep learning methods and showed superior performance, achieving high sensitivity and F2 scores in both binary and three-class classification tasks. It also generalized well on a Chinese dataset.

论文提出了一个名为MF-GCN的多频图卷积网络，用于利用眼动追踪、面部和声学特征的三模态数据检测抑郁。该模型通过结合低频和高频信号解决了现有模型的局限性。MF-GCN在二分类和三分类任务中均优于传统机器学习和深度学习方法，实现了较高的敏感性和F2分数。该模型还对不同的数据集（中国多模态抑郁语料库）具有良好的泛化能力。

Physically Interpretable World Models via Weakly Supervised Representation Learning

Authors: Zhenjiang Mao, Mrinall Eashaan Umasudhan, Ivan Ruchkin

First: 2024-12-17T12:51:24+00:00 · Latest: 2025-11-21T18:24:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning predictive models from high-dimensional sensory observations is fundamental for cyber-physical systems, yet the latent representations learned by standard world models lack physical interpretability. This limits their reliability, generalizability, and applicability to safety-critical tasks. We introduce Physically Interpretable World Models (PIWM), a framework that aligns latent representations with real-world physical quantities and constrains their evolution through partially known physical dynamics. Physical interpretability in PIWM is defined by two complementary properties: (i) the learned latent state corresponds to meaningful physical variables, and (ii) its temporal evolution follows physically consistent dynamics. To achieve this without requiring ground-truth physical annotations, PIWM employs weak distribution-based supervision that captures state uncertainty naturally arising from real-world sensing pipelines. The architecture integrates a VQ-based visual encoder, a transformer-based physical encoder, and a learnable dynamics model grounded in known physical equations. Across three case studies (Cart Pole, Lunar Lander, and Donkey Car), PIWM achieves accurate long-horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data-driven models. These results demonstrate the feasibility and advantages of learning physically interpretable world models directly from images under weak supervision.

中文标题/摘要

标题：通过弱监督表示学习获得物理可解释的世界模型

从高维感官观察中学习预测模型是网络物理系统中的基本问题，但标准世界模型学习到的潜在表示缺乏物理可解释性，这限制了它们的可靠性和泛化能力，以及在关键安全任务中的应用。我们提出了物理可解释的世界模型（PIWM），这是一种框架，它将潜在表示与现实世界的物理量对齐，并通过部分已知的物理动力学约束其演变。PIWM 中的物理可解释性由两个互补的属性定义：(i) 学习到的潜在状态对应于有意义的物理变量，(ii) 其时间演变遵循物理上一致的动力学。为了在不需要真实物理注释的情况下实现这一点，PIWM 使用弱分布监督，这种监督自然捕捉到实际传感管道中固有的状态不确定性。该架构结合了基于 VQ 的视觉编码器、基于变换器的物理编码器和基于已知物理方程的可学习动力学模型。在三个案例研究（Cart Pole、Lunar Lander 和 Donkey Car）中，PIWM 实现了准确的长期预测，恢复了真实的系统参数，并显著提高了物理关联性，优于纯数据驱动的模型。这些结果表明，在弱监督下直接从图像中学习物理可解释的世界模型的可行性和优势。

Summary / 总结

The research aims to develop physically interpretable world models for cyber-physical systems by aligning latent representations with real-world physical quantities. PIWM uses weak distribution-based supervision to learn from high-dimensional sensory data without requiring ground-truth physical annotations. The model integrates a VQ-based visual encoder, a transformer-based physical encoder, and a learnable dynamics model. Experimental results show that PIWM can achieve accurate long-horizon predictions, recover true system parameters, and improve physical grounding compared to purely data-driven models in three case studies: Cart Pole, Lunar Lander, and Donkey Car.

研究旨在通过将潜在表示与现实世界的物理量对齐来开发物理可解释的世界模型。方法使用弱分布监督来捕捉状态不确定性，并结合了基于VQ的视觉编码器、基于变压器的物理编码器和基于已知物理方程的可学习动力学模型。关键发现表明，PIWM在三个案例研究中实现了准确的长时预测、恢复了真实系统参数，并且与纯数据驱动模型相比，提高了物理接地性。

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Authors: Zhen Wang, Zhifeng Gao, Guolin Ke

First: 2025-11-21T18:23:04+00:00 · Latest: 2025-11-21T18:23:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.

中文标题/摘要

标题：带掩蔽和重排序的自我监督强化学习从可验证奖励

测试时的扩展已被证明可以显著提高大型语言模型（LLMs）的数学推理能力。然而，对于大量数学语料库，尤其是定理证明，RLVR的扩展性受到限制：中间推理至关重要，而最终答案难以直接和可靠地验证。同时，基于token的SFT往往退化为机械记忆，而不是诱导更长的思维链。受BERT的自我监督任务启发，我们提出了MR-RLVR（带掩蔽和重排序的RLVR），通过“掩蔽-填充”和“步骤重排序”构建过程级自我监督奖励，从中间推理中提取可学习的信号。我们的训练管道包括两个阶段：首先在采样的数学计算和证明数据上进行自我监督训练；然后在只有结果可验证的数学计算数据集上进行RLVR微调。我们在Qwen2.5-3B和DeepSeek-R1-Distill-Qwen-1.5B上实现了MR-RLVR，并在AIME24、AIME25、AMC23和MATH500上进行评估。在固定采样和解码预算下，MR-RLVR在Pass@1、Pass@5和Pass@8上的平均相对增益分别为+9.86%、+5.27%和+4.00%。这些结果表明，在仅结果可验证的设置中，结合过程感知的自我监督信号可以有效增强RLVR的扩展性和性能。

Summary / 总结

The paper proposes MR-RLVR, a method that uses masked-and-reordered self-supervision to improve reinforcement learning from verifiable rewards, especially for mathematical theorem proving where intermediate reasoning is crucial but final answers are hard to verify. The method constructs process-level self-supervised rewards through 'masked-then-fill' and 'step reordering' to extract learnable signals from intermediate reasoning. Experiments on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B show that MR-RLVR improves Pass@1, Pass@5, and Pass@8 by +9.86%, +5.27%, and +4.00% respectively compared to the original RLVR, indicating enhanced scalability and performance in outcome-verifiable settings.

论文提出了一种名为MR-RLVR的方法，通过掩码和重排序的自监督任务来改进基于可验证结果的强化学习，特别是在数学定理证明中，中间推理至关重要但最终答案难以直接验证。该方法通过‘掩码-填空’和‘步骤重排序’构造过程级的自监督奖励，以从中间推理中提取可学习的信号。实验表明，MR-RLVR在Qwen2.5-3B和DeepSeek-R1-Distill-Qwen-1.5B上的Pass@1、Pass@5和Pass@8分别提高了9.86%、5.27%和4.00%，表明在仅可验证结果的环境中增强了可扩展性和性能。

Can AI Perceive Physical Danger and Intervene?

Authors: Abhishek Jindal, Dmitry Kalashnikov, R. Alex Hofer, Oscar Chang, Divya Garikapati, Anirudha Majumdar, Pierre Sermanet, Vikas Sindhwani

First: 2025-09-25T22:09:17+00:00 · Latest: 2025-11-21T18:22:41+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

When AI interacts with the physical world -- as a robot or an assistive agent -- new safety challenges emerge beyond those of purely ``digital AI". In such interactions, the potential for physical harm is direct and immediate. How well do state-of-the-art foundation models understand common-sense facts about physical safety, e.g. that a box may be too heavy to lift, or that a hot cup of coffee should not be handed to a child? In this paper, our contributions are three-fold: first, we develop a highly scalable approach to continuous physical safety benchmarking of Embodied AI systems, grounded in real-world injury narratives and operational safety constraints. To probe multi-modal safety understanding, we turn these narratives and constraints into photorealistic images and videos capturing transitions from safe to unsafe states, using advanced generative models. Secondly, we comprehensively analyze the ability of major foundation models to perceive risks, reason about safety, and trigger interventions; this yields multi-faceted insights into their deployment readiness for safety-critical agentic applications. Finally, we develop a post-training paradigm to teach models to explicitly reason about embodiment-specific safety constraints provided through system instructions. The resulting models generate thinking traces that make safety reasoning interpretable and transparent, achieving state of the art performance in constraint satisfaction evaluations. The benchmark is released at https://asimov-benchmark.github.io/v2

中文标题/摘要

标题：AI能否感知物理危险并干预？

当AI与物理世界互动——作为机器人或辅助代理时，新的安全挑战随之而来，超越了纯粹的“数字AI”所面临的挑战。在这种互动中，物理伤害的潜在风险是直接且即时的。最先进的基础模型对物理安全常识的理解如何？例如，它们是否知道一个箱子可能太重而无法提起，或者一杯热咖啡不应递给儿童？在本文中，我们的贡献有三个方面：首先，我们开发了一种高度可扩展的方法，用于持续评估具身AI系统的物理安全性，该方法基于实际伤害案例和操作安全约束。为了探究多模态安全理解，我们将这些案例和约束转化为捕捉从安全状态到不安全状态过渡的逼真图像和视频，使用先进的生成模型。其次，我们全面分析了主要基础模型感知风险、推理安全以及触发干预的能力；这为它们在关键安全应用中的部署准备提供了多方面的见解。最后，我们开发了一种后训练范式，通过系统指令提供特定于具身的安全约束来教授模型进行显式推理。生成的思维轨迹使安全推理变得可解释和透明，实现了约束满足评估中的最佳性能。基准已发布于https://asimov-benchmark.github.io/v2

Summary / 总结

This paper addresses the safety challenges faced by AI systems when interacting with the physical world. It introduces a scalable method for benchmarking physical safety in embodied AI systems using real-world injury narratives and operational safety constraints. The study evaluates major foundation models' ability to perceive risks, reason about safety, and trigger interventions, revealing their deployment readiness for safety-critical applications. The research also develops a post-training paradigm to teach models to reason about embodiment-specific safety constraints, improving interpretability and performance in constraint satisfaction tasks.

本文探讨了AI系统在与物理世界交互时面临的安全挑战。研究提出了一种基于真实世界伤害案例和操作安全约束的可扩展方法，用于评估物理安全基准。研究评估了主要基础模型感知风险、安全推理和触发干预的能力，发现虽然模型能够理解一些安全概念，但在复杂场景中常常表现不佳。研究还开发了一种后训练范式，以提高模型对安全约束的推理能力，从而在约束满足评估中取得了最先进的性能。

Topology Aware Neural Interpolation of Scalar Fields

Authors: Mohamed Kissi, Keanu Sisouk, Joshua A. Levine, Julien Tierny

First: 2025-08-25T13:04:21+00:00 · Latest: 2025-11-21T18:16:26+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper presents a neural scheme for the topology-aware interpolation of time-varying scalar fields. Given a time-varying sequence of persistence diagrams, along with a sparse temporal sampling of the corresponding scalar fields, denoted as keyframes, our interpolation approach aims at "inverting" the non-keyframe diagrams to produce plausible estimations of the corresponding, missing data. For this, we rely on a neural architecture which learns the relation from a time value to the corresponding scalar field, based on the keyframe examples, and reliably extends this relation to the non-keyframe time steps. We show how augmenting this architecture with specific topological losses exploiting the input diagrams both improves the geometrical and topological reconstruction of the non-keyframe time steps. At query time, given an input time value for which an interpolation is desired, our approach instantaneously produces an output, via a single propagation of the time input through the network. Experiments interpolating 2D and 3D time-varying datasets show our approach superiority, both in terms of data and topological fitting, with regard to reference interpolation schemes. Our implementation is available at this GitHub link : https://github.com/MohamedKISSI/Topology-Aware-Neural-Interpolation-of-Scalar-Fields.git.

中文标题/摘要

标题：拓扑感知标量场神经插值

本文提出了一种神经方案，用于时间变化的标量场的拓扑感知插值。给定一个时间变化的持久图序列以及相应的稀疏时间采样的标量场关键帧，我们的插值方法旨在“反转”非关键帧的图，以生成相应的缺失数据的合理估计。为此，我们依赖于一种神经架构，该架构基于关键帧示例学习从时间值到相应标量场的关系，并可靠地将这种关系扩展到非关键帧的时间步。我们展示了通过将特定拓扑损失与输入图结合，增强该架构如何提高非关键帧时间步的几何和拓扑重建。在查询时间，给定一个需要插值的输入时间值，我们的方法通过网络中时间输入的一次传播即时生成输出。实验表明，与参考插值方案相比，我们的方法在数据和拓扑拟合方面均表现出优越性。我们的实现可在以下GitHub链接获取：https://github.com/MohamedKISSI/Topology-Aware-Neural-Interpolation-of-Scalar-Fields.git。

Summary / 总结

This paper proposes a neural method for interpolating time-varying scalar fields while preserving their topological features. Given keyframes and persistence diagrams, the approach learns to estimate missing scalar field data at non-keyframe time steps. Topological losses are used to enhance the reconstruction, improving both geometric and topological accuracy. Experiments demonstrate superior performance compared to existing methods in both data and topological fitting for 2D and 3D datasets.

该论文提出了一种神经方法，用于在保持拓扑性质的同时插值时间变化的标量场。给定关键帧及其持久图，该方法学习在非关键帧时间步估计缺失的数据。通过引入拓扑损失，模型在几何和拓扑准确性方面均提高了插值数据的表现。实验表明，在2D和3D数据集中，该方法优于传统插值方法。

PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM

Authors: Siqi Liang, Yudi Zhang, Yue Guo

First: 2025-11-21T18:15:47+00:00 · Latest: 2025-11-21T18:15:47+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a novel framework for persona-based language model system, motivated by the need for personalized AI agents that adapt to individual user preferences. In our approach, the agent embodies the user's "persona" (e.g. user profile or taste) and is powered by a large language model (LLM). To enable the agent to leverage rich contextual information, we introduce a Knowledge-Graph-enhanced Retrieval-Augmented Generation (Graph RAG) mechanism that constructs an LLM-derived graph index of relevant documents and summarizes communities of related information. Our framework generates personalized prompts by combining: (1) a summary of the user's historical behaviors and preferences extracted from the knowledge graph, and (2) relevant global interaction patterns identified through graph-based community detection. This dynamic prompt engineering approach allows the agent to maintain consistent persona-aligned behaviors while benefiting from collective knowledge. On the LaMP benchmark, our method improves news categorization F1 by 11.1%, movie tagging F1 by 56.1%, and reduces product rating MAE by 10.4% over prior methods. Our code is available at https://anonymous.4open.science/r/PersonaAgentwGraphRAG-DE6F

中文标题/摘要

标题：PersonaAgent与GraphRAG：基于社区意识的知识图谱个性化LLM系统

我们提出了一种基于人设的语言模型系统框架，旨在满足个性化AI代理适应个体用户偏好的需求。在我们的方法中，代理承载用户的“人设”（例如用户档案或品味），并由大规模语言模型（LLM）驱动。为了使代理能够利用丰富的上下文信息，我们引入了一种知识图谱增强的检索增强生成（Graph RAG）机制，该机制构建了一个由LLM生成的相关文档图索引，并总结了相关信息的社区。我们的框架通过结合以下内容生成个性化提示：(1) 从知识图谱中提取的用户历史行为和偏好的总结，以及(2) 通过基于图的社区检测识别的相关全球交互模式。这种动态提示工程方法使代理能够在保持一致的人设行为的同时受益于集体知识。在LaMP基准测试中，我们的方法在新闻分类F1上提高了11.1%，在电影标记F1上提高了56.1%，并且在产品评分MAE上降低了10.4%。我们的代码可在https://anonymous.4open.science/r/PersonaAgentwGraphRAG-DE6F获取

Summary / 总结

This paper introduces PersonaAgent with GraphRAG, a framework for personalized AI agents that adapt to individual user preferences. The method uses a large language model and a knowledge-graph-enhanced retrieval-augmented generation mechanism to construct an LLM-derived graph index and summarize related information communities. The framework generates personalized prompts by combining user historical behaviors and global interaction patterns, enabling consistent persona-aligned behaviors. Experimental results show improvements in news categorization F1 by 11.1%, movie tagging F1 by 56.1%, and product rating MAE by 10.4% over previous methods on the LaMP benchmark.

本文提出了一个名为PersonaAgent的框架，结合GraphRAG机制，旨在创建能够适应个体用户偏好的个性化AI代理。该方法使用大型语言模型来体现用户的人格，并利用知识图谱增强的检索增强生成机制构建相关文档的LLM衍生图索引。这种方法在LaMP基准上提高了新闻分类F1分数11.1%，电影标签F1分数56.1%，并减少了产品评分MAE误差10.4%。

Forecasting Future Anatomies: Longitudinal Brain Mri-to-Mri Prediction

Authors: Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka

First: 2025-11-04T13:19:58+00:00 · Latest: 2025-11-21T18:11:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

中文标题/摘要

标题：预测未来解剖结构：纵向脑MRI到MRI预测

从基线磁共振成像（MRI）预测未来脑状态是神经影像学中的一个核心挑战，对于研究阿尔茨海默病（AD）等神经退行性疾病具有重要意义。现有大多数方法预测未来认知评分或临床结果，如从轻度认知障碍转化为痴呆。相反，我们在此研究纵向MRI图像到图像的预测，以预测参与者几年后的整个脑MRI，内在地建模复杂的、空间分布的神经退行性模式。我们在两个纵向队列（ADNI和AIBL）上实现并评估了五种深度学习架构（UNet、U2-Net、UNETR、时间嵌入UNet和ODE-UNet）。预测的随访MRI直接与实际随访扫描进行比较，使用能够捕捉全局相似性和局部差异的指标。表现最佳的模型实现了高保真预测，并且所有模型在独立外部数据集上泛化良好，展示了稳健的跨队列性能。我们的结果表明，深度学习可以可靠地在体素水平上预测参与者的脑MRI，为个体化预后提供了新的机会。

Summary / 总结

This study aims to predict future brain MRI scans from a baseline MRI to understand neurodegenerative diseases like Alzheimer's. Five deep learning models (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) were evaluated on two cohorts (ADNI and AIBL). The best models achieved high-fidelity predictions and generalized well to an independent dataset, showing robust cross-cohort performance in forecasting participant-specific brain MRI at the voxel level.

本研究旨在从基线影像预测未来脑MRI，以理解阿尔茨海默病等神经退行性疾病。五个深度学习模型在两个队列中进行了评估，最佳模型实现了高保真预测，并且能够很好地泛化到独立的外部数据集。这表明深度学习可以可靠地预测个体脑MRI在体素水平的变化，为个性化预后提供了新的见解。

Automated Interpretable 2D Video Extraction from 3D Echocardiography

Authors: Milos Vukadinovic, Hirotaka Ieki, Yuki Sahashi, David Ouyang, Bryan He

First: 2025-11-20T00:40:43+00:00 · Latest: 2025-11-21T18:09:32+00:00

Comments: 12 pages, 5 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96\% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .

中文标题/摘要

标题：自动可解释的2D视频从3D超声心动图提取

尽管心脏具有复杂的三维（3D）解剖结构，但传统的医学成像技术——心脏超声，依赖于一系列显示心脏结构的2D视频。3D超声心动图是一种正在发展的成像技术，现在提供了足够的图像质量用于临床使用，有可能简化数据采集并改善对非轴向特征的评估。我们提出了一种自动方法，从3D心脏超声体积中选择标准的2D视图，使医生能够在他们习惯的格式下解释数据，同时受益于3D扫描的速度和易用性。通过应用深度学习视图分类器和基于解剖标志的下游启发式方法，结合心脏病专家提供的启发式方法，我们重建了标准的超声心动图视图。该方法在盲评中得到了三位心脏病专家的验证（1,600个视频来自2家医院的准确率为96%）。下游的2D视频还通过人工智能超声心动图模型（EchoPrime和PanEcho）验证了其检测心脏异常的能力，以及生成心脏解剖学临床标准测量值（EchoNet-Measurement）的能力。我们证明了提取的2D视频保留了空间校准和诊断特征，使临床医生能够从3D体积中获得准确的现实世界解释。我们发布了代码和29个3D超声心动图视频的数据集https://github.com/echonet/3d-echo。

Summary / 总结

The research aims to streamline the interpretation of 3D echocardiography by automatically extracting standard 2D views. It uses a deep learning view classifier and heuristics based on anatomical landmarks and cardiologists' input to reconstruct these views. The method achieved 96% accuracy in a blinded evaluation by three cardiologists and successfully detected cardiac abnormalities and generated clinical-grade measurements, preserving spatial calibration and diagnostic features.

研究旨在通过自动提取标准2D视图来简化3D超声心动图的解读。采用深度学习视图分类器和解剖学启发式方法重建这些视图，在盲评中准确率达到96%。提取的2D视频能够有效检测心脏异常并生成临床级别的测量结果，同时保持空间校准和诊断特征。

SRA-CP: Spontaneous Risk-Aware Selective Cooperative Perception

Authors: Jiaxi Liu, Chengyuan Ma, Hang Zhou, Weizhe Tang, Shixiao Liang, Haoyang Ding, Xiaopeng Li, Bin Ran

First: 2025-11-21T18:03:48+00:00 · Latest: 2025-11-21T18:03:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Cooperative perception (CP) offers significant potential to overcome the limitations of single-vehicle sensing by enabling information sharing among connected vehicles (CVs). However, existing generic CP approaches need to transmit large volumes of perception data that are irrelevant to the driving safety, exceeding available communication bandwidth. Moreover, most CP frameworks rely on pre-defined communication partners, making them unsuitable for dynamic traffic environments. This paper proposes a Spontaneous Risk-Aware Selective Cooperative Perception (SRA-CP) framework to address these challenges. SRA-CP introduces a decentralized protocol where connected agents continuously broadcast lightweight perception coverage summaries and initiate targeted cooperation only when risk-relevant blind zones are detected. A perceptual risk identification module enables each CV to locally assess the impact of occlusions on its driving task and determine whether cooperation is necessary. When CP is triggered, the ego vehicle selects appropriate peers based on shared perception coverage and engages in selective information exchange through a fusion module that prioritizes safety-critical content and adapts to bandwidth constraints. We evaluate SRA-CP on a public dataset against several representative baselines. Results show that SRA-CP achieves less than 1% average precision (AP) loss for safety-critical objects compared to generic CP, while using only 20% of the communication bandwidth. Moreover, it improves the perception performance by 15% over existing selective CP methods that do not incorporate risk awareness.

中文标题/摘要

标题：SRA-CP：自发风险感知选择性协同感知

协同感知（CP）通过使联网车辆（CVs）之间共享信息，能够克服单车辆传感器的局限性，具有巨大的潜力。然而，现有的通用CP方法需要传输大量与驾驶安全无关的感知数据，超出可用的通信带宽。此外，大多数CP框架依赖预定义的通信伙伴，使其不适合动态交通环境。本文提出了一种自发风险感知选择性协同感知（SRA-CP）框架来解决这些问题。SRA-CP引入了一种去中心化的协议，其中联网代理持续广播轻量级的感知覆盖摘要，并仅在检测到风险相关的盲区时发起有针对性的合作。感知风险识别模块使每辆CV能够本地评估遮挡对其驾驶任务的影响，并确定是否需要合作。当CP被触发时，ego车辆根据共享的感知覆盖选择合适的伙伴，并通过融合模块进行选择性信息交换，该模块优先考虑安全关键内容并适应带宽限制。我们在公共数据集上将SRA-CP与几个代表性基线进行了评估。结果显示，与通用CP相比，SRA-CP在安全关键对象上的平均精度（AP）损失不到1%，同时仅使用20%的通信带宽。此外，它比不包含风险意识的现有选择性CP方法提高了15%的感知性能。

Summary / 总结

SRA-CP addresses the limitations of existing cooperative perception (CP) methods by introducing a decentralized protocol that selectively shares perception data only when necessary, based on risk assessment. This approach reduces unnecessary data transmission and improves safety by focusing on critical information. Experimental results demonstrate that SRA-CP maintains high precision for safety-critical objects with minimal bandwidth usage and enhances perception performance by 15% compared to other selective CP methods.

SRA-CP通过引入一个分散化的协议，在基于风险评估的情况下仅在必要时选择性地分享感知数据，解决了现有合作感知（CP）方法的局限性。这种方法减少了不必要的数据传输，并通过关注关键信息提高了安全性。实验结果表明，SRA-CP在使用极低带宽的情况下，能够保持对安全关键对象的高精度，并且与不包含风险意识的其他选择性CP方法相比，感知性能提高了15%。

Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift

Authors: Björn Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet, Nicolas Courty

First: 2025-11-21T17:57:43+00:00 · Latest: 2025-11-21T17:57:43+00:00

Comments: Accepted at BMVC 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.

中文标题/摘要

标题：改进多模态蒸馏以应对激光雷达领域偏移的3D语义分割

在全监督下训练的语义分割网络对不同类型的激光雷达无法泛化，除非进行干预。为了减少领域偏移下的性能差距，最近的趋势是利用提供跨域鲁棒特征的视觉基础模型（VFMs）。在本研究中，我们进行了详尽的研究，以确定在无监督领域适应中利用VFMs进行激光雷达点云语义分割的配方。基于无监督的图像到激光雷达知识蒸馏，我们的研究揭示了以下几点：(1) 激光雷达主干网络的架构对在目标域上最大化泛化性能至关重要；(2) 可以一次预训练一个主干网络，并用于解决多个领域偏移问题；(3) 最佳结果是保持预训练的主干网络冻结，并训练一个MLP头用于语义分割。该方法在四个广泛认可和具有挑战性的设置中达到了最先进的结果。代码将在：https://github.com/valeoai/muddos.公开。

Summary / 总结

This study aims to improve the generalization of semantic segmentation networks trained on one type of lidar to unseen lidars. By leveraging vision foundation models, the research identifies key factors such as the architecture of the lidar backbone and the use of a frozen pretrained backbone with an MLP head for semantic segmentation. The proposed method achieves state-of-the-art results in four challenging settings for lidar point cloud segmentation under domain shifts. The code is available at https://github.com/valeoai/muddos.

本研究旨在通过利用视觉基础模型来提高lidar点云语义分割网络在域转移下的泛化能力。研究发现，lidar骨干网络的架构、一次预训练一个骨干网络并用于解决多个域转移问题、以及使用冻结的预训练骨干网络和MLP头进行语义分割是关键因素，能够显著减少在四个具有挑战性的设置下的性能差距。

Illustrator's Depth: Monocular Layer Index Prediction for Image Decomposition

Authors: Nissim Maruani, Peiying Zhang, Siddhartha Chaudhuri, Matthew Fisher, Nanxuan Zhao, Vladimir G. Kim, Pierre Alliez, Mathieu Desbrun, Wang Yifan

First: 2025-11-21T17:56:43+00:00 · Latest: 2025-11-21T17:56:43+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Illustrator's Depth, a novel definition of depth that addresses a key challenge in digital content creation: decomposing flat images into editable, ordered layers. Inspired by an artist's compositional process, illustrator's depth infers a layer index to each pixel, forming an interpretable image decomposition through a discrete, globally consistent ordering of elements optimized for editability. We also propose and train a neural network using a curated dataset of layered vector graphics to predict layering directly from raster inputs. Our layer index inference unlocks a range of powerful downstream applications. In particular, it significantly outperforms state-of-the-art baselines for image vectorization while also enabling high-fidelity text-to-vector-graphics generation, automatic 3D relief generation from 2D images, and intuitive depth-aware editing. By reframing depth from a physical quantity to a creative abstraction, illustrator's depth prediction offers a new foundation for editable image decomposition.

中文标题/摘要

标题：Illustrator的深度：基于单ocular层索引预测的图像分解

我们提出了Illustrator的深度，这是一种新颖的深度定义，解决了数字内容创作中的关键挑战：将扁平图像分解为可编辑、有序的图层。受艺术家构图过程的启发，Illustrator的深度推断每个像素的图层索引，通过离散的、全局一致的元素排序形成可解释的图像分解，优化了可编辑性。我们还提出并使用一个精心策划的分层矢量图形数据集训练了一个神经网络，直接从位图输入预测图层。我们的图层索引推断解锁了一系列强大的下游应用。特别是，它在图像矢量化方面显著优于最先进的基线，同时支持高保真度的文本到矢量图形生成、从二维图像自动生成3D浮雕以及直观的深度感知编辑。通过将深度从物理量重新定义为创意抽象，Illustrator的深度预测为可编辑图像分解提供了一个新的基础。

Summary / 总结

The paper introduces Illustrator's Depth, a new concept of depth aimed at decomposing flat images into editable layers. The method uses a neural network trained on a dataset of layered vector graphics to predict layer indices for each pixel, enabling various applications such as image vectorization, text-to-vector graphics generation, and 3D relief creation. The approach significantly outperforms existing methods and supports intuitive depth-aware editing.

研究引入了Illustrator的深度，这是一种新的深度定义，用于将平面图像分解为可编辑的图层，灵感来源于艺术家的创作过程。该方法使用一个基于层叠矢量图形数据集训练的神经网络来预测每个像素的图层索引。关键发现表明，这种方法在图像矢量化方面优于现有方法，并能够实现文本到矢量图形生成、自动从2D图像生成3D浮雕以及深度感知编辑等高级应用。通过提供一个离散且全局一致的图层排序，Illustrator的深度预测增强了图像的可编辑性，并为数字内容创作开辟了新的可能性。

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

Authors: Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal

First: 2025-11-21T17:48:02+00:00 · Latest: 2025-11-21T17:48:02+00:00

Comments: website: https://sketchverify.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.

中文标题/摘要

标题：基于草图引导验证的物理感知视频生成规划

近期的视频生成方法越来越多地依赖于规划中间控制信号（如物体轨迹），以提高时间连贯性和运动保真度。然而，这些方法大多采用单次规划方案，通常仅限于简单的运动，或者需要多次调用视频生成器进行迭代优化，导致计算成本高昂。为克服这些限制，我们提出了一种无需训练的、基于草图验证的规划框架——SketchVerify，该框架通过引入测试时的采样和验证循环，在进行完整的视频生成之前，以更动态一致的轨迹（即物理上合理且指令一致的运动）来提高运动规划的质量。给定提示和参考图像，我们的方法预测多个候选运动计划，并使用结合指令语义对齐和物理合理性评估的视觉语言验证器对其进行排名。为了高效地评分候选运动计划，我们通过将对象合成到静态背景上生成轻量级视频草图，从而绕过了昂贵的重复扩散合成过程，同时保持了相当的性能。我们迭代优化运动计划，直到找到一个满意的计划，然后将其传递给轨迹条件生成器进行最终合成。在WorldModelBench和PhyWorldBench上的实验表明，与竞争性基线相比，我们的方法在运动质量、物理真实性和长期一致性方面显著提高，且效率更高。进一步的消融研究还表明，增加轨迹候选的数量可以一致地提高整体性能。

Summary / 总结

The research aims to enhance the temporal coherence and motion fidelity in video generation by improving motion planning quality. SketchVerify proposes a training-free framework that uses a sketch-verification loop to predict and rank multiple candidate motion plans, which are then refined iteratively. Experiments show that this method outperforms existing approaches in terms of motion quality, physical realism, and long-term consistency, while being more efficient. The method uses lightweight video sketches for efficient scoring of candidate plans, bypassing the need for expensive repeated synthesis.

该论文提出了一种名为SketchVerify的无训练框架，用于提高视频生成中的运动规划。该方法使用草图验证循环来预测和细化多个运动计划，并根据语义对齐和物理合理性进行评估。这种方法在运动质量、物理真实性和长期一致性方面优于现有方法，同时更为高效。实验结果表明，与竞争基线相比，该方法在运动质量方面有显著改进。

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Authors: Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong

First: 2025-11-21T17:46:44+00:00 · Latest: 2025-11-21T17:46:44+00:00

Comments: 10 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

中文标题/摘要

标题：MMT-ARD：多模态多教师对抗鲁棒蒸馏方法

视觉-语言模型（VLMs）在越来越多的安全关键应用中得到部署，因此其对抗鲁棒性变得至关重要。虽然对抗知识蒸馏在从教师模型向学生模型转移鲁棒性方面显示出潜力，但传统的单教师方法存在知识多样性有限、收敛速度慢以及难以平衡鲁棒性和准确性的缺点。为了解决这些挑战，我们提出了MMT-ARD：一种多模态多教师对抗鲁棒蒸馏框架。我们的主要创新是一个双教师知识融合架构，协同优化干净特征的保留和鲁棒特征的增强。为了更好地处理具有挑战性的对抗样本，我们引入了一种基于教师置信度的动态权重分配策略，能够适应性地关注更难的样本。此外，为了减轻教师之间的偏差，我们设计了一种基于自适应Sigmoid的加权函数，以在不同模态之间平衡知识转移的强度。在ImageNet和零样本基准上的广泛实验表明，MMT-ARD在ViT-B-32模型上提高了鲁棒准确率4.32%，零样本准确率3.5%，同时传统单教师方法的训练效率提高了2.3倍。这些结果突显了MMT-ARD在增强多模态大型模型对抗鲁棒性方面的有效性和可扩展性。我们的代码可在https://github.com/itsnotacie/MMT-ARD/获取。

Summary / 总结

MMT-ARD is a framework that enhances the adversarial robustness of vision-language models by using a dual-teacher knowledge fusion architecture and dynamic weight allocation based on teacher confidence. It improves robust accuracy by 4.32% and zero-shot accuracy by 3.5% on the ViT-B-32 model, with a 2.3x increase in training efficiency compared to traditional single-teacher methods.

MMT-ARD 是一种框架，通过使用双教师知识融合架构和动态权重分配来增强视觉-语言模型（VLMs）的对抗鲁棒性。它在 ViT-B-32 模型上将鲁棒准确率提高了 4.32%，零样本准确率提高了 3.5%，并且相比传统单教师方法训练效率提高了 2.3 倍。

CATCODER: Repository-Level Code Generation with Relevant Code and Type Context

Authors: Zhiyuan Pan, Xing Hu, Xin Xia, Xiaohu Yang

First: 2024-06-05T13:56:42+00:00 · Latest: 2025-11-21T17:41:45+00:00

Comments: Revised and extended version; To be published in ACM Transactions on Software Engineering and Methodology

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, repository-level code generation presents unique challenges, particularly due to the need to utilize information spread across multiple files within a repository. Specifically, successful generation depends on a solid grasp of both general, context-agnostic knowledge and specific, context-dependent knowledge. While LLMs are widely used for the context-agnostic aspect, existing retrieval-based approaches sometimes fall short as they are limited in obtaining a broader and deeper repository context. In this paper, we present CatCoder, a novel code generation framework designed for statically typed programming languages. CatCoder enhances repository-level code generation by integrating relevant code and type context. Specifically, it leverages static analyzers to extract type dependencies and merges this information with retrieved code to create comprehensive prompts for LLMs. To evaluate the effectiveness of CatCoder, we adapt and construct benchmarks that include 199 Java tasks and 90 Rust tasks. The results show that CatCoder outperforms the RepoCoder baseline by up to 14.44% and 17.35%, in terms of compile@k and pass@k scores. In addition, the generalizability of CatCoder is assessed using various LLMs, including both code-specialized models and general-purpose models. Our findings indicate consistent performance improvements across all models, which underlines the practicality of CatCoder. Furthermore, we evaluate the time consumption of CatCoder in a large open source repository, and the results demonstrate the scalability of CatCoder.

中文标题/摘要

标题：CATCODER：基于相关代码和类型上下文的仓库级代码生成

大型语言模型（LLMs）在代码生成任务中展现了卓越的能力。然而，仓库级代码生成面临着独特的挑战，特别是需要利用仓库中多个文件中的信息。具体来说，成功的生成依赖于对一般性、无上下文知识和具体性、有上下文知识的深刻理解。虽然LLMs广泛用于无上下文知识方面，但现有的检索式方法有时会因获取更广泛和深入的仓库上下文能力有限而表现不佳。在本文中，我们提出了CatCoder，这是一种专为静态类型编程语言设计的新型代码生成框架。CatCoder通过整合相关代码和类型上下文来增强仓库级代码生成。具体而言，它利用静态分析器提取类型依赖关系，并将此信息与检索到的代码合并，为LLMs创建全面的提示。为了评估CatCoder的有效性，我们改编并构建了基准测试，包括199个Java任务和90个Rust任务。结果显示，CatCoder在编译@k和通过@k得分上分别比RepoCoder基线高出14.44%和17.35%。此外，我们使用包括代码专门模型和通用模型在内的多种LLMs评估了CatCoder的泛化能力。我们的研究结果表明，CatCoder在所有模型上都表现出一致的性能提升，这突显了CatCoder的实用性。此外，我们还评估了CatCoder在大型开源仓库中的时间消耗，结果表明CatCoder的可扩展性。

Summary / 总结

CatCoder is a novel code generation framework for statically typed programming languages that integrates relevant code and type context to address the challenges of repository-level code generation. It uses static analyzers to extract type dependencies and merges this information with retrieved code to create comprehensive prompts for large language models. The framework outperforms the RepoCoder baseline by up to 14.44% and 17.35% in compile@k and pass@k scores, respectively, and shows consistent performance improvements across various large language models. Additionally, CatCoder demonstrates scalability in a large open source repository.

CatCoder 是一种用于静态类型编程语言的仓库级代码生成框架，通过整合相关代码和类型上下文来增强大型语言模型的能力。该框架使用静态分析器提取类型依赖关系，并将其与检索到的代码合并以创建全面的提示。实验结果显示，CatCoder 在 compile@k 和 pass@k 评分上分别比 RepoCoder 基线高出最多 14.44% 和 17.35%。此外，CatCoder 在各种语言模型上表现出一致的性能改进，并在大型开源仓库中具有可扩展性。

REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing

Authors: Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir

First: 2025-11-21T17:41:26+00:00 · Latest: 2025-11-21T17:41:26+00:00

Comments: Code and data available at https://github.com/be-chen/REMSA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.

中文标题/摘要

标题：REMSA：用于遥感基础模型选择的LLM代理

基础模型（FMs）在遥感（RS）中越来越多地用于环境监测、灾害评估和土地利用制图等任务。这些模型包括单模态视觉编码器和多模态架构，分别在合成孔径雷达（SAR）、多光谱、高光谱和图像-文本数据上进行训练。它们支持包括语义分割、图像分类、变化检测和视觉问答在内的多种RS任务。然而，由于文档分散、格式异构和部署约束多样，选择合适的遥感基础模型（RSFM）仍然很困难。我们介绍了RSFM数据库（RS-FMD），这是一个结构化的资源，涵盖了超过150个RSFM，跨越多种数据模态、分辨率和学习范式。基于RS-FMD，我们提出了REMSA，这是第一个基于LLM的代理，用于从自然语言查询中自动选择RSFM。REMSA 解释用户需求，解决缺失的约束，使用上下文学习对候选模型进行排名，并提供透明的解释。我们还提出了一个由75个专家验证的RS查询场景基准，生成了900种配置，在专家中心的评估协议下进行评估。REMSA 在多个基线中表现更优，包括朴素代理、密集检索和无结构的RAG基LLM。它完全基于公开的元数据运行，并不访问私人或敏感数据。

Summary / 总结

The research addresses the challenge of selecting appropriate remote sensing foundation models (RSFMs) due to scattered documentation and varied deployment constraints. It introduces REMSA, an LLM agent that interprets user requirements, ranks RSFMs using in-context learning, and provides transparent justifications. REMSA outperforms several baselines and operates solely on publicly available metadata without accessing private data.

研究解决了由于文档分散和部署约束多样而导致选择合适的遥感基础模型（RSFM）的难题。它引入了REMSA，这是一种基于LLM的代理，能够解释用户需求、使用上下文学习对RSFM进行排序，并提供透明的解释。REMSA在多个基准模型中表现出色，并且仅依赖于公开的元数据，不访问私人或敏感数据。

InTAct: Interval-based Task Activation Consolidation for Continual Learning

Authors: Patryk Krukowski, Jan Miksa, Piotr Helm, Jacek Tabor, Paweł Wawrzyński, Przemysław Spurek

First: 2025-11-21T17:36:12+00:00 · Latest: 2025-11-21T17:36:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Continual learning aims to enable neural networks to acquire new knowledge without forgetting previously learned information. While recent prompt-based methods perform strongly in class-incremental settings, they remain vulnerable under domain shifts, where the input distribution changes but the label space remains fixed. This exposes a persistent problem known as representation drift. Shared representations evolve in ways that overwrite previously useful features and cause forgetting even when prompts isolate task-specific parameters. To address this issue, we introduce InTAct, a method that preserves functional behavior in shared layers without freezing parameters or storing past data. InTAct captures the characteristic activation ranges associated with previously learned tasks and constrains updates to ensure the network remains consistent within these regions, while still allowing for flexible adaptation elsewhere. In doing so, InTAct stabilizes the functional role of important neurons rather than directly restricting parameter values. The approach is architecture-agnostic and integrates seamlessly into existing prompt-based continual learning frameworks. By regulating representation changes where past knowledge is encoded, InTAct achieves a principled balance between stability and plasticity. Across diverse domain-incremental benchmarks, including DomainNet and ImageNet-R, InTAct consistently reduces representation drift and improves performance, increasing Average Accuracy by up to 8 percentage points over state-of-the-art baselines.

中文标题/摘要

标题：InTAct：基于区间的任务激活整合在持续学习中的应用

持续学习旨在使神经网络能够获取新知识而不忘记之前学习的信息。尽管最近的基于提示的方法在类别增量设置中表现出色，但在输入分布变化但标签空间保持不变的领域转换情况下，它们仍然容易受到威胁。这暴露了一个持续存在的问题，即表示漂移。共享表示以会覆盖之前有用特征的方式演变，即使提示隔离了特定任务的参数，也会导致遗忘。为了解决这一问题，我们引入了InTAct，一种在不冻结参数或存储过去数据的情况下保持共享层功能行为的方法。InTAct捕获与之前学习的任务相关的特征激活范围，并通过确保网络在这些区域内部保持一致，同时在其他地方仍允许灵活适应来限制更新。因此，InTAct稳定了重要神经元的功能作用，而不是直接限制参数值。该方法具有架构无关性，并能无缝集成到现有的基于提示的持续学习框架中。通过调节编码过去知识的表示变化，InTAct实现了稳定性和可塑性之间的原则性平衡。在包括DomainNet和ImageNet-R在内的多种领域增量基准测试中，InTAct一致地减少了表示漂移并提高了性能，相对于最先进的基线提高了平均准确率多达8个百分点。

Summary / 总结

InTAct is a method for continual learning that addresses representation drift by preserving the activation ranges of previously learned tasks without freezing parameters or storing past data. It ensures the network remains consistent within these regions while allowing flexible adaptation elsewhere. InTAct improves performance across various domain-incremental benchmarks, increasing Average Accuracy by up to 8 percentage points over state-of-the-art baselines.

论文针对持续学习中出现的表示漂移问题，即神经网络在适应新任务时可能会忘记之前学习的信息。它提出了InTAct方法，该方法捕获之前学习任务的激活范围，并限制更新以在这些区域内保持一致性。InTAct不受架构限制，并且可以无缝集成到现有的基于提示的持续学习框架中，在DomainNet和ImageNet-R基准测试中，InTAct能够将平均准确率提高多达8个百分点，超过最先进的基线方法。

Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems

Authors: Zengyu Zou, Jingyuan Wang, Yixuan Huang, Junjie Wu

First: 2025-11-21T17:32:10+00:00 · Latest: 2025-11-21T17:32:10+00:00

Comments: 15 pages

Abs · PDF · Code1 · Code2

Abstract

This paper addresses the cooperative Multi-Vehicle Dynamic Pickup and Delivery Problem with Stochastic Requests (MVDPDPSR) and proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT). MVDPDPSR is an extension of the vehicle routing problem and a spatio-temporal system optimization problem, widely applied in scenarios such as on-demand delivery. Classical operations research methods face bottlenecks in computational complexity and time efficiency when handling large-scale dynamic problems. Although existing reinforcement learning methods have achieved some progress, they still encounter several challenges: 1) Independent decoding across multiple vehicles fails to model joint action distributions; 2) The feature extraction network struggles to capture inter-entity relationships; 3) The joint action space is exponentially large. To address these issues, we designed the MAPT framework, which employs a Transformer Encoder to extract entity representations, combines a Transformer Decoder with a Pointer Network to generate joint action sequences in an AutoRegressive manner, and introduces a Relation-Aware Attention module to capture inter-entity relationships. Additionally, we guide the model's decision-making using informative priors to facilitate effective exploration. Experiments on 8 datasets demonstrate that MAPT significantly outperforms existing baseline methods in terms of performance and exhibits substantial computational time advantages compared to classical operations research methods.

中文标题/摘要

标题：多智能体指针变换器：基于序列到序列的多车辆动态取送问题强化学习

本文针对合作的多车辆动态取送问题带有随机请求（MVDPDPSR）进行了研究，并提出了一种基于序列到序列的端到端集中决策框架，称为多智能体指针变换器（MAPT）。MVDPDPSR 是车辆路径问题的扩展，是一个时空系统优化问题，广泛应用于按需配送等场景。经典的运筹学方法在处理大规模动态问题时面临计算复杂性和时间效率的瓶颈。尽管现有的强化学习方法取得了一些进展，但仍面临几个挑战：1）多车辆之间的独立解码无法建模联合动作分布；2）特征提取网络难以捕捉实体间的关系；3）联合动作空间是指数级的。为了解决这些问题，我们设计了MAPT框架，该框架采用Transformer编码器提取实体表示，结合Transformer解码器与指针网络以自回归方式生成联合动作序列，并引入关系感知注意力模块以捕捉实体间的关系。此外，我们通过信息先验引导模型的决策，以促进有效的探索。在8个数据集上的实验表明，MAPT在性能上显著优于现有基线方法，并且在计算时间上具有显著优势，优于经典运筹学方法。

Summary / 总结

This paper tackles the MVDPDPSR problem by proposing the Multi-Agent Pointer Transformer (MAPT), which addresses the limitations of classical methods and existing reinforcement learning approaches. MAPT uses a Transformer Encoder to extract entity representations, a Transformer Decoder with a Pointer Network to generate joint action sequences, and a Relation-Aware Attention module to capture inter-entity relationships. Experiments show that MAPT outperforms existing methods and is more computationally efficient than classical operations research methods.

本文提出了Multi-Agent Pointer Transformer (MAPT)框架来解决MVDPDPSR问题，即车辆路线问题的扩展。MAPT使用Transformer编码器提取实体表示，并结合Transformer解码器和指针网络以自回归方式生成联合动作序列。此外，它还包括一个关系感知注意力模块来捕捉实体间的关系。实验表明，MAPT在性能上优于现有方法，并且比经典运筹学方法具有更高的计算效率。

SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Authors: Shrikant Kendre, Austin Xu, Honglu Zhou, Michael Ryoo, Shafiq Joty, Juan Carlos Niebles

First: 2025-11-21T17:30:18+00:00 · Latest: 2025-11-21T17:30:18+00:00

Comments: 23 pages, 6 tables, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

中文标题/摘要

标题：SMILE：一种复合词项语义度量方法用于问答评估

传统的文本和视觉问答评估指标，如ROUGE、METEOR和精确匹配（EM），主要关注基于n-gram的词项相似性，往往忽略了准确评估所需的深层次语义理解。虽然BERTScore和MoverScore等措施利用上下文嵌入来解决这一局限性，但它们在平衡句子级和关键词级语义方面缺乏灵活性，并且忽略了词项相似性，而后者仍然很重要。基于大型语言模型（LLM）的评估器虽然强大，但也存在高成本、偏见、不一致性和幻觉等问题。为了解决这些问题，我们引入了SMILE：语义度量集成词项精确性，这是一种结合句子级语义理解和关键词级语义理解和简单关键词匹配的新方法。这种方法平衡了词项精确性和语义相关性，提供了全面的评估。广泛的基准测试显示，SMILE与人类判断高度相关，并且计算量轻，填补了词项和语义评估之间的差距。

Summary / 总结

SMILE is a novel composite metric for evaluating question-answering systems that integrates sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. It aims to balance lexical precision and semantic relevance, addressing the limitations of traditional metrics like ROUGE and METEOR. Experimental results show that SMILE is highly correlated with human judgments and is computationally efficient, making it a promising tool for evaluating question-answering systems across various tasks.

SMILE 是一种新颖的复合评价指标，结合了句级语义理解和关键词级语义理解和简单的关键词匹配。它旨在平衡词精确度和语义相关性，解决了 ROUGE 和 METEOR 等传统指标的局限性。实验结果表明，SMILE 与人类判断高度相关且计算效率高，是跨多种任务评估问答系统的一个有前景的工具。

TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints

Authors: Dongbo Shi, Shen Cao, Lubin Fan, Bojian Wu, Jinhui Guo, Ligang Liu, Renjie Chen

Venue: AAAI 2026

First: 2025-02-27T06:16:04+00:00 · Latest: 2025-11-21T17:27:18+00:00

Abs · PDF · Code1 · Code2

Abstract

We present TrackGS, a novel method to integrate global feature tracks with 3D Gaussian Splatting (3DGS) for COLMAP-free novel view synthesis. While 3DGS delivers impressive rendering quality, its reliance on accurate precomputed camera parameters remains a significant limitation. Existing COLMAP-free approaches depend on local constraints that fail in complex scenarios. Our key innovation lies in leveraging feature tracks to establish global geometric constraints, enabling simultaneous optimization of camera parameters and 3D Gaussians. Specifically, we: (1) introduce track-constrained Gaussians that serve as geometric anchors, (2) propose novel 2D and 3D track losses to enforce multi-view consistency, and (3) derive differentiable formulations for camera intrinsics optimization. Extensive experiments on challenging real-world and synthetic datasets demonstrate state-of-the-art performance, with much lower pose error than previous methods while maintaining superior rendering quality. Our approach eliminates the need for COLMAP preprocessing, making 3DGS more accessible for practical applications.

中文标题/摘要

标题：TrackGS：利用全局特征轨迹优化COLMAP-free 3D高斯点积

我们提出了TrackGS，这是一种将全局特征轨迹与3D高斯点积（3DGS）结合以实现COLMAP-free新颖视图合成的新方法。尽管3DGS提供了出色的渲染质量，但其依赖于准确的预计算相机参数仍然是一个显著的限制。现有的COLMAP-free方法依赖于局部约束，在复杂场景中会失效。我们的关键创新在于利用特征轨迹来建立全局几何约束，从而同时优化相机参数和3D高斯点。具体来说，我们：（1）引入了受轨迹约束的高斯点作为几何锚点，（2）提出了新的2D和3D轨迹损失以确保多视图一致性，（3）推导了相机内参优化的可微分公式。在具有挑战性的现实世界和合成数据集上的大量实验表明，我们的方法在保持卓越渲染质量的同时，具有比以前方法更低的姿态误差，并且完全消除了对COLMAP预处理的需求，使3DGS更适用于实际应用。

Summary / 总结

TrackGS integrates global feature tracks with 3D Gaussian Splatting to optimize camera parameters and 3D Gaussians without relying on COLMAP. It introduces track-constrained Gaussians and novel track losses to enforce multi-view consistency, enabling simultaneous optimization. Experiments show superior performance with lower pose error and high rendering quality compared to previous methods, making 3DGS more accessible for practical use.

TrackGS 将全局特征轨迹与 3D 高斯散点结合，无需依赖 COLMAP 即可同时优化相机参数和 3D 高斯点。它引入了轨迹约束的高斯点作为几何锚点，并提出了新的二维和三维轨迹损失以确保多视图一致性。实验表明，该方法在保持卓越渲染质量的同时，具有更低的姿态误差，并且比之前的方法更易于实际应用。

Towards fully differentiable neural ocean model with Veros

Authors: Etienne Meunier, Said Ouala, Hugo Frezat, Julien Le Sommer, Ronan Fablet

First: 2025-11-21T17:24:00+00:00 · Latest: 2025-11-21T17:24:00+00:00

Comments: Accepted to Differentiable Systems and Scientific Machine Learning (workshop, EurIPS 2025)

Abs · PDF · Code1 · Code2

Abstract

We present a differentiable extension of the VEROS ocean model, enabling automatic differentiation through its dynamical core. We describe the key modifications required to make the model fully compatible with JAX autodifferentiation framework and evaluate the numerical consistency of the resulting implementation. Two illustrative applications are then demonstrated: (i) the correction of an initial ocean state through gradient-based optimization, and (ii) the calibration of unknown physical parameters directly from model observations. These examples highlight how differentiable programming can facilitate end-to-end learning and parameter tuning in ocean modeling. Our implementation is available online.

中文标题/摘要

标题：基于Veros的全可微海洋模型

我们提出了VEROS海洋模型的可微分扩展，使其动力学核心能够通过自动微分。我们描述了使模型完全兼容JAX自动微分框架所需的关键修改，并评估了由此产生的实现的数值一致性。然后展示了两个示例应用：(i) 通过基于梯度的优化校正初始海洋状态，(ii) 直接从模型观测校准未知物理参数。这些示例突显了可微分编程如何促进海洋建模中的端到端学习和参数调整。我们的实现已在线发布。

Summary / 总结

The research aims to enhance the VEROS ocean model by integrating automatic differentiation, allowing for gradient-based optimization and parameter calibration. The method involves modifying the model to be compatible with JAX's autodifferentiation framework. Key findings include the numerical consistency of the modified model and its successful application in correcting initial ocean states and calibrating unknown parameters through model observations, demonstrating the potential of differentiable programming in ocean modeling.

研究旨在通过集成自动微分来增强VEROS海洋模型，以便进行基于梯度的优化和参数调整。方法是将模型修改为与JAX的自动微分框架兼容，确保数值一致性。关键发现包括成功纠正初始海洋状态和直接从模型观测中校准物理参数，展示了不同iable编程在海洋建模中的潜力。

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Authors: Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

First: 2025-11-17T17:58:18+00:00 · Latest: 2025-11-21T17:22:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

中文标题/摘要

标题：Live-SWE-agent：软件工程代理能否在运行时自我演化？

大型语言模型（LLMs）正在重塑几乎所有行业，包括软件工程。近年来，提出了一些LLM代理来解决实际的软件问题。这些软件代理通常配备了一套编程工具，并能自主决定下一步行动，以形成完整的解决端到端软件任务的轨迹。虽然前景广阔，但它们通常需要专门设计，且可能仍不理想，因为彻底探索整个代理架构设计空间极其困难且成本高昂。认识到软件代理本质上是软件，可以进一步细化/修改，研究人员最近提出了几种自我改进的软件代理，包括达尔文-哥德尔机（DGM）。同时，这些自我改进的代理需要在特定基准上进行昂贵的离线训练，可能在不同LLM或基准之间泛化能力不强。在本文中，我们提出了Live-SWE-agent，这是第一个在解决实际软件问题时能够自主且连续在运行时自我演化的软件代理。具体而言，Live-SWE-agent 从最基本的代理架构开始，仅具有bash工具的访问权限（例如，mini-SWE-agent），并在解决实际软件问题时自主演化其自身的架构实现。在广泛研究的SWE-bench Verified基准上的评估显示，Live-SWE-AGENT 在无需测试时缩放的情况下，实现了令人印象深刻的77.4%的解决率，超越了所有现有软件代理，包括最佳的专有解决方案。此外，Live-SWE-agent 在最近的SWE-Bench Pro基准上也超越了最先进的手工构建软件代理，实现了最佳已知的45.8%的解决率。

Summary / 总结

Live-SWE-agent is a software agent that can autonomously and continuously evolve itself during runtime to solve real-world software problems. Starting with a basic agent scaffold, it evolves its own implementation while solving tasks. The agent outperforms existing software agents on the SWE-bench Verified benchmark with a solve rate of 77.4%, and achieves the best-known solve rate of 45.8% on the SWE-Bench Pro benchmark, surpassing state-of-the-art manually crafted agents.

Live-SWE-agent 是一种可以在运行时自主改进以解决实际软件工程问题的自进化软件代理。它从一个基本的代理框架开始，边解决任务边进化其实现。在 SWE-bench Verified 基准上，Live-SWE-agent 达到了 77.4% 的解决率，无需测试时扩展，超越了现有代理。此外，在 SWE-Bench Pro 基准上，它也超越了最先进的手工构建代理，实现了 45.8% 的最高解决率。

Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers

Authors: Christopher Boland, Sotirios Tsaftaris, Sonia Dahdouh

Venue: Machine.Learning.for.Biomedical.Imaging. 3 (2025)

First: 2025-11-21T17:18:35+00:00 · Latest: 2025-11-21T17:18:35+00:00

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:020

Abs · PDF · Code1 · Code2

Abstract

Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.

中文标题/摘要

标题：通过专家教师中间层知识蒸馏防止医学图像分析中的捷径学习

深度学习模型容易通过训练数据中虚假相关但无关的特征来学习捷径解决方案。在如医学图像分析等高风险应用中，这种现象可能导致模型在预测时无法使用临床有意义的特征，从而导致较差的鲁棒性和对患者的伤害。我们证明了不同类型的捷径（弥漫在整个图像中的以及局限于特定区域的）在不同网络层中表现不同，因此可以通过针对中间层的缓解策略更有效地加以应对。我们提出了一种新颖的知识蒸馏框架，利用在少量任务相关数据上微调的教师网络来缓解在大量带有偏差特征的数据集上训练的学生网络中的捷径学习。通过在CheXpert、ISIC 2017和SimBA数据集上使用各种架构（ResNet-18、AlexNet、DenseNet-121和3D CNNs）进行广泛的实验，我们展示了与传统的经验风险最小化、基于增强的偏差缓解和基于群体的偏差缓解方法相比的一致改进。在许多情况下，即使在分布外测试数据上，我们也能达到与在无偏差数据上训练的基线模型相当的性能。我们的结果表明，我们的方法在标注偏差有限且捷径特征难以先验识别的实际医学成像场景中具有实际应用价值。

Summary / 总结

The paper addresses the issue of shortcut learning in deep learning models for medical image analysis, which can lead to poor robustness. It proposes a knowledge distillation framework that uses a teacher network fine-tuned on relevant data to mitigate shortcut learning in a student network trained on a large, biased dataset. Experiments on CheXpert, ISIC 2017, and SimBA datasets show consistent improvements over traditional methods and achieve comparable performance to models trained on bias-free data, even on out-of-distribution test data.

论文针对医疗图像分析中深度学习模型容易学习捷径解决方案的问题，可能导致模型预测时使用临床意义不大的特征，从而影响鲁棒性并对患者造成伤害。提出了一种知识蒸馏框架，利用一个在相关数据上微调的教师网络来缓解学生网络在大型带偏见数据集上训练时的捷径学习问题。在CheXpert、ISIC 2017和SimBA数据集上的实验表明，该方法在多种架构下比传统方法表现出一致的改进，并且在某些情况下，其性能与在无偏见数据上训练的基线模型相当，甚至在未见过的数据上也表现良好。

DS-Span: Single-Phase Discriminative Subgraph Mining for Efficient Graph Embeddings

Authors: Yeamin Kaiser, Muhammed Tasnim Bin Anwar, Bholanath Das, Chowdhury Farhan Ahmed, Md. Tanvir Alam

First: 2025-11-21T17:17:51+00:00 · Latest: 2025-11-21T17:17:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph representation learning seeks to transform complex, high-dimensional graph structures into compact vector spaces that preserve both topology and semantics. Among the various strategies, subgraph-based methods provide an interpretable bridge between symbolic pattern discovery and continuous embedding learning. Yet, existing frequent or discriminative subgraph mining approaches often suffer from redundant multi-phase pipelines, high computational cost, and weak coupling between mined structures and their discriminative relevance. We propose DS-Span, a single-phase discriminative subgraph mining framework that unifies pattern growth, pruning, and supervision-driven scoring within one traversal of the search space. DS-Span introduces a coverage-capped eligibility mechanism that dynamically limits exploration once a graph is sufficiently represented, and an information-gain-guided selection that promotes subgraphs with strong class-separating ability while minimizing redundancy. The resulting subgraph set serves as an efficient, interpretable basis for downstream graph embedding and classification. Extensive experiments across benchmarks demonstrate that DS-Span generates more compact and discriminative subgraph features than prior multi-stage methods, achieving higher or comparable accuracy with significantly reduced runtime. These results highlight the potential of unified, single-phase discriminative mining as a foundation for scalable and interpretable graph representation learning.

中文标题/摘要

标题：DS-Span：高效图嵌入的单阶段区分性子图挖掘

图表示学习旨在将复杂的高维图结构转换为紧凑的向量空间，同时保留拓扑结构和语义。在各种策略中，基于子图的方法为符号模式发现与连续嵌入学习之间提供了可解释的桥梁。然而，现有的频繁或区分性子图挖掘方法往往遭受冗余的多阶段管道、高计算成本以及挖掘结构与其区分相关性之间的弱耦合问题。我们提出了一种名为DS-Span的单阶段区分性子图挖掘框架，该框架在搜索空间的一次遍历中统一了模式增长、剪枝和监督驱动评分。DS-Span引入了一种覆盖率限制的资格机制，一旦图被充分表示，就动态限制探索；以及一种基于信息增益的选择机制，促进具有强类别分离能力的子图，同时最小化冗余。由此产生的子图集作为下游图嵌入和分类的高效、可解释的基础。广泛的基准实验表明，DS-Span生成的子图特征比之前的多阶段方法更紧凑且更具区分性，且具有显著减少的运行时间，从而实现了更高的或可比的准确性。这些结果突显了统一的单阶段区分性挖掘作为可扩展和可解释图表示学习基础的潜力。

Summary / 总结

The research aims to improve graph representation learning by proposing DS-Span, a single-phase framework that integrates pattern growth, pruning, and scoring. DS-Span uses a coverage-capped eligibility mechanism to limit exploration and an information-gain-guided selection to promote discriminative subgraphs. Experiments show that DS-Span generates more compact and discriminative subgraph features, achieving higher accuracy with reduced runtime compared to multi-stage methods.

研究旨在通过提出单阶段判别子图挖掘框架DS-Span来改进图表示学习。DS-Span在一次遍历中统一了模式增长、剪枝和监督驱动评分，并使用覆盖率限制机制和信息增益引导选择来确保高效且判别性强的子图提取。实验结果显示，DS-Span生成的子图特征更紧凑且更具判别性，相比多阶段方法，具有更高的准确率和更短的运行时间。

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Authors: Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel

First: 2025-11-21T17:09:43+00:00 · Latest: 2025-11-21T17:09:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.

中文标题/摘要

标题：SPEAR-1：通过三维理解超越机器人演示的扩展

机器人基础模型（RFMs）作为通用的端到端系统，在机器人控制方面具有巨大潜力。然而，它们在新环境、任务和实体方面的泛化能力仍然有限。我们认为，主要瓶颈在于它们的基础：大多数RFMs都是通过微调互联网预训练的视觉-语言模型（VLMs）构建的。然而，这些VLMs是在2D图像-语言任务上进行训练的，缺乏在三维世界中进行实体控制所需的三维空间推理能力。直接通过大规模的机器人数据来弥合这一差距成本高昂且难以扩展。相反，我们提出了一种方法，即丰富易于收集的非机器人图像数据的三维注释，并增强预训练的VLM以具备三维理解能力。遵循这一策略，我们训练了SPEAR-VLM，这是一种能够从单张2D图像中推断出三维空间中物体坐标的三维感知视觉语言模型。基于SPEAR-VLM，我们引入了我们的主要贡献——SPEAR-1：一种结合了基于语言指令的实体控制和地面化三维感知的机器人基础模型。SPEAR-1在来自24个Open X-Embodiment数据集的约4500万帧数据上进行训练，其性能优于或匹配π_0-FAST和π_{0.5}等最先进的模型，同时使用了20倍少的机器人演示数据。这种精心设计的训练策略解锁了新的VLM能力，从而在仅使用机器人数据的情况下提升了实体控制的可靠性。我们公开了我们的模型权重和三维标注的数据集。

Summary / 总结

The research aims to enhance the generalization capabilities of Robotic Foundation Models (RFMs) by addressing their limitations in 3D spatial reasoning. The method involves enriching non-robotic image data with 3D annotations and training a 3D-aware Vision-Language Model (SPEAR-VLM) that infers object coordinates from 2D images. The main experimental finding is that the resulting SPEAR-1 model, which integrates grounded 3D perception with language-instructed control, outperforms state-of-the-art models while requiring significantly fewer robot demonstrations.

研究旨在通过增强Robotic Foundation Models（RFMs）的3D空间推理能力来提升其泛化能力。方法是丰富非机器人图像数据的3D注释，并训练一个能够从2D图像中推断物体坐标的3D感知视觉语言模型（SPEAR-VLM）。最终的SPEAR-1模型结合了基于3D感知的语言指导控制，其性能优于或与最先进的模型相当，同时只需要较少的机器人演示数据。这种方法显著提升了在新环境和任务中的实体控制可靠性。

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Authors: Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wengqing Wu, Le Zhang, Massimo Poesio, Juntao Yu

First: 2025-11-14T22:41:27+00:00 · Latest: 2025-11-21T17:08:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89\%} and reduces end-to-end latency by \textbf{28\%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.

中文标题/摘要

标题：见林又见木：面向长视频多模态语言模型的查询感知分词器

尽管多模态大型语言模型（MLLMs）在视频理解能力方面取得了近期进展，但长视频理解仍是一个挑战。主要问题在于视觉标记的数量随着视频长度线性增长，导致注意力成本、内存和延迟爆炸性增长。为了解决这一挑战，我们提出了查询感知标记选择器（\textbf{QTSplus}），这是一种轻量级但强大的视觉标记选择模块，作为视觉编码器和LLMs之间的信息闸门。给定文本查询和视频标记，QTSplus通过（i）利用交叉注意力评分视觉标记，（ii）根据查询的复杂性预测实例特定的保留预算，以及（iii）在训练期间使用可微直通估计器选择Top-$n$标记，在推理期间使用硬门选择，动态选择输入文本查询最重要的视觉证据。此外，一个小的重编码器使用绝对时间信息保持时间顺序，使二级定位成为可能，同时保持全局覆盖。将QTSplus集成到Qwen2.5-VL中，在长视频上压缩视觉流最多可达\textbf{89\%}，并减少端到端延迟\textbf{28\%}。在八个长视频理解基准上的评估显示，与原始Qwen模型相比，总体准确率接近一致，分别在TempCompass方向和顺序准确率上高出\textbf{+20.5}和\textbf{+5.6}点。这些结果表明，QTSplus是一种有效的、通用的机制，可以将MLLMs扩展到现实世界的长视频场景，同时保留任务相关的证据。

Summary / 总结

The research addresses the challenge of long-video understanding by proposing Query-aware Token Selector (QTSplus), a module that dynamically selects important visual evidence for text queries. It scores visual tokens via cross-attention, predicts an instance-specific retention budget, and selects Top-$n$ tokens during training and uses a hard gate at inference. QTSplus compresses the vision stream by up to 89% and reduces end-to-end latency by 28% on long videos, while maintaining or improving accuracy on long video understanding benchmarks.

研究提出了Query-aware Token Selector (QTSplus)，该方法能够动态选择与给定文本查询最相关的视觉令牌。QTSplus 使用交叉注意力来评分视觉令牌，预测实例特定的保留预算，并在训练期间选择 Top-$n$ 令牌，在推理时使用硬门。这种方法在长视频上将视觉流压缩高达 89%，并减少端到端延迟 28%。实验结果表明，QTSplus 在长视频理解基准测试中保持或优于原始模型的准确性。

Beyond Multiple Choice: A Hybrid Framework for Unifying Robust Evaluation and Verifiable Reasoning Training

Authors: Yesheng Liu, Hao Li, Haiyu Xu, Baoqi Pei, Jiahao Wang, Mingxuan Zhao, Jingshu Zheng, Zheqi He, JG Yao, Bowen Qin, Xi Yang, Jiajun Zhang

First: 2025-11-21T17:06:37+00:00 · Latest: 2025-11-21T17:06:37+00:00

Comments: Project url: https://flageval-baai.github.io/ReVeL/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

中文标题/摘要

标题：超越选择题：统一稳健评估与可验证推理训练的混合框架

多项选择题回答（MCQA）一直是评估和强化微调（RFT）现代多模态语言模型的一种流行格式。其受限的输出格式允许简化和确定性的自动验证。然而，我们发现选项可能泄露可利用的信号，这使得准确度指标无法可靠地反映实际能力，并鼓励在RFT期间进行显式或隐式的答案猜测行为。我们提出了ReVeL（由LLM重写和验证），这是一种框架，将多项选择题问题重写为开放式问题，同时尽可能保持答案的可验证性。该框架根据不同的答案类型对问题进行分类，并应用不同的重写和验证方案。在应用于RFT时，我们转换了20,000个MCQA示例，并使用GRPO对Qwen2.5-VL模型进行微调。在ReVeL-OpenQA上训练的模型在多项选择基准测试中的准确度与MCQA相当，并且在开放式问题准确度上提高了约六个百分点，表明比基于MCQA的训练具有更好的数据效率和更稳健的奖励信号。在用于评估时，ReVeL还揭示了MCQA基准测试中高达20个百分点的分数膨胀（相对于开放式问题），提高了评判准确性，并减少了成本和延迟。我们将公开发布代码和数据。

Summary / 总结

The paper addresses the limitations of multiple-choice question answering (MCQA) in evaluating and training multimodal language models, where the constrained output format can lead to unreliable accuracy metrics and encourage guessing behaviors. It introduces ReVeL, a hybrid framework that rewrites MCQA into open-form questions while maintaining verifiable answers. When applied to training, ReVeL-OpenQA models match MCQA accuracy on benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and robust reward signals. For evaluation, ReVeL reduces score inflation by up to 20 percentage points, improves judging accuracy, and lowers cost and latency.

论文针对多选题问答（MCQA）在评估和训练多模态语言模型时存在的局限性，如受限的输出格式可能导致不稳定的准确度指标和促进猜测行为。它提出了ReVeL框架，将MCQA重写为开放式问题，同时保持答案可验证。应用于训练时，ReVeL-OpenQA模型在基准测试中的准确度与MCQA相当，并且提高了约六个百分点的OpenQA准确度，表明更好的数据效率和更稳健的奖励信号。在评估时，ReVeL减少了高达20个百分点的分数膨胀，提高了评判准确性，并降低了成本和延迟。

Minimax Statistical Estimation under Wasserstein Contamination

Authors: Patrick Chao, Edgar Dobriban

First: 2023-08-03T16:19:40+00:00 · Latest: 2025-11-21T17:03:29+00:00

Comments: A revision, including a changed title. This version extends the results to more general perturbations and loss functions, while also obtaining a new optimal rate for density estimation. Some of the techniques described in the original submission (ambiguity set minimax lower bounds, Bayes lower bounds) are not required anymore and have thus been removed

Abs · PDF · Code1 · Code2

Abstract

Contaminations are a key concern in modern statistical learning, as small but systematic perturbations of all datapoints can substantially alter estimation results. Here, we study Wasserstein-$r$ contaminations ($r\ge 1$) in an $\ell_q$ norm ($q\in [1,\infty]$), in which each observation may undergo an adversarial perturbation with bounded cost, complementing the classical Huber model, corresponding to total variation norm, where only a fraction of observations is arbitrarily corrupted. We study both independent and joint (coordinated) contaminations and develop a minimax theory under $\ell_q^r$ losses. Our analysis encompasses several fundamental problems: location estimation, linear regression, and pointwise nonparametric density estimation. For joint contaminations in location estimation and for prediction in linear regression, we obtain the exact minimax risk, identify least favorable contaminations, and show that the sample mean and least squares predictor are respectively minimax optimal. For location estimation under independent contaminations, we give sharp upper and lower bounds, including exact minimaxity in the Euclidean Wasserstein contamination case, when $q=r=2$. For pointwise density estimation in any dimension, we derive the optimal rate, showing that it is achieved by kernel density estimation with a bandwidth that is possibly larger than the classical one. Our proofs leverage powerful tools from optimal transport developed over the last 20 years, including the dynamic Benamou-Brenier formulation. Taken together, our results suggest that in contrast to the Huber contamination model, for norm-based Wasserstein contaminations, classical estimators may be nearly optimally robust.

Summary / 总结

This paper investigates the impact of Wasserstein-$r$ contaminations on statistical estimation, extending the classical Huber model. The study covers location estimation, linear regression, and nonparametric density estimation under both independent and joint contaminations. Key findings include the exact minimax risk for joint contaminations in location estimation and linear regression, with the sample mean and least squares predictor being minimax optimal. For independent contaminations in location estimation, sharp upper and lower bounds are derived, with the sample mean being minimax optimal in the Euclidean Wasserstein contamination case. The paper also provides the optimal rate for pointwise density estimation in any dimension, achieved by kernel density estimation with a potentially larger bandwidth. The analysis relies on optimal transport tools, suggesting that classical estimators can be nearly optimally robust against Wasserstein contaminations.

该研究探讨了Wasserstein-$r$污染对统计估计的影响，扩展了经典的Huber模型。研究涵盖了独立和联合污染下的位置估计、线性回归和非参数密度估计。主要发现包括联合污染下位置估计和线性回归的精确最小最大风险，样本均值和最小二乘预测器分别是最小最大最优的。对于独立污染下的位置估计，给出了精确的上界和下界，样本均值在欧几里得Wasserstein污染情况下是最小最大最优的。研究还提供了任何维度下点估计的最优速率，通过带宽可能更大的核密度估计实现。分析依赖于最优传输工具，表明经典估计器对Wasserstein污染可以近乎最优地稳健。

SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense

Authors: Patryk Krukowski, Łukasz Gorczyca, Piotr Helm, Kamil Książek, Przemysław Spurek

First: 2025-06-09T21:43:56+00:00 · Latest: 2025-11-21T16:58:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Continual learning under adversarial conditions remains an open problem, as existing methods often compromise either robustness, scalability, or both. We propose a novel framework that integrates Interval Bound Propagation (IBP) with a hypernetwork-based architecture to enable certifiably robust continual learning across sequential tasks. Our method, SHIELD, generates task-specific model parameters via a shared hypernetwork conditioned solely on compact task embeddings, eliminating the need for replay buffers or full model copies and enabling efficient over time. To further enhance robustness, we introduce Interval MixUp, a novel training strategy that blends virtual examples represented as $\ell_{\infty}$ balls centered around MixUp points. Leveraging interval arithmetic, this technique guarantees certified robustness while mitigating the wrapping effect, resulting in smoother decision boundaries. We evaluate SHIELD under strong white-box adversarial attacks, including PGD and AutoAttack, across multiple benchmarks. It consistently outperforms existing robust continual learning methods, achieving state-of-the-art average accuracy while maintaining both scalability and certification. These results represent a significant step toward practical and theoretically grounded continual learning in adversarial settings.

Summary / 总结

SHIELD is a framework that integrates Interval Bound Propagation with a hypernetwork-based architecture to enable certifiably robust continual learning. It uses a shared hypernetwork to generate task-specific model parameters and introduces Interval MixUp to enhance robustness. SHIELD outperforms existing methods in terms of average accuracy under strong white-box adversarial attacks, while maintaining scalability and certification.

SHIELD 是一个框架，将 Interval Bound Propagation (IBP) 与基于超网络的架构结合，以实现可验证鲁棒的持续学习。它使用共享超网络根据紧凑的任务嵌入生成任务特定的模型参数，避免了回放缓冲区或完整模型副本的需求。SHIELD 还引入了 Interval MixUp 来增强鲁棒性，通过混合虚拟示例来保证可验证的鲁棒性并减轻包裹效应。实验结果表明，SHIELD 在强白盒对抗攻击下优于现有方法，同时保持了可扩展性和验证性，是向对抗环境中实用且理论基础的持续学习迈出的重要一步。

Value of Information-Enhanced Exploration in Bootstrapped DQN

Authors: Stergios Plataniotis, Charilaos Akasiadis, Georgios Chalkiadakis

First: 2025-11-04T20:22:58+00:00 · Latest: 2025-11-21T16:56:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Efficient exploration in deep reinforcement learning remains a fundamental challenge, especially in environments characterized by high-dimensional states and sparse rewards. Traditional exploration strategies that rely on random local policy noise, such as $ε$-greedy and Boltzmann exploration methods, often struggle to efficiently balance exploration and exploitation. In this paper, we integrate the notion of (expected) value of information (EVOI) within the well-known Bootstrapped DQN algorithmic framework, to enhance the algorithm's deep exploration ability. Specifically, we develop two novel algorithms that incorporate the expected gain from learning the value of information into Bootstrapped DQN. Our methods use value of information estimates to measure the discrepancies of opinions among distinct network heads, and drive exploration towards areas with the most potential. We evaluate our algorithms with respect to performance and their ability to exploit inherent uncertainty arising from random network initialization. Our experiments in complex, sparse-reward Atari games demonstrate increased performance, all the while making better use of uncertainty, and, importantly, without introducing extra hyperparameters.

中文标题/摘要

标题：信息增强探索在自助DQN中的价值

在深度强化学习中高效探索仍然是一个基本挑战，尤其是在高维状态和稀疏奖励的环境中。传统的依赖于随机局部策略噪声的探索策略，如$ε$-贪婪和玻尔兹曼探索方法，往往难以在探索和利用之间进行有效平衡。在本文中，我们将在著名的自助DQN算法框架中整合（期望）信息价值（EVOI）的概念，以增强算法的深度探索能力。具体来说，我们开发了两种新的算法，将学习信息价值的预期收益纳入自助DQN。我们的方法使用信息价值估计来衡量不同网络头之间的意见分歧，并驱动探索向最有潜力的区域。我们根据性能和其利用随机网络初始化固有的不确定性能力来评估我们的算法。我们的实验在复杂的、稀疏奖励的Atari游戏中表明，性能得到了提高，同时更好地利用了不确定性，而且重要的是，没有引入额外的超参数。

Summary / 总结

This paper addresses the challenge of efficient exploration in deep reinforcement learning, particularly in high-dimensional state and sparse reward environments. It introduces two novel algorithms that integrate the concept of expected value of information (EVOI) into the Bootstrapped DQN framework to enhance exploration. The key experimental findings show improved performance in complex Atari games, better utilization of uncertainty, and no additional hyperparameters needed.

该论文针对高维状态和稀疏奖励环境下的深度强化学习中高效探索的挑战。它提出了两种新的算法，将预期价值信息（EVOI）的概念整合到Bootstrapped DQN框架中，以增强探索能力。实验结果表明，在复杂的游戏环境中，这些算法能够更好地利用不确定性并提高性能，且无需增加额外的超参数。