arXiv 论文速递

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

Authors: Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, Ziwei Liu

First: 2025-12-22T18:59:57+00:00 · Latest: 2025-12-22T18:59:57+00:00

Comments: Code link: https://github.com/WeichenFan/UAE

Abstract

Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.

中文标题/摘要

标题：棱镜假设：通过统一自编码器协调语义和像素表示

不同模态的深层表示本质上是交织的。在本文中，我们系统地分析了各种语义和像素编码器的频谱特性。有趣的是，我们的研究揭示了一种令人鼓舞且鲜少探索的对应关系：编码器的特征频谱与其功能作用之间存在联系：语义编码器主要捕捉低频成分，这些成分编码抽象意义，而像素编码器还保留高频信息，传达精细细节。这一启发性发现提供了一种统一视角，将编码器行为与其潜在频谱结构联系起来。我们将其定义为棱镜假设，其中每种数据模态可以被视为将自然世界投影到共享特征频谱上的过程，就像棱镜一样。基于这一洞察，我们提出了统一自编码器（UAE），该模型通过创新的频带调节器协调语义结构和像素细节，使它们无缝共存。在ImageNet和MS-COCO基准上的广泛实验验证了我们的UAE能够将语义抽象和像素级保真度统一到一个最先进的单一潜在空间中。

Summary / 总结

This paper explores the relationship between semantic and pixel encoders by analyzing their spectral characteristics. It proposes the Prism Hypothesis, which suggests that semantic encoders focus on low-frequency components for abstract meaning, while pixel encoders capture both low- and high-frequency components for fine-grained details. Based on this hypothesis, the authors introduce Unified Autoencoding (UAE), a model that harmonizes semantic and pixel representations through a frequency-band modulator. Experiments on ImageNet and MS-COCO show that UAE achieves state-of-the-art performance in unifying semantic abstraction and pixel-level fidelity.

本文通过分析语义和像素编码器的频谱特性，探讨了它们之间的关系。提出了棱镜假设，认为语义编码器主要捕捉低频成分以表达抽象意义，而像素编码器则同时保留低频和高频成分以表达细节。基于这一假设，作者提出了统一自编码器（UAE），该模型通过频带调节器协调语义和像素表示。在ImageNet和MS-COCO基准上的实验表明，UAE在统一语义抽象和像素级保真度方面达到了最先进的性能。

Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

Authors: Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

First: 2025-12-22T18:59:34+00:00 · Latest: 2025-12-22T18:59:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The current standard for evaluating this capability is MedCalc-Bench, a large-scale dataset constructed using LLM-based feature extraction and rule-based aggregation. However, treating such model-generated benchmarks as static oracles risks enshrining historical model errors as evaluation gold standards, a problem dangerously amplified when these datasets serve as reward signals for Reinforcement Learning (RL). In this work, we propose viewing benchmarks for complex tasks such as clinical score computation as ''in-progress living documents'' that should be periodically re-evaluated as the processes for creating them improve. We introduce a systematic, physician-in-the-loop pipeline that leverages advanced agentic verifiers to audit and relabel MedCalc-Bench, utilizing automated triage to reserve scarce clinician attention for the most contentious instances. Our audit reveals that a notable fraction of original labels diverge from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity. To study whether this label noise meaningfully impacts downstream RL training, we fine-tune a Qwen3-8B model via Group Relative Policy Optimization (GRPO) and demonstrate that training on corrected labels yields an 8.7% absolute improvement in accuracy over the original baseline -- validating that label noise materially affects model evaluation. These findings underscore that in safety-critical domains, rigorous benchmark maintenance is a prerequisite for genuine model alignment.

中文标题/摘要

标题：通过医师监督可扩展地增强临床效度任务基准

自动化临床风险评分的计算提供了一个显著的机会，以减轻医师的行政负担并提升患者护理质量。当前评估这种能力的标准是MedCalc-Bench，这是一个大规模数据集，使用基于LLM的功能提取和基于规则的聚合构建而成。然而，将此类模型生成的基准视为静态或acles，存在将历史模型错误固化为评估金标准的风险，当这些数据集作为强化学习（RL）的奖励信号时，这一问题被放大。在本文中，我们提出将复杂的任务，如临床评分计算的基准视为“在进行中的活文档”，应随着创建它们的过程改进而定期重新评估。我们引入了一种系统性的、医师参与的管道，利用先进的代理验证者进行审计和重新标记MedCalc-Bench，利用自动化分诊来保留稀缺的临床注意力用于最争议的实例。我们的审计揭示，相当一部分原始标签由于提取错误、计算器逻辑不匹配和临床含糊不清而与医学真实情况不符。为了研究这种标签噪声是否对下游RL训练产生实质性影响，我们通过组相对策略优化（GRPO）微调了一个Qwen3-8B模型，并证明使用修正后的标签进行训练比原始基线提高了8.7%的绝对准确率——验证了标签噪声对模型评估有实质性影响。这些发现强调，在安全关键领域，严格的基准维护是实现真正模型对齐的前提。

Summary / 总结

This study addresses the issue of using model-generated benchmarks as static oracles in evaluating clinical risk score computation, which can perpetuate historical errors. The authors propose a physician-in-the-loop pipeline to periodically re-evaluate benchmarks, using automated triage to focus clinician attention on contentious instances. They found that a significant portion of original labels were incorrect due to extraction errors, calculator logic mismatches, and clinical ambiguity. Fine-tuning a Qwen3-8B model with corrected labels improved accuracy by 8.7% compared to the original baseline, highlighting the importance of rigorous benchmark maintenance in safety-critical domains.

该研究针对使用模型生成的基准作为临床风险评分评估标准时可能延续历史错误的问题，提出了一种医师参与的循环评估管道。研究发现，大量原始标签因提取错误、计算逻辑不符和临床模糊性而失准。使用修正后的标签微调Qwen3-8B模型后，准确率提高了8.7%，表明标签噪声对模型评估有实质性影响。

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Authors: Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu

First: 2025-12-22T18:59:07+00:00 · Latest: 2025-12-22T18:59:07+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.

中文标题/摘要

标题：利用大规模多模态对应学习推动视听感知前沿

我们介绍了感知编码器视听模型（PE-AV），这是一种用于音频和视频理解的新一代编码器，通过扩展对比学习进行训练。基于PE，PE-AV 对扩展表示到音频以及原生支持跨音频-视频、音频-文本和视频-文本模态的联合嵌入做出了多项关键贡献。PE-AV 统一的跨模态嵌入使我们能够实现新的任务，如语音检索，并在标准的音频和视频基准测试中达到新的最佳性能。我们通过构建强大的视听数据引擎，为数百万对音频-视频数据生成高质量的字幕，从而实现跨模态的一致大规模监督。我们的音频数据包括语音、音乐和一般声效，避免了先前工作中的单域限制。我们利用十个成对对比目标，表明跨模态和字幕类型对的扩展增强了对齐并提高了零样本性能。我们进一步通过使用帧级对比目标对PE-AV进行微调，开发了PE-A-Frame，使其能够实现细粒度的音频帧到文本对齐，用于声音事件检测等任务。

Summary / 总结

The research aims to enhance audiovisual perception through large-scale multimodal correspondence learning. PE-AV, a new family of encoders, is trained with scaled contrastive learning to extend representations to audio and support joint embeddings across audio-video, audio-text, and video-text modalities. Key experimental findings include setting a new state of the art across standard audio and video benchmarks and enabling novel tasks such as speech retrieval. This is achieved by synthesizing high-quality captions for a large dataset of audio-video pairs and using ten pairwise contrastive objectives to improve alignment and zero-shot performance.

研究旨在通过大规模多模态对应学习提升音频视觉感知。PE-AV 是一种新的编码器家族，通过缩放对比学习训练以扩展到音频并支持跨音频-视频、音频-文本和视频-文本模态的联合嵌入。关键实验发现包括在标准音频和视频基准测试中达到新的最先进水平，并且能够实现诸如语音检索等新任务。这通过为大量音频-视频对合成高质量的字幕并使用十个成对对比目标来提高对齐和零样本性能来实现。

Zero-shot Reconstruction of In-Scene Object Manipulation from Video

Authors: Dixuan Lin, Tianyou Wang, Zhuoyang Pan, Yufu Wang, Lingjie Liu, Kostas Daniilidis

First: 2025-12-22T18:58:29+00:00 · Latest: 2025-12-22T18:58:29+00:00

Abs · PDF · Code1 · Code2

Abstract

We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.

中文标题/摘要

标题：基于单目RGB视频的场景内物体操作零样本重建

我们构建了首个从单目RGB视频中重建场景内物体操作的系统。由于场景重建病态、手物深度模糊以及需要物理上合理的交互，这极具挑战性。现有方法在手为中心的坐标系中操作，忽视了场景，阻碍了度量准确性和实际应用。在我们的方法中，我们首先使用数据驱动的基础模型初始化核心组件，包括物体网格和姿态、场景点云以及手的姿态。然后我们应用两阶段优化，从抓取到交互恢复完整的手物运动，使其与输入视频中观察到的场景信息保持一致。

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Authors: Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang

First: 2025-12-22T18:58:12+00:00 · Latest: 2025-12-22T18:58:12+00:00

Comments: Project page: https://harmlesssr.github.io/openbench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

中文标题/摘要

标题：从室内到开放世界：揭示MLLM的空间推理差距

虽然多模态大型语言模型（MLLMs）在语义任务上取得了令人印象深刻的性能，但它们的空间智能——对于稳健和基于现实的AI系统至关重要——仍然发展不足。现有的基准测试在诊断这一局限性方面存在不足：它们要么专注于过于简化的定性推理，要么依赖于特定领域的室内数据，受限于缺乏具有可验证度量真实性的室外数据集。为了弥合这一差距，我们引入了一个基于行人视角视频的大规模基准，这些视频由同步立体相机、LiDAR和IMU/GPS传感器捕获。该数据集提供了度量精确的3D信息，使我们能够自动生成跨越层次谱系的空间推理问题——从定性的关系推理到定量的度量和运动理解。评估表明，在结构化的室内基准测试中观察到的性能提升在开放世界环境中消失。进一步使用合成异常场景和盲测分析证实，当前的MLLMs严重依赖于语言先验而非基于视觉的推理。因此，我们的基准测试为诊断这些局限性并推进物理上基于现实的空间智能提供了一个有原则的平台。

Summary / 总结

The research aims to highlight the spatial reasoning limitations of Multimodal Large Language Models (MLLMs) by introducing a new benchmark using pedestrian-perspective videos with synchronized sensors. This dataset offers precise 3D information, enabling a range of spatial reasoning questions. Experiments show that MLLMs perform well in indoor settings but struggle in open-world scenarios, indicating a reliance on linguistic priors rather than visual reasoning. The benchmark provides a platform for diagnosing and advancing spatial intelligence in AI systems.

研究旨在通过引入一个使用行人视角视频和同步传感器的新基准，提供精确的3D信息，来揭示MLLMs的空间推理缺陷。研究发现，MLLMs在结构化的室内环境中表现良好，但在开放世界场景中表现不佳，表明它们主要依赖于语言先验而非基于视觉的推理。该基准为诊断和推进物理上接地的空间智能提供了平台。

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Authors: Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

First: 2025-08-27T17:39:11+00:00 · Latest: 2025-12-22T18:57:39+00:00

Comments: New experiments on VL retention and new ablations. 18 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge. We also provide ablation study on vision-language ability retention on LIBERO-OOD (Out-of-Distribution) benchmark, with our method improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our code is available at https://github.com/Liang-ZX/DiscreteDiffusionVLA/tree/libero.

中文标题/摘要

标题：离散扩散VLA：将离散扩散引入视觉-语言-行动策略的动作解码

视觉-语言-行动（VLA）模型将大型视觉-语言骨干网络适应为将图像和指令映射为机器人行动。然而，现有的VLA要么以固定从左到右的顺序自回归生成行动，要么在骨干网络之外附加单独的MLP或扩散头，导致信息路径碎片化和专门的训练要求，阻碍了统一、可扩展架构的构建。我们提出了离散扩散VLA，这是一种统一变换器策略，使用离散扩散建模离散化的行动片段。该设计保留了扩散的逐步细化范式，同时原生兼容VLM的离散标记接口。我们的方法实现了自适应解码顺序，先解决简单的行动元素，再解决困难的元素，并使用二次重新掩码在细化轮次中重新访问不确定的预测，从而提高一致性并实现稳健的错误校正。这种统一的解码器保留了预训练的视觉-语言先验，支持并行解码，打破自回归瓶颈，并减少函数评估次数。离散扩散VLA在LIBERO上的平均成功率达到了96.3%，在SimplerEnv-Fractal上的视觉匹配率为71.2%，在SimplerEnv-Bridge上的综合得分为54.2%。我们还在LIBERO-OOD（离群分布）基准上提供了视觉-语言能力保留的新消融研究，我们的方法优于自回归、MLP解码器和连续扩散基线。这些发现表明，离散扩散VLA支持精确的动作建模和一致的训练，为扩展VLA到更大规模的模型和数据集奠定了基础。我们的代码可在https://github.com/Liang-ZX/DiscreteDiffusionVLA/tree/libero/ 获取。

Summary / 总结

The research aims to address the limitations of existing Vision-Language-Action (VLA) models by proposing Discrete Diffusion VLA, which models discretized action chunks with discrete diffusion. This method retains the progressive refinement paradigm of diffusion while being compatible with the discrete token interface of vision-language models. Key findings include an adaptive decoding order that handles easy actions first and revisits uncertain predictions, leading to improved success rates on LIBERO (96.3% avg. success rates), SimplerEnv-Fractal (71.2% visual matching), and SimplerEnv-Bridge (54.2% overall). The method also outperforms autoregressive, MLP decoder, and continuous diffusion baselines in retaining vision-language abilities on the LIBERO-OOD benchmark.

论文提出了Discrete Diffusion VLA，这是一种统一变换器策略，使用离散扩散来建模动作片段，解决了现有VLA模型的局限性。该模型实现了自适应解码，重新审视不确定的预测，提高了一致性和错误修正能力。该模型在LIBERO上达到了96.3%的平均成功率，在SimplerEnv-Fractal上实现了71.2%的视觉匹配，在SimplerEnv-Bridge上达到了54.2%的整体成功率。消融研究显示该方法在保留视觉语言能力方面非常有效。该设计支持并行解码，减少了函数评估次数，为扩展VLA模型奠定了基础。

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

Authors: Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang

First: 2025-10-10T17:54:24+00:00 · Latest: 2025-12-22T18:56:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official contests of 14 Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 34 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestants, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results are publicly available on our website.

中文标题/摘要

标题：LiveOIBench：大型语言模型能否在信息学奥林匹克竞赛中超越人类选手？

由于其复杂性和易于验证的特点，编程竞赛问题逐渐成为评估大型语言模型（LLMs）编程能力的重要基准。然而，当前的编程基准存在一些局限性，如缺乏极富挑战性的问题、测试用例覆盖不足以及依赖于限制访问性的在线平台API。为解决这些问题，我们引入了LiveOIBench，这是一个包含403个专家精选的信息学奥林匹克级别编程竞赛问题的综合基准，每个问题平均有60个专家设计的测试用例。这些问题直接来源于2023年至2025年间不同地区14场官方信息学奥林匹克竞赛的72场正式比赛。LiveOIBench通过四个关键特性脱颖而出：（1）精心挑选的高质量任务，附有详细的子任务评分标准和广泛的私有测试用例；（2）直接整合顶尖选手的表现数据，以实现与顶级人类选手的对比；（3）计划从新发布的奥林匹克竞赛问题中持续、无污染地更新；（4）一个自包含的评估系统，便于离线和易于复现的评估。对34个流行的通用和推理型LLMs进行基准测试后，我们发现GPT-5达到了显著的第81.76百分位，这是一个强大的结果，但仍低于顶级人类选手，后者通常排名在第90百分位以上。相比之下，开源推理模型GPT-OSS-120B仅达到第60百分位，突显了与前沿封闭模型之间的能力差距。详细分析表明，稳健的推理模型更倾向于精确的问题分析而非过度探索，这表明未来模型应强调结构化分析并尽量减少不必要的探索。所有数据、代码和排行榜结果均可在我们的网站上公开获取。

Summary / 总结

The paper introduces LiveOIBench, a benchmark consisting of 403 expert-curated Olympiad-level competitive programming problems, to evaluate the coding capabilities of large language models (LLMs). It compares 34 popular LLMs, finding that GPT-5 performs at an 81.76th percentile, slightly below top human contestants. The study highlights the need for models to focus on precise problem analysis rather than extensive exploration.

论文介绍了LiveOIBench基准，使用了403个专家精选的奥林匹克级别编程竞赛问题来评估大型语言模型（LLMs）。研究发现，GPT-5的表现为第81.76百分位，虽然优于许多模型，但仍不及顶尖的人类选手。研究强调模型需要专注于精确的问题分析而非过度探索。

WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Authors: Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang

First: 2025-12-22T18:53:50+00:00 · Latest: 2025-12-22T18:53:50+00:00

Comments: Project page: https://hyokong.github.io/worldwarp-page/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{https://hyokong.github.io/worldwarp-page/}{https://hyokong.github.io/worldwarp-page/}.

中文标题/摘要

标题：WorldWarp：异步视频扩散传播3D几何

生成长距离、几何上一致的视频面临着根本性的难题：虽然一致性要求严格遵守像素空间中的3D几何，但最先进的生成模型在相机条件下的潜在空间中运行得最有效。这种脱节导致当前方法在处理遮挡区域和复杂相机轨迹时遇到困难。为弥合这一差距，我们提出了WorldWarp框架，该框架结合了3D结构锚点和2D生成细化器。为了建立几何基础，WorldWarp通过高斯点积（3DGS）维护一个在线的3D几何缓存。通过显式地将历史内容映射到新视图，该缓存充当结构支架，确保每个新帧都尊重先前的几何结构。然而，静态映射不可避免地会因遮挡而留下空洞和伪影。我们使用一种时空扩散（ST-Diff）模型来解决这一问题，该模型旨在实现“填充和修订”的目标。我们的关键创新是一个时空变化的噪声调度：空白区域接收全噪声以触发生成，而映射区域接收部分噪声以实现细化。通过在每一步动态更新3D缓存，WorldWarp在视频片段之间保持一致性。因此，它通过确保3D逻辑指导结构而扩散逻辑完善纹理，实现了最先进的保真度。项目页面：https://hyokong.github.io/worldwarp-page/

Summary / 总结

WorldWarp addresses the challenge of generating long-range, geometrically consistent video by coupling a 3D structural anchor with a 2D generative refiner. It uses an online 3D geometric cache built via Gaussian Splatting (3DGS) to maintain structural consistency. A Spatio-Temporal Diffusion (ST-Diff) model with a spatio-temporal varying noise schedule is employed to fill in occluded areas and refine the generated content, ensuring that 3D logic guides structure while diffusion logic perfects texture. This approach achieves state-of-the-art fidelity in video generation.

WorldWarp 通过结合 3D 结构锚点和 2D 生成细化器来解决长距离、几何一致性的视频生成挑战。它使用通过高斯点积（3DGS）构建的 3D 几何缓存来保持结构一致性，并使用时空扩散（ST-Diff）模型来填补被遮挡的区域。关键发现表明，WorldWarp 通过确保 3D 逻辑引导结构而扩散逻辑完善纹理来实现最先进的保真度。

Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

Authors: Niclas Griesshaber, Jochen Streb

First: 2025-12-22T18:53:03+00:00 · Latest: 2025-12-22T18:53:03+00:00

Abs · PDF · Code1 · Code2

Abstract

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.

中文标题/摘要

标题：基于档案图像扫描的多模态LLM构建历史数据集：德国专利（1877-1918）

我们利用多模态大型语言模型（LLMs）从9,562份档案图像扫描中构建了1877年至1918年间共计306,070份德国专利数据集，使用基于Gemini-2.5-Pro和Gemini-2.5-Flash-Lite的LLM管道。基准测试表明，多模态LLMs可以创建比我们的研究助理更高的质量数据集，同时在构建专利数据集方面比我们图像语料库快795倍，成本低205倍。每页约包含20至50项专利条目，采用哥特体和罗马体字体，双栏排版。我们的原始资料的字体和布局复杂性表明，多模态LLMs是经济史中数据集构建范式的转变。我们开源了基准测试、专利数据集以及基于LLM的数据管道，这些工具可以轻松适应其他图像语料库，降低非技术研究人员的门槛。最后，我们解释了部署LLMs构建历史数据集的经济学，并推测这对经济史领域的潜在影响。

Summary / 总结

This study uses multimodal large language models (LLMs) to construct a dataset of 306,070 German patents from 1877 to 1918 using 9,562 archival image scans. The pipeline, powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite, demonstrates that LLMs can create higher quality datasets more than 795 times faster and 205 times cheaper than research assistants. The dataset, along with the LLM-based data pipeline, is open-sourced to facilitate similar projects for other image corpora.

本研究利用多模态大型语言模型（LLMs）从1877年至1918年的9,562张档案图像扫描中构建了306,070项德国专利数据集。由Gemini-2.5-Pro和Gemini-2.5-Flash-Lite驱动的管道表明，LLMs可以比研究助理更快更便宜地创建高质量的数据集，速度快795倍，成本低205倍。该数据集和基于LLM的数据管道已开源，以促进其他图像语料库的类似项目。

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Authors: Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

First: 2025-12-22T18:51:48+00:00 · Latest: 2025-12-22T18:51:48+00:00

Comments: Preprint. Our code is available at https://github.com/Trae1ounG/BuPO

Abs · PDF · Code1 · Code2 · Code3

Abstract

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.

中文标题/摘要

标题：自底向上的策略优化：您的语言模型策略中隐含着内部策略

现有的强化学习（RL）方法将大型语言模型（LLMs）视为单一统一的策略，忽视了其内部机制。因此，理解策略如何在各层和模块中演变对于实现更精确的优化和揭示复杂的推理机制至关重要。在本文中，我们通过利用Transformer残差流的内在分割以及隐藏状态组成与未嵌入矩阵之间的等价性来分解语言模型策略，从而揭示了内部层策略和内部模块策略。通过分析内部策略的熵，我们发现：(a) 早期层保持高熵以进行探索，顶层层收敛到接近零的熵以进行细化，不同模型系列的收敛模式不同。(b) LLama在最终层的预测空间迅速收敛，而Qwen系列模型，尤其是Qwen3，表现出更接近人类的、逐步结构化的推理模式。受这些发现的启发，我们提出了自底向上的策略优化（BuPO），这是一种新的RL范式，在早期训练过程中直接优化内部层策略。通过在较低层对训练目标进行对齐，BuPO重建了基础的推理能力并实现了更好的性能。在复杂推理基准上的广泛实验表明了我们方法的有效性。我们的代码可在https://github.com/Trae1ounG/BuPO获取。

Summary / 总结

This paper addresses the limitations of existing reinforcement learning approaches that treat large language models as a single unified policy. By decomposing the language model policy, the authors reveal Internal Layer Policies and Internal Modular Policies. They find that early layers maintain high entropy for exploration, while top layers converge to low entropy for refinement. Motivated by these findings, they propose Bottom-up Policy Optimization (BuPO), which directly optimizes the internal layer policy during early training, leading to superior performance on complex reasoning benchmarks.

本文针对在强化学习中将大型语言模型视为单一统一策略的局限性，提出了底层策略优化（BuPO）方法，该方法在早期训练中直接优化内部层策略。研究发现，早期层保持高熵以进行探索，而顶层层收敛到接近零的熵以进行细化。BuPO重建了基础的推理能力，并在复杂的推理基准测试中表现出色。

Probing forced responses and causality in data-driven climate emulators: conceptual limitations and the role of reduced-order models

Authors: Fabrizio Falasca

First: 2025-06-27T18:04:36+00:00 · Latest: 2025-12-22T18:48:57+00:00

Abs · PDF · Code1 · Code2

Abstract

A central challenge in climate science and applied mathematics is developing data-driven models of multiscale systems that capture both stationary statistics and responses to external perturbations. Current neural climate emulators aim to resolve the atmosphere-ocean system in all its complexity but often struggle to reproduce forced responses, limiting their use in causal studies such as Green's function experiments. To explore the origin of these limitations, we first examine a simplified dynamical system that retains key features of climate variability. We interpret the results through linear response theory, providing a rigorous framework to evaluate neural models beyond stationary statistics and to probe causal mechanisms. We argue that the ability of emulators of multiscale systems to reproduce perturbed statistics depends critically on (i) the choice of an appropriate coarse-grained representation and (ii) careful parameterizations of unresolved processes. These insights highlight reduced-order models, tailored to specific goals, processes, and scales, as valuable alternatives to general-purpose emulators. We next consider a real-world application by developing a neural model to investigate the joint variability of the surface temperature field and radiative fluxes. The model infers a multiplicative noise process directly from data, largely reproduces the system's probability distribution, and enables causal studies through forced responses. We discuss its limitations and outline directions for future work. Overall, these results expose key challenges in data-driven modeling of multiscale physical systems and underscore the value of coarse-grained, stochastic approaches, with response theory providing a principled framework to guide model design and enhance causal understanding.

中文标题/摘要

标题：探究数据驱动气候模拟器中的强迫响应和因果关系：概念局限性及降阶模型的作用

气候科学和应用数学中的一个核心挑战是开发能够捕捉多尺度系统中稳态统计和对外部扰动响应的数据驱动模型。当前的神经气候模拟器试图解决大气-海洋系统的全部复杂性，但往往难以再现强迫响应，限制了其在因果研究中的应用，如格林函数实验。为了探索这些局限性的来源，我们首先研究了一个简化动力系统，保留了气候变率的关键特征。我们通过线性响应理论来解释结果，提供了一个超越稳态统计的神经模型评估框架，并探究因果机制。我们认为，多尺度系统模拟器再现扰动统计的能力取决于(i) 适当粗粒化表示的选择和(ii) 未解决过程的精细参数化。这些见解突显了针对特定目标、过程和尺度定制的降阶模型作为通用模拟器的有价值替代品的重要性。我们接着考虑了一个实际应用，通过开发一个神经模型来研究地表温度场和辐射通量的联合变异性。该模型直接从数据中推断出一个乘性噪声过程，基本再现了系统的概率分布，并通过强迫响应进行因果研究。我们讨论了其局限性并指出了未来工作的方向。总体而言，这些结果揭示了多尺度物理系统数据驱动建模中的关键挑战，并强调了粗粒化、随机方法的价值，响应理论为模型设计提供了一个原则性的框架，以增强因果理解。

Summary / 总结

The study addresses the challenge of developing data-driven models for multiscale climate systems that can capture both stationary statistics and responses to external perturbations. It examines a simplified dynamical system using linear response theory to evaluate neural models and highlights the importance of appropriate coarse-grained representations and parameterizations of unresolved processes. The research demonstrates that neural models can infer multiplicative noise processes and reproduce system statistics, enabling causal studies, but also identifies limitations that need to be addressed for better model performance.

研究旨在解决开发能够同时捕捉静止统计和外部扰动响应的数据驱动气候模型的挑战。它通过分析简化动力系统并应用线性响应理论来评估神经气候模拟器。关键发现包括选择合适的粗粒化表示和参数化未解决的过程以再现扰动统计的重要性。研究展示了定制化的、随机的方法作为通用模拟器的替代品的价值，如通过神经模型直接从数据中推断出一个乘法噪声过程，并通过强迫响应进行因果研究的实世界应用示例。

Over++: Generative Video Compositing for Layer Interaction Effects

Authors: Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, Dan B Goldman

First: 2025-12-22T18:39:58+00:00 · Latest: 2025-12-22T18:39:58+00:00

Comments: Project page: https://overplusplus.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

中文标题/摘要

标题：Over++：生成视频合成以实现图层交互效果

在专业的视频合成工作流程中，艺术家必须手动创建前景主体与背景图层之间的环境交互，如阴影、反射、灰尘和水花。现有的视频生成模型难以在保留输入视频的同时添加这些效果，而当前的视频修复方法要么需要昂贵的逐帧掩模，要么产生不切实际的结果。我们引入了增强合成这一新任务，该任务根据文本提示和输入视频图层合成现实且半透明的环境效果，同时保留原始场景。为了解决这一任务，我们提出了Over++这一视频效果生成框架，该框架不假设相机姿态、场景静止或深度监督。我们构建了一个针对该任务的配对效果数据集，并引入了一种无配对增强策略，以保留文本驱动的可编辑性。我们的方法还支持可选的掩模控制和关键帧指导，而无需密集标注。尽管在有限的数据上进行训练，Over++仍能生成多样且现实的环境效果，并在效果生成和场景保留方面优于现有基线。

Summary / 总结

The research aims to automate the creation of realistic environmental interactions in video compositing, such as shadows and reflections, by using text prompts and input video layers. Over++ is a video effect generation framework that does not require assumptions about camera pose or depth supervision. The method produces diverse and realistic effects while preserving the original scene, outperforming existing techniques in both effect generation and scene preservation. It also supports optional mask control and keyframe guidance without needing dense annotations, despite being trained on limited data.

研究旨在通过文本提示和输入视频层自动创建视频合成中的现实环境交互效果，如阴影和反射。Over++是一个不需要假设相机姿态或深度监督的视频效果生成框架。该方法生成多样且真实的效果同时保留原始场景，优于现有技术在效果生成和场景保留方面的表现。它还支持可选的遮罩控制和关键帧指导，无需密集标注，尽管仅在有限数据上进行训练。

CodeTF: One-stop Transformer Library for State-of-the-art Code LLMs

Authors: Nghi D. Q. Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, Steven C. H. Hoi

First: 2023-05-31T05:24:48+00:00 · Latest: 2025-12-22T18:29:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners.

中文标题/摘要

标题：CodeTF：最先进的代码LLM的即用型Transformer库

代码智能在现代软件工程的转型中扮演着关键角色。最近，基于深度学习的模型，尤其是基于Transformer的大规模语言模型（LLMs），通过利用海量的开源代码数据和编程语言特性，在处理这些任务方面展现了显著的潜力。然而，开发和部署此类模型通常需要同时具备机器学习和软件工程的专业知识，这成为模型采用的障碍。在本文中，我们介绍了CodeTF，一个开源的基于Transformer的库，用于最先进的代码LLM和代码智能。遵循模块化设计和可扩展框架的原则，我们设计了CodeTF，提供统一的接口，以实现不同模型、数据集和任务的快速访问和开发。我们的库支持一系列预训练的代码LLM模型和流行的代码基准，包括标准化的接口以高效地训练和提供代码LLM，以及语言特定的解析器和用于提取代码属性的实用函数。在本文中，我们描述了设计原则、架构、关键模块和组件，并与其他相关库工具进行了比较。最后，我们希望CodeTF能够弥合机器学习/生成式AI与软件工程之间的差距，为开发人员、研究人员和实践者提供一个全面的开源解决方案。

Summary / 总结

CodeTF is an open-source library designed to facilitate the development and deployment of state-of-the-art code-based large language models (LLMs) by providing a unified interface and modular design. It supports a variety of pretrained models and code benchmarks, enabling rapid access and development across different types of models and tasks. Key experimental findings include the efficient training and serving of code LLMs, as well as the provision of language-specific parsers and utility functions for code attribute extraction, which significantly reduce the barrier for model adoption in both machine learning and software engineering domains.

CodeTF 是一个开源库，旨在通过使用 Transformer 技术来促进先进代码大型语言模型 (LLM) 的开发和部署。它通过提供统一接口和支持一系列预训练模型和代码基准来降低采用门槛。主要发现包括代码 LLM 的高效训练和提供语言特定解析器和代码属性提取的实用函数，证明了其在代码智能任务中的有效性。

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

Authors: Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang

First: 2025-09-16T06:16:05+00:00 · Latest: 2025-12-22T18:22:20+00:00

Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.

中文标题/摘要

标题：AsyMoE：利用模态不对称性增强大型视觉-语言模型专家专业化

大型视觉-语言模型（LVLMs）通过扩展架构和大量训练，在多模态任务中表现出色。然而，现有的混合专家（MoE）方法由于视觉和语言处理之间的不对称性而面临挑战。视觉信息是空间上完整的，而语言需要保持顺序上下文。因此，MoE模型难以平衡模态特定特征和跨模态交互。通过系统分析，我们观察到，深层的语言专家逐渐失去上下文定位，并更多依赖参数知识，而不是利用提供的视觉和语言信息。为了解决这个问题，我们提出了AsyMoE，这是一种新型架构，通过三个专门的专家组来建模这种不对称性。我们设计了跨模态专家进行模态特定处理，超曲面跨模态专家进行分层跨模态交互，并设计了证据优先的语言专家以抑制参数偏差并保持上下文定位。广泛的实验表明，与vanilla MoE和模态特定MoE相比，AsyMoE分别提高了26.58%和15.45%的准确率，且参数激活量比密集模型少25.45%。

Summary / 总结

The research aims to improve the performance of large vision-language models by addressing the asymmetry between visual and linguistic processing in existing Mixture of Experts (MoE) approaches. AsyMoE, a novel architecture, is proposed to model this asymmetry using three specialized expert groups: intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to maintain contextual grounding. Despite the promising results showing 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, the authors have withdrawn the submission due to a fundamental error in the methodology that affects the validity of the main results.

论文针对现有大规模视觉-语言模型（LVLM）中视觉和语言处理之间的不对称性问题，提出了AsyMoE架构，该架构引入了三种专门的专家组：用于模态特定处理的内模态专家、用于层级跨模态交互的超球面跨模态专家，以及用于抑制参数偏见并保持上下文接地的证据优先语言专家。实验表明，AsyMoE在准确率上分别比vanilla MoE和模态特定MoE高出26.58%和15.45%，并且参数量比密集模型少25.45%。然而，由于方法上的根本错误影响了结果的有效性，该提交已被作者撤回。

GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks

Authors: Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang

First: 2025-11-02T11:58:55+00:00 · Latest: 2025-12-22T18:21:18+00:00

Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

Abs · PDF · Code1 · Code2

Abstract

Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.

中文标题/摘要

标题：GraphGeo：基于异构图神经网络的多智能体辩论框架用于视觉地理定位

视觉地理定位需要广泛的空间知识和复杂的推理来确定图像位置，而不依赖GPS元数据。传统的检索方法受限于数据库的覆盖范围和质量。最近的大规模视觉-语言模型（LVLMs）能够直接从图像内容进行位置推理，但单个模型在处理多样化的地理区域和复杂的场景时存在困难。现有的多智能体系统通过模型协作来提高性能，但所有智能体交互均被统一处理。它们缺乏有效处理相互矛盾预测的机制。我们提出 **GraphGeo**，一种使用异构图神经网络的多智能体辩论框架，用于视觉地理定位。我们的方法通过类型化的边来建模多样的辩论关系，区分支持性的合作、竞争性的论辩以及知识转移。我们引入了一种双层辩论机制，结合节点级细化和边级论辩建模。跨层拓扑细化策略使图结构和智能体表示能够共同进化。在多个基准上的实验表明，GraphGeo 显著优于现有最佳方法。我们的框架通过结构化的辩论将智能体之间的认知冲突转化为增强的地理定位准确性。

Summary / 总结

GraphGeo is a multi-agent debate framework for visual geo-localization using heterogeneous graph neural networks. It models diverse debate relationships and introduces a dual-level debate mechanism for node-level refinement and edge-level argumentation. Experiments show that GraphGeo significantly outperforms existing methods. However, the submission was withdrawn due to a fundamental error in the methodology that affects the validity of the main results.

GraphGeo 是一种使用异质图神经网络的多代理辩论框架，用于视觉地理定位。它模型了多样的辩论关系，并引入了节点级细化和边级论辩建模的双重机制。实验表明，GraphGeo 显著优于现有方法。然而，由于方法中的根本错误影响了主要结果的有效性，提交已被作者撤回。

4D Gaussian Splatting as a Learned Dynamical System

Authors: Arnold Caleb Asiimwe, Carl Vondrick

First: 2025-12-22T18:20:29+00:00 · Latest: 2025-12-22T18:20:29+00:00

Abs · PDF · Code1 · Code2

Abstract

We reinterpret 4D Gaussian Splatting as a continuous-time dynamical system, where scene motion arises from integrating a learned neural dynamical field rather than applying per-frame deformations. This formulation, which we call EvoGS, treats the Gaussian representation as an evolving physical system whose state evolves continuously under a learned motion law. This unlocks capabilities absent in deformation-based approaches:(1) sample-efficient learning from sparse temporal supervision by modeling the underlying motion law; (2) temporal extrapolation enabling forward and backward prediction beyond observed time ranges; and (3) compositional dynamics that allow localized dynamics injection for controllable scene synthesis. Experiments on dynamic scene benchmarks show that EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering

中文标题/摘要

标题：四维高斯点积作为学习动力系统

我们将四维高斯点积重新解释为连续时间动力系统，在这种系统中，场景运动源自积分一个学习到的动力学场，而不是逐帧变形。我们称之为EvoGS的这种表述将高斯表示视为在学习到的运动法则下连续演变的物理系统。这解锁了基于变形的方法所不具备的能力：(1) 通过建模底层运动法则，从稀疏的时间监督中高效学习；(2) 时间外推，使预测超越观察到的时间范围；(3) 组合动力学，允许局部动力学注入以实现可控的场景合成。在动态场景基准测试中的实验表明，EvoGS在保持实时渲染的同时，实现了比变形场基线更好的运动连贯性和时间一致性

Summary / 总结

The research aims to enhance 4D Gaussian Splatting by interpreting it as a continuous-time dynamical system, enabling the modeling of scene motion through a learned neural field. This method, named EvoGS, allows for sample-efficient learning from sparse temporal data, temporal extrapolation, and compositional dynamics. Experiments demonstrate that EvoGS outperforms deformation-based approaches in terms of motion coherence and temporal consistency while supporting real-time rendering.

研究旨在通过将4D高斯散点视为连续时间动力系统来提升其性能，其中场景运动通过集成一个学习到的神经动力场来推导。这种方法称为EvoGS，允许从稀疏时间数据中高效学习，进行时间外推，并实现可组合的动力学。实验表明，EvoGS在运动连贯性和时间一致性方面优于基于变形场的基线方法，同时支持实时渲染。

GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs

Authors: Heng Zhang, Tianyi Zhang, Yuling Shi, Xiaodong Gu, Yaomin Shen, Haochen You, Zijian Zhang, Yilei Yuan, Jin Huang

First: 2025-10-14T02:48:50+00:00 · Latest: 2025-12-22T18:20:12+00:00

Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

Abs · PDF · Code1 · Code2

Abstract

Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: \textbf{Can alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures?} We introduce \textbf{GraphShaper}, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47\% accuracy improvements on citation networks and 7.63\% on social networks in zero-shot settings.

中文标题/摘要

标题：GraphShaper：几何感知对齐以提高文本标注图的迁移学习

图基础模型代表了一种变革性的范式，用于在多种图领域中学习可迁移的表示。最近的方法利用大型语言模型通过对比学习将图和文本模态统一到共享表示空间中。然而，系统性的评估揭示了在结构边界处存在显著的性能下降，这些边界处不同的拓扑模式交汇，准确率损失超过20个百分点。这一问题源于一个关键限制：当前方法假设所有图结构都可以编码在一个单一的欧几里得空间中。实际上，树结构需要双曲几何来保持层次分支，而循环模式则依赖于球面几何来保持闭合性质。在结构边界处，节点会受到冲突的几何约束，而统一的编码空间无法解决这一问题。这提出了一个关键挑战：\textbf{能否设计对齐框架以尊重图结构的内在几何多样性？} 我们引入了\textbf{GraphShaper}，这是一种几何感知框架，通过多几何专业化增强图编码。我们的方法采用针对不同几何空间定制的专家网络，动态计算融合权重，根据局部结构特征适配性地整合几何属性。这种适应性融合在对齐到文本嵌入之前保留了结构完整性。广泛的实验表明，在零样本设置下，GraphShaper在引用网络中实现了9.47%的准确率提升，在社交网络中实现了7.63%的提升。

Summary / 总结

GraphShaper is a geometry-aware framework designed to improve transfer learning in text-attributed graphs by addressing the limitations of current methods that assume a single Euclidean space for all graph structures. It uses expert networks for different geometric spaces and dynamically computes fusion weights to adaptively integrate geometric properties based on local structural characteristics. Experiments show that GraphShaper improves accuracy by 9.47% on citation networks and 7.63% on social networks in zero-shot settings, but the methodology has a fundamental error that affects the validity of the results.

GraphShaper 是一种几何感知框架，旨在通过解决当前方法假设所有图结构都能在一个单一欧几里得空间中编码的问题来提高文本标注图的迁移学习效果。它使用针对不同几何空间的专家网络，并动态计算融合权重以基于局部结构特征整合几何属性。实验表明，GraphShaper 在零样本设置下使引文网络的准确率提高了 9.47%，社会网络提高了 7.63%，但方法存在根本性错误，影响了结果的有效性。

The Best of Both Worlds: Hybridizing Neural Operators and Solvers for Stable Long-Horizon Inference

Authors: Rajyasri Roy, Dibyajyoti Nayak, Somdatta Goswami

First: 2025-12-22T18:17:28+00:00 · Latest: 2025-12-22T18:17:28+00:00

Comments: 18 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Numerical simulation of time-dependent partial differential equations (PDEs) is central to scientific and engineering applications, but high-fidelity solvers are often prohibitively expensive for long-horizon or time-critical settings. Neural operator (NO) surrogates offer fast inference across parametric and functional inputs; however, most autoregressive NO frameworks remain vulnerable to compounding errors, and ensemble-averaged metrics provide limited guarantees for individual inference trajectories. In practice, error accumulation can become unacceptable beyond the training horizon, and existing methods lack mechanisms for online monitoring or correction. To address this gap, we propose ANCHOR (Adaptive Numerical Correction for High-fidelity Operator Rollouts), an online, instance-aware hybrid inference framework for stable long-horizon prediction of nonlinear, time-dependent PDEs. ANCHOR treats a pretrained NO as the primary inference engine and adaptively couples it with a classical numerical solver using a physics-informed, residual-based error estimator. Inspired by adaptive time-stepping in numerical analysis, ANCHOR monitors an exponential moving average (EMA) of the normalized PDE residual to detect accumulating error and trigger corrective solver interventions without requiring access to ground-truth solutions. We show that the EMA-based estimator correlates strongly with the true relative L2 error, enabling data-free, instance-aware error control during inference. Evaluations on four canonical PDEs: 1D and 2D Burgers', 2D Allen-Cahn, and 3D heat conduction, demonstrate that ANCHOR reliably bounds long-horizon error growth, stabilizes extrapolative rollouts, and significantly improves robustness over standalone neural operators, while remaining substantially more efficient than high-fidelity numerical solvers.

中文标题/摘要

标题：兼收并蓄：结合神经算子和求解器以实现稳定的长期预测

时间依赖偏微分方程（PDE）的数值模拟是科学和工程应用的核心，但高保真求解器在长期或时间关键设置中往往代价高昂。神经算子（NO）代理可以在参数和函数输入上提供快速推理；然而，大多数自回归NO框架仍然容易累积误差，且平均误差度量对单个推理轨迹的保证有限。实践中，误差累积在训练窗口之外可能会变得不可接受，而现有方法缺乏在线监控或纠正机制。为解决这一缺口，我们提出了一种在线、实例感知的混合推理框架——ANCHOR（高保真算子滚动的自适应数值校正），用于非线性、时间依赖PDE的稳定长期预测。ANCHOR将预训练的NO作为主要推理引擎，并通过基于物理信息的残差误差估计器与经典数值求解器进行自适应耦合。受数值分析中自适应时间步长的启发，ANCHOR监控归一化PDE残差的指数移动平均（EMA）以检测累积误差并触发纠正求解器干预，而无需访问真实解。我们证明基于EMA的估计器与真实的相对L2误差高度相关，使推理期间能够实现无数据、实例感知的误差控制。在四个经典PDE上的评估（1D和2D Burger方程、2D Allen-Cahn方程和3D热传导）表明，ANCHOR能够可靠地限制长期误差增长，稳定外推滚动，并显著提高鲁棒性，同时保持比高保真数值求解器更高效的特性。

Summary / 总结

The research aims to address the computational limitations of high-fidelity solvers for long-horizon predictions of time-dependent PDEs, proposing ANCHOR, an online hybrid framework that combines a pretrained neural operator with a classical numerical solver. ANCHOR uses an exponential moving average of the normalized PDE residual to monitor and correct errors, achieving stable long-horizon predictions and improved robustness compared to standalone neural operators or high-fidelity solvers. Evaluations on various PDEs show that ANCHOR effectively bounds error growth and stabilizes predictions over long horizons.

研究提出了一种名为ANCHOR的自适应混合框架，结合预训练的神经算子和经典数值求解器，以解决时间依赖偏微分方程（PDE）的长期稳定推理问题。ANCHOR使用归一化PDE残差的指数移动平均值来监控和纠正错误，提供稳健且高效的推理。实验表明，ANCHOR有效地控制了误差增长，稳定了长期预测，并优于单独的神经算子和高保真求解器。

InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos

Authors: Yangsong Zhang, Abdul Ahad Butt, Gül Varol, Ivan Laptev

First: 2025-08-31T09:38:59+00:00 · Latest: 2025-12-22T18:14:12+00:00

Comments: Accepted to 3DV 2026. Project page: https://mael-zys.github.io/InterPose/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.

中文标题/摘要

标题：InterPose：从大规模网络视频中学习生成人体与物体交互

人体运动生成得益于最近在大规模运动捕捉数据上训练的扩散模型取得了巨大进展。然而，现有的大多数工作目前主要针对空旷场景中孤立人物的动画生成。同时，在复杂3D场景中合成逼真的人体与物体交互仍然是计算机图形学和机器人学中的一个关键挑战。生成多样化人体与物体交互的一个障碍是缺乏大规模包含多种物体操作的数据集。实际上，现有的运动捕捉数据通常仅限于单个人物和有限种类物体的操作。为了解决这一问题，我们提出了一种自动运动提取流水线，并使用它来收集富含交互的人体运动。我们的新数据集InterPose包含73,800个3D人体运动序列及其自动获取的文本描述，这些描述来自45,800个包含人体与物体交互的视频。我们进行了广泛的实验，并证明InterPose能够显著提高人体运动生成的最新方法的效果。此外，我们使用InterPose开发了一个基于LLM的代理，使其能够零样本动画生成与多种物体和场景互动的人。

Summary / 总结

The research aims to generate realistic human-object interactions in complex 3D scenes by addressing the lack of large-scale datasets with diverse object manipulations. The authors propose an automatic motion extraction pipeline to collect 73.8K sequences of 3D human motions from 45.8K videos, creating a new dataset called InterPose. Experiments show that InterPose significantly improves state-of-the-art methods for human motion generation and enables zero-shot animation of people interacting with various objects and scenes.

研究旨在通过解决缺乏包含多样化物体操作的大规模数据集问题，生成复杂3D场景中的真实人类物体交互。方法包括使用自动动作提取流水线从45.8K包含人类物体交互的视频中收集73.8K个3D人类动作序列及其对应的文本描述。生成的InterPose数据集显著提升了现有的人类动作生成方法，并能够实现零样本动画，展示人们与各种物体和场景的交互。

From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

Authors: Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Sicheng Zhao, Yimian Dai, Xiangyu Yue

Venue: ICCV 2025

First: 2024-12-15T11:08:49+00:00 · Latest: 2025-12-22T17:52:35+00:00

Comments: Accepted by ICCV 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework, which drives the existing SIRST detection networks progressively and actively recognizes and learns harder samples. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build an efficient and stable bridge between full supervision and single point supervision tasks. Our code is available at https://github.com/YuChuang1205/PAL

中文标题/摘要

标题：从易到难：基于单点监督的红外小目标检测渐进式主动学习框架

近年来，单帧红外小目标（SIRST）检测在单点监督下引起了广泛关注。然而，最新的基于单点监督的标签演化（LESPS）框架存在不稳定、过度标签演化以及难以发挥嵌入网络性能的问题。受生物体逐渐适应环境并不断积累知识的启发，我们构建了一个创新的渐进式主动学习（PAL）框架，该框架能够逐步且主动地识别和学习更难的样本。具体而言，为了避免早期低性能模型导致错误选择难样本，我们提出了一个模型预启动概念，专注于自动选择一部分易样本，帮助模型获得基本的任务特定学习能力。同时，我们提出了一个改进的双重更新策略，可以促进对更难样本的合理学习和伪标签的持续精炼。此外，为了缓解过度标签演化的风险，合理引入了一个衰减因子，有助于在目标注释的扩展和收缩之间实现动态平衡。广泛的实验表明，配备我们PAL框架的现有SIRST检测网络在多个公开数据集上取得了最先进的（SOTA）结果。此外，我们的PAL框架可以建立从全监督任务到单点监督任务的高效且稳定的桥梁。我们的代码可在https://github.com/YuChuang1205/PAL获取

Summary / 总结

The research addresses the instability and excessive label evolution issues in the latest label evolution with single point supervision (LESPS) framework for single-frame infrared small target detection. It introduces a Progressive Active Learning (PAL) framework that helps the model progressively recognize and learn harder samples. The PAL framework includes a model pre-start concept to ensure basic task-specific learning and a refined dual-update strategy to promote the learning of harder samples and pseudo-label refinement. Additionally, a decay factor is introduced to balance the expansion and contraction of target annotations. Experiments show that the PAL framework improves the performance of existing SIRST detection networks and builds an efficient bridge between full and single point supervision tasks.

本文解决了单点监督下红外小目标检测中的不稳定性和过度标签演化问题，提出了一种渐进主动学习（PAL）框架，帮助模型逐步识别和学习更难的样本。该框架包括一个模型预启动概念，确保初始的基本学习能力，以及一个精炼的双重更新策略，促进更难样本的学习和伪标签的连续优化。实验表明，PAL框架提高了现有SIRST检测网络的性能，并在多个公开数据集上达到了最先进的结果。

MapTrace: Scalable Data Generation for Route Tracing on Maps

Authors: Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal

First: 2025-12-22T17:45:39+00:00 · Latest: 2025-12-22T17:45:39+00:00

Abs · PDF · Code1 · Code2

Abstract

While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.

中文标题/摘要

标题：MapTrace：地图路线跟踪的大规模数据生成

尽管多模态大型语言模型在许多视觉和文本推理任务上已经达到了人类水平的表现，但在地图上的细粒度空间理解，如路线跟踪方面的能力仍然有限。与人类能够快速学习解析和导航地图不同，当前的模型往往无法遵守基本的路径约束，部分原因是收集大规模、像素级准确的路径注解的成本和难度极高。为了解决这个问题，我们引入了一种可扩展的合成数据生成流水线，该流水线利用合成地图图像和像素级解析来自动生成这一具有挑战性任务的精确注解。使用此流水线，我们构建了一个包含4000张地图和23000条路径样本的微调数据集，使模型能够获得更接近人类的空间能力。使用此数据集，我们对开源和专有MLLM进行了微调。MapBench上的结果显示，微调显著提高了鲁棒性，成功率提高了多达6.4个百分点，同时减少了路径跟踪误差（NDTW）。这些增益表明，预训练模型中缺乏的细粒度空间推理能力可以通过合成监督明确地进行教学。

Summary / 总结

The research aims to enhance the spatial reasoning capabilities of multimodal large language models, particularly in route tracing on maps. It introduces a scalable synthetic data generation pipeline that uses synthetic map images and pixel-level parsing to automatically generate precise annotations. This pipeline produced a fine-tuning dataset of 23,000 path samples across 4,000 maps, which improved the models' robustness and reduced path-tracing errors by up to 6.4 points on MapBench. The findings suggest that fine-grained spatial reasoning can be effectively taught to models through synthetic supervision.

研究旨在提升多模态大型语言模型（MLLM）在地图路径规划中的空间推理能力。为了解决收集像素级路径注解的难题，作者开发了一种可扩展的合成数据生成流水线，生成了包含4,000张地图的23,000个路径样本数据集。该数据集在MapBench基准测试中提高了模型的鲁棒性，并将路径规划误差降低了最多6.4个百分点。

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Authors: Boyuan Chen, Donghai Hong, Jiaming Ji, Jiacheng Zheng, Bowen Dong, Jiayi Zhou, Kaile Wang, Juntao Dai, Xuyao Wang, Wenqi Chen, Qirui Zheng, Wenxin Li, Sirui Han, Yike Guo, Yaodong Yang

First: 2025-05-29T19:00:42+00:00 · Latest: 2025-12-22T17:36:54+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .

中文标题/摘要

标题：InterMT：多轮交错偏好对齐与人类反馈

随着多模态大型模型（MLLMs）在各种挑战性任务中不断进步，一个关键问题出现了：还缺少哪些基本能力？人类学习的一个关键方面是与环境进行持续的互动——不仅限于语言，还包括多模态的理解和生成。为了更接近人类级别的智能，模型必须同样支持多轮、多模态的互动。特别是，它们应该理解交错的多模态上下文，并在持续的交流中做出连贯的回应。在这项工作中，我们通过InterMT进行初步探索——这是第一个用于多轮多模态互动的偏好数据集，基于真实的人类反馈。在这次探索中，我们特别强调了人类监督的重要性，引入了专家注释来指导过程，因为当前的MLLMs缺乏这种复杂的互动能力。InterMT在九个子维度上捕捉了人类的偏好，包括15600个提示、52600个多轮对话实例和32400个人类标注的偏好对。为了弥补多模态理解和生成能力的不足，我们引入了一种代理工作流程，利用工具增强的MLLMs构建多轮问答实例。为了进一步实现这一目标，我们引入了InterMT-Bench来评估MLLMs在多轮、多模态任务中协助裁判的能力。我们通过裁判调节等应用展示了InterMT的用途，并进一步揭示了裁判模型的多轮扩展规律。我们希望开源数据能够帮助促进进一步研究，使当前的MLLMs向下一步迈进。我们的项目网站可以在https://pku-intermt.github.io 查看。

Summary / 总结

The research aims to address the missing capability of multi-turn, multimodal interaction in multimodal large language models (MLLMs) by introducing InterMT, a preference dataset grounded in human feedback. The method involves creating a dataset with 15,600 prompts, 52,600 dialogue instances, and 32,400 preference pairs, and developing an agentic workflow to construct multi-turn QA instances. Key findings include the demonstration of the dataset's utility in applications like judge moderation and the revelation of the multi-turn scaling law for judge models.

该研究旨在解决多模态大型语言模型（MLLMs）需要支持多轮、多模态交互的问题，这对于实现人类级别的智能至关重要。作者引入了InterMT，这是一个基于真实人类反馈的多轮多模态交互偏好数据集，包含15.6k个提示、52.6k个对话实例和32.4k个偏好对。他们还提出了InterMT-Bench来评估MLLMs在多轮、多模态任务中的辅助能力，展示了该数据集在法官调节等应用中的用途，并揭示了法官模型在多轮交互中的扩展规律。

Shape it Up! Restoring LLM Safety during Finetuning

Authors: ShengYun Peng, Pin-Yu Chen, Jianfeng Chi, Seongmin Lee, Duen Horng Chau

Venue: NeurIPS

First: 2025-05-22T18:05:16+00:00 · Latest: 2025-12-22T17:30:15+00:00

Comments: NeurIPS'25

Abs · PDF · Code1 · Code2 · Code3

Abstract

Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial responses, tracking how safety risk evolves throughout the response, segment by segment. This leads to the Safety Trajectory Assessment of Response (STAR), a token-level signal that enables shaping to operate dynamically over the training sequence. Building on this, we present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families-all without compromising capability on intended tasks. We encourage future safety research to build on dynamic shaping principles for stronger mitigation against evolving finetuning risks. Our code is publicly available at https://github.com/poloclub/star-dss.

中文标题/摘要

标题：塑形！在微调期间恢复LLM安全性

大型语言模型（LLMs）的微调允许用户特定的定制，但引入了关键的安全风险：即使是一些有害示例也可能破坏安全对齐。一种常见的缓解策略是更强烈地更新被认定为安全的示例，同时减少或排除标记为不安全的示例。然而，由于安全上下文在一个示例内部可能会发生变化，等量地更新有害和无害部分的响应是次优的——我们称之为静态安全性塑形。相反，我们提出了一种动态安全性塑形（DSS）框架，该框架利用细粒度的安全信号来强化从响应的安全部分中学习，同时抑制不安全的内容。为了在微调期间实现这种细粒度的控制，我们引入了一个关键见解：传统的用于过滤的护栏模型可以重新用于评估部分响应，跟踪响应中安全风险如何逐段演变。这导致了响应安全性轨迹评估（STAR），一种标记级信号，使塑形能够在训练序列中动态地操作。在此基础上，我们提出了STAR-DSS，它根据STAR分数进行指导，能够稳健地缓解微调风险，并在各种威胁、数据集和模型家族中实现显著的安全改进，而不会牺牲预期任务的能力。我们鼓励未来的安全研究建立在动态塑形原则之上，以更有效地应对不断演变的微调风险。我们的代码可在https://github.com/poloclub/star-dss/公开获取。

Summary / 总结

This paper addresses the safety risks associated with finetuning large language models (LLMs) by proposing dynamic safety shaping (DSS), which uses fine-grained safety signals to dynamically adjust the model's learning process. The authors introduce a token-level signal called Safety Trajectory Assessment of Response (STAR) to track safety risk evolution during the response generation. The resulting STAR-DSS framework mitigates finetuning risks and improves safety across various threats and datasets without compromising the model's capability for intended tasks.

本文提出了动态安全塑造（DSS）方法，通过使用细粒度的安全信号来选择性地更新大型语言模型（LLMs），基于响应的不同部分的安全性。作者引入了STAR，这是一种跟踪响应中安全风险的标记级信号，使安全塑造能够动态进行。实验表明，STAR-DSS能够稳健地缓解微调风险，并在各种威胁和数据集上提高安全性，同时不损害模型在预期任务上的能力。

Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies

Authors: Hugo Garrido-Lestache Belinchon, Jeremy Kedziora

First: 2025-07-30T15:48:38+00:00 · Latest: 2025-12-22T17:22:59+00:00

Comments: 11 pages

Abs · PDF · Code1 · Code2

Abstract

This paper introduces Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm designed to enhance multi-agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi-headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter-agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint-action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi-agent paradigms, including Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter-agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).

中文标题/摘要

标题：基于注意力机制的演员-评论家策略增强多智能体协作

本文介绍了团队注意力-演员-评论家（TAAC），这是一种用于增强合作环境中多智能体协作的强化学习算法。TAAC采用集中训练/集中执行方案，并在演员和评论家中引入多头注意力机制。这种设计促进了智能体之间的动态通信，使智能体能够明确查询队友，从而有效管理联合动作空间的指数增长，同时确保高度的协作。我们还引入了一种惩罚性损失函数，以促进智能体之间多样且互补的角色。我们在模拟足球环境中将TAAC与代表其他多智能体范式的基准算法（包括近似策略优化和多智能体演员-注意力-评论家）进行评估。我们发现，TAAC在多种指标（胜率、进球差、Elo评分、智能体间连接性、平衡的空间分布以及频繁的战术互动，如球权交换）上表现出更优的性能和协作行为。

Summary / 总结

The paper presents Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm aimed at improving multi-agent collaboration in cooperative environments. TAAC uses a Centralized Training/Centralized Execution scheme with multi-headed attention mechanisms in both the actor and critic, facilitating dynamic communication among agents. The algorithm also includes a penalized loss function to encourage diverse and complementary roles. TAAC was tested in a simulated soccer environment and outperformed benchmark algorithms like Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic in various metrics, including win rates, goal differentials, and inter-agent connectivity.

论文提出了Team-Attention-Actor-Critic (TAAC) 算法，旨在提高多智能体在合作环境中的协作能力。TAAC 使用了集中训练/集中执行方案，并在演员和评论家中都采用了多头注意力机制，以促进智能体之间的动态通信。该算法还包含一个惩罚损失函数，以鼓励多样且互补的角色。TAAC 在模拟足球环境中进行了测试，并在多种指标（如胜率、进球差、智能体间连接性等）上优于基准算法，如近端策略优化和多智能体演员-注意力评论家。

Patlak Parametric Image Estimation from Dynamic PET Using Diffusion Model Prior

Authors: Ziqian Huang, Boxiao Yu, Siqi Li, Savas Ozdemir, Sangjin Bae, Jae Sung Lee, Guobao Wang, Kuang Gong

First: 2025-12-22T17:11:33+00:00 · Latest: 2025-12-22T17:11:33+00:00

Comments: 10 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Dynamic PET enables the quantitative estimation of physiology-related parameters and is widely utilized in research and increasingly adopted in clinical settings. Parametric imaging in dynamic PET requires kinetic modeling to estimate voxel-wise physiological parameters based on specific kinetic models. However, parametric images estimated through kinetic model fitting often suffer from low image quality due to the inherently ill-posed nature of the fitting process and the limited counts resulting from non-continuous data acquisition across multiple bed positions in whole-body PET. In this work, we proposed a diffusion model-based kinetic modeling framework for parametric image estimation, using the Patlak model as an example. The score function of the diffusion model was pre-trained on static total-body PET images and served as a prior for both Patlak slope and intercept images by leveraging their patch-wise similarity. During inference, the kinetic model was incorporated as a data-consistency constraint to guide the parametric image estimation. The proposed framework was evaluated on total-body dynamic PET datasets with different dose levels, demonstrating the feasibility and promising performance of the proposed framework in improving parametric image quality.

中文标题/摘要

标题：使用扩散模型先验从动态PET估计帕特拉克参数图像

动态PET能够定量估计生理相关参数，并广泛应用于研究并在临床中逐渐被采用。动态PET中的参数成像需要通过特定的动力学模型进行动力学建模，以基于特定的动力学模型估计体素级别的生理参数。然而，通过动力学模型拟合估计的参数图像往往由于拟合过程的固有病态性质以及由于全身PET在多个床位位置上非连续数据采集导致的有限计数而具有较低的图像质量。在本研究中，我们提出了一种基于扩散模型的动力学建模框架，以帕特拉克模型为例。扩散模型的得分函数在静态全身PET图像上进行预训练，并通过其块间相似性作为帕特拉克斜率和截距图像的先验。在推理过程中，动力学模型被作为数据一致性约束以指导参数图像的估计。所提出的方法在不同剂量水平的全身动态PET数据集上进行了评估，证明了该方法在提高参数图像质量方面的可行性和有希望的性能。

Summary / 总结

This study addresses the low image quality issue in parametric imaging from dynamic PET by proposing a diffusion model-based framework. The framework uses a pre-trained diffusion model score function as a prior for Patlak slope and intercept images, and incorporates the kinetic model as a data-consistency constraint during inference. Evaluation on dynamic PET datasets showed improved parametric image quality.

该研究旨在通过提出基于扩散模型的动态PET动力学建模框架来提高参数图像的质量。该框架使用从静态PET图像预训练的得分函数作为Patlak斜率和截距图像的先验，并在推断过程中将动力学模型作为数据一致性约束。在不同剂量水平的动态PET数据集上的评估表明，所提出的方法能够有效提高参数图像的质量。

LeLaR: The First In-Orbit Demonstration of an AI-Based Satellite Attitude Controller

Authors: Kirill Djebko, Tom Baumann, Erik Dilger, Frank Puppe, Sergio Montenegro

First: 2025-12-22T17:00:25+00:00 · Latest: 2025-12-22T17:00:25+00:00

Comments: 55 pages, 27 figures, 29 tables. The maneuver telemetry datasets generated and analyzed during this work are available in the GitHub repository https://github.com/kdjebko/lelar-in-orbit-data

Abs · PDF · Code1 · Code2 · Code3

Abstract

Attitude control is essential for many satellite missions. Classical controllers, however, are time-consuming to design and sensitive to model uncertainties and variations in operational boundary conditions. Deep Reinforcement Learning (DRL) offers a promising alternative by learning adaptive control strategies through autonomous interaction with a simulation environment. Overcoming the Sim2Real gap, which involves deploying an agent trained in simulation onto the real physical satellite, remains a significant challenge. In this work, we present the first successful in-orbit demonstration of an AI-based attitude controller for inertial pointing maneuvers. The controller was trained entirely in simulation and deployed to the InnoCube 3U nanosatellite, which was developed by the Julius-Maximilians-Universität Würzburg in cooperation with the Technische Universität Berlin, and launched in January 2025. We present the AI agent design, the methodology of the training procedure, the discrepancies between the simulation and the observed behavior of the real satellite, and a comparison of the AI-based attitude controller with the classical PD controller of InnoCube. Steady-state metrics confirm the robust performance of the AI-based controller during repeated in-orbit maneuvers.

中文标题/摘要

标题：LeLaR：基于AI的卫星姿态控制器首次在轨演示

姿态控制对于许多卫星任务至关重要。然而，传统的控制器设计耗时且对模型不确定性及操作边界条件的变化敏感。深度强化学习（DRL）通过自主与仿真环境交互学习适应性控制策略，提供了一种有前景的替代方案。克服从仿真到现实的差距，即将在仿真中训练的代理部署到实际物理卫星上，仍然是一个重大挑战。在本文中，我们介绍了基于AI的姿态控制器在惯性指向机动中的首次成功在轨演示。该控制器完全在仿真中训练，并部署到由图宾根大学路德维希-马克西米利安大学与柏林技术大学合作开发的InnoCube 3U纳米卫星上，该卫星于2025年1月发射。我们介绍了AI代理的设计、训练过程的方法、仿真与实际卫星行为之间的差异，以及基于AI的姿态控制器与InnoCube的传统PD控制器的比较。稳态指标证实了基于AI的控制器在重复在轨机动中的稳健性能。

Summary / 总结

This paper presents the first in-orbit demonstration of an AI-based satellite attitude controller, addressing the challenge of Sim2Real deployment. The controller, trained in simulation, was successfully deployed to the InnoCube 3U nanosatellite. The AI-based controller demonstrated robust performance during inertial pointing maneuvers, outperforming the classical PD controller in terms of steady-state metrics.

本文介绍了首个在轨演示的基于AI的卫星姿态控制器。该控制器通过在仿真中使用深度强化学习进行训练，并部署到了由Julius-Maximilians-Universität Würzburg与Technische Universität Berlin合作开发的InnoCube 3U纳米卫星上。研究结果表明，AI控制器在惯性对准机动中的表现优于传统的PD控制器，特别是在稳态指标方面，证实了其在实际应用中的有效性。

Results of the 2024 CommonRoad Motion Planning Competition for Autonomous Vehicles

Authors: Yanliang Huang, Xia Yan, Peiran Yin, Zhenduo Zhang, Zeyan Shao, Youran Wang, Haoliang Huang, Matthias Althoff

First: 2025-12-22T16:46:40+00:00 · Latest: 2025-12-22T16:46:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Over the past decade, a wide range of motion planning approaches for autonomous vehicles has been developed to handle increasingly complex traffic scenarios. However, these approaches are rarely compared on standardized benchmarks, limiting the assessment of relative strengths and weaknesses. To address this gap, we present the setup and results of the 4th CommonRoad Motion Planning Competition held in 2024, conducted using the CommonRoad benchmark suite. This annual competition provides an open-source and reproducible framework for benchmarking motion planning algorithms. The benchmark scenarios span highway and urban environments with diverse traffic participants, including passenger cars, buses, and bicycles. Planner performance is evaluated along four dimensions: efficiency, safety, comfort, and compliance with selected traffic rules. This report introduces the competition format and provides a comparison of representative high-performing planners from the 2023 and 2024 editions.

中文标题/摘要

标题：2024年通用道路自主车辆运动规划竞赛结果

过去十年间，开发了多种用于自主车辆的运动规划方法以应对日益复杂的交通场景。然而，这些方法很少在标准化基准上进行比较，限制了对其相对优劣的评估。为解决这一问题，我们介绍了2024年举办的第4届通用道路运动规划竞赛的设置和结果，该竞赛使用了通用道路基准套件进行。该年度竞赛提供了一个开源且可重复的框架来评估运动规划算法。基准场景涵盖了高速公路和城市环境，包括多种交通参与者，如乘用车、公交车和自行车。规划器性能从效率、安全性、舒适性和遵守选定交通规则四个方面进行评估。本报告介绍了竞赛格式，并提供了2023年和2024年版中代表性高性能规划器的比较。

Summary / 总结

The research aims to evaluate and compare various motion planning approaches for autonomous vehicles using the CommonRoad benchmark suite. The competition evaluates planners based on efficiency, safety, comfort, and compliance with traffic rules. Key findings include improvements in planner performance across different scenarios, with a focus on safety and efficiency in both highway and urban environments.

研究旨在通过CommonRoad基准套件评估和比较各种自主车辆的运动规划方法。竞赛根据效率、安全、舒适和遵守交通规则的程度来评估规划器的表现。主要发现包括在不同场景中规划器性能的提升，特别是在高速公路和城市环境中的安全性和效率方面有了改进。

REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

Authors: Martin Sedlacek, Pavlo Yefanov, Georgy Ponimatkin, Jai Bardhan, Simon Pilc, Mederic Fourmy, Evangelos Kazakos, Cees G. M. Snoek, Josef Sivic, Vladimir Petrik

First: 2025-12-22T16:44:23+00:00 · Latest: 2025-12-22T16:44:23+00:00

Comments: 9 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 3,500 objects. Finally, we establish two task sets that form our benchmark and evaluate the π_{0}, π_{0}-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs. Project page: https://martin-sedlacek.com/realm

中文标题/摘要

标题：REALM：一种用于机器人操作泛化的实况到模拟验证基准

视觉-语言-动作（VLA）模型使机器人能够理解并执行由自然语言指令描述的任务。然而，一个关键挑战在于它们能否在超出训练环境和条件的情况下进行泛化，目前在现实世界中评估这种能力既困难又昂贵。为了解决这一差距，我们提出了REALM，一种新的模拟环境和基准，旨在评估VLA模型的泛化能力，特别强调通过高保真视觉和对齐的机器人控制建立模拟和现实世界性能之间的强相关性。我们的环境提供了一套15种扰动因素、7种操作技能以及超过3500种物体。最后，我们建立了两个任务集作为我们的基准，并评估了π_{0}、π_{0}-FAST和GR00T N1.5 VLA模型，表明泛化和鲁棒性仍然是一个开放的挑战。更广泛地说，我们还表明，模拟为我们提供了一个现实世界的宝贵代理，并使我们能够系统地探索和量化VLA的弱点和失败模式。项目页面：https://martin-sedlacek.com/realm

Summary / 总结

The research aims to evaluate the generalization capabilities of Vision-Language-Action (VLA) models in robotic manipulation, which is challenging due to the difficulty and cost of real-world testing. REALM, a new simulation environment, is introduced to bridge this gap by providing a benchmark with high-fidelity visuals and aligned robot control, featuring 15 perturbation factors, 7 manipulation skills, and over 3,500 objects. The evaluation of three VLA models (π_{0}, π_{0}-FAST, and GR00T N1.5) on two task sets revealed that generalization and robustness remain significant challenges. The study also highlights that simulation can serve as a valuable proxy for real-world performance, allowing for systematic analysis of VLA model weaknesses and failure modes.

研究旨在评估Vision-Language-Action (VLA)模型在机器人操作中的泛化能力，由于现实世界测试的难度和成本，这一领域存在挑战。REALM作为一种新的仿真环境被引入，通过模拟多种扰动和操作技能来弥补这一差距。研究评估了三种VLA模型，并发现泛化仍然是一个开放的问题，强调了进一步研究的必要性。仿真被证明是探测和量化VLA模型弱点的一个有价值的工具。

BabyFlow: 3D modeling of realistic and expressive infant faces

Authors: Antonia Alomar, Mireia Masias, Marius George Linguraru, Federico M. Sukno, Gemma Piella

First: 2025-12-22T16:42:58+00:00 · Latest: 2025-12-22T16:42:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.

中文标题/摘要

标题：BabyFlow：真实且富有表情的婴儿面部3D建模

通过分析婴儿颅面形态，早期检测发育障碍可以得到帮助，但由于数据有限且婴儿面部表情频繁自发变化，婴儿面部建模具有挑战性。我们引入了BabyFlow，这是一种生成式AI模型，能够分离面部身份和表情，从而独立控制两者。通过使用正则化流，BabyFlow学习到灵活的概率表示，捕捉富有表情的婴儿面部的复杂非线性变化，而不受限制性的线性假设。为了解决缺乏且不受控制的表情数据，我们进行了跨年龄段表情转移，将成人的3D扫描表情适应到婴儿数据集中，以增加具有真实且系统性表情变体的婴儿数据集。因此，BabyFlow提高了3D重建的准确性，特别是在嘴、眼睛和鼻子等高度表情区域，并支持婴儿表情的合成和修改，同时保持身份。此外，通过与扩散模型集成，BabyFlow生成了具有一致3D几何的高保真度2D婴儿图像，为数据增强和早期面部分析提供了强大的工具。

Summary / 总结

BabyFlow is a generative AI model that disentangles facial identity and expression in infants, allowing for independent control over both. It uses normalizing flows to learn flexible, probabilistic representations of expressive infant faces, improving 3D reconstruction accuracy, especially in highly expressive regions. Cross-age expression transfer is used to enrich infant datasets with realistic and systematic expressive variants, supporting the synthesis and modification of infant expressions while preserving identity. Additionally, BabyFlow integrates with diffusion models to generate high-fidelity 2D infant images with consistent 3D geometry, enhancing data augmentation and early facial analysis capabilities.

BabyFlow 是一个生成式 AI 模型，能够分离婴儿面部的身份和表情，实现两者独立控制。它使用正则化流来学习表达性婴儿面部的灵活概率表示，提高 3D 重建精度，尤其是在高度表达性区域。通过跨年龄表情转移，丰富婴儿数据集中的真实和系统性表情变体，支持保持身份的同时合成和修改婴儿表情。此外，BabyFlow 与扩散模型集成，生成具有一致 3D 几何的高保真 2D 婴儿图像，增强数据增强和早期面部分析能力。

CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal

Authors: Yongxin Wang, Zhicheng Yang, Meng Cao, Mingfei Han, Haokun Lin, Yingying Zhu, Xiaojun Chang, Xiaodan Liang

First: 2025-12-22T16:34:21+00:00 · Latest: 2025-12-22T16:34:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures. When all rollouts are wrong, gradients stall; when one happens to be correct, the update usually ignores why the others are close-but-wrong, and credit can be misassigned to spurious chains. We present CARE (Contrastive Anchored REflection), a failure-centric post-training framework for multimodal reasoning that turns errors into supervision. CARE combines: (i) an anchored-contrastive objective that forms a compact subgroup around the best rollout and a set of semantically proximate hard negatives, performs within-subgroup z-score normalization with negative-only scaling, and includes an all-negative rescue to prevent zero-signal batches; and (ii) Reflection-Guided Resampling (RGR), a one-shot structured self-repair that rewrites a representative failure and re-scores it with the same verifier, converting near-misses into usable positives without any test-time reflection. CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures. On Qwen2.5-VL-7B, CARE lifts macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks; with Qwen3-VL-8B it reaches competitive or state-of-the-art results on MathVista and MMMU-Pro under an identical evaluation protocol.

中文标题/摘要

标题：CARE 何以失败：对比锚定反思在可验证多模态中的应用

组相对强化学习带有可验证奖励（RLVR）时常浪费最有信息的数据，即它已经知道的失败。当所有展开都是错误的，梯度会停滞；当一个恰好正确时，更新通常会忽略为什么其他的是接近但错误的，且信用可能会被错误的链路所误分配。我们提出了CARE（对比锚定反思），一种以失败为中心的后训练框架，将错误转化为监督。CARE 结合了：(i) 一个锚定对比目标，围绕最佳展开形成一个紧凑的子组，并包含一组语义上接近的硬负样本，进行子组内 z 分数归一化，仅使用负值缩放，并包括一个全负救援以防止信号为零的批次；和 (ii) 反射引导重采样 (RGR)，这是一种一次性的结构化自我修复，重写一个代表性的失败并用相同的验证器重新评分，将接近失手转化为可用的正样本，无需任何测试时的反思。CARE 提高了准确性和训练平滑度，同时显着增加了来自失败的学习信号的比例。在 Qwen2.5-VL-7B 上，CARE 在六个可验证视觉推理基准测试中将宏平均准确度提高了 4.6 个百分点，与 GRPO 相比；使用 Qwen3-VL-8B，它在相同的评估协议下达到了 MathVista 和 MMMU-Pro 的竞争或最先进结果。

Summary / 总结

CARE (Contrastive Anchored REflection) is a failure-centric framework for verifiable multimodal reasoning that transforms errors into supervision. It uses an anchored-contrastive objective to form a compact subgroup around the best rollout and includes semantically proximate hard negatives, performs within-subgroup z-score normalization, and includes an all-negative rescue mechanism. Additionally, it employs Reflection-Guided Resampling (RGR) to re-score a representative failure, turning near-misses into usable positives. CARE improves accuracy and training smoothness, especially on Qwen2.5-VL-7B, where it increases macro-averaged accuracy by 4.6 points over GRPO across six verifiable visual-reasoning benchmarks.

CARE（对比锚定反思）是一种针对可验证多模态推理的失败中心化框架，将错误转化为监督信息。它使用锚定对比目标，围绕最佳执行形成紧凑子组，并包含语义上接近的硬负样本，进行子组内z分数归一化，并包含全部负样本救援机制。此外，它还采用反思引导重采样（RGR）重新评估一个代表性的失败，将接近失败转化为可用的正样本。CARE提高了准确性和训练平滑度，特别是在Qwen2.5-VL-7B上，它在六个可验证视觉推理基准上的宏平均准确率提高了4.6个百分点，超过了GRPO。

OM4OV: Leveraging Ontology Matching for Ontology Versioning

Authors: Zhangcheng Qiang, Kerry Taylor, Weiqing Wang

First: 2024-09-30T14:00:04+00:00 · Latest: 2025-12-22T16:29:48+00:00

Comments: 16 pages, 8 figures, 1 table

Abs · PDF · Code1 · Code2

Abstract

Due to the dynamic nature of the Semantic Web, version control is necessary to capture time-varying information for widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many approaches treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse the similarities and differences between OM and OV and formalise the OM4OV pipeline. The pipeline is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be reused for OV tasks, but without necessary extensions, the current OM4OV pipeline can produce skewed measurements, poor performance in detecting update entities, and limited explainability for false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, building upon the existing alignments from OM to reduce the number of matching candidates and improve overall OV performance.

中文标题/摘要

标题：OM4OV：利用本体匹配进行本体版本化

由于语义网的动态性质，版本控制对于捕捉广泛使用的本体中的时间变化信息是必要的。尽管本体版本化（OV）作为有效本体管理的关键组成部分已有长期的认识，但许多方法将OV视为类似于本体匹配（OM），并直接重用OM系统来执行OV任务。在本研究中，我们系统地分析了OM和OV之间的相似性和差异，并形式化了OM4OV管道。该管道在最先进的OM系统Agent-OM中实现和评估。实验结果表明，OM系统可以重用于OV任务，但如果没有必要的扩展，当前的OM4OV管道会产生偏差的测量结果，检测更新实体的性能较差，并且对于错误映射的解释能力有限。为了解决这些问题，我们提出了一种优化方法，称为交叉引用（CR）机制，基于现有的OM对齐来减少匹配候选数量，从而提高整体的OV性能。

Summary / 总结

This study addresses the need for version control in the Semantic Web to manage time-varying information in ontologies. It analyzes the differences between ontology matching and ontology versioning, and proposes an OM4OV pipeline implemented in Agent-OM. Experimental results show that while OM systems can be reused for OV tasks, they require extensions to improve performance and explainability. The study introduces a cross-reference mechanism to optimize the pipeline and enhance overall ontology versioning performance.

该研究针对语义网中对时间变化信息进行版本控制的需求，分析了语义匹配与语义版本控制之间的差异，并在Agent-OM中提出了OM4OV管道。实验结果表明，虽然可以利用语义匹配系统进行版本控制任务，但需要扩展以提高性能和解释性。研究引入了交叉引用机制来优化管道，从而提升整体的语义版本控制性能。

StoryMem: Multi-shot Long Video Storytelling with Memory

Authors: Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan

First: 2025-12-22T16:23:24+00:00 · Latest: 2025-12-22T16:23:24+00:00

Comments: Project page: https://kevin-thu.github.io/StoryMem

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

中文标题/摘要

标题：StoryMem：基于记忆的多帧长视频叙事

视觉叙事需要生成具有电影级质量和长范围一致性的多帧视频。受人类记忆启发，我们提出StoryMem，一种将长视频叙事重新定义为基于显式视觉记忆的迭代帧合成的范式，将预训练的单帧视频扩散模型转化为多帧叙事者。这通过一种新颖的记忆到视频（M2V）设计实现，该设计维护一个紧凑且动态更新的历史生成帧的关键帧记忆库。存储的记忆通过潜在连接和负RoPE位移注入单帧视频扩散模型，仅通过LoRA微调。结合语义关键帧选择策略和美学偏好过滤，进一步确保生成过程中的信息性和稳定性。此外，所提出的框架自然支持平滑的镜头过渡和定制化的叙事生成应用。为了便于评估，我们引入了ST-Bench，一个多样化的多帧视频叙事基准。大量实验表明，StoryMem在跨帧一致性方面优于先前的方法，同时保持了高质量和提示的依从性，标志着向连贯的分钟级视频叙事迈出了一大步。

Summary / 总结

The research aims to generate multi-shot videos with cinematic quality and long-range consistency. StoryMem reformulates long-form video storytelling as iterative shot synthesis conditioned on visual memory, using a Memory-to-Video (M2V) design to maintain a compact and dynamically updated memory bank of keyframes. Experiments show that StoryMem outperforms previous methods in cross-shot consistency while maintaining high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

StoryMem 将长视频叙事重新定义为基于视觉记忆的迭代镜头合成，通过 Memory-to-Video (M2V) 设计维护一个紧凑且动态更新的关键帧记忆库。结合语义关键帧选择和审美偏好过滤，确保生成过程中具有高审美质量和跨镜头一致性。实验表明，StoryMem 在保持审美质量和遵循提示方面优于先前的方法，朝着连贯的一分钟长视频叙事迈出了重要一步。

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Authors: Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez

First: 2025-12-22T16:21:39+00:00 · Latest: 2025-12-22T16:21:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .

中文标题/摘要

标题：CASA：通过自注意力实现的跨注意力高效视觉-语言融合

视觉-语言模型（VLMs）通常通过将预训练视觉编码器中的图像令牌插入语言模型的文字流中来进行训练。这使得文本和图像信息能够在模型内部完全相互注意，但对高分辨率图像、长对话或流式视频来说，这在内存和计算上都非常昂贵。利用跨注意力的VLMs是令牌插入的高效替代方案，但在涉及精细视觉细节的任务上表现出明显的性能差距。我们发现，提高此类模型的关键在于在专门的跨注意力层中也启用局部文本到文本的交互。基于此，我们提出了CASA（Cross-Attention via Self-Attention），一种简单而高效的范式，该范式在常见的图像理解基准测试中显著减少了与完整令牌插入的差距，同时在长上下文多模态任务（如流式视频字幕生成）中保持与跨注意力模型相同的可扩展性。如需样本和代码，请参见我们的项目页面 https://kyutai.org/casa 。

Summary / 总结

The research aims to address the computational inefficiency of vision-language models (VLMs) when handling high-resolution images, long conversations, or streaming videos. The proposed method, CASA (Cross-Attention via Self-Attention), introduces local text-to-text interaction in cross-attention layers to bridge the performance gap with full token insertion methods. Experiments show that CASA significantly reduces the performance gap on common image understanding benchmarks while maintaining scalability for long-context multimodal tasks like streaming video captioning.

研究旨在解决视觉语言模型（VLMs）在处理高分辨率图像、长对话或流媒体视频时的计算效率问题。提出的CASA（Cross-Attention via Self-Attention）方法在交叉注意力层中引入局部文本到文本的交互，以缩小与全token插入方法的性能差距。实验表明，CASA在常见图像理解基准测试上显著减少了性能差距，同时保持了对长上下文多模态任务（如流媒体视频字幕生成）的可扩展性。

SlicerOrbitSurgerySim: An Open-Source Platform for Virtual Registration and Quantitative Comparison of Preformed Orbital Plates

Authors: Chi Zhang, Braedon Gunn, Andrew M. Read-Fuller

First: 2025-12-22T16:21:29+00:00 · Latest: 2025-12-22T16:21:29+00:00

Comments: 12 pages, 8 figures. Submitted to Journal of Oral and Maxillofacial Surgery. Code: https://github.com/chz31/SlicerOrbitSurgerySim/tree/main

Abs · PDF · Code1 · Code2 · Code3

Abstract

Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.

中文标题/摘要

标题：SlicerOrbitSurgerySim：一个开源平台，用于虚拟注册和定量比较预成型眶板

眶植入物的不良适应是术后并发症和翻修手术的主要原因。尽管预成型眶板与定制植入物相比，可以降低费用和手术时间，但外科医生目前缺乏公开可用的工具和标准化指标来定量比较不同供应商、尺寸和患者解剖结构的板的适应性。我们开发了SlicerOrbitSurgerySim，这是一个开源的3D Slicer平台扩展，可以在患者特定的虚拟规划环境中实现交互式虚拟注册、评估和比较多个预成型眶板。该软件生成可重复的定量板到眶距离指标和可视化工具，支持患者特定的规划和基于人群的统计分析以评估板的适应性。通过促进植入物设计和放置策略的客观比较，该工具旨在改善术前决策，减少术中板的修改，并促进协作研究和外科教育。提供了试点研究、样本数据集和详细教程以支持测试、透明度和可重复性。

Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement

Authors: Hongsheng Xing, Qiuxin Si

First: 2025-12-22T16:19:01+00:00 · Latest: 2025-12-22T16:19:01+00:00

Comments: 13 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categorical variable, which prevents systematic interpolation and extrapolation across the solvent space. This work introduces the \textbf{Catechol Benchmark}, a high-throughput transient flow chemistry dataset comprising 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures, parameterized by continuous volume fractions ($\% B$). We evaluate various architectures under rigorous leave-one-solvent-out and leave-one-mixture-out protocols to test generalization to unseen chemical environments. Our results demonstrate that classical tabular methods (e.g., Gradient-Boosted Decision Trees) and large language model embeddings (e.g., Qwen-7B) struggle with quantitative precision, yielding Mean Squared Errors (MSE) of 0.099 and 0.129, respectively. In contrast, we propose a hybrid GNN-based architecture that integrates Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. This approach achieves an \textbf{MSE of 0.0039} ($\pm$ 0.0003), representing a 60\% error reduction over competitive baselines and a $>25\times$ improvement over tabular ensembles. Ablation studies confirm that explicit molecular graph message-passing and continuous mixture encoding are essential for robust generalization. The complete dataset, evaluation protocols, and reference implementations are released to facilitate data-efficient reaction prediction and continuous solvent representation learning.

中文标题/摘要

标题：从瞬态流动数据学习连续溶剂效应： catechol 重排的图神经网络基准

跨连续溶剂组成范围预测反应结果仍然是有机合成和过程化学中的关键挑战。传统机器学习方法通常将溶剂视为离散的分类变量，这阻碍了在溶剂空间中系统的内插和外推。本工作引入了**catechol 基准**，这是一个高通量瞬态流动化学数据集，包含1,227个实验产率测量值，用于24种纯溶剂及其二元混合物中对位取代catechol的重排，参数化为连续体积分数（% B）。我们采用各种架构，在严格的留一溶剂验证和留一混合物验证协议下进行评估，以测试对未见过的化学环境的泛化能力。我们的结果表明，经典表格方法（例如梯度增强决策树）和大型语言模型嵌入（例如Qwen-7B）在定量精度方面存在困难，分别产生均方误差（MSE）为0.099和0.129。相比之下，我们提出了一种基于图神经网络的混合架构，该架构结合了图注意网络（GATs）、微分反应指纹（DRFP）和学习的混合溶剂编码。这种方法实现了**MSE为0.0039**（±0.0003），比竞争性基线减少了60%的误差，并且比表格集成提高了超过25倍。消融研究证实，显式的分子图消息传递和连续混合物编码对于稳健的泛化至关重要。完整数据集、评估协议和参考实现已发布，以促进数据高效的反应预测和连续溶剂表示学习。

Summary / 总结

This study addresses the challenge of predicting reaction outcomes across different solvent compositions in organic synthesis. It introduces the Catechol Benchmark, a dataset of 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures. Various machine learning models were evaluated, with classical methods and large language models performing poorly. A hybrid Graph Neural Network (GNN) architecture, combining Graph Attention Networks and learned mixture-aware solvent encodings, achieved an MSE of 0.0039, significantly outperforming other models.

该研究旨在解决在有机合成中不同溶剂组成下预测反应结果的挑战。引入了Catechol基准数据集，包含1,227个实验产率测量值，用于24种纯溶剂及其二元混合物中对位取代酚的重排反应。各种机器学习模型进行了评估，传统的梯度增强决策树和大型语言模型嵌入表现不佳。一种结合图注意力网络（GATs）和学习混合物感知溶剂编码的混合图神经网络（GNN）架构，实现了0.0039的MSE，显著优于其他模型。

Multi-Modal Soccer Scene Analysis with Masked Pre-Training

Authors: Marc Peral, Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo

Venue: WACV 2026

First: 2025-12-22T16:18:45+00:00 · Latest: 2025-12-22T16:18:45+00:00

Comments: 10 pages, 2 figures. WACV 2026

Abs · PDF · Code1 · Code2

Abstract

In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.

中文标题/摘要

标题：基于掩码预训练的多模态足球场景分析

在本文中，我们提出了一种多模态架构，用于从战术摄像机录像中分析足球场景，重点关注三个核心任务：球轨迹推断、球状态分类和球拥有者识别。为此，我们的解决方案将三种不同的输入模态（球员轨迹、球员类型和单个球员的图像剪辑）整合到一个统一框架中，使用一系列社会时空变换器块处理空间和时间动态。与依赖于准确的球跟踪或手工设计启发式方法的先前方法不同，我们的方法在没有直接访问其过去或未来位置的情况下推断球轨迹，并在嘈杂或遮挡条件下从实际顶级联赛比赛中稳健地识别球状态和球拥有者。我们还引入了CropDrop，这是一种特定模态的掩码预训练策略，防止过度依赖图像特征，并鼓励模型在预训练期间依赖跨模态模式。我们在大规模数据集上展示了我们方法的有效性，在所有任务上均优于最先进的基线方法。我们的结果突显了在基于变换器的架构中结合结构化和视觉线索的好处，以及在多模态学习中使用现实的掩码策略的重要性。

Summary / 总结

This work proposes a multi-modal architecture for analyzing soccer scenes using player trajectories, player types, and image crops. It integrates these inputs into a unified framework using sociotemporal transformer blocks to infer ball trajectory, classify ball state, and identify ball possessor. The approach does not rely on direct ball tracking or heuristics, and it introduces CropDrop, a modality-specific masking pre-training strategy, which improves performance on a large-scale dataset compared to state-of-the-art methods. The results demonstrate the effectiveness of combining structured and visual cues in transformer-based models and the importance of realistic masking strategies in multi-modal learning.

本文提出了一种多模态架构，利用球员轨迹、球员类型和个体球员的图像剪辑来分析足球场景，重点关注球轨迹推断、球状态分类和球占有者识别。该方法使用了一种级联的时空变压器块，并引入了CropDrop，这是一种模态特定的掩码预训练策略，以提高跨模态模式的学习。在大规模数据集上的实验表明，该方法在所有任务上都显著优于现有方法，突出了结合结构化和视觉线索在基于变压器的架构中的好处，以及在多模态学习中使用现实的掩码策略的重要性。

Deep Learning for Unrelated-Machines Scheduling: Handling Variable Dimensions

Authors: Diego Hitzges, Guillaume Sagnol

First: 2025-12-22T16:18:29+00:00 · Latest: 2025-12-22T16:18:29+00:00

Comments: 24th IEEE International Conference on Machine Learning and Applications (ICMLA 2025) in Boca Raton, USA. Project page: https://github.com/DiegoHitzges/Deep-Learning-for-Unrelated-Machines-Scheduling . 8 pages, 4 figures, 3 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deep learning has been effectively applied to many discrete optimization problems. However, learning-based scheduling on unrelated parallel machines remains particularly difficult to design. Not only do the numbers of jobs and machines vary, but each job-machine pair has a unique processing time, dynamically altering feature dimensions. We propose a novel approach with a neural network tailored for offline deterministic scheduling of arbitrary sizes on unrelated machines. The goal is to minimize a complex objective function that includes the makespan and the weighted tardiness of jobs and machines. Unlike existing online approaches, which process jobs sequentially, our method generates a complete schedule considering the entire input at once. The key contribution of this work lies in the sophisticated architecture of our model. By leveraging various NLP-inspired architectures, it effectively processes any number of jobs and machines with varying feature dimensions imposed by unrelated processing times. Our approach enables supervised training on small problem instances while demonstrating strong generalization to much larger scheduling environments. Trained and tested on instances with 8 jobs and 4 machines, costs were only 2.51% above optimal. Across all tested configurations of up to 100 jobs and 10 machines, our network consistently outperformed an advanced dispatching rule, which incurred 22.22% higher costs on average. As our method allows fast retraining with simulated data and adaptation to various scheduling conditions, we believe it has the potential to become a standard approach for learning-based scheduling on unrelated machines and similar problem environments.

中文标题/摘要

标题：无关机器调度中的深度学习：处理可变维度

深度学习已被有效应用于许多离散优化问题。然而，基于学习的无关并行机器调度设计尤为困难。不仅作业和机器的数量会变化，而且每个作业-机器配对都有独特的处理时间，动态改变特征维度。我们提出了一种新颖的方法，使用一种针对无关机器上任意大小的离线确定性调度定制的神经网络。目标是最小化包括作业和机器的最长完工时间和加权延迟的复杂目标函数。与现有的在线方法不同，这些方法按顺序处理作业，我们的方法一次考虑整个输入生成完整的调度。本文的关键贡献在于我们模型的复杂架构。通过利用各种NLP启发式架构，它能够有效处理由无关处理时间引起的任何数量的作业和机器的可变特征维度。我们的方法允许在小问题实例上进行监督训练，并且在更大的调度环境中表现出强大的泛化能力。在8个作业和4台机器的实例上训练和测试后，成本仅比最优值高出2.51%。在所有测试配置中，最多100个作业和10台机器，我们的网络始终优于一种先进的调度规则，该规则平均成本高出22.22%。由于我们的方法允许使用模拟数据快速重新训练并适应各种调度条件，我们认为它有可能成为无关机器上基于学习的调度的标准方法以及类似问题环境。

Summary / 总结

This paper addresses the challenge of scheduling on unrelated parallel machines using deep learning, where the number of jobs and machines varies, and each job-machine pair has unique processing times. The authors propose a neural network tailored for offline deterministic scheduling, which minimizes the makespan and weighted tardiness. Unlike online approaches that process jobs sequentially, their method considers the entire input at once, enabling supervised training on small instances and strong generalization to larger environments. The model outperformed an advanced dispatching rule, with costs only 2.51% above optimal for 8 jobs and 4 machines, and consistently outperformed the rule across all tested configurations up to 100 jobs and 10 machines, with an average 22.22% lower cost.

本文探讨了使用深度学习在不相关并行机器上进行作业调度的挑战。作者提出了一种新型神经网络，能够处理可变维度，如作业和机器的数量，以及独特的处理时间。不同于按顺序处理作业的在线方法，他们的方法一次生成完整的调度。该模型借鉴了NLP架构，有效地处理了变化的特征维度。实验表明，该方法在小实例中的成本仅比最优值高出2.51%，而在更大实例中则比先进的调度规则高出22.22%，显示出强大的泛化能力和在不相关机器学习调度中的潜在标准化应用。

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Authors: Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli

First: 2025-12-22T16:18:00+00:00 · Latest: 2025-12-22T16:18:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.

中文标题/摘要

标题：QuantiPhy：评估视觉语言模型物理推理能力的定量基准

理解物理世界对于通用人工智能代理至关重要。然而，尚不清楚最先进的视觉感知模型（例如大型VLM）是否能够进行定量的物理属性推理。现有的评估主要基于VQA且为定性的，无法提供这些模型能否从视频观察中推断出移动物体的动力学量的见解。为了解决这一问题，我们提出了QuantiPhy，这是第一个旨在定量测量VLM物理推理能力的基准。QuantiPhy包含超过3300个视频-文本实例，具有数值真实值，评估VLM在给定时间戳估计物体大小、速度和加速度的表现，其中一个属性作为输入先验。基准标准化了提示和评分，以评估数值准确性，从而实现模型之间的公平比较。我们在最先进的VLM上的实验揭示了它们的定性合理性与实际数值正确性之间的一致差距。我们进一步深入分析了背景噪声、反事实先验和策略性提示等关键因素，并发现最先进的VLM在进行定量动力学属性推理时，严重依赖预训练的世界知识，而不是忠实使用提供的视觉和文本输入作为参考。QuantiPhy提供了第一个严格的、可扩展的测试平台，推动VLM超越单纯的口头合理性，迈向基于数字的物理理解。

Summary / 总结

QuantiPhy is a benchmark designed to evaluate the quantitative physical reasoning abilities of vision-language models. It consists of over 3,300 video-text instances with numerical ground truth, assessing models' ability to estimate an object's size, velocity, and acceleration. Experiments show a gap between models' qualitative plausibility and numerical accuracy, indicating reliance on pre-trained knowledge rather than visual and textual inputs for quantitative reasoning.

QuantiPhy 是一个基准测试，用于评估视觉语言模型的定量物理推理能力。它包含超过 3,300 个视频-文本实例，带有数值 ground truth，以评估模型在估计物体大小、速度和加速度方面的表现。实验表明，在质性合理性与数值准确性之间存在差距，表明模型在推理动力学属性时更依赖预训练知识，而不是视觉和文本输入。

Estimating Spatially Resolved Radiation Fields Using Neural Networks

Authors: Felix Lehner, Pasquale Lombardo, Susana Castillo, Oliver Hupe, Marcus Magnor

First: 2025-12-19T14:52:04+00:00 · Latest: 2025-12-22T16:13:25+00:00

Abs · PDF · Code1 · Code2

Abstract

We present an in-depth analysis on how to build and train neural networks to estimate the spatial distribution of scattered radiation fields for radiation protection dosimetry in medical radiation fields, such as those found in interventional radiology and cardiology. We present three different synthetically generated datasets with increasing complexity for training, using a Monte-Carlo Simulation application based on Geant4. On those datasets, we evaluate convolutional and fully connected architectures of neural networks to demonstrate which design decisions work well for reconstructing the fluence and spectra distributions over the spatial domain of such radiation fields. All our datasets, as well as our training pipeline, are published as open source in separate repositories.

中文标题/摘要

标题：使用神经网络估算空间分辨辐射场

我们对如何构建和训练神经网络以估算医学辐射场（如介入放射学和心脏病学中的辐射场）中散射辐射场的空间分布进行了深入分析，用于辐射防护剂量学。我们使用基于Geant4的蒙特卡洛模拟应用程序，生成了三个不同复杂度的合成数据集进行训练。在这些数据集上，我们评估了卷积和全连接神经网络架构，以展示哪些设计决策适用于重建此类辐射场的空间域中的通量和能谱分布。我们所有的数据集以及训练管道都已作为开源发布在单独的仓库中。

Summary / 总结

This study aims to develop and train neural networks for estimating the spatial distribution of scattered radiation fields in medical radiation environments. Three synthetic datasets with increasing complexity were created using Geant4 Monte-Carlo simulations. Convolutional and fully connected neural network architectures were evaluated to determine their effectiveness in reconstructing fluence and spectra distributions. The study found that certain design decisions in neural network architecture were effective for this task. All datasets and the training pipeline are openly available.

研究旨在开发和训练神经网络以估计医疗辐射环境中散射辐射场的空间分布。使用Geant4蒙特卡洛模拟创建了三个复杂度递增的合成数据集。评估了卷积和全连接神经网络架构，以确定哪种设计最适合重构辐射场中的通量和能谱分布。研究显示某些网络设计对此任务有效，所有数据集和训练管道均已公开发布。

LacaDM: A Latent Causal Diffusion Model for Multiobjective Reinforcement Learning

Authors: Xueming Yan, Bo Yin, Yaochu Jin

First: 2025-12-22T16:08:03+00:00 · Latest: 2025-12-22T16:08:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Multiobjective reinforcement learning (MORL) poses significant challenges due to the inherent conflicts between objectives and the difficulty of adapting to dynamic environments. Traditional methods often struggle to generalize effectively, particularly in large and complex state-action spaces. To address these limitations, we introduce the Latent Causal Diffusion Model (LacaDM), a novel approach designed to enhance the adaptability of MORL in discrete and continuous environments. Unlike existing methods that primarily address conflicts between objectives, LacaDM learns latent temporal causal relationships between environmental states and policies, enabling efficient knowledge transfer across diverse MORL scenarios. By embedding these causal structures within a diffusion model-based framework, LacaDM achieves a balance between conflicting objectives while maintaining strong generalization capabilities in previously unseen environments. Empirical evaluations on various tasks from the MOGymnasium framework demonstrate that LacaDM consistently outperforms the state-of-art baselines in terms of hypervolume, sparsity, and expected utility maximization, showcasing its effectiveness in complex multiobjective tasks.

中文标题/摘要

标题：LacaDM：多目标强化学习的潜在因果扩散模型

多目标强化学习（MORL）由于目标之间的固有冲突和动态环境下的适应性困难，面临着重大挑战。传统方法往往难以在大型和复杂的状态-动作空间中有效泛化。为了解决这些限制，我们提出了潜在因果扩散模型（LacaDM），这是一种新颖的方法，旨在增强MORL在离散和连续环境中的适应性。与主要解决目标之间冲突的现有方法不同，LacaDM 学习环境状态和策略之间的潜在时间因果关系，从而在各种MORL场景中实现高效的知识转移。通过在基于扩散模型的框架中嵌入这些因果结构，LacaDM 在平衡冲突目标的同时，保持了在未见过的环境中强大的泛化能力。来自MOGymnasium框架的各种任务上的实证评估表明，LacaDM 在超体积、稀疏性和期望效用最大化方面始终优于最先进的基线方法，展示了其在复杂多目标任务中的有效性。

Summary / 总结

The research addresses the challenges of multiobjective reinforcement learning (MORL) by introducing LacaDM, a Latent Causal Diffusion Model. This model learns latent causal relationships between environmental states and policies to enhance adaptability in both discrete and continuous environments. Experimental results show that LacaDM outperforms state-of-the-art baselines in hypervolume, sparsity, and expected utility maximization across various tasks, demonstrating its effectiveness in complex MORL scenarios.

研究通过引入LacaDM（潜因果扩散模型）来解决多目标强化学习（MORL）的挑战。该模型学习环境状态和策略之间的潜在因果关系，以增强在离散和连续环境中的适应性。实验结果表明，LacaDM在各种任务中在超体积、稀疏性和期望效用最大化方面优于最先进的基线方法，展示了其在复杂MORL任务中的有效性。