MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
Authors: Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, Hengshuang Zhao
First: 2025-12-16T18:59:59+00:00 · Latest: 2025-12-16T18:59:59+00:00
Comments: Project Page: https://sihuiji.github.io/MemFlow.github.io/
Abstract
The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees the generation efficiency. In this way, MemFlow achieves outstanding long-context consistency with negligible computation burden (7.9% speed reduction compared with the memory-free baseline) and keeps the compatibility with any streaming video generation model with KV cache.
中文标题/摘要
标题:MemFlow:流动的自适应内存用于一致且高效的长视频叙事
长视频流媒体生成的核心挑战在于保持长时间上下文中的内容一致性,这对内存设计提出了高要求。大多数现有解决方案通过预定义策略压缩历史帧来维持记忆。然而,不同待生成的视频片段需要参考不同的历史线索,这很难用固定策略满足。在本文中,我们提出了MemFlow来解决这个问题。具体来说,在生成即将来的片段之前,我们通过检索与该片段文本提示最相关的历史帧,动态更新内存库。这种设计即使在未来的帧中发生新事件或场景切换时,也能保持叙事连贯性。此外,在生成过程中,我们仅在注意力层的每个查询中激活内存库中最相关的令牌,从而有效保证生成效率。这样,MemFlow在几乎不增加计算负担(与无记忆基线相比,速度降低7.9%)的情况下实现了出色的长上下文一致性,并且与任何带有KV缓存的流媒体视频生成模型兼容。
Summary / 总结
MemFlow addresses the challenge of maintaining content consistency in long video narratives by dynamically updating the memory bank with relevant historical frames based on text prompts. This approach ensures narrative coherence even with new events or scenario changes. During generation, only relevant tokens are activated, maintaining efficiency. MemFlow achieves excellent long-context consistency with minimal computational overhead (7.9% speed reduction compared to a memory-free baseline) and is compatible with any streaming video generation model with KV cache.
MemFlow 通过根据当前片段的文本提示动态更新包含相关历史帧的内存库来解决长视频叙述中保持内容一致性的挑战。这种方法确保了叙述连贯性并在生成过程中保持高效,实现了显著的长上下文一致性,同时计算开销很小(与无内存基线相比,速度降低了7.9%)。
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Authors: Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang
First: 2025-12-16T18:59:58+00:00 · Latest: 2025-12-16T18:59:58+00:00
Comments: Project Page: https://timelens-arc-lab.github.io/
Abstract
This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset. Building on our data foundation, we conduct in-depth explorations of algorithmic design principles, yielding a series of meaningful insights and effective yet efficient practices. These include interleaved textual encoding for time representation, a thinking-free reinforcement learning with verifiable rewards (RLVR) approach as the training paradigm, and carefully designed recipes for RLVR training. These efforts culminate in TimeLens models, a family of MLLMs with state-of-the-art VTG performance among open-source models and even surpass proprietary models such as GPT-5 and Gemini-2.5-Flash. All codes, data, and models will be released to facilitate future research.
中文标题/摘要
标题:TimeLens:利用多模态大语言模型重新思考视频时间定位
本文并未引入新的方法,而是为视频时间定位(VTG)这一视频理解的核心能力建立了简单、渐进且必要的基准。尽管多模态大语言模型(MLLMs)在各种视频理解任务中表现出色,但针对VTG优化它们的方法仍处于探索阶段。本文介绍了TimeLens,这是一种系统性的研究,旨在构建具有强大VTG能力的MLLMs,主要从数据质量和算法设计两个维度进行。我们首先揭示了现有VTG基准中的关键质量问题,并引入了TimeLens-Bench,这是一个包含三个流行基准的严格重新注释版本。我们的分析显示,与传统基准相比,模型排名发生了巨大变化,证实了先前评估标准的不可靠性。我们还通过自动重新注释流水线解决了嘈杂的训练数据问题,生成了TimeLens-100K,这是一个大规模、高质量的训练数据集。基于我们的数据基础,我们深入探讨了算法设计原则,产生了许多有意义的见解和有效且高效的实践。这些包括交错的文本编码用于时间表示、无思考的强化学习(带有可验证奖励)作为训练范式,以及精心设计的RLVR训练食谱。这些努力最终产生了TimeLens模型,这是一个具有开源模型中最佳VTG性能的MLLM家族,甚至超越了GPT-5和Gemini-2.5-Flash等专有模型。所有代码、数据和模型都将发布,以促进未来的研究。
Summary / 总结
This paper establishes a baseline for video temporal grounding (VTG) by addressing data quality issues and algorithmic design. It introduces TimeLens-Bench, a re-annotated version of existing VTG benchmarks, and TimeLens-100K, a high-quality training dataset. The authors explore algorithmic design principles, leading to TimeLens models with state-of-the-art VTG performance, surpassing proprietary models like GPT-5 and Gemini-2.5-Flash.
本文介绍了TimeLens,这是一种使用多模态大型语言模型(MLLMs)来增强视频时间定位(VTG)的方法。通过引入TimeLens-Bench和TimeLens-100K,解决了现有VTG基准中的关键问题,提供了高质量的数据。研究探索了算法设计原则,提出了有效的实践,如交错文本编码和无思考的带有可验证奖励的强化学习(RLVR)方法。最终,TimeLens模型在VTG上达到了最先进的性能,超越了开源和专有模型。
CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives
Authors: Zihan Wang, Jiashun Wang, Jeff Tan, Yiwen Zhao, Jessica Hodgins, Shubham Tulsiani, Deva Ramanan
First: 2025-12-16T18:59:50+00:00 · Latest: 2025-12-16T18:59:50+00:00
Comments: Project page: https://crisp-real2sim.github.io/CRISP-Real2Sim/
Abstract
We introduce CRISP, a method that recovers simulatable human motion and scene geometry from monocular video. Prior work on joint human-scene reconstruction relies on data-driven priors and joint optimization with no physics in the loop, or recovers noisy geometry with artifacts that cause motion tracking policies with scene interactions to fail. In contrast, our key insight is to recover convex, clean, and simulation-ready geometry by fitting planar primitives to a point cloud reconstruction of the scene, via a simple clustering pipeline over depth, normals, and flow. To reconstruct scene geometry that might be occluded during interactions, we make use of human-scene contact modeling (e.g., we use human posture to reconstruct the occluded seat of a chair). Finally, we ensure that human and scene reconstructions are physically-plausible by using them to drive a humanoid controller via reinforcement learning. Our approach reduces motion tracking failure rates from 55.2\% to 6.9\% on human-centric video benchmarks (EMDB, PROX), while delivering a 43\% faster RL simulation throughput. We further validate it on in-the-wild videos including casually-captured videos, Internet videos, and even Sora-generated videos. This demonstrates CRISP's ability to generate physically-valid human motion and interaction environments at scale, greatly advancing real-to-sim applications for robotics and AR/VR.
中文标题/摘要
标题:CRISP:基于接触引导的单目视频从模拟到现实的平面场景原语方法
我们介绍了CRISP,一种从单目视频中恢复可模拟的人体运动和场景几何的方法。先前的人体-场景联合重建工作依赖于数据驱动的先验知识和联合优化,但没有物理约束,或者恢复出带有噪声和伪影的几何结构,导致与场景交互的运动跟踪策略失效。相比之下,我们的关键洞察是通过使用深度、法线和流的简单聚类流水线拟合平面原语,来恢复凸形、干净且可用于模拟的几何结构。为了重建在交互过程中可能被遮挡的场景几何结构,我们利用人体-场景接触建模(例如,使用人体姿态重建椅子被遮挡的座位)。最后,我们通过强化学习使用人体和场景重建来驱动一个类人控制器,确保它们的物理合理性。我们的方法在以人体为中心的视频基准(EMDB,PROX)上将运动跟踪失败率从55.2%降低到6.9%,同时实现了43%更快的RL模拟吞吐量。我们还在野外视频中进一步验证了它,包括随意拍摄的视频、互联网视频,甚至Sora生成的视频。这表明CRISP能够大规模生成物理有效的真人运动和交互环境,极大地推进了从现实到模拟的应用,特别是在机器人技术和AR/VR领域。
Summary / 总结
CRISP is a method that recovers human motion and scene geometry from monocular video by fitting planar primitives to a point cloud reconstruction of the scene. It uses human-scene contact modeling to reconstruct occluded parts and ensures physical plausibility through reinforcement learning. CRISP reduces motion tracking failure rates from 55.2% to 6.9% on human-centric benchmarks and increases RL simulation throughput by 43%. It is validated on various video types, showing its ability to generate physically valid environments for real-to-sim applications.
CRISP 通过将平面原语拟合到场景的点云重建上来从单目视频中恢复人体运动和场景几何。它使用人体-场景接触建模来重建被遮挡的部分,并通过强化学习确保物理合理性。CRISP 将人体中心基准上的运动跟踪失败率从 55.2% 降低到 6.9%,并使 RL 模拟吞吐量提高 43%。它在各种视频类型上进行了验证,展示了其生成物理有效环境的能力,极大地推进了真实到模拟的应用。
Love First, Know Later: Persona-Based Romantic Compatibility Through LLM Text World Engines
Authors: Haoyang Shang, Zhengyang Yan, Xuan Liu
Venue: NeurIPS 2025 Oral
First: 2025-12-04T02:07:05+00:00 · Latest: 2025-12-16T18:59:14+00:00
Comments: NeurIPS 2025 Workshop: First Workshop on LLM Persona Modeling (Oral)
Abstract
We propose Love First, Know Later: a paradigm shift in computational matching that simulates interactions first, then assesses compatibility. Instead of comparing static profiles, our framework leverages LLMs as text world engines that operate in dual capacity-as persona-driven agents following behavioral policies and as the environment modeling interaction dynamics. We formalize compatibility assessment as a reward-modeling problem: given observed matching outcomes, we learn to extract signals from simulations that predict human preferences. Our key insight is that relationships hinge on responses to critical moments-we translate this observation from relationship psychology into mathematical hypotheses, enabling effective simulation. Theoretically, we prove that as LLM policies better approximate human behavior, the induced matching converges to optimal stable matching. Empirically, we validate on speed dating data for initial chemistry and divorce prediction for long-term stability. This paradigm enables interactive, personalized matching systems where users iteratively refine their agents, unlocking future possibilities for transparent and interactive compatibility assessment.
中文标题/摘要
标题:先爱后知:基于人设的浪漫兼容性通过LLM文本世界引擎
我们提出Love First, Know Later:一种计算匹配的范式转变,先模拟互动,再评估兼容性。我们框架利用LLM作为文本世界引擎,兼具人设驱动的代理和环境建模互动动态的双重能力。我们将兼容性评估形式化为奖励建模问题:给定匹配结果,我们学习从模拟中提取预测人类偏好的信号。我们的核心洞察是,关系取决于对关键时刻的反应——我们将这一观察从关系心理学转化为数学假设,从而实现有效的模拟。理论上,我们证明,随着LLM策略更好地逼近人类行为,诱导的匹配将收敛到最优稳定匹配。实验上,我们在速配数据上验证了初始化学反应,在离婚预测上验证了长期稳定性。这一范式使交互式、个性化匹配系统成为可能,用户可以迭代优化其代理,从而解锁未来透明和交互式的兼容性评估的可能性。
Summary / 总结
We propose Love First, Know Later: a paradigm shift in computational matching that simulates interactions first, then assesses compatibility.
研究旨在通过使用LLM作为文本世界引擎来模拟互动,以动态响应关键时刻而非静态档案来改进计算匹配。方法包括使用LLM作为角色驱动的代理进行互动模拟,然后从这些模拟中学习以预测人类偏好并评估兼容性。关键发现表明,随着LLM策略更好地模拟人类行为,匹配结果将趋向于最优稳定匹配,通过速配数据验证了初始化学反应,并通过离婚预测验证了长期稳定性。
Native and Compact Structured Latents for 3D Generation
Authors: Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, Jiaolong Yang
First: 2025-12-16T18:58:28+00:00 · Latest: 2025-12-16T18:58:28+00:00
Comments: Project Page: https://microsoft.github.io/TRELLIS.2/
Abstract
Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
中文标题/摘要
标题:3D生成中的原生紧凑结构潜变量
近期在3D生成建模方面的进展显著提高了生成的真实感,但该领域仍受限于现有的表示方法,难以捕捉具有复杂拓扑和详细外观的资产。本文提出了一种从原生3D数据中学习结构化潜变量的方法,以应对这一挑战。其核心是一种新的稀疏体素结构O-体素,这是一种全体素表示,同时编码几何和外观。O-体素可以稳健地建模任意拓扑,包括开放、非流形和完全封闭的表面,同时捕捉超越纹理颜色的全面表面属性,如基于物理的渲染参数。基于O-体素,我们设计了一种稀疏压缩VAE,提供高空间压缩率和紧凑的潜变量空间。我们使用多种公开的3D资产数据集训练了包含40亿参数的大规模流匹配模型进行3D生成。尽管规模庞大,但推理仍然非常高效。同时,我们生成的资产的几何和材料质量远超现有模型。我们认为我们的方法在3D生成建模方面提供了显著的进步。
Summary / 总结
This paper addresses the challenge of representing complex 3D assets with detailed appearance by introducing O-Voxel, a new sparse voxel structure that captures both geometry and appearance. Based on O-Voxel, a Sparse Compression VAE is designed to achieve high spatial compression and a compact latent space. The authors train large-scale flow-matching models with 4B parameters on diverse 3D asset datasets, resulting in highly efficient inference and superior geometry and material quality in generated assets compared to existing models.
该论文通过引入新的稀疏体素结构O-Voxel和稀疏压缩VAE来解决3D生成模型在捕捉复杂拓扑和详细外观方面的挑战。该方法使用包含40亿参数的大规模流匹配模型生成3D资产,实现了高空间压缩和紧凑的潜在空间。生成的资产在几何和材料质量方面明显优于现有模型,显著推进了3D生成建模的发展。
MMGR: Multi-Modal Generative Reasoning
Authors: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu
First: 2025-12-16T18:58:04+00:00 · Latest: 2025-12-16T18:58:04+00:00
Comments: work in progress
Abstract
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
中文标题/摘要
标题:MMGR:多模态生成推理
视频基础模型生成视觉上逼真且时间上连贯的内容,但它们作为世界模拟器的可靠性取决于是否捕捉了物理、逻辑和空间约束。现有指标如弗雷切视频距离(FVD)强调感知质量,而忽视了推理失败,包括因果关系、物理法则和全局一致性的问题。我们引入了MMGR(多模态生成推理评估与基准),一个基于五种推理能力的原理性评估框架:物理、逻辑、三维空间、二维空间和时间。MMGR在抽象推理(ARC-AGI、数独)、体感导航(现实世界三维导航和定位)和物理常识(体育和组合交互)三个领域评估生成推理。MMGR应用细粒度指标,要求视频和图像生成的整体正确性。我们对领先视频模型(Veo-3、Sora-2、Wan-2.2)和图像模型(Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image)进行了基准测试,揭示了不同领域的性能差距。模型在物理常识任务上表现出适度的成功,但在抽象推理(ARC-AGI准确率低于10%)和体感设置中的长期空间规划方面表现不佳。我们的分析指出了当前模型的关键局限性,包括过度依赖感知数据、全局状态一致性较弱以及奖励视觉合理性而非因果正确性的目标。MMGR提供了一个统一的诊断基准,并为推理感知生成世界模型指明了方向。
Summary / 总结
MMGR is a new evaluation framework for video and image models, focusing on their reasoning abilities in physical, logical, and spatial domains. It evaluates models across abstract reasoning, embodied navigation, and physical commonsense tasks using fine-grained metrics. The study benchmarks several leading models and finds significant performance gaps, especially in abstract reasoning and long-term spatial planning. Models show moderate success in physical commonsense but struggle with abstract tasks and long-horizon spatial planning.
MMGR 是一个用于评估视频和图像生成模型推理能力的新框架,重点关注其在物理、逻辑和空间领域的表现。该研究使用细粒度的指标评估模型在抽象推理、体感导航和物理常识任务上的表现,并发现显著的性能差距,尤其是在抽象推理和长时空间规划方面。主要限制包括对感知数据的过度依赖和全局状态一致性较弱。
CHIP: Adaptive Compliance for Humanoid Control through Hindsight Perturbation
Authors: Sirui Chen, Zi-ang Cao, Zhengyi Luo, Fernando Castañeda, Chenran Li, Tingwu Wang, Ye Yuan, Linxi "Jim" Fan, C. Karen Liu, Yuke Zhu
First: 2025-12-16T18:56:04+00:00 · Latest: 2025-12-16T18:56:04+00:00
Comments: The first two authors contributed equally. Project page: https://nvlabs.github.io/CHIP/
Abstract
Recent progress in humanoid robots has unlocked agile locomotion skills, including backflipping, running, and crawling. Yet it remains challenging for a humanoid robot to perform forceful manipulation tasks such as moving objects, wiping, and pushing a cart. We propose adaptive Compliance Humanoid control through hIsight Perturbation (CHIP), a plug-and-play module that enables controllable end-effector stiffness while preserving agile tracking of dynamic reference motions. CHIP is easy to implement and requires neither data augmentation nor additional reward tuning. We show that a generalist motion-tracking controller trained with CHIP can perform a diverse set of forceful manipulation tasks that require different end-effector compliance, such as multi-robot collaboration, wiping, box delivery, and door opening.
中文标题/摘要
标题:CHIP:通过前瞻扰动实现类人控制的自适应顺应性
近年来,类人机器人在敏捷运动技能方面取得了进展,包括后空翻、跑步和爬行。然而,类人机器人在执行诸如移动物体、擦拭和推车等力量型操作任务方面仍然面临挑战。我们提出了通过前瞻扰动实现自适应顺应性的人类控制(CHIP),这是一个即插即用模块,能够在保持对动态参考运动的敏捷跟踪的同时,实现可控的末端执行器刚度。CHIP 实现简单,无需数据增强或额外的奖励调整。我们展示了使用 CHIP 训练的一般运动跟踪控制器可以执行多种需要不同末端执行器顺应性的力量型操作任务,例如多机器人协作、擦拭、箱子递送和开门。
Summary / 总结
The research aims to enhance humanoid robots' ability to perform forceful manipulation tasks. CHIP, a module for adaptive compliance control, is proposed to enable controllable end-effector stiffness while maintaining agile tracking of dynamic motions. The method involves using hindsight perturbation and does not require data augmentation or additional reward tuning. Key findings show that a generalist motion-tracking controller with CHIP can execute various forceful manipulation tasks, including multi-robot collaboration, wiping, box delivery, and door opening, with different end-effector compliance requirements.
论文提出了一种名为CHIP的方法,通过回顾性扰动实现人形机器人的适应性刚度控制,能够在保持动态运动跟踪的同时控制末端执行器的刚度。该方法不需要数据增强或额外的奖励调整,并允许通用的运动跟踪控制器执行多种力量操作任务,包括多机器人协作、擦地、箱子递送和开门。
Misspecification-robust amortised simulation-based inference using variational methods
Authors: Matthew O'Callaghan, Kaisey S. Mandel, Gerry Gilmore
First: 2025-09-06T14:10:49+00:00 · Latest: 2025-12-16T18:48:04+00:00
Comments: Latex edits, fixed typos
Abstract
Recent advances in neural density estimation have enabled powerful simulation-based inference (SBI) methods that can flexibly approximate Bayesian inference for intractable stochastic models. Although these methods have demonstrated reliable posterior estimation when the simulator accurately represents the underlying data generative process (DGP), recent work has shown that they perform poorly in the presence of model misspecification. This poses a significant issue for their use in real-world problems, due to simulators always misrepresenting the true DGP to a certain degree. In this paper, we introduce robust variational neural posterior estimation (RVNP), a method which addresses the problem of misspecification in amortised SBI by bridging the simulation-to-reality gap using variational inference and error modelling. We test RVNP on multiple benchmark tasks, including using real data from astronomy, and show that it can recover robust posterior inference in a data-driven manner without adopting hyperparameters or priors governing the misspecification influence.
中文标题/摘要
标题:使用变分方法的鲁棒自适应模拟基于推理
神经密度估计的最新进展使强大的模拟基于推理(SBI)方法得以实现,这些方法可以灵活地近似不可解随机模型的贝叶斯推理。尽管这些方法在模拟器准确表示底层数据生成过程(DGP)时能够可靠地估计后验分布,但最近的研究表明,它们在模型失配的情况下表现不佳。这在实际问题中是一个重大问题,因为模拟器总是以某种程度偏离真正的DGP。在本文中,我们引入了鲁棒变分神经后验估计(RVNP)方法,该方法通过使用变分推理和误差建模来弥合模拟到现实的差距,从而解决自适应SBI中的模型失配问题。我们对多个基准任务进行了测试,包括使用来自天文学的真实数据,并展示了它可以在数据驱动的方式下恢复鲁棒的后验推理,而无需采用控制失配影响的超参数或先验。
Summary / 总结
This paper addresses the issue of model misspecification in simulation-based inference (SBI) methods, which can lead to poor posterior estimation. The authors introduce RVNP, a robust variational neural posterior estimation method that uses variational inference and error modelling to bridge the simulation-to-reality gap. Experimental results on benchmark tasks, including real data from astronomy, demonstrate that RVNP can provide robust posterior inference without relying on hyperparameters or priors for misspecification influence.
本文解决了模拟基于推理(SBI)方法中模型错配导致后验估计不佳的问题。作者提出了RVNP,一种鲁棒的变分神经后验估计方法,通过变分推理和误差建模来弥合模拟与现实之间的差距。实验结果表明,RVNP可以在不需要依赖错配影响的超参数或先验的情况下,提供鲁棒的后验估计。
TomoGraphView: 3D Medical Image Classification with Omnidirectional Slice Representations and Graph Neural Networks
Authors: Johannes Kiechle, Stefan M. Fischer, Daniel M. Lang, Cosmin I. Bercea, Matthew J. Nyflot, Lina Felsner, Julia A. Schnabel, Jan C. Peeken
First: 2025-11-12T16:30:34+00:00 · Latest: 2025-12-16T18:46:50+00:00
Comments: Preprint submitted to Medical Image Analysis (MedIA)
Abstract
The sharp rise in medical tomography examinations has created a demand for automated systems that can reliably extract informative features for downstream tasks such as tumor characterization. Although 3D volumes contain richer information than individual slices, effective 3D classification remains difficult: volumetric data encode complex spatial dependencies, and the scarcity of large-scale 3D datasets has constrained progress toward 3D foundation models. As a result, many recent approaches rely on 2D vision foundation models trained on natural images, repurposing them as feature extractors for medical scans with surprisingly strong performance. Despite their practical success, current methods that apply 2D foundation models to 3D scans via slice-based decomposition remain fundamentally limited. Standard slicing along axial, sagittal, and coronal planes often fails to capture the true spatial extent of a structure when its orientation does not align with these canonical views. More critically, most approaches aggregate slice features independently, ignoring the underlying 3D geometry and losing spatial coherence across slices. To overcome these limitations, we propose TomoGraphView, a novel framework that integrates omnidirectional volume slicing with spherical graph-based feature aggregation. Instead of restricting the model to axial, sagittal, or coronal planes, our method samples both canonical and non-canonical cross-sections generated from uniformly distributed points on a sphere enclosing the volume. We publicly share our accessible code base at http://github.com/compai-lab/2025-MedIA-kiechle and provide a user-friendly library for omnidirectional volume slicing at https://pypi.org/project/OmniSlicer.
中文标题/摘要
标题:TomoGraphView:使用全方位切片表示和图神经网络的3D医学图像分类
医学断层扫描检查的急剧增加催生了对能够可靠提取用于下游任务(如肿瘤表征)的特征的自动化系统的市场需求。尽管3D体素包含比单个切片更多的信息,但有效的3D分类仍然具有挑战性:体素数据编码复杂的空间依赖性,大规模3D数据集的稀缺性限制了3D基础模型的发展。因此,许多最近的方法依赖于在自然图像上训练的2D视觉基础模型,将其重新用于医学扫描的特征提取,表现出令人惊讶的性能。尽管这些方法在实践中取得了成功,但通过切片分解将2D基础模型应用于3D扫描的方法仍然从根本上受到限制。沿轴向、矢状和冠状平面的标准切片往往无法捕捉到结构的真实空间范围,尤其是当其方向不与这些标准视图对齐时。更严重的是,大多数方法独立地聚合切片特征,忽略了底层的3D几何结构,导致切片之间的空间连贯性丧失。为了克服这些限制,我们提出了一种名为TomoGraphView的新框架,该框架将全方位体积切片与基于球面图的特征聚合相结合。我们的方法不仅采样沿包围体积的均匀分布点生成的标准和非标准横截面,而且不限制模型沿轴向、矢状或冠状平面。我们公开分享了我们的可访问代码库http://github.com/compai-lab/2025-MedIA-kiechle,并提供了一个用户友好的全方位体积切片库https://pypi.org/project/OmniSlicer。
Summary / 总结
TomoGraphView is a novel framework for 3D medical image classification that uses omnidirectional slice representations and graph neural networks. It addresses the limitations of traditional slice-based approaches by sampling both canonical and non-canonical cross-sections from uniformly distributed points on a sphere. The method improves spatial coherence and captures the true spatial extent of structures, leading to better performance compared to 2D foundation models applied to 3D scans via slice-based decomposition.
该研究提出了TomoGraphView框架,结合了全向切片表示和图神经网络,用于3D医学图像分类。该方法通过球形图基特征聚合,克服了传统切片方法的限制,能够捕捉到标准和非标准切面。主要发现表明,与传统切片方法相比,该方法在空间连贯性和肿瘤表征准确性方面有所提升。
GraphBench: Next-generation graph learning benchmarking
Authors: Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, Christopher Morris
First: 2025-12-04T05:30:31+00:00 · Latest: 2025-12-16T18:45:56+00:00
Abstract
Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See www.graphbench.io for further details.
中文标题/摘要
标题:GraphBench:下一代图学习基准测试
图上的机器学习在分子性质预测和芯片设计等多个领域已经取得了令人印象深刻的进展。然而,基准测试实践仍然支离破碎,通常依赖于狭窄的任务特定数据集和不一致的评估协议,这阻碍了可重复性和更广泛的进步。为了解决这个问题,我们引入了GraphBench,这是一个涵盖多个领域和预测任务的综合基准测试套件,包括节点级、边级、图级和生成设置。GraphBench提供了标准化的评估协议——具有一致的数据集划分和考虑离分布泛化的性能指标——以及一个统一的超参数调优框架。此外,我们使用消息传递神经网络和图变换器模型对GraphBench进行了基准测试,提供了原则性的基线并建立了参考性能。更多信息请参见www.graphbench.io。
Summary / 总结
GraphBench is designed to improve benchmarking practices in graph learning by providing a comprehensive suite that covers various domains and tasks. It includes standardized evaluation protocols and a unified hyperparameter tuning framework. Key findings show that GraphBench benchmarks message-passing neural networks and graph transformer models, establishing a reference performance for these models.
GraphBench旨在通过提供涵盖各种领域和任务的全面套件来改进图上机器学习的基准测试。它包括标准化的评估协议和统一的超参数调整框架。关键实验发现表明,GraphBench为消息传递神经网络和图变压器模型建立了参考性能,从而提高了可重复性和领域内的进展。
VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image
Authors: Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Steve Lin, Baining Guo
Venue: NeurIPS 2025
First: 2025-12-16T18:44:00+00:00 · Latest: 2025-12-16T18:44:00+00:00
Comments: NeurIPS 2025 paper. Project webpage: https://www.microsoft.com/en-us/research/project/vasa-3d/
Abstract
We propose VASA-3D, an audio-driven, single-shot 3D head avatar generator. This research tackles two major challenges: capturing the subtle expression details present in real human faces, and reconstructing an intricate 3D head avatar from a single portrait image. To accurately model expression details, VASA-3D leverages the motion latent of VASA-1, a method that yields exceptional realism and vividness in 2D talking heads. A critical element of our work is translating this motion latent to 3D, which is accomplished by devising a 3D head model that is conditioned on the motion latent. Customization of this model to a single image is achieved through an optimization framework that employs numerous video frames of the reference head synthesized from the input image. The optimization takes various training losses robust to artifacts and limited pose coverage in the generated training data. Our experiment shows that VASA-3D produces realistic 3D talking heads that cannot be achieved by prior art, and it supports the online generation of 512x512 free-viewpoint videos at up to 75 FPS, facilitating more immersive engagements with lifelike 3D avatars.
中文标题/摘要
标题:VASA-3D:从单张图像生成逼真音频驱动的高斯头部avatar
我们提出了VASA-3D,一种音频驱动的单张图像3D头部avatar生成器。这项研究解决了两个主要挑战:捕捉真实人类面部存在的微妙表情细节,以及从单张肖像图像中重建复杂的3D头部avatar。为了准确建模表情细节,VASA-3D利用了VASA-1的方法,该方法在2D说话头部中产生了极高的真实感和生动性。我们工作的关键要素是将这种运动潜在变量转换为3D,这通过设计一个条件依赖于运动潜在变量的3D头部模型来实现。通过使用从输入图像合成的参考头部的大量视频帧,将该模型定制到单张图像上,是通过一个优化框架实现的,该框架采用了生成训练数据中抗伪影和姿态覆盖有限的各种训练损失。我们的实验表明,VASA-3D生成了无法通过现有技术实现的逼真3D说话头部,并支持以高达75 FPS的速度在线生成512x512的自由视角视频,促进了与逼真3D avatar的更沉浸式互动。
Summary / 总结
VASA-3D is an audio-driven 3D head avatar generator that addresses the challenges of capturing subtle facial expressions and reconstructing 3D avatars from a single image. It uses the motion latent from VASA-1 to model expression details and translates this latent to 3D by conditioning a 3D head model. The model is customized to a single image through an optimization framework using synthesized video frames. Experiments show that VASA-3D generates highly realistic 3D talking heads and supports real-time video generation at 75 FPS.
VASA-3D 是一个基于音频的 3D 头部avatar生成器,旨在捕捉细微的表情细节并从单张图像中重建复杂的 3D 头部。它利用 VASA-1 的运动隐含特征来建模表情细节,并通过一个基于运动隐含特征的 3D 头部模型将其转换为 3D。该模型通过使用从输入图像合成的多个视频帧进行优化来针对单张图像进行定制。实验表明,VASA-3D 生成了高度逼真的 3D 说话头部,并支持以高帧率生成高分辨率的自由视角视频。
VIBE: Can a VLM Read the Room?
Authors: Tania Chakraborty, Eylon Caplan, Dan Goldwasser
Venue: EMNLP
First: 2025-06-11T19:07:35+00:00 · Latest: 2025-12-16T18:42:51+00:00
Comments: Findings of EMNLP, 2025
Abstract
Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.
中文标题/摘要
标题:VIBE:VLM能否读懂房间里的社交信号?
理解人类社会行为,如识别情绪及其背后的社会动态,是一个重要且具有挑战性的问题。尽管语言模型取得了显著进展,但它们仅限于文本领域,无法解释非言语线索在理解社交情境中的重要作用。视觉语言模型(VLMs)可能能够弥补这一差距,但它们在推理此类社会线索方面的能力尚未受到广泛关注。在本文中,我们探讨了VLM在社会推理方面的能力。我们发现VLM的一个先前未被注意到的局限性:视觉社会-语用推理差距。为解决这一差距,我们为VLM提出了一项新任务:视觉社会-语用推理。我们构建了一个高质量的数据集来测试VLM在该任务上的能力,并在该数据集上对几种VLM进行了基准测试。
Summary / 总结
The paper aims to explore the capabilities of Vision Language Models (VLMs) in social reasoning, particularly in understanding non-verbal cues that are crucial for social dynamics. The authors introduce a new task called Visual Social-Pragmatic Inference to address the Visual Social-Pragmatic Inference gap, and they construct a high-quality dataset to benchmark several VLMs on this task. The key finding is that current VLMs struggle with visual social-pragmatic inference, highlighting the need for improvement in this area.
本文探讨了视觉语言模型(VLM)在社会推理方面的能力,指出了一个被称为视觉社会语用推理差距的局限性。为了解决这一问题,作者提出了一项新任务并构建了一个高质量的数据集来评估VLM。主要实验发现是当前的VLM在视觉社会语用推理方面存在困难,这表明需要在这一领域进行改进。
Beyond Lipschitz Continuity and Monotonicity: Fractal and Chaotic Activation Functions in Echo State Networks
Authors: Rae Chipera, Jenny Du, Irene Tsapara
First: 2025-12-16T18:41:01+00:00 · Latest: 2025-12-16T18:41:01+00:00
Comments: 50 pages, 21 figures. Extended version with full proofs, parameter sweeps, and appendices
Abstract
Contemporary reservoir computing relies heavily on smooth, globally Lipschitz continuous activation functions, limiting applications in defense, disaster response, and pharmaceutical modeling where robust operation under extreme conditions is critical. We systematically investigate non-smooth activation functions, including chaotic, stochastic, and fractal variants, in echo state networks. Through comprehensive parameter sweeps across 36,610 reservoir configurations, we demonstrate that several non-smooth functions not only maintain the Echo State Property (ESP) but outperform traditional smooth activations in convergence speed and spectral radius tolerance. Notably, the Cantor function (continuous everywhere and flat almost everywhere) maintains ESP-consistent behavior up to spectral radii of rho ~ 10, an order of magnitude beyond typical bounds for smooth functions, while achieving 2.6x faster convergence than tanh and ReLU. We introduce a theoretical framework for quantized activation functions, defining a Degenerate Echo State Property (d-ESP) that captures stability for discrete-output functions and proving that d-ESP implies traditional ESP. We identify a critical crowding ratio Q=N/k (reservoir size / quantization levels) that predicts failure thresholds for discrete activations. Our analysis reveals that preprocessing topology, rather than continuity per se, determines stability: monotone, compressive preprocessing maintains ESP across scales, while dispersive or discontinuous preprocessing triggers sharp failures. While our findings challenge assumptions about activation function design in reservoir computing, the mechanism underlying the exceptional performance of certain fractal functions remains unexplained, suggesting fundamental gaps in our understanding of how geometric properties of activation functions influence reservoir dynamics.
中文标题/摘要
标题:超越Lipschitz连续性和单调性:回声状态网络中的分形和混沌激活函数
当前的回声状态网络依赖于平滑的、全局Lipschitz连续的激活函数,这限制了其在防御、灾害响应和制药建模等关键领域中的应用,这些领域需要在极端条件下保持稳健运行。我们系统地研究了非平滑激活函数,包括混沌、随机和分形变体,在回声状态网络中的应用。通过在36,610种回声状态配置中进行全面的参数扫描,我们证明了几种非平滑函数不仅保持了回声状态特性(ESP),而且在收敛速度和谱半径耐受性方面优于传统的平滑激活函数。值得注意的是,康托函数(处处连续但几乎处处平坦)在谱半径ρ~10时仍保持ESP一致的行为,这比平滑函数的典型界限高出一个数量级,同时实现2.6倍于tanh和ReLU的更快收敛速度。我们提出了量化激活函数的理论框架,定义了退化回声状态特性(d-ESP),以捕捉离散输出函数的稳定性,并证明d-ESP意味着传统ESP。我们确定了一个关键的拥挤比Q=N/k(回声状态大小/量化级别),预测了离散激活函数的失效阈值。我们的分析表明,预处理拓扑结构而非连续性本身决定了稳定性:单调、压缩的预处理在不同尺度上保持ESP,而发散或不连续的预处理会触发尖锐的失效。尽管我们的发现挑战了回声状态网络中激活函数设计的假设,但某些分形函数的出色性能背后的机制尚未解释,这表明我们对激活函数的几何特性如何影响回声状态动力学的理解存在根本性缺口。
Summary / 总结
This study investigates non-smooth activation functions in echo state networks, aiming to enhance robustness under extreme conditions. Through extensive parameter sweeps, the research demonstrates that certain non-smooth functions, such as the Cantor function, maintain the Echo State Property and outperform traditional smooth activations in terms of convergence speed and spectral radius tolerance. Notably, the Cantor function achieves 2.6 times faster convergence than tanh and ReLU while maintaining stability up to a spectral radius of rho ~ 10, an order of magnitude beyond typical bounds for smooth functions.
研究系统地探讨了混沌、随机和分形等非光滑激活函数在回声状态网络中的应用,以增强其在极端条件下的鲁棒性。通过广泛的参数扫描,研究发现某些非光滑函数,如康托函数,不仅保持了回声状态性质(ESP),还在收敛速度和谱半径容限方面优于传统光滑激活函数。特别是康托函数的收敛速度比tanh和ReLU快2.6倍,且在谱半径rho ~ 10时仍保持ESP,这一数值远超光滑函数的典型界限。
COMMA: A Communicative Multimodal Multi-Agent Benchmark
Authors: Timothy Ossowski, Danyal Maqbool, Jixuan Chen, Zefan Cai, Tyler Bradshaw, Junjie Hu
Venue: Transactions on Machine Learning Research, 2025
First: 2024-10-10T02:49:47+00:00 · Latest: 2025-12-16T18:36:40+00:00
Abstract
The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.
中文标题/摘要
标题:COMMA:一种交流多模态多智能体基准
基于大型基础模型构建的多模态智能体的迅速发展很大程度上忽视了它们在协作任务中通过语言进行智能体间交流的潜力。这一忽视在理解其在实际部署中的有效性方面,尤其是在与人类交流时,形成了一个关键缺口。现有的智能体基准未能解决智能体间交流和协作的关键方面,特别是在智能体信息获取不平等且必须合作完成超出个体能力范围的任务的场景中。为填补这一缺口,我们引入了COMMA:一种新型的谜题基准,旨在通过语言交流评估多模态多智能体系统的协作性能。我们的基准包括多种多模态谜题,提供了在交流协作环境中对智能体能力的四个关键方面的全面评估。我们的研究结果揭示了最先进的模型,包括强大的专有模型GPT-4o和推理模型o4-mini,存在令人惊讶的弱点。许多链式推理模型如R1-Onevision和LLaVA-CoT在智能体间合作中难以超越随机基线,表明它们在交流能力方面存在潜在的增长空间。
Summary / 总结
The paper introduces COMMA, a benchmark for evaluating the collaborative performance of multimodal multi-agent systems through language communication. It addresses the gap in existing benchmarks by focusing on scenarios where agents have unequal information and must collaborate. Key findings show that state-of-the-art models, including GPT-4o and reasoning models like o4-mini, exhibit significant weaknesses in agent-agent communication, with many reasoning models performing no better than random baselines.
论文介绍了COMMA基准,用于评估多模态多智能体系统在语言通信下的协作性能。它弥补了现有基准的不足,重点关注智能体信息不均等且需要协作的场景。主要发现表明,包括GPT-4o和推理模型o4-mini在内的先进模型在智能体间通信方面存在显著缺陷,许多推理模型的表现甚至不如随机基线。
ART: Articulated Reconstruction Transformer
Authors: Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Geng, Jiajun Wu, Richard Newcombe, Jakob Engel, Zhao Dong
First: 2025-12-16T18:35:23+00:00 · Latest: 2025-12-16T18:35:23+00:00
Comments: Project Page: https://kyleleey.github.io/ART/
Abstract
We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.
中文标题/摘要
标题:ART: 关节可动重建变换器
我们引入了ART(关节可动重建变换器)——一种不依赖于特定类别的、前馈模型,可以从稀疏的多状态RGB图像中重建完整的3D关节可动物体。之前的方法要么依赖于缓慢且易碎的跨状态对应关系的优化,要么使用仅限于特定物体类别的前馈模型。相比之下,ART 将关节可动物体视为刚性部件的装配体,将重建问题表述为基于部件的预测。我们新设计的变换器架构将稀疏图像输入映射到一组可学习的部件槽位,ART 从中联合解码各个部件的统一表示,包括它们的3D几何形状、纹理和明确的关节参数。生成的重建结果具有物理可解释性,并且可以直接导出用于模拟。ART 在大规模、多样化的数据集上进行训练,每个部件都有监督,并在多种基准测试中进行评估,其性能显著优于现有基线,并在基于图像输入的关节可动物体重建中建立了新的状态。
Summary / 总结
The research introduces ART, a category-agnostic feed-forward model that reconstructs complete 3D articulated objects from sparse, multi-state RGB images. Unlike previous methods that rely on slow optimization or are limited to specific categories, ART treats articulated objects as assemblies of rigid parts and formulates reconstruction as part-based prediction. The model uses a transformer architecture to map sparse image inputs to learnable part slots, jointly decoding unified representations for individual parts including their 3D geometry, texture, and articulation parameters. ART outperforms existing baselines and sets a new state of the art in articulated object reconstruction from image inputs.
研究动机是开发一种能够从稀疏的多状态RGB图像中重建完整3D articulated物体的无类别模型。ART,Articulated Reconstruction Transformer,使用变压器架构将稀疏图像映射到可学习的部分槽位,并联合解码3D几何、纹理和关节参数。ART在现有方法中表现出色,并在从图像重建articulated物体方面建立了新的状态-of-the-art。
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
Authors: Zechen Bai, Chen Gao, Mike Zheng Shou
First: 2025-12-16T18:26:38+00:00 · Latest: 2025-12-16T18:26:38+00:00
Comments: 15 pages
Abstract
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame'' this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6\% on long-horizon tasks, +22.0\% in 1-shot learning, and enables cross-task generalization -- achieving 20.8\% success on unseen tasks without task-specific demonstrations training (vs. 0\% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.
中文标题/摘要
标题:EVOLVE-VLA:通过环境反馈进行测试时训练的视觉-语言-行动模型
实现真正适应性的体态智能需要能够通过不断与环境互动来改进的代理,而不仅仅是通过模仿静态演示来学习,这类似于人类通过练习掌握技能的方式。视觉-语言-行动(VLA)模型通过利用大型语言模型推进了机器人的操作,但仍然从根本上受到监督微调(SFT)的限制:需要每项任务数百次演示,僵化地记忆轨迹,并且在部署条件与训练条件不一致时无法适应。我们提出了EVOLVE-VLA,这是一种测试时训练框架,使VLA能够通过与环境的互动进行连续适应,而无需或几乎无需特定任务的演示。关键技术挑战是用自主反馈替换测试时不可用的先验奖励信号。我们通过一个学习的进步估计器提供密集反馈来解决这一问题,并且关键地,我们设计了框架来“驯服”这种固有的噪声信号,通过两种机制:(1)累积进步估计机制平滑噪声的点估计,(2)渐进的视野扩展策略,使策略逐步进化。EVOLVE-VLA实现了显著的改进:在长时任务上提高了8.6%,在1次学习中提高了22.0%,并且实现了跨任务泛化——在没有特定任务演示的情况下训练,实现了20.8%的成功率(而纯SFT为0%)。定性分析表明,演示中不存在的新兴能力,包括错误恢复和新颖策略。这项工作代表了向真正学习和适应的VLA迈出的关键一步,从静态模仿转向持续自我改进。
Summary / 总结
EVOLVE-VLA is a test-time training framework for Vision-Language-Action models that enables continuous adaptation through environmental interaction, reducing the need for task-specific demonstrations. It uses a learned progress estimator to provide dense feedback and two mechanisms to handle noisy signals: accumulative progress estimation and progressive horizon extension. Key results include improvements of 8.6% on long-horizon tasks, 22.0% in 1-shot learning, and the ability to generalize across tasks without specific training, demonstrating emergent capabilities not seen in demonstrations.
EVOLVE-VLA 是一种测试时训练框架,使视觉-语言-行动模型能够通过环境交互持续适应,减少对特定任务演示的需求。它使用一个学习进度估计器提供密集反馈,并通过两种机制处理噪声信号:累积进度估计和渐进时间扩展。关键结果包括在长时任务上提高了8.6%的表现,在单次学习上提高了22.0%,并且在没有特定任务演示的情况下实现了跨任务泛化(20.8%的成功率,而纯监督微调为0%)。
Enhancing Visual Sentiment Analysis via Semiotic Isotopy-Guided Dataset Construction
Authors: Marco Blanchini, Giovanna Maria Dimitri, Benedetta Tondi, Tarcisio Lancioni, Mauro Barni
First: 2025-12-16T18:26:22+00:00 · Latest: 2025-12-16T18:26:22+00:00
Abstract
Visual Sentiment Analysis (VSA) is a challenging task due to the vast diversity of emotionally salient images and the inherent difficulty of acquiring sufficient data to capture this variability comprehensively. Key obstacles include building large-scale VSA datasets and developing effective methodologies that enable algorithms to identify emotionally significant elements within an image. These challenges are reflected in the limited generalization performance of VSA algorithms and models when trained and tested across different datasets. Starting from a pool of existing data collections, our approach enables the creation of a new larger dataset that not only contains a wider variety of images than the original ones, but also permits training new models with improved capability to focus on emotionally relevant combinations of image elements. This is achieved through the integration of the semiotic isotopy concept within the dataset creation process, providing deeper insights into the emotional content of images. Empirical evaluations show that models trained on a dataset generated with our method consistently outperform those trained on the original data collections, achieving superior generalization across major VSA benchmarks
中文标题/摘要
标题:通过符号同构指导的数据集构建增强视觉情感分析
视觉情感分析(VSA)是一项具有挑战性的任务,因为情感显著图像的多样性极大,且获取足够数据以全面捕捉这种多样性具有内在难度。主要障碍包括构建大规模VSA数据集和开发有效的方法,使算法能够识别图像中的情感相关元素。这些挑战反映在当VSA算法和模型在不同数据集上进行训练和测试时,其泛化性能有限。从现有的数据集合中,我们的方法能够创建一个更大的新数据集,不仅包含比原始数据更广泛的图像种类,还允许训练具有更强能力的新模型,专注于情感相关的图像元素组合。这通过在数据集构建过程中整合符号同构概念得以实现,提供了对图像情感内容更深入的理解。实证评估表明,使用我们方法生成的数据集训练的模型在主要VSA基准测试中始终优于使用原始数据集训练的模型,实现了更好的泛化性能
Summary / 总结
The paper addresses the challenge of Visual Sentiment Analysis (VSA) by proposing a method to enhance dataset construction using semiotic isotopy. This method creates a larger and more diverse dataset, improving the ability of models to identify emotionally significant elements in images. Experimental results demonstrate that models trained on this new dataset perform better and generalize more effectively across various VSA benchmarks compared to those trained on original datasets.
研究旨在通过解决数据多样性和模型泛化能力的问题来提升视觉情感分析(VSA)。方法利用语义等价性构建了一个更大且更多样化的数据集,这提高了模型识别情感相关图像元素的能力。实验结果表明,使用这种方法构建的数据集训练的模型在各种VSA基准测试中的表现更好,泛化能力更强,优于使用原始数据集训练的模型。
WaveSim: A Wavelet-based Multi-scale Similarity Metric for Weather and Climate Fields
Authors: Gabriele Accarino, Viviana Acquaviva, Sara Shamekh, Duncan Watson-Parris, David Lawrence
First: 2025-12-16T18:15:53+00:00 · Latest: 2025-12-16T18:15:53+00:00
Abstract
We introduce WaveSim, a multi-scale similarity metric for the evaluation of spatial fields in weather and climate applications. WaveSim exploits wavelet transforms to decompose input fields into scale-specific wavelet coefficients. The metric is built by multiplying three orthogonal components derived from these coefficients: Magnitude, which quantifies similarities in the energy distribution of the coefficients, i.e., the intensity of the field; Displacement, which captures spatial shift by comparing the centers of mass of normalized energy distributions; and Structure, which assesses pattern organization independent of location and amplitude. Each component yields a scale-specific similarity score ranging from 0 (no similarity) to 1 (perfect similarity), which are then combined across scales to produce an overall similarity measure. We first evaluate WaveSim using synthetic test cases, applying controlled spatial and temporal perturbations to systematically assess its sensitivity and expected behavior. We then demonstrate its applicability to physically relevant case studies of key modes of climate variability in Earth System Models. Traditional point-wise metrics lack a mechanism for attributing errors to physical scales or modes of dissimilarity. By operating in the wavelet domain and decomposing the signal along independent axes, WaveSim bypasses these limitations and provides an interpretable and diagnostically rich framework for assessing similarity in complex fields. Additionally, the WaveSim framework allows users to place emphasis on a specific scale or component, and lends itself to user-specific model intercomparison, model evaluation, and calibration and training of forecasting systems. We provide a PyTorch-ready implementation of WaveSim, along with all evaluation scripts, at: https://github.com/gabrieleaccarino/wavesim.
中文标题/摘要
标题:WaveSim:基于小波的多尺度相似性度量方法及其在天气和气候场中的应用
我们介绍了WaveSim,一种用于天气和气候应用中空间场评估的多尺度相似性度量方法。WaveSim 利用小波变换将输入场分解为特定尺度的小波系数。该度量方法由从这些系数中导出的三个正交组件组成:幅度,量化系数能量分布相似性,即场的强度;位移,通过比较归一化能量分布的质心来捕捉空间位移;结构,独立于位置和幅度评估模式组织。每个组件都产生一个从0(无相似性)到1(完美相似性)的特定尺度相似性评分,然后在不同尺度上结合生成总体相似性度量。我们首先使用合成测试案例评估WaveSim,通过系统地施加控制的空间和时间扰动来评估其敏感性和预期行为。然后,我们展示了其在地球系统模型中关键气候变率模式的实际应用。传统的点度量缺乏将误差归因于物理尺度或不相似模式的机制。通过在小波域操作并沿独立轴分解信号,WaveSim 跳过了这些限制,提供了一个可解释且诊断丰富的框架来评估复杂场中的相似性。此外,WaveSim框架允许用户强调特定的尺度或组件,并适用于用户特定的模型比较、模型评估、以及预报系统的校准和训练。我们提供了一个PyTorch兼容的WaveSim实现,以及所有评估脚本:https://github.com/gabrieleaccarino/wavesim。
Summary / 总结
WaveSim is a multi-scale similarity metric for evaluating spatial fields in weather and climate applications. It uses wavelet transforms to decompose fields into scale-specific coefficients and evaluates similarity through three orthogonal components: Magnitude, Displacement, and Structure. The method provides scale-specific similarity scores and combines them to produce an overall similarity measure. WaveSim demonstrates sensitivity to controlled perturbations and is applicable to climate variability case studies, offering a rich and interpretable framework for model evaluation and forecasting system training.
WaveSim 是一种多尺度相似性度量,用于评估天气和气候应用中的空间场。它通过小波变换将字段分解为尺度特定的系数,然后通过三个组件进行评估:幅度、位移和结构。该度量提供尺度特定的评分并将其结合以给出总体相似性度量。评估显示 WaveSim 对受控扰动的敏感性及其在气候变异性研究中的适用性,通过在小波域操作并允许尺度特定分析,解决了传统点度量的局限性。
ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
Authors: Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li
First: 2025-12-16T18:13:54+00:00 · Latest: 2025-12-16T18:13:54+00:00
Comments: Code is available at https://github.com/Leon-LihongWang/ViRC
Abstract
CoT has significantly enhanced the reasoning ability of LLMs while it faces challenges when extended to multimodal domains, particularly in mathematical tasks. Existing MLLMs typically perform textual reasoning solely from a single static mathematical image, overlooking dynamic visual acquisition during reasoning. In contrast, humans repeatedly examine visual image and employ step-by-step reasoning to prove intermediate propositions. This strategy of decomposing the problem-solving process into key logical nodes adheres to Miller's Law in cognitive science. Inspired by this insight, we propose a ViRC framework for multimodal mathematical tasks, introducing a Reason Chunking mechanism that structures multimodal mathematical CoT into consecutive Critical Reasoning Units (CRUs) to simulate human expert problem-solving patterns. CRUs ensure intra-unit textual coherence for intermediate proposition verification while integrating visual information across units to generate subsequent propositions and support structured reasoning. To this end, we present CRUX dataset by using three visual tools and four reasoning patterns to provide explicitly annotated CRUs across multiple reasoning paths for each mathematical problem. Leveraging the CRUX dataset, we propose a progressive training strategy inspired by human cognitive learning, which includes Instructional SFT, Practice SFT, and Strategic RL, aimed at further strengthening the Reason Chunking ability of the model.The resulting ViRC-7B model achieves a 18.8\% average improvement over baselines across multiple mathematical benchmarks. Code is available at https://github.com/Leon-LihongWang/ViRC.
中文标题/摘要
标题:ViRC:增强视觉交错数学推理链的推理片段化
推理能力显著提升了LLMs的推理能力,但在扩展到多模态领域时,特别是在数学任务中,它面临着挑战。现有的多模态LLMs通常仅从单一静态数学图像中进行文本推理,忽视了推理过程中的动态视觉获取。相比之下,人类会反复检查视觉图像,并采用逐步推理来证明中间命题。这种将问题解决过程分解为关键逻辑节点的方法遵循了认知科学中的米勒定律。受此启发,我们提出了一种ViRC框架,用于多模态数学任务,引入了一种推理片段化机制,将多模态数学推理链结构化为连续的关键推理单元(CRUs),以模拟人类专家的问题解决模式。CRUs确保了单元内的文本连贯性,用于中间命题验证,同时整合跨单元的视觉信息以生成后续命题并支持结构化推理。为此,我们使用三种视觉工具和四种推理模式,通过CRUX数据集提供了每个数学问题的多条推理路径上的显式标注CRUs。利用CRUX数据集,我们提出了一种受人类认知学习启发的渐进式训练策略,包括指令性SFT、练习性SFT和策略性RL,旨在进一步增强模型的推理片段化能力。最终,ViRC-7B模型在多个数学基准测试中平均提高了18.8%。代码可在https://github.com/Leon-LihongWang/ViRC获取。
Summary / 总结
The paper addresses the challenge of extending CoT to multimodal domains, especially for mathematical tasks, by proposing a ViRC framework with a Reason Chunking mechanism. This mechanism decomposes the problem-solving process into consecutive Critical Reasoning Units (CRUs) to simulate human reasoning patterns. The CRUX dataset, which includes annotated CRUs, is used for training a model through a progressive strategy. The ViRC-7B model shows a 18.8% average improvement over baselines on multiple mathematical benchmarks.
研究旨在通过解决现有方法仅关注静态图像的问题,增强LLMs在多模态数学任务中的推理能力。提出的ViRC框架引入了Reason Chunking机制,将推理过程分解为连续的关键推理单元(CRUs),以模拟人类的解题模式。模型使用一个渐进训练策略和新创建的CRUX数据集进行训练,结果在多个数学基准测试中平均提高了18.8%。
AI-GenBench: A New Ongoing Benchmark for AI-Generated Image Detection
Authors: Lorenzo Pellegrini, Davide Cozzolino, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Luisa Verdoliva, Marco Prati, Marco Ramilli
First: 2025-04-29T15:41:13+00:00 · Latest: 2025-12-16T18:11:21+00:00
Comments: Accepted at Verimedia workshop, IJCNN 2025. 9 pages, 6 figures, 4 tables, code available: https://github.com/MI-BioLab/AI-GenBench
Abstract
The rapid advancement of generative AI has revolutionized image creation, enabling high-quality synthesis from text prompts while raising critical challenges for media authenticity. We present Ai-GenBench, a novel benchmark designed to address the urgent need for robust detection of AI-generated images in real-world scenarios. Unlike existing solutions that evaluate models on static datasets, Ai-GenBench introduces a temporal evaluation framework where detection methods are incrementally trained on synthetic images, historically ordered by their generative models, to test their ability to generalize to new generative models, such as the transition from GANs to diffusion models. Our benchmark focuses on high-quality, diverse visual content and overcomes key limitations of current approaches, including arbitrary dataset splits, unfair comparisons, and excessive computational demands. Ai-GenBench provides a comprehensive dataset, a standardized evaluation protocol, and accessible tools for both researchers and non-experts (e.g., journalists, fact-checkers), ensuring reproducibility while maintaining practical training requirements. By establishing clear evaluation rules and controlled augmentation strategies, Ai-GenBench enables meaningful comparison of detection methods and scalable solutions. Code and data are publicly available to ensure reproducibility and to support the development of robust forensic detectors to keep pace with the rise of new synthetic generators.
中文标题/摘要
标题:AI-GenBench:一种新的持续基准测试,用于检测AI生成的图像
生成式AI的迅速发展已经彻底改变了图像创作,从文本提示生成高质量合成图像,同时也带来了媒体真实性方面的关键挑战。我们提出了AI-GenBench,这是一种新型基准测试,旨在解决在现实世界场景中检测AI生成图像的迫切需求。与现有解决方案仅在静态数据集上评估模型不同,AI-GenBench 引入了一种时间评估框架,其中检测方法在按其生成模型历史顺序排列的合成图像上逐步训练,以测试其在面对新生成模型(例如从GAN到扩散模型的过渡)时的泛化能力。我们的基准测试专注于高质量、多样化的视觉内容,并克服了当前方法的关键局限性,包括任意的数据集划分、不公平的比较以及过高的计算需求。AI-GenBench 提供了一个全面的数据集、标准化的评估协议和易于使用的工具(例如记者、事实核查员),确保可再现性同时保持实际的训练需求。通过建立明确的评估规则和受控的增强策略,AI-GenBench 使检测方法的比较具有意义,并支持可扩展的解决方案。代码和数据已公开,以确保可再现性并支持开发与新合成生成器同步的稳健法医检测器。
Summary / 总结
AI-GenBench is a new ongoing benchmark for detecting AI-generated images, addressing the challenge of media authenticity in the era of generative AI. Unlike static datasets, it uses a temporal evaluation framework, incrementally training detection methods on synthetic images generated by different models over time. Key findings include improved generalization to new generative models and overcoming limitations of existing benchmarks such as unfair comparisons and excessive computational demands.
论文介绍了Ai-GenBench,这是一个新的用于检测AI生成图像的基准,旨在应对生成式AI时代媒体真实性的问题。Ai-GenBench采用了一种时间上的评估框架,逐步训练检测方法以处理不同模型生成的合成图像。主要发现包括对新生成模型的更好泛化能力和一个全面的数据集,支持可重复和可扩展的AI生成图像法医检测解决方案。
Understanding Sampler Stochasticity in Training Diffusion Models for RLHF
Authors: Jiayuan Sheng, Hanyang Zhao, Haoxian Chen, David D. Yao, Wenpin Tang
First: 2025-10-12T19:08:38+00:00 · Latest: 2025-12-16T18:10:07+00:00
Abstract
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.
中文标题/摘要
标题:理解采样随机性在训练RLHF扩散模型中的作用
从人类反馈中进行强化学习(RLHF)越来越多地用于微调扩散模型,但一个关键挑战是训练中使用的随机采样器与推理中使用的确定性采样器之间的不匹配。实践中,模型使用随机SDE采样器进行微调以鼓励探索,而推理通常依赖于确定性ODE采样器以提高效率和稳定性。这种差异导致了奖励差距,引发了关于推理期间是否能期望高质量输出的担忧。在本文中,我们从理论上描述了这种奖励差距,并为通用扩散模型提供了非空的界,同时为发散爆炸(VE)和方差保持(VP)高斯模型提供了更尖锐的收敛速率。方法上,我们采用广义去噪扩散隐式模型(gDDIM)框架,以支持任意高的随机性,同时在整个过程中保持数据边缘。通过大规模实验,我们的发现验证了在使用去噪扩散策略优化(DDPO)和混合组相对策略优化(MixGRPO)的文本到图像模型中,奖励差距在训练过程中持续缩小,当模型使用更高随机性的SDE训练更新时,ODE采样质量也得到了提高。
Summary / 总结
This paper addresses the challenge of reward gaps in Reinforcement Learning from Human Feedback (RLHF) when training diffusion models. It theoretically characterizes the reward gap between stochastic samplers used during training and deterministic samplers used during inference, providing non-vacuous bounds for general diffusion models and sharper convergence rates for specific models. Empirically, the study shows that reward gaps consistently narrow over training, and ODE sampling quality improves when models are trained with higher-stochasticity SDE samplers.
本文探讨了在强化学习从人类反馈(RLHF)中训练扩散模型时,由训练中使用随机采样器和推理中使用确定性采样器导致的奖励差距问题。作者通过理论分析了这种奖励差距,并提出了使用广义去噪扩散隐式模型(gDDIM)框架来支持高程度的随机性,从而缩小奖励差距。大规模实验结果表明,在使用更高随机性采样器进行训练时,奖励差距在训练过程中会持续减少,推理质量也会得到提升。
Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-guided Subtyping and Lesion-Wise Model Ensemble
Authors: Daniel Capellán-Martín, Abhijeet Parida, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru
Venue: MICCAI
First: 2025-12-16T18:09:48+00:00 · Latest: 2025-12-16T18:09:48+00:00
Comments: 12 pages, 5 figures, 3 tables. Algorithm presented at MICCAI BraTS 2025
Abstract
Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.
中文标题/摘要
标题:适用于多种脑肿瘤的可适应分割流水线:基于影像组学亚型指导和病灶级模型集成
在多参数磁共振成像(MRI)上对脑肿瘤进行稳健且通用的分割仍然困难,因为肿瘤类型差异很大。BraTS 2025 Lighthouse 挑战赛在成人和儿童肿瘤的多样化高质量数据集上评估分割方法:国际儿科脑肿瘤分割(PED)、术前脑膜瘤肿瘤分割(MEN)、脑膜瘤放疗分割(MEN-RT)以及术前和术后脑转移瘤分割(MET)。我们提出了一种灵活、模块化且可适应的流水线,通过选择和组合最先进的模型,并在训练前后应用肿瘤和病灶特定的处理,以提高分割性能。从MRI中提取的影像组学特征有助于检测肿瘤亚型,确保训练更平衡。自定义病灶级性能指标确定每个模型在集成中的影响,并优化进一步细化预测的后处理,使工作流程能够针对每个病例调整每一步。在BraTS测试集中,我们的流水线在多个挑战中实现了与顶级算法相当的性能。这些发现证实,自定义病灶感知处理和模型选择可以产生稳健的分割,但不会将方法锁定到特定的网络架构中。我们的方法有可能在临床实践中用于定量肿瘤测量,支持诊断和预后。
Summary / 总结
The research aims to develop a flexible segmentation pipeline for diverse brain tumors using multi-parametric MRI. The method involves selecting and combining state-of-the-art models, applying tumor- and lesion-specific processing, and using radiomic features to detect tumor subtypes. Key findings show that the pipeline achieves performance comparable to top-ranked algorithms in multiple challenges, demonstrating the effectiveness of custom lesion-aware processing and model selection without being tied to a specific network architecture.
研究通过开发一个灵活的管道,结合最先进的模型和肿瘤特异性处理,以解决在多参数MRI上对多样脑肿瘤进行稳健分割的难题。使用放射学特征来指导肿瘤亚型分类并确保训练平衡。自定义病灶级性能指标确定每个模型在集成中的影响,并优化后处理以进一步细化预测。该管道在多个BraTS挑战中取得了竞争力的表现,证明了自定义病灶感知处理和模型选择的有效性,而不局限于特定的网络架构。
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Authors: Yuan Gao, Chen Chen, Tianrong Chen, Jiatao Gu
First: 2025-12-08T18:57:26+00:00 · Latest: 2025-12-16T18:04:34+00:00
Abstract
Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual representations, either by aligning them inside VAEs or directly within the generative model. However, adapting such representations remains challenging due to fundamental mismatches between understanding-oriented features and generation-friendly latent spaces. Representation encoders benefit from high-dimensional latents that capture diverse hypotheses for masked regions, whereas generative models favor low-dimensional latents that must faithfully preserve injected noise. This discrepancy has led prior work to rely on complex objectives and architectures. In this work, we propose FAE (Feature Auto-Encoder), a simple yet effective framework that adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer, while retaining sufficient information for both reconstruction and understanding. The key is to couple two separate deep decoders: one trained to reconstruct the original feature space, and a second that takes the reconstructed features as input for image generation. FAE is generic; it can be instantiated with a variety of self-supervised encoders (e.g., DINO, SigLIP) and plugged into two distinct generative families: diffusion models and normalizing flows. Across class-conditional and text-to-image benchmarks, FAE achieves strong performance. For example, on ImageNet 256x256, our diffusion model with CFG attains a near state-of-the-art FID of 1.29 (800 epochs) and 1.70 (80 epochs). Without CFG, FAE reaches the state-of-the-art FID of 1.48 (800 epochs) and 2.08 (80 epochs), demonstrating both high quality and fast learning.
中文标题/摘要
标题:一层足够:适应预训练视觉编码器进行图像生成
视觉生成模型(例如,扩散模型)通常在压缩的潜在空间中运行,以平衡训练效率和样本质量。与此同时,人们越来越关注利用高质量的预训练视觉表示,无论是通过在VAEs内部对齐它们,还是直接在生成模型内部。然而,由于理解导向的特征与生成友好的潜在空间之间存在根本性差异,使得适应这些表示仍然具有挑战性。表示编码器受益于高维潜在空间,可以捕捉到遮罩区域的多种假设,而生成模型则偏好低维潜在空间,必须忠实保留注入的噪声。这种差异导致先前的工作依赖于复杂的优化目标和架构。在本工作中,我们提出了一种简单而有效的框架FAE(特征自编码器),它仅使用一个注意力层即可将预训练的视觉表示适应为适合生成的低维潜在空间,同时保留足够的信息用于重建和理解。关键在于结合两个独立的深度解码器:一个用于重建原始特征空间,另一个则以重建的特征作为输入进行图像生成。FAE是通用的,可以与各种自监督编码器(例如,DINO,SigLIP)结合,并插入两种不同的生成家族:扩散模型和归一化流。在类别条件和文本到图像基准测试中,FAE表现出色。例如,在ImageNet 256x256上,我们的带有CFG的扩散模型达到接近最先进的FID为1.29(800个周期)和1.70(80个周期)。不使用CFG时,FAE达到最先进的FID为1.48(800个周期)和2.08(80个周期),展示了高质量和快速学习的能力。
Summary / 总结
This work addresses the challenge of adapting pre-trained visual representations for image generation by proposing FAE (Feature Auto-Encoder), which uses a single attention layer to transform high-dimensional features into low-dimensional latents suitable for generation. FAE consists of two decoders: one for reconstruction and another for image generation. The method is versatile and can be used with various self-supervised encoders and generative models. Experiments on ImageNet show that FAE achieves strong performance, with near state-of-the-art FID scores on both 800 and 80 epochs, and state-of-the-art scores without Classifier-Free Guidance, demonstrating both high quality and fast learning.
本文提出了一种名为FAE(特征自编码器)的方法,通过使用单个注意力层将高维特征转换为适合生成的低维潜变量,以解决预训练视觉表示适应图像生成的挑战。FAE 包含两个解码器:一个用于重建原始特征空间,另一个用于生成图像。该框架具有灵活性,可以与各种自监督编码器和生成模型结合使用。在ImageNet上的实验表明,FAE 能够实现强大的性能,其在800和80个epoch时的FID分数接近最佳水平,并且在没有Classifier-Free Guidance (CFG)的情况下达到最佳FID分数。
TiME: Tiny Monolingual Encoders for Efficient NLP Pipelines
Authors: David Schulmeister, Valentin Hartmann, Lars Klein, Robert West
First: 2025-12-16T18:02:58+00:00 · Latest: 2025-12-16T18:02:58+00:00
Abstract
Today, a lot of research on language models is focused on large, general-purpose models. However, many NLP pipelines only require models with a well-defined, small set of capabilities. While large models are capable of performing the tasks of those smaller models, they are simply not fast enough to process large amounts of data or offer real-time responses. Furthermore, they often use unnecessarily large amounts of energy, leading to sustainability concerns and problems when deploying them on battery-powered devices. In our work, we show how to train small models for such efficiency-critical applications. As opposed to many off-the-shelf NLP pipelines, our models use modern training techniques such as distillation, and offer support for low-resource languages. We call our models TiME (Tiny Monolingual Encoders) and comprehensively evaluate them on a range of common NLP tasks, observing an improved trade-off between benchmark performance on one hand, and throughput, latency and energy consumption on the other. Along the way, we show that distilling monolingual models from multilingual teachers is possible, and likewise distilling models with absolute positional embeddings from teachers with relative positional embeddings.
中文标题/摘要
标题:TiME: 小型单语编码器以实现高效NLP管道
如今,语言模型的研究大多集中在大型、通用的模型上。然而,许多NLP管道只需要具有明确、小型能力集的模型。虽然大型模型能够执行这些小型模型的任务,但它们通常不足以快速处理大量数据或提供实时响应。此外,它们往往使用不必要的大量能源,导致可持续性问题并在部署于电池供电设备时出现问题。在我们的研究中,我们展示了如何训练小型模型以满足这些效率关键的应用。与许多现成的NLP管道不同,我们的模型使用了现代训练技术如蒸馏,并支持低资源语言。我们将我们的模型称为TiME(小型单语编码器),并在一系列常见的NLP任务上进行了全面评估,观察到在基准性能和吞吐量、延迟和能源消耗之间取得了更好的权衡。过程中,我们展示了从多语言教师蒸馏单语言模型的可能性,以及从具有相对位置嵌入的教师蒸馏具有绝对位置嵌入的模型的可能性。
A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images
Authors: Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Christian Matek, Lukas Wolfseher, Rainer Spang, Ralf Huss, Johannes Raffler, Sarah Reinke, Wolfram Klapper, Katja Steiger, Kristina Schwamborn, Carsten Marr
First: 2025-12-16T17:58:03+00:00 · Latest: 2025-12-16T17:58:03+00:00
Comments: 17 pages
Abstract
Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.
中文标题/摘要
标题:HE染色全切片图像多实例学习模型淋巴瘤亚型分类的多中心基准研究
及时准确的淋巴瘤诊断对于指导癌症治疗至关重要。标准诊断方法结合HE染色全切片图像、免疫组化、流式细胞术和分子遗传学测试来确定淋巴瘤亚型,这一过程需要昂贵的设备、熟练的人员,并导致治疗延误。深度学习方法可以通过提取常规可用的HE染色切片中的诊断信息来辅助病理学家,但缺乏针对多中心数据的淋巴瘤亚型全面基准测试。在本研究中,我们首次提出了涵盖四种常见淋巴瘤亚型和健康对照组织的多中心淋巴瘤基准数据集。我们系统地评估了五种公开的病理学基础模型(H-optimus-1、H0-mini、Virchow2、UNI2、Titan)与注意力机制(AB-MIL)和变压器机制(TransMIL)多实例学习聚合器在三个放大倍数(10倍、20倍、40倍)下的表现。在同分布测试集中,模型在所有放大倍数下的多类别平衡准确率均超过80%,所有基础模型表现相似,两种聚合方法结果相当。放大倍数研究显示,40倍分辨率已足够,更高分辨率或跨放大倍数聚合无性能提升。然而,在异分布测试集中,性能显著下降至约60%,突显了显著的泛化挑战。为了推进该领域,需要涵盖更多罕见淋巴瘤亚型的更大规模多中心研究。我们提供了一个自动基准测试管道,以促进此类未来研究。
Summary / 总结
The study aims to improve lymphoma diagnosis by developing deep learning models that can analyze HE-stained whole slide images. Five publicly available models were evaluated using a multicenter dataset covering four common lymphoma subtypes and healthy controls. Across all magnifications, models achieved balanced accuracies over 80%, with no significant differences between foundation models and aggregation methods. However, performance dropped to around 60% on out-of-distribution test sets, indicating challenges in generalization. The study suggests that 40x resolution is sufficient for lymphoma subtyping and calls for larger multicenter studies to include rare subtypes.
该研究旨在通过开发能够分析HE染色全切片图像的深度学习模型来改善淋巴瘤诊断。研究评估了五种公开的病理基础模型,使用了注意力基(AB-MIL)和变换器基(TransMIL)聚合器,在三种放大倍数下进行测试。模型在内分布测试集上的平衡准确率超过80%,两种聚合方法之间没有显著差异,所有模型表现相似。然而,在外分布测试集上的性能降至约60%,表明泛化能力存在挑战。未来的研究应关注包含罕见淋巴瘤亚型的更大规模多中心研究。
AMD-HookNet++: Evolution of AMD-HookNet with Hybrid CNN-Transformer Feature Enhancement for Glacier Calving Front Segmentation
Authors: Fei Wu, Marcel Dreier, Nora Gourmelon, Sebastian Wind, Jianlin Zhang, Thorsten Seehaus, Matthias Braun, Andreas Maier, Vincent Christlein
Venue: IEEE Transactions on Geoscience and Remote Sensing (2025)
First: 2025-12-16T17:57:52+00:00 · Latest: 2025-12-16T17:57:52+00:00
Abstract
The dynamics of glaciers and ice shelf fronts significantly impact the mass balance of ice sheets and coastal sea levels. To effectively monitor glacier conditions, it is crucial to consistently estimate positional shifts of glacier calving fronts. AMD-HookNet firstly introduces a pure two-branch convolutional neural network (CNN) for glacier segmentation. Yet, the local nature and translational invariance of convolution operations, while beneficial for capturing low-level details, restricts the model ability to maintain long-range dependencies. In this study, we propose AMD-HookNet++, a novel advanced hybrid CNN-Transformer feature enhancement method for segmenting glaciers and delineating calving fronts in synthetic aperture radar images. Our hybrid structure consists of two branches: a Transformer-based context branch to capture long-range dependencies, which provides global contextual information in a larger view, and a CNN-based target branch to preserve local details. To strengthen the representation of the connected hybrid features, we devise an enhanced spatial-channel attention module to foster interactions between the hybrid CNN-Transformer branches through dynamically adjusting the token relationships from both spatial and channel perspectives. Additionally, we develop a pixel-to-pixel contrastive deep supervision to optimize our hybrid model by integrating pixelwise metric learning into glacier segmentation. Through extensive experiments and comprehensive quantitative and qualitative analyses on the challenging glacier segmentation benchmark dataset CaFFe, we show that AMD-HookNet++ sets a new state of the art with an IoU of 78.2 and a HD95 of 1,318 m, while maintaining a competitive MDE of 367 m. More importantly, our hybrid model produces smoother delineations of calving fronts, resolving the issue of jagged edges typically seen in pure Transformer-based approaches.
中文标题/摘要
标题:AMD-HookNet++:AMD-HookNet的混合CNN-Transformer特征增强演进,用于冰川崩解前沿分割
冰川和冰架前沿的动力学显著影响冰盖的质量平衡和沿海海平面。为了有效监测冰川状况,持续估算冰川崩解前沿的位置变化至关重要。AMD-HookNet首次引入了一个纯双分支卷积神经网络(CNN)进行冰川分割。然而,卷积操作的局部特性和平移不变性虽然有助于捕捉低级细节,但限制了模型保持长距离依赖的能力。在本研究中,我们提出AMD-HookNet++,一种新颖的混合CNN-Transformer特征增强方法,用于分割冰川和在合成孔径雷达图像中勾勒崩解前沿。我们的混合结构由两个分支组成:基于Transformer的上下文分支,用于捕捉长距离依赖性,提供更大的视图中的全局上下文信息;以及基于CNN的目标分支,用于保留局部细节。为了增强连接混合特征的表示,我们设计了一个增强的空间-通道注意模块,通过从空间和通道两个视角动态调整令牌关系,促进混合CNN-Transformer分支之间的交互。此外,我们开发了一种像素到像素的对比深度监督,通过将像素级度量学习集成到冰川分割中,优化我们的混合模型。通过在具有挑战性的冰川分割基准数据集CaFFe上进行广泛的实验和全面的定量及定性分析,我们展示了AMD-HookNet++在IoU为78.2、HD95为1,318米的情况下,保持了竞争力的MDE为367米的新最佳性能。更重要的是,我们的混合模型生成了更平滑的崩解前沿轮廓,解决了纯Transformer方法中常见的锯齿状边缘问题。
Summary / 总结
AMD-HookNet++ is an advanced hybrid CNN-Transformer model designed for glacier calving front segmentation in synthetic aperture radar images. It combines a Transformer-based context branch for capturing long-range dependencies with a CNN-based target branch for local detail preservation. An enhanced spatial-channel attention module and pixel-to-pixel contrastive deep supervision are introduced to improve feature interactions and model optimization. Experiments on the CaFFe dataset demonstrate that AMD-HookNet++ achieves a new state-of-the-art with an IoU of 78.2 and HD95 of 1,318 m, while producing smoother calving front delineations compared to pure Transformer-based approaches.
AMD-HookNet++ 是一种结合了基于 Transformer 的上下文分支和基于 CNN 的目标分支的混合模型,用于合成孔径雷达图像中的冰川崩解前沿分割。该模型通过增强的空间-通道注意模块和像素到像素的对比深监督来加强混合特征的表示。实验结果表明,AMD-HookNet++ 在 CaFFe 数据集上达到了新的最佳性能,IoU 为 78.2,HD95 为 1,318 米,同时生成的冰川崩解前沿轮廓更加平滑,解决了纯 Transformer 方法中常见的锯齿边缘问题。
Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Authors: Jungsuk Oh, Jay-Yoon Lee
First: 2025-08-25T18:36:28+00:00 · Latest: 2025-12-16T17:50:43+00:00
Abstract
Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks.
We introduce \textbf{Latent Self-Consistency (LSC)}, which selects the most semantically consistent response using learnable token embeddings. LSC's lightweight forward processing of summary tokens only introduces negligible runtime overhead (at most $0.9\%$) on top of standard decoding of the base LLM, and requires no changes to the model architecture.
Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC, and WUCS on both short-form and long-form on average performance, while adding negligible computational overhead on vanilla inference. These results position LSC as a reliable consistency-selection method that works effectively across various answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low expected calibration error across both answer formats.
中文标题/摘要
标题:潜在自我一致性在短答案和长答案推理中的可靠多数集选择
大型语言模型(LLMs)中的概率解码经常产生不一致的输出,特别是在复杂或长格式的问题上。自我一致性(SC)通过在短格式问答中对精确字符串进行多数投票来缓解这一问题,而通用自我一致性(USC)和加权未igram一致性得分(WUCS)则扩展到长格式响应,但在短格式基准测试中会失去准确性。
我们引入了**潜在自我一致性(LSC)**,它使用可学习的标记嵌入来选择最语义一致的响应。LSC的轻量级前向处理仅在标准解码的基础上增加了可忽略的运行时开销(最多0.9%),并且不需要更改模型架构。
在6个短格式和5个长格式推理基准测试(例如,MATH、MMLU、TruthfulQA)中,LSC在平均性能上超过了SC、USC和WUCS,同时在纯推理中增加了可忽略的计算开销。这些结果将LSC定位为一种在各种答案格式中都能有效工作的可靠的不一致性选择方法。此外,LSC还提供了校准良好的置信度估计,两种答案格式的预期校准误差均较低。
Summary / 总结
The research addresses the issue of inconsistent outputs from Large Language Models (LLMs) on complex questions. It introduces Latent Self-Consistency (LSC), which uses learnable token embeddings to select the most semantically consistent response. LSC outperforms existing methods like Self-Consistency (SC), Universal Self-Consistency (USC), and Weighted Unigram Consistency Score (WUCS) on both short-form and long-form benchmarks, with minimal computational overhead. LSC also provides well-calibrated confidence estimates.
研究针对大型语言模型(LLMs)在复杂问题上产生不一致输出的问题,引入了使用可学习的标记嵌入来选择最语义一致答案的潜自洽性(LSC)方法。LSC在短答案和长答案基准测试中均优于现有方法(如自洽性SC、通用自洽性USC和加权未规范化一致性得分WUCS),同时增加了极小的计算开销。该方法还能提供良好的置信度估计,使其在各种答案格式中都表现出色。
MuseCPBench: an Empirical Study of Music Editing Methods through Music Context Preservation
Authors: Yash Vishe, Eric Xue, Xunyi Jiang, Zachary Novack, Junda Wu, Julian McAuley, Xin Xu
First: 2025-12-16T17:44:56+00:00 · Latest: 2025-12-16T17:44:56+00:00
Abstract
Music editing plays a vital role in modern music production, with applications in film, broadcasting, and game development. Recent advances in music generation models have enabled diverse editing tasks such as timbre transfer, instrument substitution, and genre transformation. However, many existing works overlook the evaluation of their ability to preserve musical facets that should remain unchanged during editing a property we define as Music Context Preservation (MCP). While some studies do consider MCP, they adopt inconsistent evaluation protocols and metrics, leading to unreliable and unfair comparisons. To address this gap, we introduce the first MCP evaluation benchmark, MuseCPBench, which covers four categories of musical facets and enables comprehensive comparisons across five representative music editing baselines. Through systematic analysis along musical facets, methods, and models, we identify consistent preservation gaps in current music editing methods and provide insightful explanations. We hope our findings offer practical guidance for developing more effective and reliable music editing strategies with strong MCP capability
中文标题/摘要
标题:MuseCPBench:通过音乐上下文保留进行音乐编辑方法的实证研究
音乐编辑在现代音乐制作中发挥着重要作用,应用于电影、广播和游戏开发。最近音乐生成模型的进步使多种编辑任务成为可能,如音色转移、乐器替换和风格转换。然而,许多现有工作忽视了评估其在编辑过程中保留音乐特征的能力,我们定义这种能力为音乐上下文保留(MCP)。虽然有些研究确实考虑了MCP,但它们采用不一致的评估协议和指标,导致不可靠且不公平的比较。为解决这一问题,我们引入了第一个MCP评估基准MuseCPBench,涵盖了四种音乐特征类别,并在五个代表性音乐编辑基线之间实现了全面比较。通过沿音乐特征、方法和模型的系统分析,我们识别了当前音乐编辑方法中一致的保留缺口,并提供了深入的解释。我们希望我们的发现为开发具有更强MCP能力的有效且可靠的音乐编辑策略提供实用指导
Summary / 总结
MuseCPBench is an empirical study that evaluates music editing methods based on their ability to preserve musical facets during editing, a property termed Music Context Preservation (MCP). The study introduces a benchmark covering four categories of musical facets and compares five representative music editing methods. Key findings highlight consistent preservation gaps in current methods, offering insights for developing more reliable music editing strategies with strong MCP capability.
研究旨在评估音乐编辑方法在保持音乐要素方面的能力,这一特性被称为音乐上下文保存(MCP)。研究引入了MuseCPBench,这是一个涵盖四种音乐要素类别的评估基准,并对比了五种代表性的音乐编辑方法。主要发现揭示了当前方法在保持音乐要素方面的一致性差距,为开发具有更强MCP能力的更有效和可靠的音乐编辑策略提供了见解。
Distill Video Datasets into Images
Authors: Zhenghao Zhao, Haoxuan Wang, Kai Wang, Yuzhang Shang, Yuan Hong, Yan Yan
First: 2025-12-16T17:33:41+00:00 · Latest: 2025-12-16T17:33:41+00:00
Abstract
Dataset distillation aims to synthesize compact yet informative datasets that allow models trained on them to achieve performance comparable to training on the full dataset. While this approach has shown promising results for image data, extending dataset distillation methods to video data has proven challenging and often leads to suboptimal performance. In this work, we first identify the core challenge in video set distillation as the substantial increase in learnable parameters introduced by the temporal dimension of video, which complicates optimization and hinders convergence. To address this issue, we observe that a single frame is often sufficient to capture the discriminative semantics of a video. Leveraging this insight, we propose Single-Frame Video set Distillation (SFVD), a framework that distills videos into highly informative frames for each class. Using differentiable interpolation, these frames are transformed into video sequences and matched with the original dataset, while updates are restricted to the frames themselves for improved optimization efficiency. To further incorporate temporal information, the distilled frames are combined with sampled real videos from real videos during the matching process through a channel reshaping layer. Extensive experiments on multiple benchmarks demonstrate that SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF, thereby offering a more effective solution.
中文标题/摘要
标题:将视频数据集提炼为图像
数据集提炼旨在合成紧凑但信息丰富的数据集,使得在其中训练的模型能够达到与在完整数据集上训练相当的性能。尽管这种方法在图像数据上显示出有希望的结果,但将数据集提炼方法扩展到视频数据上却证明具有挑战性,通常会导致性能不佳。在本文中,我们首先将视频集提炼的核心挑战识别为由视频的时间维度引入的可学习参数的大幅增加,这使优化复杂化并阻碍了收敛。为了解决这一问题,我们观察到单帧通常足以捕捉视频的判别语义。利用这一见解,我们提出了单帧视频集提炼(SFVD)框架,该框架将视频提炼为每个类别的高度信息性的帧。通过可微插值,这些帧被转换为视频序列并与原始数据集匹配,同时更新仅限于帧本身以提高优化效率。为了进一步整合时间信息,在匹配过程中,提炼的帧通过通道重塑层与真实视频中的采样真实视频结合。在多个基准上的广泛实验表明,SFVD 显著优于先前的方法,在 MiniUCF 上实现了高达 5.3% 的性能提升,从而提供了一个更有效的解决方案。
Summary / 总结
This study addresses the challenge of distilling video datasets into images to reduce the size of datasets while maintaining model performance. The core issue is the increase in learnable parameters due to the temporal dimension of videos, which complicates optimization. The proposed Single-Frame Video set Distillation (SFVD) method distills videos into informative frames for each class, using differentiable interpolation to transform these frames into video sequences. The method also incorporates temporal information by combining distilled frames with real video samples. Experiments show that SFVD outperforms previous methods, achieving up to 5.3% better performance on the MiniUCF benchmark.
该研究旨在将视频数据集精简为图像,以达到与完整数据集训练相当的性能。它识别出核心问题是由于视频的时间维度导致可学习参数增加。提出的单帧视频集蒸馏(SFVD)框架将视频精简为每个类别的信息帧,并通过可微插值将这些帧转换为视频序列并与原始数据集匹配。实验表明,SFVD在MiniUCF上优于先前方法,最高可提高5.3%的性能。
JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
Authors: Atsuyuki Miyai, Shota Onohara, Jeonghun Baek, Kiyoharu Aizawa
First: 2025-12-16T17:33:00+00:00 · Latest: 2025-12-16T17:33:00+00:00
Comments: Project page: https://mmmu-japanese-benchmark.github.io/JMMMU_Pro/
Abstract
This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.
中文标题/摘要
标题:JMMMU-Pro:基于图像的日本多学科多模态理解基准及Vibe基准构建方法
本文介绍了基于图像的日本多学科多模态理解基准JMMMU-Pro,以及Vibe基准构建方法,这是一种可扩展的构建方法。从MMMU到MMMU-Pro的演变中,JMMMU-Pro通过将问题图像和问题文本组合成单个图像,扩展了JMMMU,从而创建了一个需要通过视觉感知进行视觉-文本综合理解的基准。为了构建JMMMU-Pro,我们提出了Vibe基准构建方法,该方法使用图像生成模型(例如Nano Banana Pro)生成候选视觉问题,人类验证输出,并在必要时使用调整后的提示重新生成以确保质量。通过利用Nano Banana Pro高度逼真的图像生成能力和嵌入干净日文文本的能力,我们以较低的成本构建了一个高质量的基准,涵盖了广泛的背景和布局设计。实验结果表明,所有开源LMM在JMMMU-Pro上都面临重大挑战,突显了JMMMU-Pro作为指导开源社区未来努力的重要基准。我们认为,JMMMU-Pro为评估LMM的日本能力提供了更严格的评估工具,而我们的Vibe基准构建方法也为未来基于图像的VQA基准的发展提供了高效的指南。
Summary / 总结
JMMMU-Pro is an image-based Japanese multimodal understanding benchmark that extends JMMMU by combining question images and texts into a single image, requiring integrated visual-textual understanding. The Vibe Benchmark Construction method uses an image generative model to produce candidate visual questions, which are then verified by humans to ensure quality. Experimental results indicate that open-source LMMs struggle significantly with JMMMU-Pro, highlighting its importance for evaluating Japanese multimodal understanding capabilities.
JMMMU-Pro 是一个基于图像的日语多模态理解基准,通过将问题图像和文本合并为一个图像来扩展 JMMMU。它使用 Vibe 基准构建方法,该方法结合了图像生成模型和人工验证以确保质量。实验结果显示开源 LMM 在 JMMMU-Pro 上表现不佳,突显了其评估日语多模态理解能力的重要性。
Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
Authors: Alessandro Trapasso, Luca Iocchi, Fabio Patrizi
First: 2025-12-16T17:26:24+00:00 · Latest: 2025-12-16T17:26:24+00:00
Comments: 19 pages, 32 figures, includes appendix
Abstract
Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.
中文标题/摘要
标题:基于模型的强化学习在离散动作非马尔可夫奖励决策过程中的应用
许多实际的决策问题涉及那些其成功依赖于整个系统历史的任务,而不是达到具有所需属性的状态。马尔可夫强化学习(RL)方法不适用于此类任务,而基于非马尔可夫奖励决策过程(NMRDP)的RL方法使智能体能够应对具有时间依赖性的任务。这种方法长期以来一直被认为在(近)最优性和样本效率方面缺乏形式上的保证。我们通过QR-MAX,一种新颖的基于模型的离散NMRDP算法,解决了这两个问题,该算法通过奖励机器将马尔可夫转换学习与非马尔可夫奖励处理进行分解。据我们所知,这是第一个利用这种分解在多项式样本复杂性下达到ε-最优策略的基于模型的RL算法。我们还通过Bucket-QR-MAX扩展了QR-MAX,这是一种基于SimHash的离散化器,保持了相同的分解结构,并实现了快速稳定的无手动网格化或函数逼近的学习。我们在复杂度递增的环境中实验性地将我们的方法与现代基于模型的RL方法进行了比较,显示出显著提高的样本效率和在寻找最优策略时的增强鲁棒性。
Unitless Unrestricted Markov-Consistent SCM Generation: Better Benchmark Datasets for Causal Discovery
Authors: Rebecca J. Herman, Jonas Wahl, Urmi Ninad, Jakob Runge
Venue: Proceedings of Machine Learning Research, 2025
First: 2025-03-21T10:46:50+00:00 · Latest: 2025-12-16T17:25:08+00:00
Comments: 4th Conference on Causal Learning and Reasoning. Code published in python package "UUMCdata" (https://pypi.org/project/UUMCdata/)
Abstract
Causal discovery aims to extract qualitative causal knowledge in the form of causal graphs from data. Because causal ground truth is rarely known in the real world, simulated data plays a vital role in evaluating the performance of the various causal discovery algorithms proposed in the literature. But recent work highlighted certain artifacts of commonly used data generation techniques for a standard class of structural causal models (SCM) that may be nonphysical, including var- and R2-sortability, where the variables' variance and coefficients of determination (R2) after regressing on all other variables, respectively, increase along the causal order. Some causal methods exploit such artifacts, leading to unrealistic expectations for their performance on real-world data. Some modifications have been proposed to remove these artifacts; notably, the internally-standardized structural causal model (iSCM) avoids varsortability and largely alleviates R2-sortability on sparse causal graphs, but exhibits a reversed R2-sortability pattern for denser graphs not featured in their work. We analyze which sortability patterns we expect to see in real data, and propose a method for drawing coefficients that we argue more effectively samples the space of SCMs. Finally, we propose a novel extension of our SCM generation method to the time series setting.
中文标题/摘要
标题:无量纲不受限制的马尔可夫一致SCM生成:更好的因果发现基准数据集
因果发现旨在从数据中提取以因果图形式表示的定性因果知识。由于现实世界中很少知道因果真相,因此模拟数据在评估文献中提出的各种因果发现算法的性能方面起着至关重要的作用。但最近的工作指出了标准结构因果模型(SCM)中常用数据生成技术的一些特定特征,这些特征可能是非物理的,包括方差排序性和R2排序性,其中变量在回归到所有其他变量后的方差和决定系数(R2)沿因果顺序增加。一些因果方法利用了这些特征,导致对它们在真实数据上的性能产生了不切实际的期望。一些修改已被提出以去除这些特征;值得注意的是,内部标准化结构因果模型(iSCM)避免了方差排序性,并在稀疏因果图上大大缓解了R2排序性,但在更密集的图上表现出相反的R2排序性模式,这不在他们的工作中有所体现。我们分析了我们期望在真实数据中看到哪些排序模式,并提出了一种方法来绘制系数,我们认为这种方法更有效地抽样SCM的空间。最后,我们提出了一种将我们的SCM生成方法扩展到时间序列设置的新方法。
Summary / 总结
The paper addresses the issue of unrealistic artifacts in simulated data used for evaluating causal discovery algorithms, particularly var- and R2-sortability. It introduces a new method for generating unitless and unrestricted Markov-consistent structural causal models (SCMs) that better mimic real-world data. The key experimental finding is that this new method reduces the artifacts while maintaining the complexity of the models, leading to more reliable benchmarks for causal discovery algorithms.
本文解决了因果发现中模拟数据中存在的不现实的artifacts问题,特别是变量方差和决定系数(R2)排序现象,这些现象可能导致因果发现方法在实际数据上的性能期望过高。作者提出了一种新的方法来生成无量纲且不受限制的马尔可夫一致结构因果模型(SCM),使其更接近真实数据。关键实验发现是,他们的方法有效地对SCM的空间进行了采样,并避免了先前方法中存在的artifacts,从而为评估因果发现算法提供了更可靠的基准数据集。
Hierarchical Persistence Velocity for Network Anomaly Detection: Theory and Applications to Cryptocurrency Markets
Authors: Omid Khormali
First: 2025-12-16T17:23:07+00:00 · Latest: 2025-12-16T17:23:07+00:00
Abstract
We introduce the Overlap-Weighted Hierarchical Normalized Persistence Velocity (OW-HNPV), a novel topological data analysis method for detecting anomalies in time-varying networks. Unlike existing methods that measure cumulative topological presence, we introduce the first velocity-based perspective on persistence diagrams, measuring the rate at which features appear and disappear, automatically downweighting noise through overlap-based weighting. We also prove that OW-HNPV is mathematically stable. It behaves in a controlled, predictable way, even when comparing persistence diagrams from networks with different feature types. Applied to Ethereum transaction networks (May 2017-May 2018), OW-HNPV demonstrates superior performance for cryptocurrency anomaly detection, achieving up to 10.4% AUC gain over baseline models for 7-day price movement predictions. Compared with established methods, including Vector of Averaged Bettis (VAB), persistence landscapes, and persistence images, velocity-based summaries excel at medium- to long-range forecasting (4-7 days), with OW-HNPV providing the most consistent and stable performance across prediction horizons. Our results show that modeling topological velocity is crucial for detecting structural anomalies in dynamic networks.
中文标题/摘要
标题:网络异常检测中的分层持久速度:理论与加密货币市场应用
我们引入了重叠加权分层归一化持久速度(OW-HNPV),这是一种新颖的拓扑数据分析方法,用于检测时间变化网络中的异常。不同于现有方法测量累积拓扑存在,我们首次从速度的角度引入了持久图的视角,测量特征出现和消失的速度,并通过基于重叠的加权自动降低噪声。我们还证明了OW-HNPV在数学上是稳定的。即使在比较不同特征类型网络的持久图时,它也表现出受控和可预测的行为。应用于以太坊交易网络(2017年5月-2018年5月),OW-HNPV在加密货币异常检测中表现出优越性能,7天价格变动预测的AUC增益高达10.4%,优于基线模型。与现有的方法,包括平均贝蒂向量(VAB)、持久景观和持久图像相比,基于速度的总结在中到长期预测(4-7天)方面表现出色,而OW-HNPV在预测时间跨度上提供了最一致和稳定的性能。我们的结果表明,建模拓扑速度对于检测动态网络中的结构性异常至关重要。
Summary / 总结
The paper introduces OW-HNPV, a novel method for detecting anomalies in time-varying networks using a velocity-based perspective on persistence diagrams. This method, which automatically downweights noise through overlap-based weighting, is mathematically stable and performs well in cryptocurrency markets, achieving a 10.4% AUC gain over baseline models for 7-day price movement predictions. OW-HNPV outperforms other established methods in medium- to long-range forecasting, providing the most consistent and stable performance across different prediction horizons.
论文提出了OW-HNPV,这是一种基于速度视角的持久同调图方法,用于检测时间变化网络中的异常。与累积存在度量不同,OW-HNPV通过基于重叠的加权自动降低噪声,并且是数学上稳定的。应用于以太坊交易网络,OW-HNPV在7天价格变动预测中优于基线模型,AUC增益高达10.4%。它在中到长期预测中表现出色,并且在预测时间跨度上提供了最一致和稳定的性能,优于VAB、持久景观和持久图像等现有方法。
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Authors: Wenqiang Sun, Haiyu Zhang, Haoyuan Wang, Junta Wu, Zehan Wang, Zhenwei Wang, Yunhong Wang, Jun Zhang, Tengfei Wang, Chunchao Guo
First: 2025-12-16T17:22:46+00:00 · Latest: 2025-12-16T17:22:46+00:00
Comments: project page: https://3d-models.hunyuan.tencent.com/world/, demo: https://3d.hunyuan.tencent.com/sceneTo3D
Abstract
This paper presents WorldPlay, a streaming video diffusion model that enables real-time, interactive world modeling with long-term geometric consistency, resolving the trade-off between speed and memory that limits current methods. WorldPlay draws power from three key innovations. 1) We use a Dual Action Representation to enable robust action control in response to the user's keyboard and mouse inputs. 2) To enforce long-term consistency, our Reconstituted Context Memory dynamically rebuilds context from past frames and uses temporal reframing to keep geometrically important but long-past frames accessible, effectively alleviating memory attenuation. 3) We also propose Context Forcing, a novel distillation method designed for memory-aware model. Aligning memory context between the teacher and student preserves the student's capacity to use long-range information, enabling real-time speeds while preventing error drift. Taken together, WorldPlay generates long-horizon streaming 720p video at 24 FPS with superior consistency, comparing favorably with existing techniques and showing strong generalization across diverse scenes. Project page and online demo can be found: https://3d-models.hunyuan.tencent.com/world/ and https://3d.hunyuan.tencent.com/sceneTo3D.
中文标题/摘要
标题:WorldPlay:实现实时交互世界建模的长期几何一致性
本文介绍了WorldPlay,一种流式视频扩散模型,能够实现实时、交互式的世界建模,并保持长期的几何一致性,解决了当前方法在速度和内存之间取舍的问题。WorldPlay 的三大创新点在于:1) 使用双动作表示法以应对用户的键盘和鼠标输入,实现稳健的动作控制。2) 通过重构上下文记忆动态重建过去帧的上下文,并利用时间重塑保持几何上重要的但已过去很久的帧的可访问性,有效缓解了记忆衰减。3) 我们还提出了一种名为上下文强迫的新蒸馏方法,专为记忆感知模型设计。通过在教师和学生之间保持记忆上下文的一致性,保持学生利用长距离信息的能力,从而实现实时速度并防止误差漂移。综上所述,WorldPlay 能以 24 FPS 生成长达数小时的 720p 视频流,具有更优的一致性,与现有技术相比表现更佳,并且在多种场景中具有较强的泛化能力。项目页面和在线演示可以在:https://3d-models.hunyuan.tencent.com/world/ 和 https://3d.hunyuan.tencent.com/sceneTo3D/ 查看。
Summary / 总结
WorldPlay is a streaming video diffusion model that addresses the trade-off between speed and memory in real-time interactive world modeling. It introduces a Dual Action Representation for robust action control, a Reconstituted Context Memory to maintain long-term geometric consistency, and Context Forcing for memory-aware distillation. The model generates 720p video at 24 FPS with superior consistency, outperforming existing techniques and demonstrating strong generalization across various scenes.
WorldPlay 是一种流式视频扩散模型,通过引入双动作表示、重构上下文记忆和上下文强迫,解决了实时互动世界建模中的速度与内存之间的权衡问题。这些创新使模型能够保持长期几何一致性并实现实时速度,生成720p视频,帧率为24 FPS,且一致性更优,适用于多种场景。项目页面和在线演示可在 https://3d-models.hunyuan.tencent.com/world/ 和 https://3d.hunyuan.tencent.com/sceneTo3D 查看。
Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
Authors: Antonio Guillen-Perez
First: 2025-12-12T20:07:04+00:00 · Latest: 2025-12-16T17:15:46+00:00
Abstract
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% ccompared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
中文标题/摘要
标题:语义驱动:通过开放词汇接地和神经符号VLM共识民主化长尾数据整理
自主车辆(AV)的稳健开发受到“长尾”训练数据稀缺的限制。尽管车队收集了大量视频日志,但识别罕见的安全关键事件(例如,不规则的乱穿马路、施工绕行)仍然是一个手动且成本高昂的过程。现有解决方案依赖于粗略的元数据搜索,缺乏精确性,或者基于云的VLM,这侵犯了隐私并昂贵。我们提出了语义驱动,这是一种本地优先的神经符号框架,用于语义数据挖掘。我们的方法将感知分为两个阶段:(1)通过实时开放词汇检测器(YOLOE)进行符号接地,以锚定注意力;(2)通过推理VLM进行认知分析,执行法医场景分析。为了减轻幻觉,我们实现了一种“系统2”推理时对齐策略,利用多模型“法官-侦察员”共识机制。在nuScenes数据集上与Waymo开放数据集(WOD-E2E)分类法进行基准测试,语义驱动实现了召回率0.966(与CLIP相比为0.475),并将风险评估误差降低了40%。该系统完全在消费级硬件(NVIDIA RTX 3090)上运行,提供了一种隐私保护的替代方案,替代云服务。
Summary / 总结
Semantic-Drive is a local-first framework for semantic data mining in autonomous vehicles, addressing the scarcity of long-tail training data. It uses a real-time open-vocabulary detector and a reasoning VLM for scene analysis, with a consensus mechanism to reduce hallucination. On the nuScenes dataset, Semantic-Drive achieves a recall of 0.966 and a 40% reduction in risk assessment error compared to single scout models, while running on consumer hardware.
Semantic-Drive 是一种面向自主车辆的本地优先框架,用于解决长尾训练数据稀缺问题。它使用实时开放词汇检测器和推理 VLM 进行场景分析,并采用共识机制减少幻觉。在 nuScenes 数据集上,Semantic-Drive 的召回率为 0.966,风险评估误差比单个侦察模型低 40%,同时在消费级硬件上运行。
LLmFPCA-detect: LLM-powered Multivariate Functional PCA for Anomaly Detection in Sparse Longitudinal Texts
Authors: Prasanjit Dubey, Aritra Guha, Zhengyi Zhou, Qiong Wu, Xiaoming Huo, Paromita Dubey
First: 2025-12-16T17:14:10+00:00 · Latest: 2025-12-16T17:14:10+00:00
Abstract
Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals. These complex textual data sets have immense potential to inform future policy and targeted recommendations. However, because SL text data lack dedicated methods and are noisy, heterogeneous, and prone to anomalies, detecting and inferring key patterns is challenging. We introduce LLmFPCA-detect, a flexible framework that pairs LLM-based text embeddings with functional data analysis to detect clusters and infer anomalies in large SL text datasets. First, LLmFPCA-detect embeds each piece of text into an application-specific numeric space using LLM prompts. Sparse multivariate functional principal component analysis (mFPCA) conducted in the numeric space forms the workhorse to recover primary population characteristics, and produces subject-level scores which, together with baseline static covariates, facilitate data segmentation, unsupervised anomaly detection and inference, and enable other downstream tasks. In particular, we leverage LLMs to perform dynamic keyword profiling guided by the data segments and anomalies discovered by LLmFPCA-detect, and we show that cluster-specific functional PC scores from LLmFPCA-detect, used as features in existing pipelines, help boost prediction performance. We support the stability of LLmFPCA-detect with experiments and evaluate it on two different applications using public datasets, Amazon customer-review trajectories, and Wikipedia talk-page comment streams, demonstrating utility across domains and outperforming state-of-the-art baselines.
中文标题/摘要
标题:LLmFPCA-detect:基于LLM的多元函数主成分分析方法用于稀疏纵向文本中的异常检测
稀疏纵向(SL)文本数据在个体在时间上反复生成文本(例如,客户评论、偶尔的社会媒体帖子、多次访问的电子医疗记录)时产生,但观察的频率和时间因个体而异。这些复杂的文本数据集具有巨大的潜力,可以为未来的政策和针对性建议提供信息。然而,由于缺乏专门的方法,SL文本数据是嘈杂的、异质的,并且容易出现异常,因此检测和推断关键模式具有挑战性。我们引入了LLmFPCA-detect,这是一种灵活的框架,将基于LLM的文本嵌入与函数数据分析相结合,以检测大型SL文本数据集中的聚类并推断异常。首先,LLmFPCA-detect使用LLM提示将每段文本嵌入到特定的应用数字空间中。在数字空间中进行稀疏多元函数主成分分析(mFPCA)形成工作马车,以恢复主要的人口特征,并生成个体水平的评分,这些评分与基线静态协变量一起,有助于数据分段、无监督异常检测和推断,并使其他下游任务成为可能。特别是,我们利用LLM进行由LLmFPCA-detect发现的数据段和异常引导的动态关键词配置文件,并证明LLmFPCA-detect生成的特定聚类功能PC评分作为现有管道中的特征有助于提高预测性能。我们通过实验支持LLmFPCA-detect的稳定性,并使用公共数据集评估它在两个不同的应用中,即亚马逊客户评论轨迹和维基百科讨论页面评论流,展示了其在不同领域的实用性和优于最先进的基线方法。
Summary / 总结
Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals.
LLmFPCA-detect 是一个框架,利用基于 LLM 的文本嵌入和功能性数据分析来检测稀疏纵向文本数据中的异常并推断关键模式。该框架使用 LLM 提示将文本嵌入到数值空间中,并应用稀疏多元功能性主成分分析 (mFPCA) 来恢复主要的人口特征并生成主题级别的分数以进行异常检测。该框架利用 LLM 进行动态关键词分析,并在现有管道中使用来自 LLmFPCA-detect 的特定簇的功能 PC 分数作为特征时显示出改进的预测性能。实验表明,该方法在亚马逊客户评论和维基百科讨论页面上具有有效性并优于最先进的基线方法。
Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis
Authors: Lukáš Samuel Marták, Patricia Hu, Gerhard Widmer
First: 2025-12-16T17:12:26+00:00 · Latest: 2025-12-16T17:12:26+00:00
Comments: pre-print of the upcoming EURASIP JASM journal article
Abstract
Automatic Music Transcription (AMT) -- the task of converting music audio into note representations -- has seen rapid progress, driven largely by deep learning systems. Due to the limited availability of richly annotated music datasets, much of the progress in AMT has been concentrated on classical piano music, and even a few very specific datasets. Whether these systems can generalize effectively to other musical contexts remains an open question. Complementing recent studies on distribution shifts in sound (e.g., recording conditions), in this work we investigate the musical dimension -- specifically, variations in genre, dynamics, and polyphony levels. To this end, we introduce the MDS corpus, comprising three distinct subsets -- (1) Genre, (2) Random, and (3) MAEtest -- to emulate different axes of distribution shift. We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics. Our extensive evaluation isolates and exposes varying degrees of performance degradation under specific distribution shifts. In particular, we measure a note-level F1 performance drop of 20 percentage points due to sound, and 14 due to genre. Generally, we find that dynamics estimation proves more vulnerable to musical variation than onset prediction. Musically informed evaluation metrics, particularly those capturing harmonic structure, help identify potential contributing factors. Furthermore, experiments with randomly generated, non-musical sequences reveal clear limitations in system performance under extreme musical distribution shifts. Altogether, these findings offer new evidence of the persistent impact of the Corpus Bias problem in deep AMT systems.
中文标题/摘要
标题:深度音乐转谱模型中的声音和音乐偏见:系统分析
自动音乐转谱(AMT)——将音乐音频转换为音符表示的任务——已经取得了快速进展,主要得益于深度学习系统的推动。由于丰富标注的音乐数据集的有限可用性,AMT 的大部分进展集中在古典钢琴音乐,甚至是一些非常特定的数据集上。这些系统是否能够有效地泛化到其他音乐背景下仍然是一个开放的问题。补充最近关于声音分布偏移的研究(例如,录音条件),在本研究中,我们探讨了音乐维度——具体来说,是流派、动态和多声部水平的变化。为此,我们引入了MDS语料库,包含三个不同的子集——(1)流派,(2)随机,(3)MAEtest——以模拟不同的分布偏移轴。我们使用传统的信息检索和音乐导向的性能指标,对MDS语料库上的几种最先进的AMT系统进行了评估。我们的广泛评估隔离并揭示了在特定分布偏移下的不同程度的性能下降。特别是,我们测量了由于声音导致的音符级F1性能下降20个百分点,由于流派导致的下降14个百分点。总体而言,我们发现动态估计比起始点预测更易受到音乐变化的影响。音乐导向的评估指标,尤其是那些捕捉和声结构的指标,有助于识别潜在的促成因素。此外,使用随机生成的非音乐序列的实验揭示了系统在极端音乐分布偏移下的明显局限性。总之,这些发现为深度AMT系统中持续存在的语料库偏见问题提供了新的证据。
Summary / 总结
This study investigates the biases in deep learning models for automatic music transcription (AMT) by examining variations in genre, dynamics, and polyphony levels. The authors introduce the MDS corpus to evaluate the performance of state-of-the-art AMT systems under different distribution shifts. Key findings include a 20 percentage point drop in note-level F1 performance due to sound and a 14 percentage point drop due to genre. Dynamics estimation was found to be more vulnerable to musical variation than onset prediction, and musically informed metrics helped identify contributing factors. Experiments with non-musical sequences highlighted the limitations of these systems under extreme distribution shifts.
研究探讨了自动音乐转录(AMT)深度学习模型在不同音乐背景下的偏见,包括体裁、动态和多声部水平。研究引入了MDS语料库,并使用传统和音乐导向的指标评估了多个最先进的AMT系统。关键发现包括音质导致的音符级F1性能下降20个百分点,以及体裁导致的下降14个百分点。研究指出,动态估计比起始预测更易受音乐变化的影响,而音乐导向的评估指标对于识别性能下降的原因至关重要。
FakeRadar: Probing Forgery Outliers to Detect Unknown Deepfake Videos
Authors: Zhaolun Li, Jichang Li, Yinqi Cai, Junye Chen, Xiaonan Luo, Guanbin Li, Rushi Lan
First: 2025-12-16T17:11:45+00:00 · Latest: 2025-12-16T17:11:45+00:00
Abstract
In this paper, we propose FakeRadar, a novel deepfake video detection framework designed to address the challenges of cross-domain generalization in real-world scenarios. Existing detection methods typically rely on manipulation-specific cues, performing well on known forgery types but exhibiting severe limitations against emerging manipulation techniques. This poor generalization stems from their inability to adapt effectively to unseen forgery patterns. To overcome this, we leverage large-scale pretrained models (e.g. CLIP) to proactively probe the feature space, explicitly highlighting distributional gaps between real videos, known forgeries, and unseen manipulations. Specifically, FakeRadar introduces Forgery Outlier Probing, which employs dynamic subcluster modeling and cluster-conditional outlier generation to synthesize outlier samples near boundaries of estimated subclusters, simulating novel forgery artifacts beyond known manipulation types. Additionally, we design Outlier-Guided Tri-Training, which optimizes the detector to distinguish real, fake, and outlier samples using proposed outlier-driven contrastive learning and outlier-conditioned cross-entropy losses. Experiments show that FakeRadar outperforms existing methods across various benchmark datasets for deepfake video detection, particularly in cross-domain evaluations, by handling the variety of emerging manipulation techniques.
中文标题/摘要
标题:FakeRadar: 探测伪造异常以检测未知深度伪造视频
在本文中,我们提出了一种名为FakeRadar的新型深度伪造视频检测框架,旨在解决实际场景中跨域泛化的挑战。现有的检测方法通常依赖于特定的篡改线索,在已知伪造类型上表现良好,但在面对新兴的篡改技术时表现出严重的局限性。这种泛化能力差源于它们无法有效适应未见过的伪造模式。为克服这一问题,我们利用大规模预训练模型(例如CLIP)主动探测特征空间,明确突出真实视频、已知伪造和未见过的篡改之间的分布差距。具体而言,FakeRadar引入了伪造异常探测,通过动态子聚类建模和聚类条件异常生成来合成接近估计子聚类边界的异常样本,模拟超出已知篡改类型的新型伪造伪影。此外,我们设计了异常引导的三教师训练,该方法通过提出的异常驱动对比学习和异常条件交叉熵损失来优化检测器,使其能够区分真实、伪造和异常样本。实验表明,FakeRadar在各种基准数据集上的深度伪造视频检测中表现优于现有方法,特别是在跨域评估中,能够处理各种新兴的篡改技术。
Summary / 总结
FakeRadar is a novel deepfake video detection framework that addresses the limitations of existing methods in handling unknown forgery techniques. It uses large-scale pretrained models to probe the feature space and highlight distributional gaps, introducing Forgery Outlier Probing to synthesize novel forgery artifacts. Additionally, it employs Outlier-Guided Tri-Training to optimize the detector for distinguishing real, fake, and outlier samples. Experimental results demonstrate that FakeRadar outperforms existing methods across various benchmark datasets, especially in cross-domain evaluations.
FakeRadar 是一种新颖的深伪视频检测框架,旨在解决现有方法在处理未知伪造类型时的局限性。它利用大规模预训练模型探测特征空间并识别分布差异,引入伪造异常探针来合成接近子簇边界的异常样本。此外,它采用异常导向的三训练方法来优化检测器以区分真实、伪造和异常样本。实验结果表明,FakeRadar 在各种基准数据集中的表现优于现有方法,特别是在跨域评估中表现出色。
Generalization performance of narrow one-hidden layer networks in the teacher-student setting
Authors: Jean Barbier, Federica Gerace, Alessandro Ingrosso, Clarissa Lauditi, Enrico M. Malatesta, Gibbs Nwemadji, Rodrigo Pérez Ortiz
First: 2025-07-01T10:18:20+00:00 · Latest: 2025-12-16T17:11:10+00:00
Comments: 35 pages, 6 figures
Abstract
Understanding the generalization abilities of neural networks for simple input-output distributions is crucial to account for their learning performance on real datasets. The classical teacher-student setting, where a network is trained from data obtained thanks to a label-generating teacher model, serves as a perfect theoretical test bed. In this context, a complete theoretical account of the performance of fully connected one-hidden layer networks in the presence of generic activation functions is lacking. In this work, we develop such a general theory for narrow networks, i.e. with a large number of hidden units, yet much smaller than the input dimension. Using methods from statistical physics, we provide closed-form expressions for the typical performance of both finite temperature (Bayesian) and empirical risk minimization estimators, in terms of a small number of summary statistics. In doing so, we highlight the presence of a transition where hidden neurons specialize when the number of samples is sufficiently large and proportional to the number of parameters of the network. Our theory accurately predicts the generalization error of neural networks trained on regression or classification tasks with either noisy full-batch gradient descent (Langevin dynamics) or full-batch gradient descent.
中文标题/摘要
标题:窄一层隐藏层网络在教师-学生设置下的泛化性能
理解神经网络对简单输入-输出分布的泛化能力对于解释其在真实数据集上的学习性能至关重要。经典的教师-学生设置,其中网络通过标签生成教师模型的数据进行训练,是一个完美的理论测试床。在此背景下,对于具有通用激活函数的完全连接的一层隐藏层网络的完整理论描述仍然缺乏。在本文中,我们为窄网络(即具有大量隐藏单元但远小于输入维度)发展了这样的通用理论。利用统计物理方法,我们提供了有限温度(贝叶斯)和经验风险最小化估计器的典型性能的闭式表达式,用少量汇总统计量表示。在此过程中,我们强调了当样本数量足够大且与网络参数数量成比例时隐藏神经元的专业化过渡。我们的理论准确预测了在回归或分类任务中使用有噪声全批量梯度下降(朗格vin动力学)或全批量梯度下降训练的神经网络的泛化误差。
Summary / 总结
This study aims to understand the generalization abilities of narrow one-hidden layer neural networks in the teacher-student setting, where a network is trained from data generated by a teacher model. The authors develop a theoretical framework using statistical physics methods to provide closed-form expressions for the performance of both Bayesian and empirical risk minimization estimators. Key findings include the presence of a transition where hidden neurons specialize as the number of samples increases, and the theory accurately predicts the generalization error for both regression and classification tasks using noisy full-batch gradient descent or Langevin dynamics.
研究旨在理解窄一层隐藏单元神经网络在教师-学生设置下的泛化性能,其中网络使用教师模型生成的数据进行训练。作者利用统计物理方法发展了一个理论框架,提供了有限温度和经验风险最小化估计器性能的闭式表达式。主要发现包括在样本数量增加时隐藏神经元的专业化过渡,以及该理论准确预测了使用噪声批量梯度下降或批量梯度下降进行回归和分类任务的泛化误差。
Multi-Model Ensemble and Reservoir Computing for River Discharge Prediction in Ungauged Basins
Authors: Mizuki Funato, Yohei Sawada
First: 2025-07-24T14:00:18+00:00 · Latest: 2025-12-16T17:07:02+00:00
Abstract
Despite the necessity for accurate flood prediction, many regions lack sufficient river discharge observations. Although numerous models for daily river discharge prediction exist, achieving high accuracy, interpretability, and efficiency under data-scarce conditions remains a major challenge. We address this with a novel method, HYdrological Prediction with multi-model Ensemble and Reservoir computing (HYPER). Our approach applies Bayesian model averaging (BMA) to 47 "uncalibrated" catchment-based conceptual hydrological models. A reservoir computing (RC) model, a type of machine learning model, is then trained via linear regression to correct BMA output errors, a non-iterative process ensuring computational efficiency. For ungauged basins, we infer the required BMA and RC weights by mapping them to catchment attributes from gauged basins, creating a generalizable framework. Evaluated on 87 Japanese basins, in a data-rich scenario, HYPER (median Nash Sutcliffe Efficiency, NSE, of 0.59) performed comparably to a benchmark LSTM (NSE 0.64) but required only 3 % of its computational time. In a data-scarce scenario (where only ~20 % of basins are gauged), HYPER maintained robust performance (NSE 0.51) by leveraging the physical structure of the ensemble. In contrast, the LSTM's performance degraded substantially (NSE -0.61) due to data insufficiency. These results demonstrate that calibrating individual conceptual hydrological models is unnecessary when using a sufficiently large ensemble that is assembled and combined with machine-learning-based bias correction. HYPER provides a robust, efficient, and generalizable solution for discharge prediction, particularly in ungauged basins. By eliminating basin-specific calibration, HYPER offers a scalable, interpretable framework for accurate hydrological prediction in diverse data-scarce regions.
中文标题/摘要
标题:多模型集成与水库计算在无水文站流域河流径流预测中的应用
尽管准确的洪水预测至关重要,但许多地区缺乏足够的河流径流观测数据。尽管存在许多用于每日河流径流预测的模型,但在数据稀缺条件下实现高精度、可解释性和高效性仍然是一个重大挑战。我们提出了一种名为HYdrological Prediction with multi-model Ensemble and Reservoir computing (HYPER)的新方法。该方法使用贝叶斯模型平均(BMA)对47个“未校准”的基于流域的概念性水文模型进行集成。然后,通过线性回归训练一种称为水库计算(RC)的机器学习模型来修正BMA输出误差,这是一个非迭代过程,确保了计算效率。对于无水文站流域,我们通过将所需的BMA和RC权重映射到有水文站流域的流域属性上来推断它们,从而创建了一个可推广的框架。在87个日本流域上进行评估,在数据丰富的情景下,HYPER(中位数纳什-斯图尔特效率,NSE,为0.59)与基准LSTM(NSE为0.64)表现相当,但仅需其3%的计算时间。在数据稀缺的情景下(只有约20%的流域有观测数据),HYPER通过利用集成的物理结构保持了稳健的表现(NSE为0.51),而LSTM由于数据不足表现大幅下降(NSE为-0.61)。这些结果表明,在使用足够大的集成且与基于机器学习的偏差校正相结合时,无需对个别概念性水文模型进行校准。HYPER提供了一种稳健、高效且可推广的径流预测解决方案,特别是在无水文站流域。通过消除流域特定校准,HYPER为在多样化的数据稀缺地区提供了一种可扩展且可解释的水文预测框架。
Summary / 总结
The research aims to improve river discharge prediction in data-scarce regions by developing HYPER, a method combining Bayesian model averaging and reservoir computing. HYPER uses 47 uncalibrated hydrological models and a machine learning model to correct errors, achieving comparable accuracy to LSTM while requiring less computational time. In data-scarce scenarios, HYPER outperformed LSTM, maintaining robust performance while LSTM's accuracy significantly declined.
研究旨在通过开发一种名为HYPER的新方法,提高数据稀缺地区河流径流预测的准确性。HYPER使用47个未校准的水文模型进行贝叶斯模型平均,并使用一种机器学习模型进行偏差校正,确保计算效率。在数据丰富的情景下,HYPER的Nash Sutcliffe效率(NSE)中位数为0.59,与基准LSTM(0.64)相当,但仅需其3%的计算时间。在数据稀缺的情景下,HYPER保持了稳健的表现(NSE 0.51),而LSTM的表现显著下降(NSE -0.61)。这表明HYPER可以在无需特定流域校准的情况下提供准确预测,使其成为适用于不同数据稀缺地区的可扩展和可解释的解决方案。