Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
Authors: Muhammad Maaz, Hanoona Rasheed, Fahad Shahbaz Khan, Salman Khan
First: 2025-11-28T18:59:58+00:00 · Latest: 2025-11-28T18:59:58+00:00
Comments: Video-R2 Technical Report
Abstract
Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.
中文标题/摘要
标题:视频-R2:强化多模态语言模型的一致性和基于视觉的推理
对动态视觉内容进行推理仍然是多模态大型语言模型的核心挑战。最近的思考模型生成了明确的推理痕迹以提高可解释性;然而,它们的推理虽然看似有说服力,但往往是逻辑上不一致或与视觉证据联系不紧密。我们通过两个诊断指标识别并形式化了这些问题:推理答案一致性(TAC),衡量推理与答案之间的对齐程度;视频注意力得分(VAS),衡量推理对视觉线索与文本线索的依赖程度。在11个视频推理基准测试中的分析表明,当前模型严重依赖于语言先验而非视觉内容。为解决这一问题,我们提出了一种强化学习方法,以增强时间精度和推理一致性。该方法结合了时间戳感知的监督微调与由新颖的时间对齐奖励(TAR)引导的组相对策略优化(GRPO)。这一双重步骤的后训练阶段鼓励时间对齐和因果一致的视频推理。由此产生的模型Video R2在多个基准测试中实现了TAC、VAS和准确率的持续提高,表明时间对齐和推理一致性改进能够提高视频理解的准确性和可信度。我们的代码、数据集和模型将开源。
Summary / 总结
The paper addresses the challenge of consistent and grounded reasoning in multimodal language models when processing dynamic visual content. It introduces two diagnostic metrics, Think Answer Consistency (TAC) and Video Attention Score (VAS), to evaluate reasoning quality. The authors propose Video-R2, a model enhanced through reinforcement learning, which combines timestamp-aware fine-tuning and Group Relative Policy Optimization guided by a Temporal Alignment Reward. The model shows improved TAC, VAS, and accuracy across multiple benchmarks, indicating better temporal alignment and reasoning coherence.
论文针对多模态语言模型在处理动态视觉内容时的推理挑战。引入了两个诊断指标TAC和VAS,分别评估推理一致性和视觉接地。作者发现当前模型主要依赖于语言先验。为改进这一问题,他们提出了Video-R2,结合了时间戳感知的细调和由TAR引导的GRPO,增强了时间精度和推理一致性。Video-R2在多个基准测试中展示了在TAC、VAS和准确率上的持续改进,表明更好的时间对齐和推理连贯性有助于更准确和可信的视频理解。
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Authors: Hanoona Rasheed, Mohammed Zumri, Muhammad Maaz, Ming-Hsuan Yang, Fahad Shahbaz Khan, Salman Khan
First: 2025-11-28T18:59:57+00:00 · Latest: 2025-11-28T18:59:57+00:00
Comments: Technical Report
Abstract
Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still "think about videos" ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to "think with videos". Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: https://github.com/mbzuai-oryx/Video-CoM
中文标题/摘要
标题:Video-CoM:通过链式操作进行交互式视频推理
近期的多模态大型语言模型(MLLMs)已经提升了视频理解能力,但大多数模型仍然“思考视频”,即一旦视频被编码,推理过程完全在文本中进行,将视觉输入视为静态背景。这种被动的范式造成了语义瓶颈:模型无法重新观看、重新聚焦或验证证据,导致在需要精细时空理解的任务中进行浅层视觉推理。在本文中,我们提出了交互式视频推理,这是一种新的范式,将视频转变为一个活跃的认知工作空间,使模型能够“与视频一起思考”。我们的模型Video CoM通过链式操作(CoM)进行推理,执行迭代的视觉操作以收集和提炼证据。为了支持这种行为,我们构建了Video CoM Instruct,一个包含18000条指令调优数据集,专门用于多步骤操作推理。除了监督学习,我们还通过具有推理意识的组相对策略优化(GRPO)强化学习进一步优化了操作策略。与仅依赖稀疏答案奖励的先前工作不同,我们的方法引入了步骤级推理奖励,引导模型进行基于事实且一致的推理。Video CoM在九个视频推理基准测试中取得了优异的结果,平均性能比最近的先进模型提高了3.6个百分点,而训练数据量仅为25000个SFT和3000个GRPO视频样本,远少于同类大规模模型。消融研究显示,具有推理意识的奖励提高了准确性和可解释性。代码:https://github.com/mbzuai-oryx/Video-CoM
Summary / 总结
This work introduces Interactive Video Reasoning, a new paradigm that enables models to 'think with videos' by performing iterative visual actions to gather and refine evidence. The model, Video CoM, uses a Chain of Manipulations (CoM) to reason through a series of visual actions. It is trained on a dataset of 18K instruction-tuned examples and optimized via reinforcement learning with reasoning-aware rewards. Video CoM outperforms recent state-of-the-art models on nine video reasoning benchmarks, improving average performance by 3.6 percent with significantly fewer training samples. Ablation studies show that reasoning-aware rewards enhance both accuracy and interpretability.
这项工作提出了交互式视频推理的新范式,使模型能够通过执行一系列视觉操作来收集和提炼证据来进行‘思考’。该模型Video CoM使用链式操作(CoM)进行推理。它在18K指令调优的数据集上进行训练,并通过带有推理意识的奖励进行强化学习优化。Video CoM在九个视频推理基准测试中表现出色,平均性能提高了3.6个百分点,且训练样本量远少于同类大型模型。消融研究显示,带有推理意识的奖励提高了准确性和可解释性。
Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction
Authors: Bao Shu, Yan Cai, Jianjian Sun, Chunrui Han, En Yu, Liang Zhao, Jingcheng Hu, Yinmin Zhang, Haoran Lv, Yuang Peng, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Xiangyu Yue
First: 2025-11-28T18:59:47+00:00 · Latest: 2025-11-28T18:59:47+00:00
Comments: 17 pages, 9 figures
Abstract
Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reasoning. To address these issues, we explore world-model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning, allowing the model to shape thinking directly through its doing, and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over-relying on environmental cues. Our experiments on Sokoban, Maze, and Taxi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.
中文标题/摘要
标题:通过做思考:通过多轮交互在大语言模型中构建高效的世界模型推理
开发稳健的世界模型推理对于大型语言模型(LLM)代理在复杂环境中规划和交互至关重要。虽然多轮交互通过真实的反馈提供了对环境动态的优越理解,但当前的方法往往施加了僵化的推理过程,这限制了模型的主动学习,最终阻碍了高效的世界模型推理。为了解决这些问题,我们探索了通过高效交互和主动推理(WMAct)的世界模型内化,这使模型摆脱了结构化推理的束缚,允许模型通过其行为直接塑造思考,并通过两种关键机制实现有效且高效的推理:(1)奖励重新缩放机制,根据行为效果调整结果奖励,以激励减少冗余和目的性交互;(2)交互频率退火策略,逐步减少允许的最大交互轮次,这迫使模型压缩其学习并内化环境动态,而不是过度依赖环境线索。我们在Sokoban、迷宫和出租车任务上的实验表明,WMAct能够实现有效的世界模型推理,能够一次性解决之前需要多次交互才能完成的任务,并且能够促进强大的环境复杂性转移,提高在一系列推理基准测试中的性能。
Summary / 总结
The research aims to enhance world model reasoning in large language model agents through efficient multi-turn interaction. It introduces WMAct, which involves a reward rescaling mechanism and an interaction frequency annealing strategy. These mechanisms allow the model to learn and internalize environmental dynamics more effectively, reducing the need for multiple interactions and improving performance on reasoning benchmarks.
研究旨在通过高效的多轮交互来增强大型语言模型代理的世界模型推理能力。引入了WMAct机制,包括奖励重新缩放机制和交互频率退火策略。这些机制使模型能够更有效地学习和内化环境动态,减少需要的交互次数,并在推理基准测试中提高性能。
AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
Authors: Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo
First: 2025-11-28T18:59:01+00:00 · Latest: 2025-11-28T18:59:01+00:00
Comments: Homepage: https://hkust-c4g.github.io/AnyTalker-homepage
Abstract
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
中文标题/摘要
标题:AnyTalker:通过互动精炼扩展多人大规模说话视频生成
近年来,多人大规模视频生成开始受到关注。虽然一些初步工作探索了基于音频的多人大规模说话视频生成,但它们往往面临多样化的多人大规模数据收集成本高昂以及难以驱动多个身份的连贯互动的挑战。为了解决这些挑战,我们提出了AnyTalker,一种具有可扩展多流处理架构的多人大规模生成框架。具体来说,我们扩展了Diffusion Transformer的注意力模块,引入了一种新颖的身份感知注意力机制,该机制迭代处理身份-音频对,允许任意扩展可驱动的身份数量。此外,训练多人大规模生成模型需要大量的多人大规模数据。我们提出的训练管道仅依赖单人视频来学习多人大规模说话模式,并仅使用少量真实的多人大规模片段来精炼互动性。此外,我们还贡献了一个专门的度量标准和数据集,用于评估生成的多人大规模视频的自然性和互动性。广泛的实验表明,AnyTalker实现了出色的唇同步、视觉质量和自然互动性,实现了数据成本和身份扩展性的良好平衡。
Summary / 总结
AnyTalker is a multi-person generation framework that addresses the challenges of high data costs and difficulty in driving multiple identities with coherent interactivity. It uses an identity-aware attention mechanism and a novel training pipeline that relies on single-person videos to learn multi-person speaking patterns. The framework achieves remarkable lip synchronization, visual quality, and natural interactivity, balancing data costs and identity scalability.
AnyTalker 是一个多个人生成框架,旨在解决高数据成本和难以驱动多个身份的连贯互动性的问题。它使用了身份感知的注意力机制和一个依赖单人视频来学习多人说话模式的训练流程,只需要少量的真实多人视频来优化互动性。该框架展示了出色的唇部同步、视觉质量和自然互动性,平衡了数据成本和身份可扩展性。
ThetaEvolve: Test-time Learning on Open Problems
Authors: Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
First: 2025-11-28T18:58:14+00:00 · Latest: 2025-11-28T18:58:14+00:00
Comments: 30 pages, link: https://github.com/ypwang61/ThetaEvolve
Abstract
Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: https://github.com/ypwang61/ThetaEvolve
中文标题/摘要
标题:ThetaEvolve:开放问题测试时学习
大型语言模型(LLMs)的最新进展使数学发现取得了突破,AlphaEvolve就是一个闭源系统,通过进化程序提高开放问题的边界。然而,它依赖于前沿LLM的集合来实现新的边界,是一个纯粹的推理系统,模型无法内化进化的策略。我们引入了ThetaEvolve,一个开源框架,简化并扩展了AlphaEvolve,使其在测试时高效地扩展上下文学习和强化学习(RL),允许模型从其经验中不断学习以改进开放优化问题。ThetaEvolve包含一个单一的LLM,一个大型程序数据库以增强探索,批量采样以提高吞吐量,懒惰惩罚以避免停滞的输出,以及可选的奖励塑造以提供稳定的训练信号等。ThetaEvolve是第一个使小型开源模型,如DeepSeek-R1-0528-Qwen3-8B,能够在AlphaEvolve提到的开放问题(圆盘堆积和第一个自相关不等式)上达到新的最佳已知边界。此外,在两个模型和四个开放任务上,我们发现ThetaEvolve在测试时使用RL始终优于仅推理的基线,并且模型确实学会了进化的功能,RL训练的检查点在训练目标任务和其他未见过的任务上都展示了更快的进步和更好的最终性能。我们已公开发布我们的代码:https://github.com/ypwang61/ThetaEvolve
Summary / 总结
ThetaEvolve is an open-source framework that extends AlphaEvolve to enable efficient scaling of in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences to improve open optimization problems. It uses a single LLM, a large program database, batch sampling, and lazy penalties. ThetaEvolve achieved new best-known bounds on open problems like circle packing and first auto-correlation inequality and outperformed inference-only baselines across two models and four open tasks, demonstrating the model's evolving capabilities through RL training.
ThetaEvolve 是一个开源框架,扩展了 AlphaEvolve,以高效地在测试时扩展上下文学习和强化学习 (RL),使模型能够从其经验中不断学习以改进开放优化问题。它使用一个单一的 LLM、一个大型程序数据库、批量采样和懒惰惩罚。ThetaEvolve 在圆填充和第一个自相关不等式等开放问题上达到了新的最佳已知界限,并且在两个模型和四个开放任务上,ThetaEvolve 通过 RL 训练超越了仅推理的基线,展示了模型通过 RL 训练的进化能力。
Visual Generation Tuning
Authors: Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
First: 2025-11-28T18:57:13+00:00 · Latest: 2025-11-28T18:57:13+00:00
Abstract
Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at https://github.com/hustvl/VGT.
中文标题/摘要
标题:视觉生成调优
大规模视觉语言模型(VLMs)通过广泛的预训练有效地弥合了模态差距,获得了与语言对齐的复杂视觉表示。然而,尚未充分探索这些优化用于多模态理解任务的表示是否内在地具有视觉生成的潜力。在本文中,我们提出了VGT(视觉生成调优),这是一种新型范式,旨在激发任何视觉语言模型中潜在的视觉生成能力。通过对充分预训练的VLMs进行高效的视觉生成调优,我们显著降低了自回归建模在连续空间中的对齐成本并加速了收敛(20倍加速)。具体而言,我们摒弃了为扩散变换器设计的纠缠像素级VAEs,并通过将预训练VLMs的语义编码器与像素解码器的潜在表示对齐来构建VGT-AE。在图像重建任务中,我们实现了26.67 PSNR和0.50 rFID,在28倍压缩比下优于专门的VAEs;在视觉生成任务中,我们在自回归模型中取得了最先进的结果,GenEval上为0.77,DPG-Bench上为78.73。此外,我们提出的VGT展示了显著的扩展潜力,并且可以赋予任何用于多模态理解训练的VLMs视觉生成能力,这开辟了探索下一代统一多模态基础模型的新途径。模型和代码可在https://github.com/hustvl/VGT/获取。
Summary / 总结
This paper introduces VGT (Visual Generation Tuning), a novel method to enhance the visual generation capabilities of pre-trained Vision Language Models (VLMs) by aligning semantic encoders with latent pixel representations. VGT significantly improves the speed of autoregressive modeling and achieves superior performance in image reconstruction and generation tasks, outperforming specialized VAEs and setting new benchmarks in autoregressive models. The approach demonstrates scalability and versatility, enabling any VLMs trained for multimodal understanding to generate visual content efficiently.
论文提出了VGT(视觉生成调优)方法,通过将语义编码器与潜在表示对齐来增强大型视觉语言模型(VLMs)的视觉生成能力。这种方法显著提高了自回归建模的速度和性能,在图像重建任务中实现了26.67 PSNR和0.50 rFID,在视觉生成任务中取得了最先进的结果,GenEval得分为0.77,DPG-Bench得分为78.73。VGT具有可扩展性,并可以应用于任何用于多模态理解的VLMs,为开发统一的多模态基础模型开辟了新途径。
SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments
Authors: Xinyi Li, Zaishuo Xia, Weyl Lu, Chenjie Hao, Yubei Chen
First: 2025-11-28T18:56:02+00:00 · Latest: 2025-11-28T18:56:02+00:00
Abstract
Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.
中文标题/摘要
标题:SmallWorlds:在孤立环境中评估世界模型的动力学理解
当前的世界模型缺乏一个统一且可控的环境来进行系统的评估,使得难以判断它们是否真正捕捉到了支配环境动力学的底层规则。在本工作中,我们通过引入SmallWorld基准测试,解决了这一开放挑战,该基准测试旨在在孤立且精确控制的动力学环境下评估世界模型的能力,而不依赖于手工设计的奖励信号。使用此基准测试,我们在完全可观测的状态空间中对包括循环状态空间模型、变压器、扩散模型和神经ODE在内的代表性架构进行了全面实验,考察了它们在六个不同领域的表现。实验结果揭示了这些模型如何有效地捕捉环境结构,以及它们在长时间序列上的预测如何退化,突显了当前建模范式的优缺点,并为未来在表示学习和动力学建模方面的改进提供了见解。
Summary / 总结
This study introduces the SmallWorld Benchmark to evaluate world models in isolated and controlled environments, addressing the lack of a unified evaluation setting. The benchmark tests various architectures like Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE across six domains. The experiments show how well these models capture environment structure and their prediction accuracy over time, revealing both their strengths and limitations in dynamics modeling.
该研究引入了SmallWorld基准,以评估世界模型在隔离且精确控制的环境中的能力,解决了缺乏统一评估环境的问题。该基准测试了包括递归状态空间模型、变压器、扩散模型和神经ODE在内的多种架构在六个领域的表现。实验展示了这些模型在捕捉环境结构方面的效果及其长时间预测的准确性,揭示了它们在动力学建模中的优势和局限性。
The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference
Authors: Hans Gundlach, Jayson Lynch, Matthias Mertens, Neil Thompson
First: 2025-11-28T18:47:33+00:00 · Latest: 2025-11-28T18:47:33+00:00
Abstract
Language models have seen enormous progress on advanced benchmarks in recent years, but much of this progress has only been possible by using more costly models. Benchmarks may therefore present a warped picture of progress in practical capabilities per dollar. To remedy this, we use data from Artificial Analysis and Epoch AI to form the largest dataset of current and historical prices to run benchmarks to date. We find that the price for a given level of benchmark performance has decreased remarkably fast, around $5\times$ to $10\times$ per year, for frontier models on knowledge, reasoning, math, and software engineering benchmarks. These reductions in the cost of AI inference are due to economic forces, hardware efficiency improvements, and algorithmic efficiency improvements. Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around $3\times$ per year. Finally, we recommend that evaluators both publicize and take into account the price of benchmarking as an essential part of measuring the real-world impact of AI.
中文标题/摘要
标题:进步的代价:算法效率与AI推理成本的下降
近年来,语言模型在高级基准测试中取得了巨大的进步,但这些进步很大程度上仅可能通过使用更昂贵的模型实现。因此,基准测试可能无法准确反映每美元实际能力的进步。为解决这一问题,我们利用人工分析和Epoch AI的数据,形成了迄今为止最大的基准测试成本数据集。我们发现,对于前沿模型的知识、推理、数学和软件工程基准测试,达到特定水平的基准测试性能的成本下降速度非常快,大约每年5到10倍。这些AI推理成本的降低是由经济力量、硬件效率改进和算法效率改进共同推动的。通过剔除开放模型以控制竞争效应,并除以硬件价格下降,我们估计算法效率的进步大约每年3倍。最后,我们建议评估者不仅要公布还要考虑基准测试的成本,将其作为衡量AI实际影响的重要组成部分。
Summary / 总结
The research aims to assess the real-world impact of AI progress by analyzing the cost of running benchmarks. The study uses data from Artificial Analysis and Epoch AI to track the price changes of running benchmarks over time. Key findings show that the cost of achieving a given level of benchmark performance has decreased dramatically, approximately $5$ to $10$ times per year, primarily due to algorithmic efficiency improvements, hardware efficiency, and economic forces. The study estimates that algorithmic efficiency improvements alone contribute to a $3$ times per year reduction in cost. The authors suggest that benchmark evaluators should consider and disclose the cost of benchmarking to better measure the practical impact of AI advancements.
研究旨在通过分析运行基准的成本来评估AI进步的实际影响。研究使用来自Artificial Analysis和Epoch AI的数据追踪了运行基准成本的变化。关键发现表明,实现给定基准性能的成本大幅下降,大约每年减少$5$到$10$倍,主要归因于算法效率提升、硬件效率提升和经济因素。研究估计,仅算法效率的提升每年就贡献了$3$倍的成本减少。作者建议,基准评估者应考虑并披露基准测试的成本,以更好地衡量AI进步的实际影响。
Object-Centric Data Synthesis for Category-level Object Detection
Authors: Vikhyat Agarwal, Jiayi Cora Guo, Declan Hoban, Sissi Zhang, Nicholas Moran, Peter Cho, Srilakshmi Pattabiraman, Shantanu Joshi
First: 2025-11-28T18:41:46+00:00 · Latest: 2025-11-28T18:41:46+00:00
Comments: 10 pages, 10 figures
Abstract
Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capability to new object classes requires large amounts of annotated training data, which is costly and time-consuming to acquire, especially for long-tailed classes with insufficient representation in existing datasets. Here, we introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models), and systematically evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting. The approaches are based on simple image processing techniques, 3D rendering, and image diffusion models, and use object-centric data to synthesize realistic, cluttered images with varying contextual coherence and complexity. We assess how these methods enable models to achieve category-level generalization in real-world data, and demonstrate significant performance boosts within this data-constrained experimental setting.
中文标题/摘要
标题:基于对象的数据合成在类别级对象检测中的应用
深度学习方法在对象检测中已经实现了对图像中特定对象类别的可靠检测。然而,将模型的检测能力扩展到新的对象类别需要大量的标注训练数据,这在获取上既昂贵又耗时,尤其是对于现有数据集中代表性不足的长尾类。在这里,我们介绍了基于对象的数据设置,当可用数据以对象为中心的形式(多视角图像或3D模型)存在时,系统地评估了四种不同数据合成方法在该设置下对新型对象类别的对象检测模型进行微调的性能。这些方法基于简单的图像处理技术、3D渲染和图像扩散模型,使用对象为中心的数据来合成具有不同上下文连贯性和复杂度的现实且杂乱的图像。我们评估了这些方法如何使模型在现实数据中实现类别级泛化,并在这一数据受限的实验设置中展示了显著的性能提升。
Summary / 总结
The research aims to address the challenge of extending object detection models to new classes with limited annotated data. It evaluates four data synthesis methods that use object-centric data (multi-view images or 3D models) to generate realistic images for fine-tuning object detection models. The methods include simple image processing, 3D rendering, and image diffusion models. The study shows that these approaches significantly improve the model's ability to generalize across new object categories in real-world settings, especially for long-tailed classes with insufficient data representation in existing datasets.
论文旨在通过利用有限的对象中心数据来扩展对象检测模型到新的对象类别。它评估了四种数据合成方法,包括图像处理、3D渲染和图像扩散,以生成具有不同复杂度的现实图像。这些方法使模型能够在数据受限的环境中对新类别进行泛化,并在实验设置中显示出显著的性能提升。
DINO-Foresight: Looking into the Future with DINO
Authors: Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
Venue: NeurIPS 2025
First: 2024-12-16T11:26:46+00:00 · Latest: 2025-11-28T18:40:16+00:00
Comments: NeurIPS 2025
Abstract
Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .
中文标题/摘要
标题:DINO-Foresight:利用DINO预见未来
预测未来动态对于自动驾驶和机器人等应用至关重要,理解环境是关键。现有像素级方法计算成本高,且往往关注无关细节。为解决这些挑战,我们引入了DINO-Foresight,一种在预训练视觉基础模型(VFMs)语义特征空间中操作的新框架。我们的方法以自监督方式训练一个掩码特征变换器,以预测VFMs特征随时间的演变。通过预测这些特征,我们可以应用现成的任务特定头部进行各种场景理解任务。在这个框架中,VFMs特征被视为一个潜在空间,不同的头部附加到该空间以执行特定任务进行未来帧分析。大量实验表明,我们的框架具有很强的性能、鲁棒性和可扩展性。项目页面和代码见https://dino-foresight.github.io/。
Summary / 总结
DINO-Foresight is designed to predict future dynamics in applications such as autonomous driving and robotics by operating in the semantic feature space of pretrained Vision Foundation Models (VFMs). It trains a masked feature transformer in a self-supervised manner to forecast the evolution of VFM features over time, enabling the use of off-the-shelf task-specific heads for scene understanding tasks. The framework demonstrates strong performance, robustness, and scalability in extensive experiments.
DINO-Foresight 通过在预训练视觉基础模型(VFMs)的语义特征空间中操作来预测未来动态,适用于自动驾驶和机器人等应用。它通过自监督方式训练一个遮蔽特征变换器来预测VFMs特征随时间的演变,从而可以使用现成的任务特定头部进行各种场景理解任务。该框架在大量实验中展示了很强的性能、鲁棒性和可扩展性。
ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts
Authors: Hang Yu, Di Zhang, Qiwei Du, Yanping Zhao, Hai Zhang, Guang Chen, Eduardo E. Veas, Junqiao Zhao
First: 2025-11-28T18:35:37+00:00 · Latest: 2025-11-28T18:35:37+00:00
Abstract
Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching's feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.
中文标题/摘要
标题:ASTRO:通过动力学引导轨迹展开实现自适应拼接
离线强化学习(RL)使智能体能够从预先收集的数据集中学习最优策略。然而,包含次优和碎片化轨迹的数据集会为奖励传播带来挑战,导致价值估计不准确和策略性能下降。虽然通过生成模型进行轨迹拼接提供了一种有前景的解决方案,但现有的增强方法经常生成要么局限于行为策略支持区域要么违背底层动力学的轨迹,从而限制了它们对策略改进的有效性。我们提出了一种名为ASTRO的数据增强框架,用于生成离线RL中分布新颖且动力学一致的轨迹。ASTRO首先学习时间距离表示以识别不同的且可达的拼接目标。然后,我们采用一种动力学引导的拼接规划器,通过Rollout Deviation Feedback(即目标状态序列与执行预测动作后实际到达状态序列之间的差距)自适应地生成连接动作序列,以提高轨迹拼接的可行性和可达性。这种方法通过拼接促进了有效的增强,并最终提升了策略学习。ASTRO在各种算法中均优于先前的离线RL增强方法,在具有挑战性的OGBench套件上实现了显著的性能提升,并在标准离线RL基准测试D4RL上展示了持续改进。
Summary / 总结
ASTRO is an offline reinforcement learning data augmentation framework that addresses the challenges of suboptimal and fragmented trajectories by generating distributionally novel and dynamics-consistent trajectories. It uses a temporal-distance representation to identify stitch targets and a dynamics-guided stitch planner to generate connecting action sequences, improving the feasibility and reachability of trajectory stitching. ASTRO outperforms previous methods across various algorithms and benchmarks, showing significant performance gains on the OGBench suite and consistent improvements on standard offline RL benchmarks like D4RL.
ASTRO 是一种用于离线强化学习的数据增强框架,旨在解决子最优和碎片化轨迹带来的挑战。它使用时间距离表示来识别缝合目标,并通过动态引导的缝合规划生成连接动作序列,确保轨迹既新颖又符合动力学。ASTRO 在各种算法中表现优于先前的方法,显示出在 OGBench 套件上的显著性能提升,并在标准离线强化学习基准 D4RL 上表现出一致的改进。
Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent
Authors: Jianzhe Lin, Zeyu Pan, Yun Zhu, Ruiqi Song, Jining Yang
First: 2025-11-28T18:32:49+00:00 · Latest: 2025-11-28T18:32:49+00:00
Comments: 15 pages, 4 figures
Abstract
We introduce SuperIntelliAgent, an agentic learning framework that couples a trainable small diffusion model (the learner) with a frozen large language model (the verifier) to enable continual intelligence growth through self-supervised interaction. Unlike conventional supervised fine-tuning, SuperIntelliAgent learns autonomously without annotation: the learner generates candidate outputs, the verifier evaluates them through step-by-step reasoning, and their interaction produces chosen/rejected pairs for Direct Preference Optimization (DPO). This converts each input into a pseudo-training signal for continual improvement. The framework integrates dual-scale memory: short-term in-context memory that preserves reasoning traces across refinement cycles, and long-term memory that consolidates acquired knowledge through lightweight on-the-fly fine-tuning. A replay buffer retains samples that show verifiable progress and replays them as auxiliary supervision, reinforcing recent learning while forming adaptive curricula. SuperIntelliAgent is infrastructure-agnostic and can be plugged into existing agentic frameworks while turning ordinary inference loops into a lifelong optimization process. We posit that pairing a trainable learner with a reasoning-capable verifier forms a minimal reliable unit of growing intelligence, as paired feedback and partial-history replay yield richer learning curricula and stronger preference alignment. With a small number of automatically generated DPO pairs, the learner improves across all benchmarks, indicating that this mechanism provides a promising direction for continual intelligence accumulation and real-world deployment.
中文标题/摘要
标题:迈向连续智能增长:自我训练、持续学习与双尺度记忆在超级智能代理中的应用
我们介绍了超级智能代理,这是一种代理学习框架,将可训练的小扩散模型(学习者)与冻结的大语言模型(验证者)结合在一起,通过自我监督的交互实现持续智能增长。与传统的监督微调不同,超级智能代理无需标注即可自主学习:学习者生成候选输出,验证者通过逐步推理进行评估,他们的交互产生被选择/拒绝的对用于直接偏好优化(DPO)。这将每个输入转化为持续改进的伪训练信号。该框架整合了双尺度记忆:短期上下文记忆,保留了推理痕迹,贯穿于细化循环;长期记忆,通过轻量级的即席微调巩固获得的知识。重播缓冲区保留了显示可验证进步的样本,并将它们作为辅助监督重播,强化最近的学习,同时形成自适应课程。超级智能代理是基础设施无关的,可以插入现有的代理框架,将普通的推理循环转变为终身优化过程。我们提出,将可训练的学习者与推理能力的验证者配对形成增长智能的最小可靠单元,配对反馈和部分历史重播产生更丰富的学习课程和更强的偏好对齐。通过少量自动生成的DPO对,学习者在所有基准测试中都得到了改进,表明这种机制为持续智能积累和实际部署提供了有希望的方向。
Summary / 总结
SuperIntelliAgent is a learning framework that combines a trainable small diffusion model with a frozen large language model to enable autonomous continual learning through self-supervised interaction. It uses Direct Preference Optimization to generate and evaluate candidate outputs, converting each input into a pseudo-training signal. The framework includes dual-scale memory: short-term in-context memory for reasoning traces and long-term memory for knowledge consolidation. A replay buffer reinforces recent learning and forms adaptive curricula. With minimal feedback, the learner shows significant improvement across benchmarks, suggesting a promising direction for continual intelligence growth.
SuperIntelliAgent 是一个结合了可训练小型扩散模型和冻结大型语言模型的智能学习框架,通过自我监督交互实现持续智能增长。它使用直接偏好优化生成和评估候选输出,将每个输入转换为伪训练信号。该框架包括双尺度记忆:短期上下文记忆用于保留推理痕迹,长期记忆用于知识巩固。回放缓冲区保留进步样本作为辅助监督,形成适应性课程。通过少量反馈,学习者在所有基准测试中都显示出改进,表明这是一种有希望的方向,用于持续智能积累和实际部署。
Uncovering Zero-Shot Generalization Gaps in Time-Series Foundation Models Using Real-World Videos
Authors: Lujun Li, Lama Sleem, Yiqun Wang, Yangjie Xu, Niccolò Gentile, Radu State
Venue: AAAI 2026
First: 2025-09-30T14:53:05+00:00 · Latest: 2025-11-28T18:31:29+00:00
Comments: This paper has been accepted by Artificial Intelligence for Time Series Analysis (AI4TS) Workshop @ AAAI 2026: Theory, Algorithms, and Applications
Abstract
Recent research on time-series foundation models (TSFMs) has underscored the scarcity of real-world data, often supplemented with synthetic sources in existing datasets, whose generalizability remains however debated. As such, in this work, we propose a novel benchmarking approach: in particular, we aim at building a curated dataset reflecting real world physical temporal dynamics, extracting temporal signals from real-world videos using optical flow. As such, we introduce REAL-V-TSFM, a novel dataset designed to capture rich and diverse time series derived from real-world videos. Experimental results on state-of-the-art TSFMs under zero-shot forecasting show that, despite strong performance on conventional benchmarks, these models exhibit performance degradation on the proposed dataset, suggesting limited generalizability to novel datasets. These findings underscore the need for novel approaches to acquiring time series data and highlight the lack of universality in recent TSFMs, while further validating the effectiveness of our video-based time series data extraction pipeline.
中文标题/摘要
标题:利用真实世界视频揭示时间序列基础模型的零样本泛化缺口
近期关于时间序列基础模型(TSFMs)的研究强调了现实世界数据的稀缺性,现有数据集中通常使用合成来源进行补充,但其泛化能力仍存在争议。因此,在这项工作中,我们提出了一种新的基准测试方法:特别是,我们旨在构建一个反映现实世界物理时间动态的精选数据集,通过光学流从真实世界视频中提取时间信号。因此,我们引入了REAL-V-TSFM,这是一个旨在捕捉来自真实世界视频的丰富多样时间序列的新数据集。在零样本预测方面,对最先进的TSFMs的实验结果表明,尽管在传统基准测试中表现出色,但这些模型在提出的数据集上的性能下降,表明其对新数据集的泛化能力有限。这些发现强调了获取时间序列数据的新方法的必要性,并突显了最近TSFMs的缺乏普适性,同时进一步验证了我们基于视频的时间序列数据提取管道的有效性。
Summary / 总结
This study addresses the issue of limited generalizability of time-series foundation models (TSFMs) by proposing a novel benchmarking approach using a curated dataset of real-world videos. The dataset, REAL-V-TSFM, captures rich and diverse time series. Experimental results show that state-of-the-art TSFMs perform well on conventional benchmarks but degrade significantly on the proposed dataset, indicating limited zero-shot generalization. This highlights the need for new methods to acquire time series data and suggests that recent TSFMs lack universality.
本研究通过提出一个新的基准数据集REAL-V-TSFM,该数据集从真实世界视频的光学流中捕捉到真实的时空动态,来解决时间序列基础模型(TSFMs)的泛化能力有限的问题。实验结果显示,这些模型在传统基准上的表现良好,但在新数据集上的表现却有所下降,表明它们对新型数据集的泛化能力有限。这表明需要新的方法来获取时间序列数据,并突显了最近TSFMs的缺乏普适性。
Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model
Authors: Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Qinglin Lu
First: 2025-11-28T18:26:39+00:00 · Latest: 2025-11-28T18:26:39+00:00
Comments: Technical Report, Project page:https://hunyuan-gamecraft-2.github.io/
Abstract
Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as "open the door", "draw a torch", or "trigger an explosion".
中文标题/摘要
标题:浑元-GameCraft-2:遵循指令的互动游戏世界模型
生成式世界模型的最新进展使创建开放性游戏环境取得了显著进步,从静态场景合成发展到动态、互动模拟。然而,当前的方法仍然受限于僵化的动作模式和高昂的标注成本,限制了其对多样化的游戏内交互和玩家驱动动态的建模能力。为了解决这些挑战,我们提出了浑元-GameCraft-2,这是一种新的基于指令的互动范式,用于生成游戏世界建模。我们的模型不再依赖固定的键盘输入,而是允许用户通过自然语言提示、键盘或鼠标信号来控制游戏视频内容,从而在生成的世界中实现灵活且语义丰富的互动。我们正式定义了互动视频数据的概念,并开发了一个自动化过程,将大规模、无结构的文本-视频对转换为因果对齐的互动数据集。基于一个14B图像到视频混合专家(MoE)基础模型,我们的模型结合了文本驱动的互动注入机制,以精细控制摄像机运动、角色行为和环境动态。我们引入了一个以互动为重点的基准测试,InterBench,以全面评估互动性能。广泛的实验表明,我们的模型生成了时间连贯且因果合理的互动游戏视频,能够忠实响应各种自由形式的用户指令,如“开门”、“画火把”或“触发爆炸”。
Summary / 总结
Hunyuan-GameCraft-2 is a new instruction-driven interaction model for generative game world modeling, which allows users to control game video content through natural language prompts, keyboard, or mouse signals. The model uses a 14B image-to-video Mixture-of-Experts foundation and incorporates a text-driven interaction injection mechanism. Experiments show that it generates temporally coherent and causally grounded interactive game videos in response to diverse user instructions.
Hunyuan-GameCraft-2 是一种新的指令驱动交互方法,用于生成游戏世界建模。它允许用户通过自然语言提示、键盘或鼠标信号来控制游戏内容,实现灵活且语义丰富的交互。该模型基于一个 14B 图像到视频的 Mixture-of-Experts 基础模型,并包含一个文本驱动的交互注入机制。实验表明,它能够生成与多样且自由形式的用户指令(如“开门”、“画火把”或“触发爆炸”)相一致的时间连贯且因果相关的交互式游戏视频。
DisMo: Disentangled Motion Representations for Open-World Motion Transfer
Authors: Thomas Ressler-Antal, Frank Fundel, Malek Ben Alaya, Stefan Andreas Baumann, Felix Krause, Ming Gui, Björn Ommer
Venue: NeurIPS 2025
First: 2025-11-28T18:25:54+00:00 · Latest: 2025-11-28T18:25:54+00:00
Comments: Accepted at NeurIPS 2025
Abstract
Recent advances in text-to-video (T2V) and image-to-video (I2V) models, have enabled the creation of visually compelling and dynamic videos from simple textual descriptions or initial frames. However, these models often fail to provide an explicit representation of motion separate from content, limiting their applicability for content creators. To address this gap, we propose DisMo, a novel paradigm for learning abstract motion representations directly from raw video data via an image-space reconstruction objective. Our representation is generic and independent of static information such as appearance, object identity, or pose. This enables open-world motion transfer, allowing motion to be transferred across semantically unrelated entities without requiring object correspondences, even between vastly different categories. Unlike prior methods, which trade off motion fidelity and prompt adherence, are overfitting to source structure or drifting from the described action, our approach disentangles motion semantics from appearance, enabling accurate transfer and faithful conditioning. Furthermore, our motion representation can be combined with any existing video generator via lightweight adapters, allowing us to effortlessly benefit from future advancements in video models. We demonstrate the effectiveness of our method through a diverse set of motion transfer tasks. Finally, we show that the learned representations are well-suited for downstream motion understanding tasks, consistently outperforming state-of-the-art video representation models such as V-JEPA in zero-shot action classification on benchmarks including Something-Something v2 and Jester. Project page: https://compvis.github.io/DisMo
中文标题/摘要
标题:DisMo:开放世界运动转移的解耦运动表示
近年来,文本到视频(T2V)和图像到视频(I2V)模型的发展,使得从简单的文本描述或初始帧生成视觉上引人入胜且动态的视频成为可能。然而,这些模型往往未能提供与内容分离的显式运动表示,限制了其在内容创作者中的应用。为了解决这一问题,我们提出了一种名为DisMo的新范式,通过图像空间重构目标直接从原始视频数据中学习抽象的运动表示。我们的表示是通用的,并且独立于静态信息,如外观、物体身份或姿态。这使得在语义上不相关的实体之间进行开放世界运动转移成为可能,即使在不同类别之间,也不需要物体对应关系。与先前的方法不同,这些方法在运动保真度和提示一致性之间进行权衡,过度拟合源结构或偏离描述的动作,我们的方法将运动语义与外观解耦,从而实现准确的转移和忠实的条件化。此外,我们的运动表示可以与任何现有的视频生成器通过轻量级适配器结合,使我们能够轻松受益于视频模型的未来进步。我们通过一系列运动转移任务展示了该方法的有效性。最后,我们证明了所学习的表示非常适合下游运动理解任务,在基准测试Something-Something v2和Jester上的零样本动作分类任务中,持续优于最先进的视频表示模型V-JEPA。项目页面:https://compvis.github.io/DisMo
Summary / 总结
DisMo is a novel method for learning abstract motion representations directly from raw video data, enabling open-world motion transfer without requiring object correspondences. It disentangles motion semantics from appearance, allowing accurate transfer and faithful conditioning. Experiments show that DisMo outperforms existing methods in motion transfer tasks and zero-shot action classification benchmarks, demonstrating its effectiveness and versatility.
DisMo 是一种直接从原始视频数据中学习解纠缠运动表示的新方法,无需对象对应即可实现开放世界中的运动转移。该方法使用图像空间重构目标来创建与静态信息无关的通用运动表示。实验表明,DisMo 可以准确地在语义上不相关的实体之间转移运动,并在 Something-Something v2 和 Jester 等基准上的零样本动作分类任务中优于现有方法。
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Authors: Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos
First: 2025-06-25T15:07:16+00:00 · Latest: 2025-11-28T18:12:20+00:00
Abstract
Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
中文标题/摘要
标题:不对称REINFORCE在离策 reinforcement学习中的应用:平衡正负奖励
强化学习(RL)越来越多地用于对齐大型语言模型(LLMs)。离策方法在实现简便性和数据效率方面优于就策技术,但通常会导致性能不佳。在本研究中,我们通过分析一个简单的离策REINFORCE算法来研究离策RL和监督微调之间的中间算法范围,其中优势定义为$A=r-V$,$r$为奖励,$V$为可调基线。直观上,降低$V$会强调高奖励样本,而提高$V$会更严厉地惩罚低奖励样本。我们首先对该离策REINFORCE算法进行了理论分析,表明当基线$V$下界预期奖励时,该算法享有策略改进保证。我们的分析揭示了虽然就策更新可以安全地利用正负信号,但离策更新更受益于更多关注正奖励而不是负奖励。我们通过在受控的随机臂博弈设置中进行实验验证了这些发现,并通过在推理任务上微调最先进的LLMs进行了验证。
Summary / 总结
This paper investigates an off-policy reinforcement learning method called Asymmetric REINFORCE, which aims to improve performance by balancing positive and negative rewards. The method uses a reward-biasing approach where the baseline $V$ is adjusted to emphasize high-reward samples and penalize low-reward ones. Theoretical analysis shows that when $V$ lower-bounds the expected reward, the algorithm provides a policy improvement guarantee. Experimental results in a stochastic bandit setting and fine-tuning of LLMs on reasoning tasks confirm the effectiveness of this approach in leveraging positive rewards more than negative ones.
本文通过提出一种不对称的REINFORCE算法来解决离策 Reinforcement Learning (RL) 中的次优性能问题。该方法通过调整基线 $V$ 来强调高奖励样本并更严厉地惩罚低奖励样本。理论分析表明,当 $V$ 低于预期奖励时,该算法提供了策略改进的保证。实验结果在随机臂问题设置和对最先进的LLM进行推理任务的微调中证实了离策更新更多地受益于正奖励而不是负奖励。
CzechLynx: A Dataset for Individual Identification and Pose Estimation of the Eurasian Lynx
Authors: Lukas Picek, Elisa Belotti, Michal Bojda, Ludek Bufka, Vojtech Cermak, Martin Dula, Rostislav Dvorak, Luboslav Hrdy, Miroslav Jirik, Vaclav Kocourek, Josefa Krausova, Jirı Labuda, Jakub Straka, Ludek Toman, Vlado Trulık, Martin Vana, Miroslav Kutal
First: 2025-06-05T12:05:43+00:00 · Latest: 2025-11-28T18:06:29+00:00
Abstract
We introduce CzechLynx, the first large-scale, open-access dataset for individual identification, pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx contains 39,760 camera trap images annotated with segmentation masks, identity labels, and 20-point skeletons and covers 319 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: southwest Bohemia and the Western Carpathians. In addition to the real camera trap data, we provide a large complementary set of photorealistic synthetic images and a Unity-based generation pipeline with diffusion-based text-to-texture modeling, capable of producing arbitrarily large amounts of synthetic data spanning diverse environments, poses, and coat-pattern variations. To enable systematic testing across realistic ecological scenarios, we define three complementary evaluation protocols: (i) geo-aware, (ii) time-aware open-set, and (iii) time-aware closed-set, covering cross-regional and long-term monitoring settings. With the provided resources, CzechLynx offers a unique, flexible benchmark for robust evaluation of computer vision and machine learning models across realistic ecological scenarios.
中文标题/摘要
标题:CzechLynx:欧洲野猫个体识别和姿态估计数据集
我们介绍了CzechLynx,这是首个大规模、开放访问的用于欧洲野猫(Lynx lynx)个体识别、姿态估计和实例分割的数据集。CzechLynx包含39,760张相机陷阱图像,附带分割掩模、身份标签和20点骨架标注,并覆盖了15年系统监测中的319个独特个体,分布在两个地理上不同的区域:西南波希米亚和西喀尔巴阡山脉。除了真实的相机陷阱数据,我们还提供了一组大量的真实感合成图像,并提供了一个基于Unity的生成管道,使用基于扩散的文本到纹理建模,能够生成跨越多种环境、姿态和毛皮图案变化的大量合成数据。为了在现实生态场景中进行系统测试,我们定义了三个互补的评估协议:(i) 地理意识型,(ii) 时间意识型开放集,和(iii) 时间意识型封闭集,涵盖了跨区域和长期监测的设置。通过提供的资源,CzechLynx为计算机视觉和机器学习模型在现实生态场景中的稳健评估提供了独特的、灵活的基准。
Summary / 总结
CzechLynx is a large-scale dataset for individual identification and pose estimation of the Eurasian lynx, containing 39,760 camera trap images with segmentation masks, identity labels, and 20-point skeletons. It covers 319 unique individuals over 15 years in two regions. The dataset includes real camera trap images and photorealistic synthetic images generated by a Unity-based pipeline. Three evaluation protocols are defined to test models in realistic ecological scenarios. Key findings include the dataset's potential for robust evaluation of computer vision and machine learning models.
CzechLynx 是一个包含 39,760 张带有分割掩码和身份标签的相机陷阱图像的大规模数据集,覆盖了 15 年内两个地区的 319 只欧亚猞猁。该数据集包括真实和合成图像,并定义了三种评估协议以在现实生态场景中测试模型。主要发现包括该数据集在长期野生动物监测中评估计算机视觉模型的实用性。
MANTA: Physics-Informed Generalized Underwater Object Tracking
Authors: Suhas Srinath, Hemang Jamadagni, Aditya Chadrasekar, Prathosh AP
Venue: WACV 2026
First: 2025-11-28T17:59:06+00:00 · Latest: 2025-11-28T17:59:06+00:00
Comments: Accepted to the IEEE/CVF WACV 2026
Abstract
Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water conditions. Existing trackers trained on terrestrial data fail to generalize to these physics-driven degradations. We present MANTA, a physics-informed framework integrating representation learning with tracking design for underwater scenarios. We propose a dual-positive contrastive learning strategy coupling temporal consistency with Beer-Lambert augmentations to yield features robust to both temporal and underwater distortions. We further introduce a multi-stage pipeline augmenting motion-based tracking with a physics-informed secondary association algorithm that integrates geometric consistency and appearance similarity for re-identification under occlusion and drift. To complement standard IoU metrics, we propose Center-Scale Consistency (CSC) and Geometric Alignment Score (GAS) to assess geometric fidelity. Experiments on four underwater benchmarks (WebUOT-1M, UOT32, UTB180, UWCOT220) show that MANTA achieves state-of-the-art performance, improving Success AUC by up to 6 percent, while ensuring stable long-term generalized underwater tracking and efficient runtime.
中文标题/摘要
标题:MANTA:基于物理的通用水下物体跟踪
水下物体跟踪由于波长依赖性衰减和散射,导致深度和水体条件下的外观严重失真,极具挑战性。现有的在陆地数据上训练的跟踪器无法泛化到这些由物理驱动的退化。我们提出了MANTA,一种结合表示学习和跟踪设计的基于物理的框架,适用于水下场景。我们提出了一种结合时间一致性和Beer-Lambert增强的双重正对比学习策略,以生成对时间和水下失真都具有鲁棒性的特征。我们进一步引入了一种多阶段流水线,通过结合几何一致性和外观相似性,增强基于运动的跟踪,以实现遮挡和漂移下的再识别。为了补充标准的IoU度量,我们提出了中心-尺度一致性(CSC)和几何对齐得分(GAS)来评估几何保真度。在四个水下基准数据集(WebUOT-1M, UOT32, UTB180, UWCOT220)上的实验表明,MANTA达到了最先进的性能,成功AUC提高了最多6个百分点,同时确保了长期稳定的泛化水下跟踪和高效的运行时间。
Summary / 总结
MANTA is a physics-informed framework for underwater object tracking that addresses the challenges of wavelength-dependent attenuation and scattering. It uses a dual-positive contrastive learning strategy and a multi-stage pipeline to improve robustness to temporal and underwater distortions. Experiments on four benchmarks show that MANTA outperforms existing methods, achieving up to 6 percent improvement in Success AUC and maintaining stable long-term tracking and efficient runtime.
MANTA 是一种针对水下对象跟踪的物理导向框架,解决了波长依赖性衰减和散射带来的挑战。它采用了一种双正对比学习策略和多阶段管道,增强了时间一致性和对水下失真的鲁棒性。实验结果显示,MANTA 在四个基准数据集上的表现优于现有方法,将 Success AUC 提高了最多 6 个百分点,并确保了长期稳定的跟踪和高效的运行时间。
LFM2 Technical Report
Authors: Alexander Amini, Anna Banaszak, Harold Benoit, Arthur Böök, Tarek Dakhran, Song Duong, Alfred Eng, Fernando Fernandes, Marc Härkönen, Anne Harrington, Ramin Hasani, Saniya Karwa, Yuri Khrustalev, Maxime Labonne, Mathias Lechner, Valentine Lechner, Simon Lee, Zetian Li, Noel Loo, Jacob Marks, Edoardo Mosca, Samuel J. Paech, Paul Pak, Rom N. Parnichkun, Alex Quach, Ryan Rogers, Daniela Rus, Nayan Saxena, Bettina Schlager, Tim Seyde, Jimmy T. H. Smith, Aditya Tadimeti, Neehal Tumma
First: 2025-11-28T17:56:35+00:00 · Latest: 2025-11-28T17:56:35+00:00
Abstract
We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.
中文标题/摘要
标题:LFM2 技术报告
我们提出了LFM2,一种为高效设备端部署和强大任务能力设计的液态基础模型家族。通过在边缘延迟和内存约束下使用硬件在环的架构搜索,我们获得了一个紧凑的混合骨干,结合了门控短卷积和少量分组查询注意块,相比同等规模的模型,可在CPU上实现2倍以上的预填充和解码速度。LFM2家族涵盖了从3.5亿到83亿参数,包括密集模型(3.5亿,7亿,12亿,26亿)和专家混合变体(总计83亿,活跃15亿),所有模型均具有32K上下文长度。LFM2的训练管道包括一个温和的、解耦的Top-K知识蒸馏目标,避免支持不匹配;难度排序的数据递进学习;以及三个阶段的后训练食谱:监督微调、长度归一化偏好优化和模型合并。在预训练10-12万亿个令牌后,LFM2模型在多种基准测试中表现出色;例如,LFM2-26亿达到IFEval 79.56%和GSM8K 82.41%。我们进一步构建了多模态和检索变体:用于视觉语言任务的LFM2-VL,用于语音的LFM2-Audio,以及用于检索的LFM2-ColBERT。LFM2-VL通过高效的视觉处理支持可调的准确率-延迟权衡,而LFM2-Audio分离了音频输入和输出路径,以实现与大3倍的模型竞争的实时语音到语音交互。LFM2-ColBERT提供了一个低延迟的查询和文档编码器,可在多种语言中实现高性能检索。所有模型均以开放权重和部署包的形式发布,支持ExecuTorch、llama.cpp和vLLM,使LFM2成为需要快速、内存高效推理和强大任务能力的边缘应用的实用基础。
Summary / 总结
LFM2 is a family of Liquid Foundation Models designed for efficient on-device deployment and strong task performance. Using hardware-in-the-loop architecture search under edge latency and memory constraints, LFM2 achieves up to 2x faster prefill and decode on CPUs compared to similarly sized models. Key models range from 350M to 8.3B parameters, with a focus on compact design and efficient training pipelines. LFM2 models demonstrate strong results across various benchmarks, such as 79.56% on IFEval and 82.41% on GSM8K. Additionally, LFM2 includes multimodal and retrieval variants that offer tunable accuracy-latency tradeoffs and real-time speech-to-speech interaction capabilities.
LFM2 是一种旨在高效部署于设备端并具备强大任务能力的液态基础模型家族。它通过硬件在环的架构搜索实现了一种紧凑的混合骨干网络,结合了门控短卷积和少量分组查询注意力块,从而在CPU上实现比同类模型快2倍的预填充和解码。LFM2 包括从350M到8.3B参数的模型,包括密集型和混合专家变体,所有模型的上下文长度均为32K。训练流程包括温度化的Top-K知识蒸馏目标、逐级学习以及三阶段后训练食谱。LFM2 模型在各种基准测试中表现出色,例如在IFEval上达到79.56%,在GSM8K上达到82.41%。此外,LFM2 还包括针对视觉语言任务、语音和检索的多模态和检索变体,具有特定优化。
Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting
Authors: Daniil Sukhorukov, Andrei Zakharov, Nikita Glazkov, Katsiaryna Yanchanka, Vladimir Kirilin, Maxim Dubovitsky, Roman Sultimov, Yuri Maksimov, Ilya Makarov
First: 2025-11-28T17:27:06+00:00 · Latest: 2025-11-28T17:27:06+00:00
Comments: 9 pages, 4 figures
Abstract
We present the Hierarchical AI-Meteorologist, an LLM-agent system that generates explainable weather reports using a hierarchical forecast reasoning and weather keyword generation. Unlike standard approaches that treat forecasts as flat time series, our framework performs multi-scale reasoning across hourly, 6-hour, and daily aggregations to capture both short-term dynamics and long-term trends. Its core reasoning agent converts structured meteorological inputs into coherent narratives while simultaneously extracting a few keywords effectively summarizing the dominant meteorological events. These keywords serve as semantic anchors for validating consistency, temporal coherence and factual alignment of the generated reports. Using OpenWeather and Meteostat data, we demonstrate that hierarchical context and keyword-based validation substantially improve interpretability and robustness of LLM-generated weather narratives, offering a reproducible framework for semantic evaluation of automated meteorological reporting and advancing agent-based scientific reasoning.
中文标题/摘要
标题:分层AI气象学家:用于多尺度和可解释天气预报报告的LLM-代理系统
我们提出了分层AI气象学家,这是一种使用分层预报推理和天气关键词生成的LLM-代理系统,能够生成可解释的天气报告。与将预报视为扁平时间序列的标准方法不同,我们的框架在小时、6小时和日聚合尺度上进行多尺度推理,以捕捉短期动态和长期趋势。其核心推理代理将结构化的气象输入转换为连贯的叙述,同时提取少量关键词,有效总结主要的气象事件。这些关键词作为语义锚点,用于验证生成报告的一致性、时间连贯性和事实对齐。使用OpenWeather和Meteostat数据,我们证明了分层上下文和基于关键词的验证显著提高了LLM生成的天气叙述的可解释性和稳健性,提供了一个可重复的框架,用于评估自动气象报告的语义评估,并推进基于代理的科学推理。
Summary / 总结
The Hierarchical AI-Meteorologist is an LLM-agent system that generates explainable weather reports by performing multi-scale reasoning across hourly, 6-hour, and daily aggregations. This approach improves interpretability and robustness compared to standard flat time series methods. The system uses a core reasoning agent to convert structured meteorological inputs into coherent narratives and extract keywords for semantic validation, ensuring consistency and factual alignment. Experiments with OpenWeather and Meteostat data show that hierarchical context and keyword-based validation enhance the interpretability and robustness of LLM-generated weather narratives.
Hierarchical AI-Meteorologist 是一个 LLM-代理系统,通过在小时、6 小时和日尺度上进行多层次推理来生成可解释的天气报告,这比标准的扁平时间序列方法提高了可解释性和稳健性。系统使用核心推理代理将结构化的气象输入转换为连贯的叙述,并提取关键词进行语义验证,确保一致性和事实对齐。使用 OpenWeather 和 Meteostat 数据的实验表明,多层次上下文和基于关键词的验证增强了 LLM 生成的天气叙述的可解释性和稳健性。
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Authors: Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan
First: 2025-11-28T17:26:34+00:00 · Latest: 2025-11-28T17:26:34+00:00
Comments: 19 pages, 10 figures
Abstract
Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.
中文标题/摘要
标题:VQRAE:多模态理解、生成和重建的表示量化自编码器
在单个分词器中统一多模态理解、生成和重建的表示仍然是构建统一模型的关键挑战。先前的研究主要试图在双编码器范式中解决这一问题,例如,分别使用单独的编码器进行理解和生成,或者通过对比损失平衡语义表示和低级特征。在本文中,我们提出了一种名为VQRAE的向量量化版本的表示自编码器,这是首次探索统一表示以产生连续语义特征用于图像理解以及离散令牌用于视觉生成。具体而言,我们基于预训练的视觉基础模型构建了一个对称的ViT解码器,并采用两阶段训练策略:首先冻结编码器,使用像素重建目标学习高维语义VQ码本;然后联合优化编码器和自我蒸馏约束。这种设计使得在保持多模态理解能力的同时,能够保留兼容生成和精细重建的离散令牌。此外,我们发现语义编码器量化具有高维码本的特性,这与图像重建中常见的低维码本做法形成了有趣的对比。语义VQ码本在1536维下可以实现100%的利用率。VQRAE在视觉理解、生成和重建的多个基准测试中表现出竞争力,并且由于其离散特性,在自回归范式中具有良好的扩展性。
Summary / 总结
VQRAE is a Vector Quantization version of Representation AutoEncoders designed to unify multimodal understanding, generation, and reconstruction within a single tokenizer. It uses a pretrained vision foundation model with a symmetric ViT decoder and employs a two-stage training strategy to learn a high-dimensional semantic VQ codebook for pixel reconstruction and fine-grained reconstruction. The model achieves 100% utilization of the semantic VQ codebook at a dimension of 1536 and demonstrates competitive performance on various benchmarks, showing promise in the autoregressive paradigm for discrete merits.
VQRAE 是一种向量量化版本的表示自编码器,旨在在一个统一的分词器中统一多模态的理解、生成和重建。它使用预训练的视觉基础模型和对称的 ViT 解码器,并采用两阶段训练策略来学习用于像素重建和精细重建的高维语义 VQ 代码本。该模型在语义 VQ 代码本中实现了 1536 维下的 100% 利用率,并在各种基准测试中表现出竞争力,显示出在自回归范式中的离散优势。
Crowdsourcing the Frontier: Advancing Hybrid Physics-ML Climate Simulation via a $50,000 Kaggle Competition
Authors: Jerry Lin, Zeyuan Hu, Tom Beucler, Katherine Frields, Hannah Christensen, Walter Hannah, Helge Heuer, Peter Ukkonnen, Laura A. Mansfield, Tian Zheng, Liran Peng, Ritwik Gupta, Pierre Gentine, Yusef Al-Naher, Mingjiang Duan, Kyo Hattori, Weiliang Ji, Chunhan Li, Kippei Matsuda, Naoki Murakami, Shlomo Ron, Marec Serlin, Hongjian Song, Yuma Tanabe, Daisuke Yamamoto, Jianyao Zhou, Mike Pritchard
First: 2025-11-26T01:32:02+00:00 · Latest: 2025-11-28T17:24:00+00:00
Comments: Main text: 29 pages, 10 figures. SI: 47 pages, 37 figures
Abstract
Subgrid machine-learning (ML) parameterizations have the potential to introduce a new generation of climate models that incorporate the effects of higher-resolution physics without incurring the prohibitive computational cost associated with more explicit physics-based simulations. However, important issues, ranging from online instability to inconsistent online performance, have limited their operational use for long-term climate projections. To more rapidly drive progress in solving these issues, domain scientists and machine learning researchers opened up the offline aspect of this problem to the broader machine learning and data science community with the release of ClimSim, a NeurIPS Datasets and Benchmarks publication, and an associated Kaggle competition. This paper reports on the downstream results of the Kaggle competition by coupling emulators inspired by the winning teams' architectures to an interactive climate model (including full cloud microphysics, a regime historically prone to online instability) and systematically evaluating their online performance. Our results demonstrate that online stability in the low-resolution, real-geography setting is reproducible across multiple diverse architectures, which we consider a key milestone. All tested architectures exhibit strikingly similar offline and online biases, though their responses to architecture-agnostic design choices (e.g., expanding the list of input variables) can differ significantly. Multiple Kaggle-inspired architectures achieve state-of-the-art (SOTA) results on certain metrics such as zonal mean bias patterns and global RMSE, indicating that crowdsourcing the essence of the offline problem is one path to improving online performance in hybrid physics-AI climate simulation.
中文标题/摘要
标题:众包前沿:通过5万美元Kaggle竞赛推进混合物理-机器学习气候模拟
亚网格机器学习(ML)参数化有可能引入新一代气候模型,这些模型可以包含更高分辨率物理效应的影响,而不必承担更显式物理模拟相关的高昂计算成本。然而,从在线不稳定到在线性能不一致等一系列重要问题限制了它们在长期气候预测中的操作使用。为了更快地解决这些问题,领域科学家和机器学习研究人员通过发布ClimSim(NeurIPS数据集和基准发布之一)及其相关Kaggle竞赛,将此问题的离线部分向更广泛的机器学习和数据科学社区开放。本文报告了Kaggle竞赛的下游结果,通过将受获胜团队架构启发的模拟器与一个包含完整云微物理(历史上易发生在线不稳定)的交互式气候模型耦合,并系统评估其在线性能。我们的结果表明,在低分辨率、真实地理设置中实现的在线稳定性可以在多种不同的架构中重现,我们认为这是一个关键里程碑。所有测试的架构在离线和在线偏差方面表现出惊人的相似性,尽管它们对架构无关设计选择(例如,扩展输入变量列表)的响应可以有显著差异。多个Kaggle启发的架构在某些指标(如经向平均偏差模式和全球均方根误差)上达到最先进的(SOTA)结果,表明众包离线问题的本质是提高混合物理-人工智能气候模拟在线性能的一种途径。
Summary / 总结
This paper reports on a Kaggle competition aimed at advancing hybrid physics-ML climate simulations. The motivation was to address issues like online instability and inconsistent performance in subgrid ML parameterizations. The method involved coupling emulators from winning competition architectures to an interactive climate model with full cloud microphysics. Key findings include reproducible online stability across diverse architectures and similar offline and online biases, with some architectures achieving state-of-the-art results on specific metrics, indicating crowdsourcing can improve online performance in hybrid simulations.
该论文旨在通过将机器学习(ML)整合到气候模型中来提高模型的准确性,同时避免显式物理模拟的高计算成本。研究通过一项Kaggle竞赛推进了ML参数化在气候模型中的应用。通过将获胜团队的模型与交互式气候模型结合,评估其在线性能。主要发现包括在多种架构下重现在线稳定性,并在特定指标上取得最先进的结果,表明众包可以改善混合物理-人工智能气候模拟的在线性能。
DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline
Authors: Rui Zhang, Hongxia Wang, Hangqing Liu, Yang Zhou, Qiang Zeng
First: 2025-11-28T17:22:07+00:00 · Latest: 2025-11-28T17:22:07+00:00
Comments: 13pages,12 figures
Abstract
Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML research.The dataset can be accessed via https://github.com/ymhzyj/DEAL-300K.
中文标题/摘要
标题:DEAL-300K:基于扩散的编辑区域定位数据集,附30万规模数据集和频率提示基线
基于扩散的图像编辑使普通用户能够轻松进行语义级别的图像操作,但也使得现实的局部伪造难以定位。现有基准主要集中在生成图像的二元检测或手动编辑区域的定位,但未能反映基于扩散编辑的特性,这些编辑通常会平滑地融入原始内容。我们提出了基于扩散的图像编辑区域定位数据集(DEAL-300K),这是一个用于基于扩散的图像操作定位(DIML)的大规模数据集,包含超过30万张标注图像。我们通过使用多模态大型语言模型生成编辑指令,使用无掩码扩散编辑器生成处理后的图像,并使用主动学习变化检测流水线获得像素级标注。在此数据集基础上,我们提出了一种定位框架,该框架结合冻结的视觉基础模型(VFM)和多频率提示调优(MFPT),以捕捉编辑区域的语义和频域线索。在DEAL-300K上训练,我们的方法在我们的测试分割上达到像素级F1分数82.56%,在外部CoCoGlide基准上达到80.97%,为未来的DIML研究提供了强大的基线和实用的基础。数据集可通过https://github.com/ymhzyj/DEAL-300K访问。
Summary / 总结
The research aims to address the challenge of localizing realistic local forgeries generated by diffusion-based image editing, which are difficult to detect. The authors developed DEAL-300K, a large-scale dataset with over 300,000 annotated images, using a multi-modal LLM for editing instructions, a mask-free diffusion editor, and an active-learning pipeline for pixel-level annotations. They also proposed a localization framework using a frozen Visual Foundation Model and Multi Frequency Prompt Tuning to capture both semantic and frequency-domain cues. The method achieved a pixel-level F1 score of 82.56% on the test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines for future research in diffusion-based image editing area localization.
研究旨在解决在基于扩散的图像编辑中定位现实局部伪造的挑战。作者开发了包含超过300,000个注释图像的DEAL-300K数据集,使用多模态大语言模型生成指令、无掩码的扩散编辑器和主动学习的变更检测管道进行像素级标注。他们还提出了一种结合冻结视觉基础模型和多频率提示调优的定位框架,其在测试分割上的像素级F1分数为82.56%,在外部CoCoGlide基准上的分数为80.97%。
SimScale: Learning to Drive via Real-World Simulation at Scale
Authors: Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li
First: 2025-11-28T17:17:38+00:00 · Latest: 2025-11-28T17:17:38+00:00
Comments: Project page: https://opendrivelab.com/SimScale
Abstract
Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.
中文标题/摘要
标题:SimScale:通过大规模现实世界模拟学习自动驾驶
实现完全自动驾驶系统需要在广泛的情景中学习合理的决策,包括安全关键和分布外的情况。然而,由人类专家收集的真实世界语料库中这些情况的代表性不足。为了补充数据多样性不足,我们引入了一种新颖且可扩展的模拟框架,能够基于现有的驾驶日志合成大量未见过的状态。我们的流水线利用先进的神经渲染和反应环境生成由扰动的自我轨迹控制的高保真多视角观察。此外,我们为这些新模拟状态开发了一种伪专家轨迹生成机制,以提供动作监督。在合成数据上,我们发现,对真实世界和模拟样本进行简单的联合训练策略可以在各种规划方法在具有挑战性的现实世界基准测试中的鲁棒性和泛化能力上取得显著提高,最高可提高6.8 EPDMS在navhard和2.9在navtest。更重要的是,通过增加模拟数据,这种策略可以平滑地扩展,即使没有额外的真实世界数据流。我们还揭示了这种模拟-现实学习系统SimScale的一些关键发现,包括伪专家的设计和不同策略架构的扩展特性。我们的模拟数据和代码将被发布。
Summary / 总结
The research aims to enhance autonomous driving systems by leveraging a scalable simulation framework to generate diverse scenarios, addressing the lack of data diversity in real-world datasets. The method involves using neural rendering and a reactive environment to create high-fidelity multi-view observations and pseudo-expert trajectories for action supervision. Key findings show that co-training with simulated and real-world data improves robustness and generalization, with up to 6.8 EPDMS improvement on navhard and 2.9 on navtest benchmarks. The system's performance scales smoothly with increased simulation data without additional real-world data. Crucial insights into the design of pseudo-experts and scaling properties for different policy architectures are also provided.
研究旨在解决现实驾驶场景中数据多样性不足的问题,这对于开发稳健的自动驾驶系统至关重要。方法是利用一个新颖的模拟框架,通过神经渲染和反应环境从现有日志中生成大规模未见过的驾驶状态。关键发现表明,使用真实和模拟数据进行联合训练可以提高规划方法的稳健性和泛化能力,分别在navhard和navtest上提高了6.8 EPDMS和2.9。该系统称为SimScale,随着模拟数据的增加,其性能可以平滑提升,无需额外的现实世界数据,并包括关于伪专家设计和不同策略架构扩展性的见解。
Distributed Dynamic Associative Memory via Online Convex Optimization
Authors: Bowen Wang, Matteo Zecchin, Osvaldo Simeone
First: 2025-11-28T16:56:18+00:00 · Latest: 2025-11-28T16:56:18+00:00
Abstract
An associative memory (AM) enables cue-response recall, and it has recently been recognized as a key mechanism underlying modern neural architectures such as Transformers. In this work, we introduce the concept of distributed dynamic associative memory (DDAM), which extends classical AM to settings with multiple agents and time-varying data streams. In DDAM, each agent maintains a local AM that must not only store its own associations but also selectively memorize information from other agents based on a specified interest matrix. To address this problem, we propose a novel tree-based distributed online gradient descent algorithm, termed DDAM-TOGD, which enables each agent to update its memory on the fly via inter-agent communication over designated routing trees. We derive rigorous performance guarantees for DDAM-TOGD, proving sublinear static regret in stationary environments and a path-length dependent dynamic regret bound in non-stationary environments. These theoretical results provide insights into how communication delays and network structure impact performance. Building on the regret analysis, we further introduce a combinatorial tree design strategy that optimizes the routing trees to minimize communication delays, thereby improving regret bounds. Numerical experiments demonstrate that the proposed DDAM-TOGD framework achieves superior accuracy and robustness compared to representative online learning baselines such as consensus-based distributed optimization, confirming the benefits of the proposed approach in dynamic, distributed environments.
中文标题/摘要
标题:基于在线凸优化的分布式动态关联记忆
关联记忆(AM)能够实现线索-响应回忆,最近被认作现代神经架构如变换器的关键机制之一。本文中,我们引入了分布式动态关联记忆(DDAM)的概念,将经典AM扩展到多代理和时间变化的数据流环境中。在DDAM中,每个代理维护一个本地AM,不仅需要存储自身的关联,还需要根据指定的兴趣矩阵有选择地记忆其他代理的信息。为了解决这一问题,我们提出了一种基于树的分布式在线梯度下降算法,称为DDAM-TOGD,该算法允许每个代理通过指定的路由树进行代理间通信,实时更新其记忆。我们为DDAM-TOGD推导了严格性能保证,证明在稳定环境中具有亚线性静态遗憾,在非稳定环境中具有路径长度依赖的动态遗憾界。这些理论结果提供了通信延迟和网络结构如何影响性能的见解。基于遗憾分析,我们进一步引入了一种组合树设计策略,优化路由树以最小化通信延迟,从而改善遗憾界。数值实验表明,提出的DDAM-TOGD框架在动态分布式环境中相比基于共识的分布式优化等代表性在线学习基准具有更高的准确性和鲁棒性,证实了所提方法的优势。
Summary / 总结
This work introduces distributed dynamic associative memory (DDAM) to enable multiple agents to maintain and update their memories based on time-varying data streams and interest matrices. The authors propose DDAM-TOGD, a tree-based distributed online gradient descent algorithm that allows agents to update their memories via inter-agent communication. Theoretical analysis shows sublinear static regret in stationary environments and dynamic regret bounds dependent on path length in non-stationary environments. Numerical experiments confirm the superior accuracy and robustness of DDAM-TOGD compared to other online learning methods.
本文提出了分布式动态关联记忆(DDAM),将其扩展到具有时间变化数据的多代理系统中。它提出了一种基于树的分布式在线梯度下降算法DDAM-TOGD,允许代理通过代理间通信更新其记忆。理论分析显示,在稳定环境中具有亚线性静态遗憾,在非稳定环境中动态遗憾依赖于路径长度。数值实验确认了该方法在动态分布式环境中的优越准确性和鲁棒性,优于基于共识的分布式优化方法。
SARD: Segmentation-Aware Anomaly Synthesis via Region-Constrained Diffusion with Discriminative Mask Guidance
Authors: Yanshu Wang, Xichen Xu, Xiaoning Lei, Guoyang Xie
First: 2025-08-05T06:43:01+00:00 · Latest: 2025-11-28T16:45:52+00:00
Comments: Accepted by The 2025 International Conference on Machine Intelligence and Nature-InspireD Computing (MIND)
Abstract
Synthesizing realistic and spatially precise anomalies is essential for enhancing the robustness of industrial anomaly detection systems. While recent diffusion-based methods have demonstrated strong capabilities in modeling complex defect patterns, they often struggle with spatial controllability and fail to maintain fine-grained regional fidelity. To overcome these limitations, we propose SARD (Segmentation-Aware anomaly synthesis via Region-constrained Diffusion with discriminative mask Guidance), a novel diffusion-based framework specifically designed for anomaly generation. Our approach introduces a Region-Constrained Diffusion (RCD) process that preserves the background by freezing it and selectively updating only the foreground anomaly regions during the reverse denoising phase, thereby effectively reducing background artifacts. Additionally, we incorporate a Discriminative Mask Guidance (DMG) module into the discriminator, enabling joint evaluation of both global realism and local anomaly fidelity, guided by pixel-level masks. Extensive experiments on the MVTec-AD and BTAD datasets show that SARD surpasses existing methods in segmentation accuracy and visual quality, setting a new state-of-the-art for pixel-level anomaly synthesis.
中文标题/摘要
标题:SARD:基于区域约束扩散与判别掩码引导的分割感知异常合成
生成逼真且空间精确的异常对于增强工业异常检测系统的鲁棒性至关重要。虽然基于扩散的方法在建模复杂缺陷模式方面表现出强大的能力,但它们往往难以控制空间,并且无法保持细粒度的区域保真度。为克服这些限制,我们提出了SARD(基于分割感知的区域约束扩散与判别掩码引导异常合成),这是一种专门设计用于异常生成的新型基于扩散的框架。我们的方法引入了一种区域约束扩散(RCD)过程,该过程通过冻结背景并在反向去噪阶段仅选择性地更新前景异常区域来保留背景,从而有效减少背景伪影。此外,我们还将判别掩码引导(DMG)模块集成到判别器中,使其能够通过像素级掩码联合评估全局逼真度和局部异常保真度。在MVTec-AD和BTAD数据集上的大量实验表明,SARD在分割准确性和视觉质量方面均优于现有方法,为像素级异常合成设定了新的最先进的水平。
Summary / 总结
SARD is a novel diffusion-based framework designed to synthesize realistic and spatially precise anomalies for enhancing industrial anomaly detection systems. It introduces a Region-Constrained Diffusion (RCD) process that preserves the background and selectively updates only the foreground anomaly regions, and a Discriminative Mask Guidance (DMG) module that guides the evaluation of both global realism and local anomaly fidelity. Experiments on MVTec-AD and BTAD datasets demonstrate that SARD outperforms existing methods in segmentation accuracy and visual quality, setting a new state-of-the-art for pixel-level anomaly synthesis.
SARD 是一种新颖的扩散基础框架,旨在合成现实且空间精确的异常,以增强工业异常检测系统。它引入了区域约束扩散(RCD)过程,保留背景并仅在前景异常区域进行选择性更新,从而减少背景伪影。此外,它还包含了一个判别性掩码引导(DMG)模块,用于评估全局真实性和局部异常保真度。在 MVTec-AD 和 BTAD 数据集上的实验表明,SARD 在分割准确性和视觉质量方面优于现有方法,为像素级异常合成设定了新的前沿。
Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach
Authors: Shuqi Liu, Han Wu, Guanzhi Deng, Jianshu Chen, Xiaoyang Wang, Linqi Song
First: 2025-11-28T16:43:46+00:00 · Latest: 2025-11-28T16:43:46+00:00
Abstract
Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.
中文标题/摘要
标题:通过结构化知识发现方法提高语言模型生成的可解释性
知识增强的文本生成旨在通过利用内部或外部知识源来提高生成文本的质量。虽然语言模型在生成连贯和流畅的文本方面表现出色,但缺乏可解释性构成了一个重大障碍。生成文本的有限可解释性严重影响了其实际应用价值,特别是在需要可靠性和可解释性的知识增强文本生成任务中。现有方法通常使用针对特定数据特征定制的知识检索器,这限制了它们对多种数据类型和任务的通用性。为克服这一限制,我们直接利用结构化知识的两层架构,包括高层实体和低层知识三元组,来设计我们的任务无关结构化知识猎人。具体而言,我们采用局部-全局交互方案进行结构化知识表示学习,并使用分层变压器指针网络作为选择相关知识三元组和实体的基础。通过结合语言模型的强大生成能力和知识猎人的高忠实度,我们的模型实现了高可解释性,使用户能够理解模型输出生成过程。此外,我们在内部知识增强的表格到文本生成(RotoWireFG数据集)和外部知识增强的对话响应生成(KdConv数据集)中实证证明了我们模型的有效性。我们的任务无关模型优于最先进的方法和相应的语言模型,为基准设立了新的标准。
Summary / 总结
This paper addresses the interpretability issue in language model generation by proposing a structured knowledge discovery approach. The method uses a two-tier architecture of structured knowledge, combining a local-global interaction scheme and a hierarchical transformer-based pointer network to select relevant knowledge triples and entities. The model integrates the strong generative ability of language models with high faithfulness from the knowledge hunter, achieving high interpretability. Experiments on RotoWireFG and KdConv datasets show that the proposed model outperforms existing methods and language models, enhancing the reliability and explainability of generated text.
该论文通过提出一种结构化知识发现方法来解决语言模型生成中的可解释性问题。该方法利用两层结构化知识架构和基于层次变换器的指针网络来增强生成文本的可解释性。实验结果表明,所提出模型在内部和外部知识增强任务中均优于现有方法,提供了更高的可解释性和可靠性。
FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis
Authors: Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
First: 2025-09-24T16:28:15+00:00 · Latest: 2025-11-28T16:42:48+00:00
Abstract
Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.
中文标题/摘要
标题:FAST:面向分割的自适应采样加速扩散与前景感知重构模块异常合成框架
工业异常分割严重依赖于像素级注释,但实际中的异常往往稀缺、多样且标注成本高昂。分割导向的工业异常合成(SIAS)已成为一种有前景的替代方案;然而,现有方法难以在采样效率和生成质量之间取得平衡。此外,大多数方法对所有空间区域处理一致,忽视了异常区域与背景区域之间的统计差异。这种一致处理阻碍了为分割任务定制的、结构特定的异常合成。本文提出了一种前景感知扩散框架FAST,包含两个新型模块:异常导向加速采样(AIAS)和前景感知重构模块(FARM)。AIAS 是一种无需训练的采样算法,专门针对分割导向的工业异常合成,通过从粗到细聚合加速逆过程,并能在仅10步内合成最先进的分割导向异常。同时,FARM 在每次采样步骤中自适应调整掩码前景区域内的异常感知噪声,确保在去噪轨迹中保留局部异常信号。在多个工业基准上的广泛实验表明,FAST 在下游分割任务中始终优于现有异常合成方法。我们已将代码发布在:https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis。
Summary / 总结
FAST is a foreground-aware diffusion framework designed to improve the efficiency and quality of segmentation-oriented industrial anomaly synthesis. It introduces two novel modules: AIAS, which accelerates the sampling process through coarse-to-fine aggregation, and FARM, which adjusts noise in masked foreground regions to preserve localized anomaly signals. Experiments show that FAST outperforms existing methods in segmentation tasks across multiple industrial benchmarks.
FAST 是一种前景感知的扩散框架,旨在提高工业异常分割合成的效率和质量。它引入了两个关键模块:Anomaly-Informed Accelerated Sampling (AIAS) 和 Foreground-Aware Reconstruction Module (FARM)。AIAS 通过从粗到细的信息聚合加速合成过程,而 FARM 在每个采样步骤中适应性地调整前景区域内的噪声。实验表明,FAST 在多个工业基准上的分割任务中优于现有方法。
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation
Authors: Yu Zhang, Jingyi Liu, Yiwei Shi, Qi Zhang, Duoqian Miao, Changwei Wang, Longbing Cao
First: 2025-11-28T16:42:18+00:00 · Latest: 2025-11-28T16:42:18+00:00
Abstract
Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.
中文标题/摘要
标题:马尔可夫尺度预测:视觉自回归生成的新时代
基于下一尺度预测的视觉自回归建模(VAR)已重新激发了视觉生成的自回归建模。尽管其全上下文依赖性,即为下一尺度预测建模所有先前尺度,通过利用完整的信息流促进了更稳定和全面的表示学习,但由此产生的计算效率低下和大量开销严重阻碍了VAR的实际应用和扩展性。这促使我们开发一种具有更好性能和效率的新VAR模型,而无需全上下文依赖性。为了解决这一问题,我们将VAR重新表述为非全上下文马尔可夫过程,提出了Markov-VAR。这通过马尔可夫尺度预测实现:我们将每个尺度视为一个马尔可夫状态,并引入一个滑动窗口,将某些先前尺度压缩成一个紧凑的历史向量,以弥补由于非全上下文依赖性而导致的历史信息损失。将历史向量与马尔可夫状态结合,产生一个代表性的动态状态,在马尔可夫过程中演变。大量实验表明,Markov-VAR 极其简单且效果显著:与 ImageNet 上的 VAR 相比,Markov-VAR 在 256×256 的情况下将 FID 降低了 10.5%,在 1024×1024 的情况下将峰值内存消耗降低了 83.8%。我们相信,Markov-VAR 可以作为未来视觉自回归生成和其他下游任务研究的基础。
Summary / 总结
This paper addresses the computational inefficiency of full-context dependency in Visual AutoRegressive (VAR) models by proposing Markov-VAR, a new model that reformulates VAR as a non-full-context Markov process. Markov-VAR uses a sliding window to compress historical information into a compact history vector, which is integrated with the current scale to form a dynamic state. Experiments show that Markov-VAR significantly improves efficiency while maintaining high performance, reducing FID by 10.5% and decreasing peak memory consumption by 83.8%.
本文通过提出Markov-VAR,将Visual AutoRegressive (VAR)模型重新表述为非全上下文马尔可夫过程,以解决VAR模型的计算效率问题。Markov-VAR使用滑动窗口将历史信息压缩成紧凑的历史向量,实现了性能和效率的平衡。实验表明,与VAR在ImageNet上的表现相比,Markov-VAR将FID降低了10.5%,并且峰值内存消耗减少了83.8%。
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
Authors: Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang
First: 2025-11-28T16:40:08+00:00 · Latest: 2025-11-28T16:40:08+00:00
Comments: Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg
Abstract
Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.
中文标题/摘要
标题:UniGeoSeg:面向地理空间场景的统一开放世界分割
遥感指令驱动分割通过指导生成掩膜,具有广泛适用性和可访问性的巨大潜力。然而,现有方法因任务表述碎片化和指导数据有限,限制了有效理解和泛化。为解决这些问题,我们引入了GeoSeg-1M,这是首个百万规模的遥感指令驱动分割数据集,通过自动掩膜过滤和指令生成管道,从多个公开数据集中合成引用、交互和推理分割指令。GeoSeg-1M 包含59万张图像、117个类别和110万张图像-掩膜-指令三元组。在此基础上,我们进一步构建了GeoSeg-Bench,一个具有挑战性的基准,旨在评估在多样化的指令驱动任务和复杂地理空间场景中的上下文理解和推理能力。此外,我们提出了UniGeoSeg,一个统一框架,作为强基线,结合任务感知文本增强、潜在知识记忆和渐进式训练策略,促进多任务学习。广泛实验表明,UniGeoSeg 在GeoSeg-Bench 和多种公开基准上的性能达到最新水平,同时表现出强大的零样本泛化能力。数据集和源代码可在 https://github.com/MiliLab/UniGeoSeg 发布。
Summary / 总结
The research aims to improve instruction-driven segmentation in remote sensing by addressing fragmented task formulations and limited instruction data. To achieve this, the authors introduce GeoSeg-1M, a million-scale dataset for remote sensing instruction-driven segmentation, and GeoSeg-Bench, a challenging benchmark for evaluating contextual understanding and reasoning. They also propose UniGeoSeg, a unified framework that includes task-aware text enhancement, latent knowledge memory, and a progressive training strategy. Experiments show that UniGeoSeg outperforms existing methods and exhibits strong zero-shot generalization capabilities.
研究旨在通过解决任务碎片化和指令数据有限的问题,提高遥感中的指令驱动分割。作者引入了GeoSeg-1M,这是一个用于遥感指令驱动分割的百万级数据集,以及GeoSeg-Bench,这是一个具有挑战性的基准,用于评估上下文理解和推理能力。他们提出了UniGeoSeg,一个统一框架,增强了文本处理,记忆了潜在知识,并采用了渐进式训练策略,展示了在基准测试中的先进性能和强大的零样本泛化能力。
Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models
Authors: Jucheng Shen, Yeonju Ro
Venue: NeurIPS 2025
First: 2025-11-03T21:30:03+00:00 · Latest: 2025-11-28T16:36:54+00:00
Comments: 7 pages, NeurIPS 2025 Efficient Reasoning Workshop
Abstract
Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.
中文标题/摘要
标题:超越静态阈值:用于扩散语言模型的一次性动态阈值
掩蔽扩散语言模型(MDLMs)在与自回归模型竞争的同时,通常以固定步长和顺序解码。为了加速解码,最近的工作如Fast-dLLM通过静态全局置信度阈值实现并行解码,但我们观察到强烈的块内和步长内置信度波动,并且在数据集中,输入之间的置信度轨迹通过余弦相似度测量几乎相同。受这些观察的启发,我们引入了一次性动态阈值(OSDT),它在单个序列上校准阈值并在后续输入中应用,几乎没有额外开销。在GPQA、GSM8K和HumanEval上,OSDT实现了更优的准确率-吞吐量权衡(在GSM8K上最佳准确率时每秒多24%的令牌,在GPQA上与相似准确率相比多45%,在HumanEval上与适度准确率差距相比多50%)。除了这些结果,我们的发现表明,可以利用可重用的任务级置信度签名来为扩散解码的更通用算法和系统创新提供更多机会。
Summary / 总结
The paper addresses the limitations of static confidence thresholds in masked diffusion language models (MDLMs) by proposing One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with minimal overhead. OSDT improves accuracy-throughput trade-offs on GPQA, GSM8K, and HumanEval, achieving up to 24% more tokens per second on GSM8K, 45% on GPQA, and 50% on HumanEval with only a modest accuracy gap. This method suggests potential for broader applications in confidence signature reuse for diffusion decoding advancements.
论文针对masked diffusion语言模型(MDLM)中静态置信阈值的局限性,提出了One-Shot动态阈值(OSDT)方法。OSDT在单个序列上校准阈值并应用于后续输入,减少解码时间。在GPQA、GSM8K和HumanEval上的实验表明,OSDT提高了准确率-吞吐量的权衡,实现了在GSM8K和HumanEval上最多50%的吞吐量提升,同时保持或仅略有准确率差距。
Predicting Market Trends with Enhanced Technical Indicator Integration and Classification Models
Authors: Abdelatif Hafid, Abderazzak Mouiha, Linglong Kong, Mohamed Rahouti, Maad Ebrahim, Mohamed Adel Serhani, Mohammed Aledhari
First: 2024-10-09T14:29:50+00:00 · Latest: 2025-11-28T16:23:30+00:00
Comments: 12 pages, 8 figures, and 6 tables
Abstract
Thanks to the high potential for profit, trading has become increasingly attractive to investors as the cryptocurrency and stock markets rapidly expand. However, because financial markets are intricate and dynamic, accurately predicting prices remains a significant challenge. The volatile nature of the cryptocurrency market makes it even harder for traders and investors to make decisions. This study presents a classification-based machine learning model to forecast the direction of the cryptocurrency market, i.e., whether prices will increase or decrease. The model is trained using historical data and important technical indicators such as the Moving Average Convergence Divergence, the Relative Strength Index, and the Bollinger Bands. We illustrate our approach with an empirical study of the closing price of Bitcoin. Several simulations, including a confusion matrix and Receiver Operating Characteristic curve, are used to assess the model's performance, and the results show a buy/sell signal accuracy of over 92\%. These findings demonstrate how machine learning models can assist investors and traders of cryptocurrencies in making wise/informed decisions in a very volatile market.
中文标题/摘要
标题:利用增强技术指标集成和分类模型预测市场趋势
由于利润潜力巨大,随着加密货币和股票市场的迅速扩张,交易对投资者来说越来越具有吸引力。然而,由于金融市场复杂且动态,准确预测价格仍然是一个重大挑战。加密货币市场的波动性使得交易者和投资者在做决策时更加困难。本研究提出了一种基于分类的机器学习模型,用于预测加密货币市场的走势,即价格是上涨还是下跌。该模型使用历史数据和重要的技术指标,如移动平均收敛发散、相对强弱指数和布林带进行训练。我们通过对比特币收盘价的实证研究来说明我们的方法。通过混淆矩阵和受试者操作特征曲线等多种模拟,评估模型的性能,结果显示买入/卖出信号的准确率超过92%。这些发现表明,机器学习模型可以帮助加密货币的投资者和交易者在非常波动的市场中做出明智/知情的决策。
Summary / 总结
This study aims to predict the direction of the cryptocurrency market, particularly for Bitcoin, using a classification-based machine learning model that integrates technical indicators like Moving Average Convergence Divergence, Relative Strength Index, and Bollinger Bands. The model's performance is evaluated through simulations, including a confusion matrix and Receiver Operating Characteristic curve, achieving a buy/sell signal accuracy of over 92%. This indicates that machine learning models can effectively assist investors and traders in making informed decisions in volatile markets.
本研究旨在利用历史数据和技术指标(如移动平均收敛发散、相对强弱指数和布林带)构建分类机器学习模型,预测加密货币市场的走势。通过混淆矩阵和接收者操作特征曲线等模拟评估模型性能,结果显示买/卖信号准确率超过92%。这表明机器学习模型可以帮助投资者和交易者在波动市场中做出明智的决策。
A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation
Authors: Rémi Marsal, Alexandre Chapoutot, Philippe Xu, David Filliat
Venue: IROS 2025
First: 2024-12-18T17:50:15+00:00 · Latest: 2025-11-28T16:18:07+00:00
Comments: Published at IROS 2025 https://ieeexplore.ieee.org/document/11247168
Abstract
The recent development of \emph{foundation models} for monocular depth estimation such as Depth Anything paved the way to zero-shot monocular depth estimation. Since it returns an affine-invariant disparity map, the favored technique to recover the metric depth consists in fine-tuning the model. However, this stage is not straightforward, it can be costly and time-consuming because of the training and the creation of the dataset. The latter must contain images captured by the camera that will be used at test time and the corresponding ground truth. Moreover, the fine-tuning may also degrade the generalizing capacity of the original model. Instead, we propose in this paper a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU. This approach avoids fine-tuning and preserves the generalizing power of the original depth estimation model while being robust to the noise of the sparse depth, of the camera-LiDAR calibration or of the depth model. Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches. Code available at github.com/ENSTA-U2IS-AI/depth-rescaling.
中文标题/摘要
标题:一种简单有效的零样本单目度量深度估计测试时适应方法
最近的单目深度估计基础模型如Depth Anything为零样本单目深度估计铺平了道路。由于它返回的是仿射不变的视差图,因此恢复度量深度的首选方法是微调模型。然而,这一阶段并不简单,由于训练和数据集的创建,它可能成本高昂且耗时。后者必须包含测试时将使用的相机拍摄的图像及其对应的地面真值。此外,微调也可能降低原始模型的泛化能力。相反,本文提出了一种新的方法,使用由传感器或低分辨率LiDAR或基于IMU姿态的结构光法提供的3D点来重新缩放Depth Anything的预测。这种方法避免了微调,并保留了原始深度估计模型的泛化能力,同时对稀疏深度噪声、相机-LiDAR标定或深度模型的噪声具有鲁棒性。我们的实验表明,该方法相对于零样本单目度量深度估计方法有所改进,与微调方法相比具有竞争力的结果,并且比深度补全方法具有更好的鲁棒性。代码可在github.com/ENSTA-U2IS-AI/depth-rescaling获取。
Summary / 总结
This paper addresses the challenge of zero-shot monocular metric depth estimation by proposing a simple yet effective test-time adaptation method. Instead of fine-tuning the model, which can be costly and degrade generalization, the authors suggest rescaling Depth Anything predictions using 3D points from sensors like low-resolution LiDAR or structure-from-motion with IMU poses. Experiments show improvements over zero-shot methods, competitive results with fine-tuned approaches, and better robustness compared to depth completion methods.
本文提出了一种针对Depth Anything模型的简单有效测试时调整方法,以解决单目零样本深度估计的问题。该方法不进行模型微调,而是使用低分辨率LiDAR或结构光与IMU姿态提供的3D点来调整模型预测。实验结果表明,该方法在性能上优于现有零样本方法和深度补全方法,并且在各种噪声源下表现出更好的鲁棒性。
Emergent Coordination and Phase Structure in Independent Multi-Agent Reinforcement Learning
Authors: Azusa Yamaguchi
First: 2025-11-28T16:14:31+00:00 · Latest: 2025-11-28T16:14:31+00:00
Comments: 22 pages, 19 figures
Abstract
A clearer understanding of when coordination emerges, fluctuates, or collapses in decentralized multi-agent reinforcement learning (MARL) is increasingly sought in order to characterize the dynamics of multi-agent learning systems. We revisit fully independent Q-learning (IQL) as a minimal decentralized testbed and run large-scale experiments across environment size L and agent density rho. We construct a phase map using two axes - the cooperative success rate (CSR) and a stability index derived from TD-error variance - revealing three distinct regimes: a coordinated and stable phase, a fragile transition region, and a jammed or disordered phase. A sharp double Instability Ridge separates these regimes and corresponds to persistent kernel drift, the time-varying shift of each agent's effective transition kernel induced by others' policy updates. Synchronization analysis further shows that temporal alignment is required for sustained cooperation, and that competition between drift and synchronization generates the fragile regime. Removing agent identifiers eliminates drift entirely and collapses the three-phase structure, demonstrating that small inter-agent asymmetries are a necessary driver of drift. Overall, the results show that decentralized MARL exhibits a coherent phase structure governed by the interaction between scale, density, and kernel drift, suggesting that emergent coordination behaves as a distribution-interaction-driven phase phenomenon.
中文标题/摘要
标题:独立多智能体强化学习中的涌现协调与相位结构
在分散的多智能体强化学习(MARL)中,对协调何时涌现、波动或崩溃的更清晰理解越来越受到重视,以便描述多智能体学习系统的动力学。我们重新审视完全独立的Q学习(IQL)作为最小的分散测试平台,并在环境大小L和智能体密度ρ的不同规模下进行大规模实验。我们使用两个轴构建相图——合作成功率(CSR)和从TD误差方差导出的稳定性指数,揭示了三个不同的阶段:协调和稳定的阶段、脆弱的过渡区域和阻塞性或无序的阶段。两个不稳定的山脊将这些阶段分开,并对应于持续的核漂移,即由于其他智能体策略更新而引起的每个智能体的有效转移核的时间变化位移。进一步的同步分析表明,持续的合作需要时间上的对齐,而漂移与同步之间的竞争产生了脆弱的阶段。移除智能体标识符完全消除了漂移,并使三阶段结构崩溃,证明了小的智能体间不对称性是漂移的必要驱动因素。总体而言,结果表明,分散的MARL表现出由规模、密度和核漂移之间的相互作用控制的相结构,表明涌现的协调行为是一种由分布-相互作用驱动的相现象。
Summary / 总结
The study investigates the emergence and stability of coordination in decentralized multi-agent reinforcement learning (MARL) by revisiting fully independent Q-learning (IQL) and conducting large-scale experiments. It constructs a phase map using cooperative success rate and stability index, identifying three regimes: coordinated and stable, fragile transition, and jammed phases. The research reveals a sharp double Instability Ridge separating these regimes, linked to persistent kernel drift and the competition between drift and synchronization. Removing agent identifiers eliminates drift and collapses the phase structure, indicating that small inter-agent asymmetries drive drift. The findings suggest that MARL coordination is a phase phenomenon driven by scale, density, and kernel drift.
研究旨在理解去中心化多智能体强化学习(MARL)系统中协调的出现、波动和崩溃。使用完全独立的Q学习(IQL)作为最小测试床,并在不同环境规模和智能体密度下进行大规模实验。研究构建了一个基于合作成功率和稳定性指数的相图,识别出三种状态:协调和稳定状态、脆弱过渡区域和混乱或无序状态。关键发现包括一个尖锐的双不稳定性岭将这些状态区分开来,由持续的内核漂移和内核漂移与同步之间的竞争驱动。移除智能体标识符完全消除了漂移并消解了三相结构,表明智能体间的微小不对称性是漂移生成的关键驱动因素。
ADNF-Clustering: An Adaptive and Dynamic Neuro-Fuzzy Clustering for Leukemia Prediction
Authors: Marco Aruta, Ciro Listone, Giuseppe Murano, Aniello Murano
First: 2025-06-23T08:30:17+00:00 · Latest: 2025-11-28T16:14:04+00:00
Comments: 6 pages, 1 figure
Abstract
Leukemia diagnosis and monitoring rely increasingly on high-throughput image data, yet conventional clustering methods lack the flexibility to accommodate evolving cellular patterns and quantify uncertainty in real time. We introduce Adaptive and Dynamic Neuro-Fuzzy Clustering, a novel streaming-capable framework that combines Convolutional Neural Network-based feature extraction with an online fuzzy clustering engine. ADNF initializes soft partitions via Fuzzy C-Means, then continuously updates micro-cluster centers, densities, and fuzziness parameters using a Fuzzy Temporal Index (FTI) that measures entropy evolution. A topology refinement stage performs density-weighted merging and entropy-guided splitting to guard against over- and under-segmentation. On the C-NMC leukemia microscopy dataset, our tool achieves a silhouette score of 0.51, demonstrating superior cohesion and separation over static baselines. The method's adaptive uncertainty modeling and label-free operation hold immediate potential for integration within the INFANT pediatric oncology network, enabling scalable, up-to-date support for personalized leukemia management.
中文标题/摘要
标题:ADNF-聚类:一种适应性和动态神经模糊聚类方法用于白血病预测
白血病诊断和监测越来越多地依赖高通量图像数据,但传统的聚类方法缺乏适应不断变化的细胞模式和实时量化不确定性的能力。我们提出了一种新的流式处理框架——自适应和动态神经模糊聚类,该框架结合了基于卷积神经网络的特征提取和在线模糊聚类引擎。ADNF通过模糊C均值初始化软分区,然后使用度量熵演化的模糊时间索引(FTI)不断更新微聚类中心、密度和模糊性参数。拓扑优化阶段执行密度加权合并和熵导向分裂,以防止过度分割和欠分割。在C-NMC白血病显微镜数据集上,我们的工具实现了0.51的轮廓分数,显示出与静态基线相比更好的凝聚性和分离性。该方法的自适应不确定性建模和无标签操作具有立即整合到INFANT儿童肿瘤网络中的潜力,能够提供可扩展的、实时的个性化白血病管理支持。
Summary / 总结
The research aims to improve leukemia diagnosis and monitoring by addressing the limitations of conventional clustering methods in handling evolving cellular patterns and real-time uncertainty quantification. The proposed ADNF-Clustering framework uses a Convolutional Neural Network for feature extraction and an online fuzzy clustering engine. It initializes soft partitions with Fuzzy C-Means and updates micro-cluster centers, densities, and fuzziness parameters using a Fuzzy Temporal Index. The method achieves a silhouette score of 0.51 on the C-NMC leukemia dataset, showing better cohesion and separation compared to static baselines. This adaptive approach with uncertainty modeling is suitable for real-time applications in pediatric oncology networks.
研究旨在通过解决传统聚类方法在处理细胞模式演变和实时不确定性量化方面的局限性,提高白血病的诊断和监测。提出的ADNF-聚类方法结合了基于卷积神经网络的特征提取和在线模糊聚类引擎。该方法使用模糊C-均值初始化软分区,并通过模糊时间指数更新微聚类中心、密度和模糊性参数。该方法在C-NMC白血病显微镜数据集上的轮廓得分为0.51,显示出比静态基线更好的凝聚性和分离性。该适应性和动态方法为儿科肿瘤网络中的实时、无标签支持提供了潜在的应用前景。
Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin
Authors: Andy Huynh, João Malheiro Silva, Holger Caesar, Tong Duy Son
First: 2025-11-25T14:25:19+00:00 · Latest: 2025-11-28T16:12:54+00:00
Comments: 8 pages, 5 figures. Submitted to IEEE Intelligent Vehicles Symposium (IV) 2026 for possible publication. Revised version (v2) to correct author order
Abstract
3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.
中文标题/摘要
标题:基于材料信息的高斯点云法重建数字孪生3D世界
数字孪生的3D重建通常依赖于基于LiDAR的方法,这些方法提供准确的几何结构但缺乏相机自然捕捉的语义和纹理。传统的LiDAR-相机融合方法需要复杂的校准并且仍然难以处理像玻璃这样的材料,这些材料在图像中可见但在点云中表示不佳。我们提出了一种仅使用相机的管道,通过多视图图像的3D高斯点云法重建场景,使用视觉模型提取语义材料掩码,将高斯表示转换为带有投影材料标签的网格表面,并为现代图形引擎和模拟器中的传感器仿真分配基于物理的材料属性。该方法结合了逼真的重建与基于物理的材料分配,提供的传感器仿真精度与LiDAR-相机融合相当,同时消除了硬件复杂性和校准要求。我们使用来自装备测试车辆的内部数据集验证了仅使用相机的方法,利用LiDAR作为反射率验证的基准,并结合图像相似性指标。
Summary / 总结
The paper proposes a camera-only pipeline for 3D reconstruction in Digital Twins using 3D Gaussian Splatting from multi-view images. It extracts semantic material masks, converts Gaussian representations to mesh surfaces with material labels, and assigns physics-based material properties. The method provides photorealistic reconstruction and sensor simulation fidelity comparable to LiDAR-camera fusion, without the need for hardware complexity or calibration. Experimental results validate the approach using an internal dataset from an instrumented test vehicle, comparing reflectivity and image similarity to LiDAR data.
论文提出了一种仅使用多视角图像进行3D重建的方法,通过3D高斯点绘技术从图像中提取语义材料掩码,将高斯表示转换为带有材料标签的网格表面,并赋予基于物理的材料属性。该方法提供了逼真的3D重建和与LiDAR-相机融合相当的传感器仿真精度,无需硬件复杂性和校准。实验结果使用来自装备测试车辆的内部数据集进行验证,将反射率和图像相似性与LiDAR数据进行比较。
Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach
Authors: Haruki Sakajo, Hiroshi Takato, Hiroshi Tsutsui, Komei Soda, Hidetaka Kamigaito, Taro Watanabe
First: 2025-11-28T16:09:36+00:00 · Latest: 2025-11-28T16:09:36+00:00
Comments: Accepted to MMLoSo 2025
Abstract
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection. These capabilities have promising applications in various industrial domains, such as autonomous driving. For example, LVLMs can generate safety-oriented descriptions of videos captured by road-facing cameras. However, ensuring comprehensive safety requires monitoring driver-facing views as well to detect risky events, such as the use of mobiles while driving. Thus, the ability to process synchronized inputs is necessary from both driver-facing and road-facing cameras. In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset. Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions. Nonetheless, several challenges remain, particularly in detecting subtle or complex events in the video. Our findings and error analysis provide valuable insights that can contribute to the improvement of LVLM-based systems in this domain.
中文标题/摘要
标题:迈向自动安全驾驶指导:大规模视觉语言模型方法
大规模视觉语言模型(LVLMs)在需要视觉信息的任务中表现出先进的能力,包括物体检测。这些能力在各个工业领域中具有潜在的应用前景,例如自动驾驶。例如,LVLMs可以生成由道路面向摄像头捕获的视频的安全描述。然而,确保全面的安全性还需要监控驾驶员面向的视角以检测诸如使用手机等危险事件。因此,处理来自驾驶员面向和道路面向摄像头的同步输入的能力是必要的。在本研究中,我们开发了模型并通过对数据集的构建和评估其在该数据集上的性能来研究LVLMs的能力。我们的实验结果表明,虽然预训练的LVLMs效果有限,但微调的LVLMs可以生成准确且安全意识强的驾驶指令。然而,仍存在一些挑战,特别是在检测视频中的细微或复杂事件方面。我们的研究结果和错误分析提供了有价值的见解,可以为该领域的LVLM基系统改进做出贡献。
Summary / 总结
This study aims to leverage large-scale Vision Language Models (LVLMs) for automatic safe driving instruction, focusing on the ability to process synchronized inputs from both driver-facing and road-facing cameras. The research constructs a dataset and evaluates LVLMs, finding that fine-tuned LVLMs can generate accurate and safety-aware driving instructions, although challenges remain in detecting subtle or complex events in the video. Error analysis provides insights for improving LVLM-based systems in this domain.
本研究旨在利用大规模视觉语言模型(LVLM)实现自动安全驾驶指导,重点在于处理来自驾驶员视角和道路视角的同步输入。研究构建了一个数据集并评估了LVLMs,发现微调后的LVLMs能够生成准确且安全的驾驶指令,但仍存在检测视频中细微或复杂事件的挑战。错误分析为改进基于LVLM的系统提供了有价值的见解。
OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning
Authors: Zixun Huang, Jiayi Sheng, Zeyu Zheng
First: 2025-11-28T16:09:28+00:00 · Latest: 2025-11-28T16:09:28+00:00
Comments: 19 pages, 7 figures
Abstract
Existing reinforcement learning (RL)-based post-training methods for large language models have advanced rapidly, yet their design has largely been guided by heuristics rather than systematic theoretical principles. This gap limits our understanding of the properties of the gradient estimators and the associated optimization algorithms, thereby constraining opportunities to improve training stability and overall performance. In this work, we provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators under mild assumptions. Our analysis establishes unbiasedness, derives exact variance expressions, and yields an optimization-loss upper bound that enables principled reasoning about learning dynamics. Building on these results, we prove convergence guarantees and derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients. We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction and naturally enhancing stability beyond existing methods. These insights motivate Optimal Baseline and Learning-Rate Policy Optimization (OBLR-PO), an algorithm that jointly adapts learning rates and baselines in a theoretically grounded manner. Experiments on Qwen3-4B-Base and Qwen3-8B-Base demonstrate consistent gains over existing policy optimization methods, validating that our theoretical contributions translate into practical improvements in large-scale post-training.
中文标题/摘要
标题:OBLR-PO:一种稳定的强化学习理论框架
基于强化学习(RL)的大语言模型后训练方法已经取得了快速进展,但其设计主要依赖于启发式方法而非系统的理论原则。这一差距限制了我们对梯度估计器统计特性和相关优化算法性质的理解,从而限制了提高训练稳定性和整体性能的机会。在本文中,我们提供了一个统一的理论框架,该框架在温和假设下表征了常用策略梯度估计器的统计特性。我们的分析建立了无偏性,推导出了精确的方差表达式,并给出了一个优化损失上界,从而能够进行关于学习动力学的原理性推理。基于这些结果,我们证明了收敛保证,并推导出由梯度信噪比(SNR)控制的自适应学习率计划。我们进一步表明,方差最优基线是一个梯度加权估计器,提供了一种新的方差减少原则,并自然增强了稳定性,超越了现有方法。这些见解促使我们提出了Optimal Baseline和Learning-Rate Policy Optimization(OBLR-PO)算法,该算法以理论为基础联合调整学习率和基线。在Qwen3-4B-Base和Qwen3-8B-Base上的实验表明,该算法在现有的策略优化方法上具有一致的改进,验证了我们的理论贡献转化为大规模后训练的实际改进。
Summary / 总结
This paper addresses the gap in theoretical understanding of reinforcement learning-based post-training methods for large language models, which have been primarily guided by heuristics. It provides a unified theoretical framework to analyze the statistical properties of policy-gradient estimators, leading to convergence guarantees and an adaptive learning-rate schedule based on the signal-to-noise ratio. The proposed OBLR-PO algorithm, which jointly adapts learning rates and baselines, shows consistent improvements over existing methods in experiments on Qwen3-4B-Base and Qwen3-8B-Base, validating the practical benefits of the theoretical insights.
本文针对大型语言模型中基于强化学习的方法缺乏理论理解的问题,提出了一个统一的理论框架(OBLR-PO),该框架描述了策略梯度估计器的统计特性。该框架允许基于梯度信号噪声比推导出自适应的学习率调度,并且提出了一个方差最优的基线。实验结果表明,与现有方法相比,该理论框架具有一致的改进效果,验证了理论洞察的实际益处。
Hard-Constrained Neural Networks with Physics-Embedded Architecture for Residual Dynamics Learning and Invariant Enforcement in Cyber-Physical Systems
Authors: Enzo Nicolás Spotorno, Josafat Leal Filho, Antônio Augusto Fröhlich
First: 2025-11-28T16:06:24+00:00 · Latest: 2025-11-28T16:06:24+00:00
Comments: 41 pages (30 pages main text + 11 pages appendices), 3 figures, 8 tables. Submitted to JMLR
Abstract
This paper presents a framework for physics-informed learning in complex cyber-physical systems governed by differential equations with both unknown dynamics and algebraic invariants. First, we formalize the Hybrid Recurrent Physics-Informed Neural Network (HRPINN), a general-purpose architecture that embeds known physics as a hard structural constraint within a recurrent integrator to learn only residual dynamics. Second, we introduce the Projected HRPINN (PHRPINN), a novel extension that integrates a predict-project mechanism to strictly enforce algebraic invariants by design. The framework is supported by a theoretical analysis of its representational capacity. We validate HRPINN on a real-world battery prognostics DAE and evaluate PHRPINN on a suite of standard constrained benchmarks. The results demonstrate the framework's potential for achieving high accuracy and data efficiency, while also highlighting critical trade-offs between physical consistency, computational cost, and numerical stability, providing practical guidance for its deployment.
中文标题/摘要
标题:具有物理嵌入架构的硬约束神经网络及其在计算物理系统中残差动力学习和不变性强制的应用
本文提出了一种框架,用于解决由微分方程支配的复杂计算物理系统中的物理信息学习问题,该系统具有未知动力学和代数不变量。首先,我们形式化了混合递归物理信息神经网络(HRPINN),这是一种通用架构,将已知的物理知识作为硬结构约束嵌入在递归积分器中,仅学习残差动力学。其次,我们引入了投影HRPINN(PHRPINN),这是一种新颖的扩展,通过集成预测-投影机制严格设计地强制执行代数不变量。该框架得到了其表示能力的理论分析支持。我们在实际电池预测DAE上验证了HRPINN,并在标准约束基准上评估了PHRPINN。结果表明,该框架具有实现高准确性和数据效率的潜力,同时也突出了物理一致性、计算成本和数值稳定性之间的关键权衡,为其实用部署提供了指导。
Multi-Modal Scene Graph with Kolmogorov-Arnold Experts for Audio-Visual Question Answering
Authors: Zijian Fu, Changsheng Lv, Mengshi Qi, Huadong Ma
First: 2025-11-28T16:03:23+00:00 · Latest: 2025-11-28T16:03:23+00:00
Abstract
In this paper, we propose a novel Multi-Modal Scene Graph with Kolmogorov-Arnold Expert Network for Audio-Visual Question Answering (SHRIKE). The task aims to mimic human reasoning by extracting and fusing information from audio-visual scenes, with the main challenge being the identification of question-relevant cues from the complex audio-visual content. Existing methods fail to capture the structural information within video, and suffer from insufficient fine-grained modeling of multi-modal features. To address these issues, we are the first to introduce a new multi-modal scene graph that explicitly models the objects and their relationship as a visually grounded, structured representation of the audio-visual scene. Furthermore, we design a Kolmogorov-Arnold Network~(KAN)-based Mixture of Experts (MoE) to enhance the expressive power of the temporal integration stage. This enables more fine-grained modeling of cross-modal interactions within the question-aware fused audio-visual representation, leading to capture richer and more nuanced patterns and then improve temporal reasoning performance. We evaluate the model on the established MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, where it achieves state-of-the-art performance. Code and model checkpoints will be publicly released.
中文标题/摘要
标题:基于柯尔莫哥洛夫-阿诺德专家的多模态场景图用于视听问答
在本文中,我们提出了一种新的基于柯尔莫哥洛夫-阿诺德专家网络的多模态场景图(SHRIKE),用于视听问答。该任务旨在通过从视听场景中提取和融合信息来模拟人类推理,主要挑战是从复杂的视听内容中识别与问题相关的关键线索。现有方法无法捕捉视频中的结构信息,并且在多模态特征的细粒度建模方面存在不足。为了解决这些问题,我们首次引入了一种新的多模态场景图,明确地将对象及其关系建模为视听场景的视觉基础结构化表示。此外,我们设计了一种基于柯尔莫哥洛夫-阿诺德网络(KAN)的混合专家(MoE)模型,以增强时间整合阶段的表达能力。这使得在问题感知的视听融合表示中能够更细粒度地建模跨模态交互,从而捕捉更丰富和更细腻的模式,进而提高时间推理性能。我们在建立的MUSIC-AVQA和MUSIC-AVQA v2基准上评估了该模型,其中达到了最先进的性能。代码和模型检查点将公开发布。
Summary / 总结
The research aims to improve audio-visual question answering by addressing the challenge of identifying relevant cues from complex audio-visual content. The proposed SHRIKE model uses a multi-modal scene graph to model objects and their relationships, and a Kolmogorov-Arnold Network-based Mixture of Experts to enhance temporal integration. The model outperforms existing methods on MUSIC-AVQA and MUSIC-AVQA v2 benchmarks, demonstrating superior performance in capturing nuanced patterns and improving temporal reasoning.
论文提出了SHRIKE,一种新的多模态场景图与柯尔莫哥洛夫-阿诺德专家网络,用于音频-视觉问答。该模型通过将对象及其关系建模为结构化的表示,并使用KAN基混合专家来增强跨模态交互,来解决在复杂音频-视觉内容中识别问题相关线索的挑战。该模型在MUSIC-AVQA和MUSIC-AVQA v2基准上达到了最先进的性能。