SemanticGen: Video Generation in Semantic Space
Authors: Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai
First: 2025-12-23T18:59:56+00:00 · Latest: 2025-12-23T18:59:56+00:00
Comments: Project page: https://jianhongbai.github.io/SemanticGen/
Abstract
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.
中文标题/摘要
标题:SemanticGen:在语义空间中生成视频
最先进的视频生成模型通常在VAE空间中学习视频潜在变量的分布,并使用VAE解码器将它们映射到像素上。虽然这种方法可以生成高质量的视频,但在生成长视频时却存在收敛速度慢且计算成本高的问题。本文中,我们提出了一种名为SemanticGen的新颖解决方案,通过在语义空间中生成视频来解决这些问题。我们的主要见解是,由于视频中固有的冗余性,生成过程应该从紧凑的高层语义空间开始进行全局规划,然后添加高频细节,而不是直接使用双向注意力模型来建模大量的低级视频令牌。SemanticGen采用两阶段生成过程。在第一阶段,扩散模型生成紧凑的语义视频特征,这些特征定义了视频的全局布局。在第二阶段,另一个扩散模型根据这些语义特征生成VAE潜在变量,以产生最终输出。我们观察到,在语义空间中生成比在VAE潜在变量空间中生成具有更快的收敛速度。当扩展到长视频生成时,我们的方法也具有有效性和计算效率。大量实验表明,SemanticGen生成高质量的视频,并优于最先进的方法和强大的基线。
Summary / 总结
SemanticGen addresses the limitations of existing video generative models by generating videos in a semantic space, which leads to faster convergence and computational efficiency, especially for long videos. It uses a two-stage process where a diffusion model first generates compact semantic video features for global planning, followed by another diffusion model to generate VAE latents conditioned on these features to produce the final video. Experiments show that SemanticGen outperforms state-of-the-art approaches in generating high-quality videos.
SemanticGen通过在语义空间生成视频来解决现有视频生成模型的局限性,这使得收敛更快且计算效率更高,尤其适用于长视频生成。其采用两阶段过程,首先使用扩散模型生成紧凑的语义视频特征进行全局规划,然后使用另一个扩散模型生成基于这些特征的VAE潜变量。实验表明,SemanticGen生成高质量的视频并优于最先进的方法。
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Authors: Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen
First: 2025-12-23T18:59:49+00:00 · Latest: 2025-12-23T18:59:49+00:00
Abstract
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.
中文标题/摘要
标题:长视频代理:多智能体长视频推理
近期多模态LLM和使用工具进行长视频问答的系统表明,可以在长达一小时的剧集中进行推理。然而,许多方法仍然将内容压缩成有损摘要,或者依赖有限的工具集,削弱了时间定位并错过了细微线索。我们提出了一种多智能体框架,在该框架中,一个主LLM协调一个定位代理来定位与问题相关的时间段,并协调一个视觉代理来提取目标文本观察。主代理在步数限制下进行规划,并通过强化学习训练以促进简洁、正确和高效的多智能体合作。这种设计有助于主代理通过定位关注相关片段,用视觉细节补充字幕,并产生可解释的轨迹。在我们提出的LongTVQA和LongTVQA+(从TVQA/TVQA+汇总而成的集水平数据集)上,我们的多智能体系统显著优于强大的非智能体基线。实验还表明,强化学习进一步增强了训练智能体的推理和规划能力。代码和数据将在https://longvideoagent.github.io/上共享。
Summary / 总结
The research aims to improve long-video question answering by developing a multi-agent framework that uses a master language model to coordinate a grounding agent and a vision agent. The master agent plans with a step limit and is trained with reinforcement learning to enhance multi-agent cooperation. The system significantly outperforms non-agent baselines on the LongTVQA and LongTVQA+ datasets, demonstrating improved reasoning and planning capabilities through reinforcement learning. Code and data are available at https://longvideoagent.github.io/.
研究旨在通过开发一个多代理框架来提高长视频问答能力,该框架利用主语言模型协调定位代理和视觉代理。系统通过强化学习训练以增强合作并专注于相关视频片段,从而在集锦级别的数据集上比非代理基线表现出色。强化学习还提高了代理的推理和规划能力。
SpatialTree: How Spatial Abilities Branch Out in MLLMs
Authors: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang
First: 2025-12-23T18:59:46+00:00 · Latest: 2025-12-23T18:59:46+00:00
Comments: webpage: https://spatialtree.github.io/
Abstract
Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.
中文标题/摘要
标题:SpatialTree:空间能力在多模态大语言模型中的分支发展
认知科学表明,空间能力是逐步发展的,从感知到推理和互动。然而,在多模态大语言模型(MLLMs)中,这种层次结构仍然不甚明了,因为大多数研究都集中在少数任务上。我们引入了SpatialTree,这是一种认知科学启发式的层次结构,将空间能力分为四个层次:低级感知(L1)、心理制图(L2)、模拟(L3)和能动性(L4)。基于这一分类法,我们构建了第一个能力导向的层次基准,全面评估了主流MLLMs的27个子能力。评估结果揭示了一个清晰的结构:L1技能大多相互独立,而更高层次的技能则高度相关,表明相互依赖性在增加。通过有针对性的监督微调,我们发现了一个令人惊讶的转移动态:L1内的负向转移,但低级到高级能力之间存在强大的跨层次转移,且具有显著的协同效应。最后,我们探讨了如何改进整个层次结构。我们发现,鼓励大量“思考”的简单强化学习是不可靠的:它有助于复杂推理,但损害了直观感知。我们提出了一种简单的自动思考策略,抑制不必要的思考,使强化学习能够一致地提高所有层次的性能。通过构建SpatialTree,我们提供了一个概念验证框架,用于理解和系统地扩展MLLMs中的空间能力。
Summary / 总结
The research aims to understand the development of spatial abilities in multimodal language models (MLLMs) by introducing a cognitive-science-inspired hierarchy called SpatialTree, which categorizes spatial abilities into four levels: perception, mental mapping, simulation, and agentic competence. The study evaluates 27 sub-abilities across various MLLMs and finds that lower-level skills are largely independent, while higher-level skills are strongly correlated. Through targeted fine-tuning, the study reveals negative transfer within the lowest level but strong cross-level transfer from lower to higher abilities. The research also explores the impact of reinforcement learning (RL) on these abilities and proposes an auto-think strategy to enhance performance across all levels of the hierarchy.
研究旨在通过提出一个认知科学启发式的层次结构SpatialTree,来理解多模态语言模型(MLLMs)中的空间能力发展,该层次结构将空间能力分为感知、心理映射、模拟和行动能力四个层次。研究对主流MLLMs在27个子能力上的表现进行了评估,并发现较低层次的能力是相互独立的,而较高层次的能力则高度相关。通过针对性的微调发现,较低层次的能力之间存在负迁移,但较低层次能力向较高层次能力的迁移则非常强大且具有协同效应。研究还探讨了如何使用强化学习(RL)来提高这些能力,提出了一种自动思考策略来抑制不必要的思考,从而在所有层次上持续提升性能。
Active Intelligence in Video Avatars via Closed-loop World Modeling
Authors: Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen
First: 2025-12-23T18:59:16+00:00 · Latest: 2025-12-23T18:59:16+00:00
Comments: Project Page: https://xuanhuahe.github.io/ORCA/
Abstract
Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
中文标题/摘要
标题:通过闭环世界建模的视频头像中的主动智能
当前的视频头像生成方法在身份保留和运动对齐方面表现出色,但缺乏真正的自主性,它们无法通过适应性环境交互自主追求长期目标。我们通过引入L-IVA(长期交互视觉头像)任务和基准来解决这一问题,用于评估随机生成环境中的目标导向规划,以及ORCA(在线推理和认知架构),这是第一个使视频头像具备主动智能的框架。ORCA 通过两个关键创新体现了内部世界模型(IWM)能力:(1) 一个闭环OTAR循环(观察-思考-行动-反思),通过不断验证预测结果与实际生成结果来在生成不确定性下保持稳健的状态跟踪;(2) 一个分层的双系统架构,其中系统2进行战略推理并预测状态,系统1将抽象计划转化为具体的、模型特定的动作指令。通过将头像控制建模为POMDP,并实施基于结果验证的连续信念更新,ORCA 使头像能够在开放域场景中自主完成多步任务。大量实验表明,ORCA 在任务成功率和行为一致性方面显著优于开环和非反思基线,验证了我们基于IWM的设计,使视频头像智能从被动动画发展到主动、目标导向的行为。
Summary / 总结
The research addresses the lack of genuine agency in current video avatar generation methods by introducing L-IVA and ORCA. L-IVA is a task and benchmark for evaluating goal-directed planning, while ORCA is a framework enabling active intelligence in video avatars through a closed-loop OTAR cycle and a hierarchical dual-system architecture. ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, demonstrating the effectiveness of its internal world model capabilities.
研究旨在通过使视频角色能够自主追求长期目标并适应性地互动来增强其智能。研究引入了L-IVA和ORCA,其中ORCA采用闭环OTAR循环和分层双系统架构来保持状态跟踪的稳健性并实现战略推理。实验结果表明,ORCA在任务成功率和行为一致性方面优于开环和非反思基线,验证了所提框架的有效性。
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures
Authors: Yedi Zhang, Andrew Saxe, Peter E. Latham
First: 2025-12-23T18:55:30+00:00 · Latest: 2025-12-23T18:55:30+00:00
Abstract
Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also illuminates the effects of data distribution and weight initialization on the duration and number of plateaus in learning, dissociating previously confounding factors. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.
中文标题/摘要
标题:鞍点到鞍点动力学解释了神经网络架构中的简洁性偏见
使用梯度下降训练的神经网络通常会随着时间学习越来越复杂的解决方案,这种现象称为简洁性偏见。尽管在各种架构中广泛观察到,现有的理论处理缺乏统一框架。我们提出了一种理论框架,解释了一般类别的神经网络(包括全连接、卷积和基于注意力的架构)中由鞍点到鞍点学习动力学引起的简洁性偏见。在这里,简单是指可以用少量隐藏单元表示的,即隐藏神经元、卷积核或注意力头。具体来说,我们展示了线性网络学习秩递增的解决方案,ReLU网络学习具有递增折点数的解决方案,卷积网络学习具有递增卷积核数的解决方案,而自我注意力模型学习具有递增注意力头数的解决方案。通过分析梯度下降学习的固定点、不变流形和动力学,我们展示了鞍点到鞍点动力学通过迭代演化接近不变流形、接近鞍点并切换到另一个不变流形来运作。我们的分析还阐明了数据分布和权重初始化对学习平台持续时间和数量的影响,从而分离了以前混淆的因素。总体而言,我们的理论为理解梯度下降何时以及为何逐渐学习越来越复杂的解决方案提供了框架。
Summary / 总结
The paper presents a theoretical framework explaining the simplicity bias in neural networks trained with gradient descent. It shows that saddle-to-saddle learning dynamics lead to increasingly complex solutions, with different architectures learning complexity in terms of hidden units, kinks, convolutional kernels, and attention heads. The analysis reveals how fixed points, invariant manifolds, and dynamics of gradient descent contribute to this process, and how data distribution and weight initialization affect the learning duration and number of plateaus.
论文研究了神经网络中简单性偏见的现象,即训练过程中解决方案变得越来越复杂。它提出了一种基于鞍点到鞍点动态的理论框架来解释这种偏见在各种网络架构中的表现,包括全连接、卷积和注意力机制网络。研究表明,线性网络学习更高秩的解决方案,ReLU网络学习更多拐点的解决方案,卷积网络学习更多卷积核的解决方案,而自注意力模型学习更多注意力头的解决方案。分析显示,鞍点到鞍点动态通过在不变流形附近迭代演化、接近鞍点并切换到其他流形,影响学习解决方案的复杂性和训练中的平台期持续时间和数量。该框架有助于理解何时以及为什么梯度下降在训练过程中逐渐学习越来越复杂的解决方案。
Repurposing Video Diffusion Transformers for Robust Point Tracking
Authors: Soowon Son, Honggyu An, Chaehyun Kim, Hyunah Ko, Jisu Nam, Dahyun Chung, Siyoon Jin, Jung Yi, Jaewon Min, Junhwa Hur, Seungryong Kim
First: 2025-12-23T18:54:10+00:00 · Latest: 2025-12-23T18:54:10+00:00
Comments: Project Page: https://cvlab-kaist.github.io/DiTracker/
Abstract
Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence and producing unreliable matching costs under challenging conditions. Through systematic analysis, we find that video Diffusion Transformers (DiTs), pre-trained on large-scale real-world videos with spatio-temporal attention, inherently exhibit strong point tracking capability and robustly handle dynamic motions and frequent occlusions. We propose DiTracker, which adapts video DiTs through: (1) query-key attention matching, (2) lightweight LoRA tuning, and (3) cost fusion with a ResNet backbone. Despite training with 8 times smaller batch size, DiTracker achieves state-of-the-art performance on challenging ITTO benchmark and matches or outperforms state-of-the-art models on TAP-Vid benchmarks. Our work validates video DiT features as an effective and efficient foundation for point tracking.
中文标题/摘要
标题:重新利用视频扩散变换器进行稳健的点跟踪
点跟踪旨在跨视频帧定位对应点,是4D重建、机器人技术和视频编辑中的基本任务。现有方法通常依赖浅层卷积骨干网络,如ResNet,这些网络独立处理帧,缺乏时间连贯性,在挑战性条件下产生不可靠的匹配成本。通过系统分析,我们发现视频扩散变换器(DiTs),在大规模真实世界视频上预训练,具有时空注意力机制,本质上表现出强大的点跟踪能力,并能稳健处理动态运动和频繁遮挡。我们提出了DiTracker,通过:(1) 查询-键注意力匹配,(2) 轻量级LoRA调优,(3) 与ResNet骨干网络的成本融合,来适应视频DiTs。尽管训练时使用8倍较小的批量大小,DiTracker在具有挑战性的ITTO基准测试中达到最先进的性能,并在TAP-Vid基准测试中匹配或超越最先进的模型。我们的工作验证了视频DiT特征作为点跟踪的有效且高效的基底。
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Authors: Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherre, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento
First: 2025-12-23T18:51:50+00:00 · Latest: 2025-12-23T18:51:50+00:00
Abstract
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
中文标题/摘要
标题:自回归模型中涌现的时间抽象使层次强化学习成为可能
大规模的自回归模型在下一个标记预测的预训练和强化学习(RL)微调后,在许多问题领域取得了前所未有的成功。在RL过程中,这些模型通过逐个生成输出来探索。然而,逐标记采样动作会导致学习效率低下,尤其是在奖励稀疏的情况下。在这里,我们展示了通过在自回归模型的内部表示中进行动作和探索,可以克服这一问题。具体来说,为了发现时间抽象的动作,我们引入了一个高阶的非因果序列模型,其输出控制基础自回归模型的残差流激活。在具有层次结构的网格世界和MuJoCo任务中,我们发现高阶模型学会了将长时间序列激活片段压缩到内部控制器中。关键的是,每个控制器执行一系列行为上具有意义的动作,这些动作在长时间尺度上展开,并伴随有学习到的终止条件,从而通过时间上的控制器组合实现对新任务的高效探索。我们展示了直接内部控制器强化学习,即“内部RL”过程,能够在标准RL微调失败的情况下从稀疏奖励中学习。我们的结果表明了自回归模型中潜在动作生成和强化学习的优势,暗示内部RL是实现基础模型中层次RL的一个有前景的方向。
Summary / 总结
The research aims to improve the efficiency of reinforcement learning in autoregressive models by enabling them to generate temporally abstract actions. The method involves training a higher-order, non-causal sequence model to control the residual activations of a base autoregressive model, allowing for the discovery of actions that unfold over longer timescales. Key findings show that this approach enables efficient exploration and learning from sparse rewards, particularly on tasks with hierarchical structure, outperforming standard RL methods in such scenarios.
研究旨在通过使自回归模型能够生成时间上的抽象动作来提高强化学习(RL)的效率。方法是引入一个高阶的非因果序列模型,控制基础自回归模型的残差流激活。关键发现表明,这种方法使模型能够学习压缩长时间序列片段到内部控制器中,这些控制器执行具有行为意义的动作,并具有学习到的终止条件。这导致了高效的探索和从稀疏奖励中学习,特别是在具有层次结构的任务中。
Reinforcement Learning From State and Temporal Differences
Authors: Lex Weaver, Jonathan Baxter
First: 2025-12-09T17:48:28+00:00 · Latest: 2025-12-23T18:50:53+00:00
Comments: Technical Report, Department of Computer Science, Australian National University, May 1999 New version uploaded 2025 after original source taken offline
Abstract
TD($λ$) with function approximation has proved empirically successful for some complex reinforcement learning problems. For linear approximation, TD($λ$) has been shown to minimise the squared error between the approximate value of each state and the true value. However, as far as policy is concerned, it is error in the relative ordering of states that is critical, rather than error in the state values. We illustrate this point, both in simple two-state and three-state systems in which TD($λ$)--starting from an optimal policy--converges to a sub-optimal policy, and also in backgammon. We then present a modified form of TD($λ$), called STD($λ$), in which function approximators are trained with respect to relative state values on binary decision problems. A theoretical analysis, including a proof of monotonic policy improvement for STD($λ$) in the context of the two-state system, is presented, along with a comparison with Bertsekas' differential training method [1]. This is followed by successful demonstrations of STD($λ$) on the two-state system and a variation on the well known acrobot problem.
中文标题/摘要
标题:基于状态和时间差的强化学习
TD($λ$)结合函数近似在某些复杂的强化学习问题中已被实验证明是成功的。对于线性近似,TD($λ$)已被证明可以最小化每个状态的近似值与其真实值之间的平方误差。然而,就策略而言,关键在于状态之间的相对顺序误差,而不是状态值的误差。我们通过在简单的两状态和三状态系统中从最优策略开始,TD($λ$)最终收敛到一个次优策略,以及在背gammon中的例子来说明这一点。然后,我们提出了一种TD($λ$)的修改形式,称为STD($λ$),其中函数逼近器是针对二元决策问题中的状态相对值进行训练的。我们对两状态系统中STD($λ$)的理论分析包括一个单调策略改进的证明,并与伯塞拉斯的微分训练方法进行了比较。随后,我们在两状态系统和一个著名的acrobot问题的变体上成功演示了STD($λ$)。
Learning Informative Attention Weights for Person Re-Identification
Authors: Yancheng Wang, Nebojsa Jojic, Yingzhen Yang
First: 2025-05-13T21:01:53+00:00 · Latest: 2025-12-23T18:50:46+00:00
Abstract
Attention mechanisms have been widely used in deep learning, and recent efforts have been devoted to incorporating attention modules into deep neural networks (DNNs) for person Re-Identification (Re-ID) to enhance their discriminative feature learning capabilities. Existing attention modules, including self-attention and channel attention, learn attention weights that quantify the importance of feature tokens or feature channels. However, existing attention methods do not explicitly ensure that the attention weights are informative for predicting the identity of the person in the input image, and may consequently introduce noisy information from the input image. To address this issue, we propose a novel method termed Reduction of Information Bottleneck loss (RIB), motivated by the principle of the Information Bottleneck (IB). A novel distribution-free and efficient variational upper bound for the IB loss (IBB), which can be optimized by standard SGD, is derived and incorporated into the training loss of the RIB models. RIB is applied to DNNs with self-attention modules through a novel Differentiable Channel Selection Attention module, or DCS-Attention, that selects the most informative channels for computing attention weights, leading to competitive models termed RIB-DCS. RIB is also incorporated into DNNs with existing channel attention modules to promote the learning of informative channel attention weights, leading to models termed RIB-CA. Both RIB-DCS and RIB-CA are applied to fixed neural network backbones and learnable backbones with Differentiable Neural Architecture Search (DNAS). Extensive experiments on multiple person Re-ID benchmarks show that RIB significantly enhances the prediction accuracy of DNNs for person Re-ID, even for the occluded person Re-ID.
中文标题/摘要
标题:学习具有信息性注意力权重的人再识别
注意力机制在深度学习中得到了广泛应用,最近的研究致力于将注意力模块融入深度神经网络(DNNs)中以增强其区分特征学习能力,特别是在人再识别(Re-ID)任务中。现有的注意力模块,包括自我注意力和通道注意力,学习注意力权重以量化特征令牌或特征通道的重要性。然而,现有的注意力方法并未明确确保注意力权重对于预测输入图像中人的身份是有信息性的,可能会引入输入图像中的噪声信息。为了解决这一问题,我们提出了一种名为信息瓶颈信息减少损失(RIB)的新方法,该方法受到信息瓶颈(IB)原理的启发。我们推导出一种新的无分布且高效的IB损失的变分上界(IBB),可以使用标准SGD进行优化,并将其整合到RIB模型的训练损失中。RIB通过一种新颖的可微通道选择注意力模块(DCS-Attention)应用于具有自我注意力模块的DNNs,该模块选择用于计算注意力权重的最具有信息性的通道,从而产生了具有竞争力的模型RIB-DCS。RIB还被整合到具有现有通道注意力模块的DNNs中,以促进学习具有信息性的通道注意力权重,产生了RIB-CA模型。RIB-DCS和RIB-CA均应用于固定神经网络骨干和具有可微神经架构搜索(DNAS)的可学习骨干。在多个再识别基准上的广泛实验表明,RIB显著提高了DNNs在人再识别任务中的预测准确性,甚至对于被遮挡的人再识别也是如此。
Summary / 总结
This paper proposes a novel method called Reduction of Information Bottleneck loss (RIB) to enhance the discriminative feature learning capabilities of deep neural networks for person re-identification (Re-ID). RIB incorporates a distribution-free and efficient variational upper bound of the Information Bottleneck loss into the training loss, which is optimized by standard SGD. RIB is applied to DNNs with self-attention and channel attention modules, leading to models termed RIB-DCS and RIB-CA, respectively. Experiments on multiple Re-ID benchmarks demonstrate that RIB significantly improves prediction accuracy, even for occluded person Re-ID scenarios.
本文提出了一种名为信息瓶颈缩减损失(RIB)的新方法,通过学习更具信息量的注意力权重来提升人再识别(Re-ID)性能。RIB 应用于具有自注意力和通道注意力模块的 DNN,增强其特征学习能力。该方法产生了名为 RIB-DCS 和 RIB-CA 的竞争性模型,并在多个 Re-ID 基准测试中显示出显著的预测准确性提升,甚至对于遮挡的人再识别也有改善。
Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs
Authors: Dhruv Anand, Ehsan Shareghi
First: 2025-12-23T18:43:05+00:00 · Latest: 2025-12-23T18:43:05+00:00
Comments: 27 pages, 5 figures, 9 tables. Cube available at https://github.com/dana-23/cube-bench
Abstract
We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predicting the outcome of a candidate move without applying it, (iv) executing multi-step plans while recovering from mistakes, and (v) detecting and revising one's own errors. Using a shared set of scrambled cube states, identical prompts and parsers, and a single distance-to-solved metric, we compare recent MLLMs side by side as a function of scramble depth. Across seven MLLMs, accuracy drops sharply with depth; once a trajectory stalls or diverges, models rarely recover, and high face-reconstruction accuracy does not guarantee competent action selection or multi-step execution. A pronounced closed- vs open-source gap emerges: the strongest closed model leads on both single-step perception tasks and multi-step control tasks, while open-weight models cluster near chance on the hardest settings; yet even the best MLLM degrades at higher cube complexity. A simple self-correction via reflective thinking yields modest gains but can also introduce overthinking. Cube Bench offers a compact, reproducible probe of sequential spatial reasoning in MLLMs.
中文标题/摘要
标题:立方体基准:多模态大语言模型空间视觉推理评估基准
我们介绍了立方体基准,这是一个魔方基准,用于评估多模态大语言模型(MLLMs)的空间和序列推理能力。基准将性能分解为五个技能:(i)从图像和文本重建魔方面,(ii)选择最佳下一步,(iii)预测候选动作的结果而不执行它,(iv)执行多步计划并从错误中恢复,以及(v)检测并修正自己的错误。使用一组混乱的魔方状态,相同的提示和解析器,以及单一的解距度量,我们按混乱深度比较了最近的MLLMs。在七个MLLM中,准确性随着深度的增加而急剧下降;一旦轨迹停滞或发散,模型很少能恢复,高面重建准确性并不保证有效的动作选择或多步执行。明显的闭源与开源差距出现:最强的闭源模型在单步感知任务和多步控制任务中均领先,而开源权重模型在最困难的设置中接近随机;然而,即使是最优秀的MLLM在更高复杂度的魔方上也会退化。简单的自我纠正通过反思思考可以带来适度的收益,但也可能导致过度思考。立方体基准提供了一个紧凑且可重复的空间序列推理探针,用于MLLMs。
Summary / 总结
Cube Bench evaluates spatial and sequential reasoning in multimodal large language models (MLLMs) through a Rubik's-cube benchmark. It decomposes performance into five skills and compares seven MLLMs using a shared set of scrambled cube states. Accuracy drops sharply with increasing complexity, and high face-reconstruction accuracy does not ensure competent action selection or multi-step execution. A significant gap is observed between closed-source and open-weight models, with closed-source models performing better on both single-step and multi-step tasks. Simple self-correction via reflective thinking yields modest gains but can also lead to overthinking. Cube Bench provides a compact and reproducible way to probe sequential spatial reasoning in MLLMs.
Cube Bench 通过一个魔方基准测试多模态大型语言模型(MLLMs)的空间和序列推理能力,将性能分解为五个技能。该基准比较了七种 MLLMs 的准确性,显示复杂度增加时性能急剧下降,并突显了闭源和开源模型之间的重要差距。即使是最优秀的 MLLM 也难以应对更高的魔方复杂度,而通过反思进行的自我纠正只能带来微小的改进。
Leveraging High-Fidelity Digital Models and Reinforcement Learning for Mission Engineering: A Case Study of Aerial Firefighting Under Perfect Information
Authors: İbrahim Oğuz Çetinkaya, Sajad Khodadadian, Taylan G. Topçu
First: 2025-12-23T18:36:07+00:00 · Latest: 2025-12-23T18:36:07+00:00
Abstract
As systems engineering (SE) objectives evolve from design and operation of monolithic systems to complex System of Systems (SoS), the discipline of Mission Engineering (ME) has emerged which is increasingly being accepted as a new line of thinking for the SE community. Moreover, mission environments are uncertain, dynamic, and mission outcomes are a direct function of how the mission assets will interact with this environment. This proves static architectures brittle and calls for analytically rigorous approaches for ME. To that end, this paper proposes an intelligent mission coordination methodology that integrates digital mission models with Reinforcement Learning (RL), that specifically addresses the need for adaptive task allocation and reconfiguration. More specifically, we are leveraging a Digital Engineering (DE) based infrastructure that is composed of a high-fidelity digital mission model and agent-based simulation; and then we formulate the mission tactics management problem as a Markov Decision Process (MDP), and employ an RL agent trained via Proximal Policy Optimization. By leveraging the simulation as a sandbox, we map the system states to actions, refining the policy based on realized mission outcomes. The utility of the RL-based intelligent mission coordinator is demonstrated through an aerial firefighting case study. Our findings indicate that the RL-based intelligent mission coordinator not only surpasses baseline performance but also significantly reduces the variability in mission performance. Thus, this study serves as a proof of concept demonstrating that DE-enabled mission simulations combined with advanced analytical tools offer a mission-agnostic framework for improving ME practice; which can be extended to more complicated fleet design and selection problems in the future from a mission-first perspective.
中文标题/摘要
标题:利用高保真数字模型和强化学习进行任务工程:在完美信息下的空中灭火案例研究
随着系统工程(SE)目标从单一系统的设计和运行转变为复杂的系统集合体(System of Systems, SoS),任务工程(Mission Engineering, ME)这一学科已经出现,并逐渐被SE社区接受为一种新的思维方式。此外,任务环境是不确定的、动态的,任务结果直接取决于任务资产如何与环境互动。这表明静态架构是脆弱的,并需要对ME采取分析上严谨的方法。为此,本文提出了一种智能任务协调方法,该方法将数字任务模型与强化学习(Reinforcement Learning, RL)相结合,特别针对自适应任务分配和重新配置的需求。具体而言,我们利用基于数字工程(Digital Engineering, DE)的基础设施,该基础设施由高保真数字任务模型和基于代理的模拟组成;然后将任务战术管理问题形式化为马尔可夫决策过程(Markov Decision Process, MDP),并采用通过近端策略优化训练的RL代理。通过利用模拟作为沙盒,我们将系统状态映射到行动,并根据实现的任务结果来优化策略。通过空中灭火案例研究展示了基于RL的智能任务协调器的实用性。我们的研究结果表明,基于RL的智能任务协调器不仅超越了基线性能,还显著减少了任务性能的变异性。因此,这项研究作为概念验证,证明了DE驱动的任务模拟结合先进的分析工具为ME实践提供了一种任务无关的框架;从任务优先的角度出发,未来可以将其扩展到更复杂的舰队设计和选择问题。
Summary / 总结
This paper proposes an intelligent mission coordination methodology combining high-fidelity digital mission models and Reinforcement Learning (RL) to address adaptive task allocation and reconfiguration in complex mission environments. The methodology formulates the mission tactics management problem as a Markov Decision Process (MDP) and uses Proximal Policy Optimization to train an RL agent. The approach is validated through an aerial firefighting case study, showing that the RL-based intelligent mission coordinator outperforms baseline methods and reduces mission performance variability.
本文提出了一种将高保真数字任务模型与强化学习(RL)相结合的智能任务协调方法,以应对复杂任务环境中的自适应任务分配和重新配置。该方法利用数字工程(DE)基础设施,并将任务战术管理问题形式化为马尔可夫决策过程(MDP)。使用近端策略优化(PPO)训练RL代理,并根据任务结果优化策略。通过空中灭火案例研究展示了该方法的实用性,表明基于RL的智能任务协调不仅超越了基线方法,还显著降低了任务性能的变异性。
Performative Policy Gradient: Optimality in Performative Reinforcement Learning
Authors: Debabrota Basu, Udvas Das, Brahim Driss, Uddalak Mukherjee
First: 2025-12-23T18:20:06+00:00 · Latest: 2025-12-23T18:20:06+00:00
Abstract
Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.
中文标题/摘要
标题:表演性策略梯度:表演性强化学习中的最优性
部署后的机器学习算法通常会对其作用的环境产生影响,从而改变标准强化学习(RL)方法忽略的基本动力学。虽然在监督学习中已经研究了在这种表演性设置下设计最优算法的问题,但RL的对应问题仍然未被充分探索。在本文中,我们证明了RL中的表演性版本的性能差异引理和策略梯度定理,并进一步引入了表演性策略梯度算法(PePG)。PePG是第一个专门设计来考虑RL中表演性问题的策略梯度算法。在softmax参数化下,以及在有和没有熵正则化的情况下,我们证明PePG收敛到表演性最优策略,即在由自身引起的分布变化下仍然保持最优的策略。因此,PePG显著扩展了在表演性RL中仅实现稳定性而未实现最优性的先前工作。此外,我们在标准的表演性RL环境中进行的实证分析表明,PePG在性能上优于标准策略梯度算法和旨在实现稳定性的现有表演性RL算法。
Summary / 总结
This paper addresses the issue of machine learning algorithms influencing the environments they act in, which is known as performative reinforcement learning. It introduces the Performative Policy Gradient (PePG) algorithm, which is the first policy gradient method designed to account for performativity. The authors prove that under certain conditions, PePG converges to performatively optimal policies, meaning the policies remain optimal even when they induce distribution shifts. Empirical results show that PePG outperforms standard policy gradient algorithms and other performative RL algorithms that focus on stability.
本文探讨了机器学习算法在其作用的环境中产生影响的问题,即执行性强化学习。提出了执行性策略梯度(PePG)算法,这是第一个设计用于考虑执行性影响的策略梯度方法。作者证明,在某些条件下,PePG能够收敛到执行性最优策略,即这些策略在自身引起的分布变化下仍然保持最优。实证分析表明,PePG在标准执行性强化学习环境中优于标准策略梯度算法和其他专注于稳定性的执行性强化学习算法。
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
Authors: Rui Pan, Zhuofu Chen, Ravi Netravali
First: 2025-12-23T18:16:58+00:00 · Latest: 2025-12-23T18:16:58+00:00
Abstract
Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
中文标题/摘要
标题:快速失败,赢得更大:通过扩散大语言模型重新思考推测性解码的起草策略
扩散大语言模型(dLLMs)提供快速并行的标记生成,但它们单独使用时存在固有的效率与质量权衡。我们表明,如果谨慎应用,dLLMs 的属性实际上可以成为起草者在自回归(AR)验证器辅助下的推测性解码中的优势。我们的核心见解是,dLLM 的并行解码速度大大降低了昂贵拒绝的风险,提供了一种实用机制,以有效实现(难以捉摸的)长篇草案,这些草案在推测性解码中能带来大量加速。我们提出了 FailFast,这是一种基于 dLLM 的推测性解码框架,通过动态调整其推测长度来实现这一方法。它“快速失败”通过在难以推测的区域花费最少的计算资源来缩短推测延迟,并在容易推测的区域“赢得更大”通过积极扩展草案长度来减少验证延迟(在许多情况下,一次推测并接受 70 个标记!)。无需任何微调,FailFast 为 AR LLM 提供无损加速,并在多种模型和工作负载上分别实现了高达 4.9 倍、1.7 倍和 1.4 倍的速度提升。我们已在 https://github.com/ruipeterpan/failfast 开源了 FailFast。
Summary / 总结
The paper addresses the efficiency-quality tradeoff in using diffusion large language models (dLLMs) for speculative decoding. It proposes FailFast, a dLLM-based framework that dynamically adjusts speculation length to minimize costly rejections and maximize draft lengths in easy regions. This approach achieves up to 4.9 times speedup over vanilla decoding and 1.7 times over the best naive dLLM drafter without fine-tuning. The framework effectively reduces verification latency and accelerates autoregressive LLMs across various models and workloads.
论文解决了使用扩散大型语言模型(dLLMs)进行推测性解码时的效率与质量权衡问题。它提出了FailFast框架,该框架动态调整推测长度以减少昂贵的拒绝并最大化草稿长度。这种方法实现了显著的加速,最高可达4.9倍比传统解码快,1.7倍比最佳朴素dLLM草稿者快,适用于各种模型和工作负载。
Similarity Field Theory: A Mathematical Framework for Intelligence
Authors: Kei-Sing Ng
First: 2025-09-21T22:34:00+00:00 · Latest: 2025-12-23T18:09:51+00:00
Abstract
We posit that persisting and transforming similarity relations form the structural basis of any comprehensible dynamic system. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field $S: U \times U \to [0,1]$ over a universe of entities $U$, satisfying reflexivity $S(E,E)=1$ and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence $Z_p=(X_p,S^{(p)})$ indexed by $p=0,1,2,\ldots$; (3) concepts $K$ as entities that induce fibers $F_α(K)={E\in U \mid S(E,K)\ge α}$, i.e., superlevel sets of the unary map $S_K(E):=S(E,K)$; and (4) a generative operator $G$ that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator $G$ is intelligent with respect to a concept $K$ if, given a system containing entities belonging to the fiber of $K$, it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. At a high level, this framework reframes intelligence and interpretability as geometric problems on similarity fields--preserving and composing level-set fibers--rather than purely statistical ones. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability implies either an anchor coordinate or asymptotic confinement to the target level (up to arbitrarily small tolerance). Together, these results constrain similarity-field evolution and motivate an interpretive lens that can be applied to large language models.
Summary / 总结
The research introduces Similarity Field Theory, a mathematical framework that formalizes the principles of similarity relations among entities and their evolution. It defines a similarity field $S$ over a universe of entities, and concepts as entities that induce superlevel sets. The theory formalizes intelligence as an operator that generates new entities belonging to the same concept's fiber. Key findings include theorems that constrain similarity-field evolution and suggest an interpretive lens for understanding intelligence and interpretability as geometric problems on similarity fields rather than purely statistical ones.
研究旨在通过相似关系提供一个动态系统的结构基础的数学框架。论文引入了相似场理论,定义了实体集上的相似场,并形式化了系统的演化通过一系列状态。关键发现包括两个定理:一是不对称性阻止相互包含;二是稳定性导致要么有锚坐标,要么渐近收敛到目标水平(误差可任意小)。这种框架将智能和可解释性重新定义为相似场上的几何问题,为刻画和构建智能系统提供了新的视角。
LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
Authors: Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl, Andreas Geiger, Kashyap Chitta
First: 2025-12-23T18:07:43+00:00 · Latest: 2025-12-23T18:07:43+00:00
Abstract
Simulators can generate virtually unlimited driving data, yet imitation learning policies in simulation still struggle to achieve robust closed-loop performance. Motivated by this gap, we empirically study how misalignment between privileged expert demonstrations and sensor-based student observations can limit the effectiveness of imitation learning. More precisely, experts have significantly higher visibility (e.g., ignoring occlusions) and far lower uncertainty (e.g., knowing other vehicles' actions), making them difficult to imitate reliably. Furthermore, navigational intent (i.e., the route to follow) is under-specified in student models at test time via only a single target point. We demonstrate that these asymmetries can measurably limit driving performance in CARLA and offer practical interventions to address them. After careful modifications to narrow the gaps between expert and student, our TransFuser v6 (TFv6) student policy achieves a new state of the art on all major publicly available CARLA closed-loop benchmarks, reaching 95 DS on Bench2Drive and more than doubling prior performances on Longest6~v2 and Town13. Additionally, by integrating perception supervision from our dataset into a shared sim-to-real pipeline, we show consistent gains on the NAVSIM and Waymo Vision-Based End-to-End driving benchmarks. Our code, data, and models are publicly available at https://github.com/autonomousvision/lead.
中文标题/摘要
标题:LEAD: 减少驾驶者与专家之间的知识不对称
模拟器可以生成几乎无限的驾驶数据,但在模拟中,模仿学习策略仍然难以实现稳健的闭环性能。受此差距的启发,我们实证研究了特权专家演示与基于传感器的学生观察之间的不一致如何限制模仿学习的有效性。具体而言,专家具有显著更高的可见性(例如,忽略遮挡)和更低的不确定性(例如,知道其他车辆的动作),这使得它们难以可靠地模仿。此外,学生模型在测试时仅通过一个目标点来指定导航意图(即,要遵循的路线)。我们证明这些不对称性会显著限制驾驶性能,并提供了实际干预措施来解决这些问题。经过仔细修改以缩小专家与学生之间的差距后,我们的TransFuser v6 (TFv6) 学生策略在所有主要的公开CARLA闭环基准测试中达到了新的最佳状态,在Bench2Drive上达到95 DS,并且在Longest6~v2和Town13上的表现比之前提高了两倍以上。此外,通过将我们数据集中的感知监督整合到共享的模拟到现实的管道中,我们在NAVSIM和Waymo基于视觉的端到端驾驶基准测试中展示了持续的收益。我们的代码、数据和模型可在https://github.com/autonomousvision/lead/ 获取。
Summary / 总结
The research aims to address the gap in robust closed-loop performance of imitation learning policies in driving simulators. The study identifies and mitigates the asymmetry between expert demonstrations and student observations, particularly in visibility and uncertainty. By narrowing these gaps, the TransFuser v6 student policy achieves new state-of-the-art results on major CARLA benchmarks, significantly improving performance on Bench2Drive and doubling prior results on Longest6~v2 and Town13. Additionally, integrating perception supervision from the dataset into a shared pipeline improves performance on NAVSIM and Waymo Vision-Based End-to-End benchmarks.
该研究解决了专家演示与传感器观测之间的差距问题,这限制了模仿学习的有效性。通过缩小视野和不确定性差距,研究人员改进了驾驶策略,实现了在主要CARLA基准测试上的最新性能,并在某些任务上将先前结果翻倍。该研究还通过集成感知监督,在模拟到现实的基准测试中展示了持续的性能提升。
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
Authors: Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang
First: 2025-12-23T18:05:43+00:00 · Latest: 2025-12-23T18:05:43+00:00
Comments: Under submission
Abstract
Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment.
We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context.
Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.
中文标题/摘要
标题:FlashVLM:文本引导的视觉标记选择框架用于大型多模态模型
大型视觉-语言模型(VLMs)通常处理每张图像或视频帧数百或数千个视觉标记,导致二次注意成本和大量冗余。现有的标记减少方法往往忽视了文本查询或依赖于深度注意图,这些图在激进剪枝下的不稳定性导致语义对齐下降。
我们提出了FlashVLM,一种文本引导的视觉标记选择框架,能够动态适应查询的视觉输入。FlashVLM 不依赖于嘈杂的注意权重,而是计算投影图像标记与语言模型空间中归一化文本嵌入之间的显式跨模态相似性。这种外在的相关性与内在的视觉显著性通过对数域加权和温度控制锐化进行融合。此外,保留一个最小但具有代表性的背景标记集以保持全局上下文的多样性。
在相同的标记预算和评估协议下,FlashVLM 实现了超越无损压缩的效果,即使在对LLaVA 1.5剪枝高达77.8%的情况下,仍略优于未剪枝基线,同时保持92.8%的准确率,即使在高达94.4%的压缩下也是如此。在14个图像和视频基准上的广泛实验表明,FlashVLM 在保持强大鲁棒性和泛化能力的同时,提供了最先进的效率性能权衡。
Summary / 总结
FlashVLM is a text-guided visual token selection framework that dynamically adapts visual inputs to textual queries. It computes an explicit cross-modal similarity between image tokens and text embeddings, fusing it with visual saliency and retaining a minimal set of background tokens. This method achieves beyond lossless compression, surpassing the unpruned baseline by pruning up to 77.8 percent of visual tokens while maintaining 92.8 percent accuracy even under 94.4 percent compression. FlashVLM demonstrates strong robustness and generalization across various VLM benchmarks.
FlashVLM 是一种文本引导的视觉标记选择框架,通过计算显式的跨模态相似性并将其与内在的视觉显著性融合,动态适应视觉输入到文本查询。它实现了超越无损压缩,即使在压缩高达 94.4% 的情况下,准确率仍保持在 92.8%。在 14 个图像和视频基准上的广泛实验表明,FlashVLM 提供了最先进的效率性能权衡,并具有强大的鲁棒性和泛化能力。
Compute-in-Memory Implementation of State Space Models for Event Sequence Processing
Authors: Xiaoyu Zhang, Mingtao Hu, Sen Lu, Soohyeon Kim, Eric Yeu-Jer Lee, Yuyang Liu, Wei D. Lu
First: 2025-11-17T21:06:52+00:00 · Latest: 2025-12-23T18:00:52+00:00
Comments: Xiaoyu Zhang and Mingtao Hu contributed equally to this work
Abstract
State space models (SSMs) have recently emerged as a powerful framework for long sequence processing, outperforming traditional methods on diverse benchmarks. Fundamentally, SSMs can generalize both recurrent and convolutional networks and have been shown to even capture key functions of biological systems. Here we report an approach to implement SSMs in energy-efficient compute-in-memory (CIM) hardware to achieve real-time, event-driven processing. Our work re-parameterizes the model to function with real-valued coefficients and shared decay constants, reducing the complexity of model mapping onto practical hardware systems. By leveraging device dynamics and diagonalized state transition parameters, the state evolution can be natively implemented in crossbar-based CIM systems combined with memristors exhibiting short-term memory effects. Through this algorithm and hardware co-design, we show the proposed system offers both high accuracy and high energy efficiency while supporting fully asynchronous processing for event-based vision and audio tasks.
中文标题/摘要
标题:内存计算中状态空间模型的实现及其在事件序列处理中的应用
状态空间模型(SSMs)最近已成为长序列处理的强大框架,超越了传统方法在多种基准测试中的表现。从根本上说,SSMs可以泛化为递归和卷积网络,并已被证明能够捕捉生物系统的关键功能。在这里,我们报告了一种在节能的内存计算(CIM)硬件中实现SSMs的方法,以实现实时、事件驱动的处理。我们的工作重新参数化了模型,使其能够使用实数值系数和共享衰减常数运行,从而降低了模型映射到实际硬件系统的复杂性。通过利用设备动力学和对角化状态转换参数,状态演化可以在基于交叉电极的CIM系统中与表现出短期记忆效应的忆阻器原生实现。通过这种算法和硬件协同设计,我们展示了所提出系统在保持高准确性和高能效的同时,支持完全异步处理事件驱动的视觉和音频任务。
Summary / 总结
The research aims to enhance the efficiency of state space models (SSMs) for processing long event sequences by implementing them in energy-efficient compute-in-memory (CIM) hardware. The method involves re-parameterizing SSMs to use real-valued coefficients and shared decay constants, which simplifies their mapping onto practical hardware. Key experimental findings show that the proposed system achieves high accuracy and energy efficiency, supporting real-time, asynchronous processing for event-based vision and audio tasks.
该研究旨在通过在能量高效的计算即存储(CIM)硬件中实现状态空间模型(SSMs),提高其事件序列处理的效率和实时性。方法包括重新参数化SSMs,使用实数值系数和共享衰减常数,简化其在实际硬件上的映射。关键实验结果表明,所提出系统实现了高准确性和高能效,并支持事件驱动的视觉和音频任务的完全异步处理。
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Authors: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
First: 2025-12-23T17:56:36+00:00 · Latest: 2025-12-23T17:56:36+00:00
Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
中文标题/摘要
标题:学习在四维中推理:视觉语言模型的动态空间理解
视觉语言模型(VLM)在一般理解方面表现出色,但在动态空间推理(DSR),即在三维空间中随时间推移对物体几何形状和关系进行推理方面仍然较弱,这主要是由于缺乏可扩展的四维感知训练资源。为了在数据集、基准和模型的各个方面弥合这一差距,我们引入了DSR套件。首先,我们提出了一种自动流水线,从野外视频中生成DSR的多项选择题-答案对。通过利用现代视觉基础模型,该流水线提取了丰富的几何和运动信息,包括相机姿态、局部点云、物体掩码、方向和三维轨迹。这些几何线索使得DSR-Train的构建成为可能,并进一步构建了DSR-Bench用于评估。与以往工作相比,我们的数据强调(i)野外视频来源,(ii)物体和场景级别的三维要求,(iii)视角变换,(iv)多物体交互,以及(v)细粒度、程序化的答案。除了数据,我们还提出了一种轻量级的几何选择模块(GSM),以无缝地将几何先验整合到VLM中,该模块浓缩了问题语义,并从预训练的四维重建先验中提取与问题相关的信息,形成一组紧凑的几何标记。这种有针对性的提取避免了向模型灌输无关知识。实验表明,将DSR-Train和GSM集成到Qwen2.5-VL-7B中显著增强了其动态空间推理能力,同时在通用视频理解基准测试中保持了准确性。
Summary / 总结
The research aims to improve vision-language models' ability to reason about dynamic spatial relationships over time by addressing the scarcity of 4D-aware training resources. The method involves an automated pipeline that generates question-answer pairs from in-the-wild videos, extracting geometric and motion information. Key findings show that integrating this data and a Geometry Selection Module into Qwen2.5-VL-7B enhances its dynamic spatial reasoning capability without compromising general video understanding accuracy.
论文通过引入DSR Suite,从野生视频中生成问题-答案对,解决了视觉-语言模型(VLM)在动态空间推理(DSR)方面的局限性。该套件提取几何和运动信息,并构建用于训练和评估的数据集。实验表明,将这些数据集和几何选择模块(GSM)集成到Qwen2.5-VL-7B中,可以提高其动态空间推理能力,同时保持一般视频理解的准确性。
Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
Authors: Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li
Venue: WACV 2026
First: 2025-12-23T17:55:35+00:00 · Latest: 2025-12-23T17:55:35+00:00
Comments: Accepted to WACV 2026
Abstract
Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.
中文标题/摘要
标题:多粒度文本引导图像融合以应对多曝光和多焦距场景
图像融合旨在从在具有挑战性条件下拍摄的一对输入中合成一张高质量的图像,例如不同的曝光水平或焦距深度。核心挑战在于有效处理输入之间的动态范围和焦距深度差异。随着视觉语言模型的发展,最近的方法将文本描述作为辅助指导以提高融合质量。然而,简单地引入粗粒度描述会阻碍对细粒度细节的理解,并且对跨模态对齐提出了挑战。为了解决这些限制,我们提出了多粒度文本引导图像融合(MTIF),这是一种具有三个关键设计的新型融合范式。首先,它引入了多粒度的文本描述,分别捕捉细粒度细节、结构线索和语义内容,并通过分层跨模态调制模块引导图像融合。其次,它在每个粒度级别引入监督信号,以促进视觉和文本特征之间的对齐并增强辅助文本的实用性。第三,它采用了一种基于显著性的增强模块,通过密集的语义内容增强训练数据,进一步加强跨模态调制和对齐。广泛的实验表明,MTIF在多曝光和多焦距图像融合任务中始终优于先前的方法。
Summary / 总结
The paper addresses the challenge of image fusion under challenging conditions such as varying exposure and focus depths. It proposes Multi-grained Text-guided Image Fusion (MTIF), which uses multi-grained textual descriptions to guide image fusion through a hierarchical cross-modal modulation module. MTIF also includes supervision signals at each granularity and a saliency-driven enrichment module to enhance cross-modal alignment. Experiments demonstrate that MTIF outperforms previous methods in both multi-exposure and multi-focus image fusion tasks.
论文提出了一种多粒度文本引导图像融合方法(MTIF),以解决在不同曝光和焦距条件下的图像融合问题。MTIF利用多粒度的文本描述来引导融合过程,包括层次化的跨模态调制模块、每个粒度级别的监督信号以及基于显著性的增强模块。实验表明,MTIF在多曝光和多焦点场景中均优于现有方法。
Resolution scaling governs DINOv3 transfer performance in chest radiograph classification
Authors: Soroosh Tayebi Arasteh, Mina Shaigan, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
First: 2025-10-08T16:25:04+00:00 · Latest: 2025-12-23T17:45:29+00:00
Abstract
Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta's DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.
中文标题/摘要
标题:分辨率缩放决定了DINOv3在胸部X光分类中的迁移性能
自我监督学习(SSL)推进了视觉表示学习,但在胸部X光成像中,这种技术的价值尚不清楚,胸部X光是一种高容量成像模式,具有细微的发现特征。Meta的DINOv3通过Gram锚定自我蒸馏扩展了早期的SSL模型。这些设计选择是否能改善胸部X光的迁移学习尚未系统测试。我们对比了DINOv3与DINOv2和ImageNet初始化在七个数据集(>814,000)上的表现。评估了两种代表性的骨干网络:ViT-B/16和ConvNeXt-B。图像分别在224x224、512x512和1024x1024像素下进行分析。我们还评估了7B模型的冻结特征。主要结果是标签的平均AUROC。在224x224分辨率下,DINOv3和DINOv2在成人数据集上表现相当。将分辨率提高到512x512时,DINOv3在两个模型上都表现出了持续的改进,而ImageNet则没有。相比之下,儿科队列的结果在不同初始化下没有差异。在所有设置中,ConvNeXt-B优于ViT-B/16。使用冻结DINOv3-7B特征的模型在性能上不如完全微调的86-89M参数的骨干网络,突显了领域适应的重要性。将分辨率扩展到1024x1024没有进一步提高准确性。分辨率相关的收益在边界依赖性和小焦点异常中最为明显。在胸部X光成像中,更高的输入分辨率对于利用现代自我监督模型的优势至关重要。512x512像素代表了一个实际的上限,在此分辨率下,DINOv3初始化的ConvNeXt-B网络提供了最强的性能,而更大的输入则几乎没有成本效益。临床应用中,这些发现支持在512x512分辨率下使用微调的中型骨干网络进行胸部X光解读,特别是在检测细微或边界中心的病变方面,这些病变对急诊和重症监护具有重要意义。
Summary / 总结
This study evaluates the transfer learning performance of DINOv3 in chest radiograph classification, comparing it with DINOv2 and ImageNet initialization across various resolutions and backbone architectures. At 224x224 pixels, DINOv3 and DINOv2 showed comparable performance on adult datasets, but DINOv3 outperformed at 512x512 pixels. ConvNeXt-B outperformed ViT-B/16 across all settings. Frozen DINOv3-7B features underperformed compared to fully finetuned backbones, indicating the need for domain adaptation. Increasing resolution beyond 512x512 did not significantly improve accuracy. The study highlights the importance of input resolution for leveraging modern self-supervised models in chest radiography, with 512x512 pixels being the optimal resolution for ConvNeXt-B networks.
研究评估了Meta的DINOv3在胸部X光分类中的迁移学习性能,将其与DINOv2和ImageNet初始化在七个数据集上进行比较。研究发现,将图像分辨率从224x224提高到512x512可以持续改善DINOv3的表现,而在儿科数据集中没有观察到这种改进。ConvNeXt-B在所有设置中均优于ViT-B/16。研究强调了分辨率对于利用自监督学习模型的重要性,512x512像素是胸部X光中实现最佳性能的实用上限。
Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset & The Effective AAM-TSA Model
Authors: Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing
First: 2025-12-23T17:42:16+00:00 · Latest: 2025-12-23T17:42:16+00:00
Abstract
Teachers' emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers' emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression.In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED.To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process.The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information.Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA.AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.
中文标题/摘要
标题:推进多模态教师情感分析:T-MED数据集与有效的AAM-TSA模型
教师的情感状态在教育场景中至关重要,深刻影响着教学效果、学生参与度和学习成就。然而,现有研究往往由于表演性特征而未能准确捕捉教师的情感,并且忽视了教学信息对情感表达的关键影响。在本文中,我们系统地研究了教师情感分析,相应地构建了数据集和模型。我们构建了首个大规模教师多模态情感分析数据集T-MED。为了确保标注的准确性和效率,我们采用了人机协作标注过程。T-MED数据集包含来自11个学科的14,938个教师情感数据实例,涵盖从K-12到高等教育的250个真实教室,整合了多模态文本、音频、视频和教学信息。此外,我们提出了一种新颖的非对称注意力机制多模态教师情感分析模型AAM-TSA。AAM-TSA引入了非对称注意力机制和分层门控单元,以实现跨模态特征的差异化融合和精确的情感分类。实验结果表明,AAM-TSA在T-MED数据集上的准确性和可解释性显著优于现有最先进的方法。
Summary / 总结
This paper addresses the importance of teachers' emotional states in educational settings by developing the T-MED dataset and the AAM-TSA model. T-MED is a large-scale multimodal dataset including text, audio, video, and instructional information from 14,938 teacher emotional instances across 250 classrooms. The AAM-TSA model uses an asymmetric attention mechanism and hierarchical gating units to improve cross-modal feature fusion and emotional classification, achieving better accuracy and interpretability compared to existing methods.
本文通过构建T-MED数据集和AAM-TSA模型,研究教师情感状态在教育中的重要性。T-MED是一个大规模的多模态数据集,包含来自250个教室的14,938个教师情感数据实例,整合了文本、音频、视频和教学信息。AAM-TSA模型使用不对称注意力机制和层次门控单元,以提高跨模态特征融合和情感分类的准确性,在T-MED数据集上优于现有方法。
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Authors: Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, Xuyan Chi, Jian Cong, Jing Cui, Qinpeng Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chengjian Feng, Han Feng, Mingyuan Gao, Yu Gao, Dong Guo, Qiushan Guo, Boyang Hao, Qingkai Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Donglei Ji, Siqi Jiang, Wei Jiang, Yunpu Jiang, Zhuo Jiang, Ashley Kim, Jianan Kong, Zhichao Lai, Shanshan Lao, Yichong Leng, Ai Li, Feiya Li, Gen Li, Huixia Li, JiaShi Li, Liang Li, Ming Li, Shanshan Li, Tao Li, Xian Li, Xiaojie Li, Xiaoyang Li, Xingxing Li, Yameng Li, Yifu Li, Yiying Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Zhiqiang Liang, Wang Liao, Yalin Liao, Heng Lin, Kengyu Lin, Shanchuan Lin, Xi Lin, Zhijie Lin, Feng Ling, Fangfang Liu, Gaohong Liu, Jiawei Liu, Jie Liu, Jihao Liu, Shouda Liu, Shu Liu, Sichao Liu, Songwei Liu, Xin Liu, Xue Liu, Yibo Liu, Zikun Liu, Zuxi Liu, Junlin Lyu, Lecheng Lyu, Qian Lyu, Han Mu, Xiaonan Nie, Jingzhe Ning, Xitong Pan, Yanghua Peng, Lianke Qin, Xueqiong Qu, Yuxi Ren, Kai Shen, Guang Shi, Lei Shi, Yan Song, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Yan Sun, Zeyu Sun, Wenjing Tang, Yaxue Tang, Zirui Tao, Feng Wang, Furui Wang, Jinran Wang, Junkai Wang, Ke Wang, Kexin Wang, Qingyi Wang, Rui Wang, Sen Wang, Shuai Wang, Tingru Wang, Weichen Wang, Xin Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Ziyu Wang, Guoqiang Wei, Wanru Wei, Di Wu, Guohong Wu, Hanjie Wu, Jian Wu, Jie Wu, Ruolan Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Liang Xiang, Fei Xiao, XueFeng Xiao, Pan Xie, Shuangyi Xie, Shuang Xu, Jinlan Xue, Shen Yan, Bangbang Yang, Ceyuan Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yang Yang, Yihang Yang, ZhiXian Yang, Ziyan Yang, Songting Yao, Yifan Yao, Zilyu Ye, Bowen Yu, Jian Yu, Chujie Yuan, Linxiao Yuan, Sichun Zeng, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Chuntao Zhang, Heng Zhang, Jingjie Zhang, Kuo Zhang, Liang Zhang, Liying Zhang, Manlin Zhang, Ting Zhang, Weida Zhang, Xiaohe Zhang, Xinyan Zhang, Yan Zhang, Yuan Zhang, Zixiang Zhang, Fengxuan Zhao, Huating Zhao, Yang Zhao, Hao Zheng, Jianbin Zheng, Xiaozheng Zheng, Yangyang Zheng, Yijie Zheng, Jiexin Zhou, Jiahui Zhu, Kuan Zhu, Shenhan Zhu, Wenjia Zhu, Benhui Zou, Feilong Zuo
First: 2025-12-15T16:36:52+00:00 · Latest: 2025-12-23T17:38:46+00:00
Comments: Seedance 1.5 pro Technical Report
Abstract
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
中文标题/摘要
标题:Seedance 1.5 pro:一种原生音视频联合生成基础模型
近期音视频生成技术的进步为统一的音视频生成铺平了道路。在此项工作中,我们介绍了Seedance 1.5 pro,这是一种专门针对原生联合音视频生成的基础模型。该模型利用双分支扩散变换器架构,结合跨模态联合模块和专门的多阶段数据管道,实现了卓越的音视频同步和生成质量。为了确保其实用性,我们实施了精细的后训练优化,包括在高质量数据集上进行监督微调(SFT)和多维度奖励模型的人工反馈强化学习(RLHF)。此外,我们还引入了一种加速框架,将推理速度提高了超过10倍。Seedance 1.5 pro 通过精确的多语言和方言唇同步、动态电影级摄像机控制和增强的叙事连贯性,使其成为专业级内容创作的强大引擎。Seedance 1.5 pro 现已可在火山引擎上访问:https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo。
Summary / 总结
Seedance 1.5 pro is a foundational model designed for native audio-visual generation. It uses a dual-branch Diffusion Transformer with a cross-modal joint module and a specialized data pipeline to achieve high-quality synchronization and generation. Post-training optimizations include SFT and RLHF, and an acceleration framework increases inference speed by over 10X. Key features include precise multilingual lip-syncing, dynamic camera control, and enhanced narrative coherence, making it suitable for professional content creation.
Seedance 1.5 pro 是一个用于原生音视频生成的基础模型,采用双分支扩散变换器和跨模态联合模块及专门的数据流水线,实现高质量的同步和生成。后训练优化包括监督微调(SFT)和基于人类反馈的强化学习(RLHF),并引入加速框架将推理速度提升超过10倍。主要功能包括精确的多语言唇同步、动态摄像机控制和增强的叙事连贯性,使其适用于专业内容创作。
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
Authors: Anna Šárová Mikeštíková, Médéric Fourmy, Martin Cífka, Josef Sivic, Vladimir Petrik
First: 2025-12-23T17:29:08+00:00 · Latest: 2025-12-23T17:29:08+00:00
Comments: 18 pages, 9 figures
Abstract
Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on four datasets (YCB-V, T-LESS, ITODD-MV, HouseCat6D) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.
中文标题/摘要
标题:AlignPose:通过多视图特征度量对齐实现通用的6D姿态估计
基于单视图RGB模型的对象姿态估计方法在泛化能力上表现出色,但从根本上受到深度歧义、杂乱和遮挡的限制。多视图姿态估计方法有可能解决这些问题,但现有工作依赖于精确的单视图姿态估计或缺乏对未见过的对象的泛化能力。我们通过以下三个贡献来应对这些挑战。首先,我们引入了AlignPose,这是一种6D物体姿态估计方法,可以从多个外在校准的RGB视图中聚合信息,无需任何特定于物体的训练或对称标注。其次,该方法的关键组成部分是一种新的多视图特征度量细化,专门用于物体姿态。它优化了一个一致的世界坐标系中的物体姿态,最小化了在所有视图中实时渲染的物体特征与观察到的图像特征之间的特征差异。第三,我们在四个数据集(YCB-V、T-LESS、ITODD-MV、HouseCat6D)上进行了广泛的实验,并使用BOP基准评估表明,AlignPose在泛化能力上优于其他已发表的方法,特别是在实践中多视图易于获取的工业数据集上。
Summary / 总结
The research aims to improve 6D pose estimation by addressing limitations of single-view methods, such as depth ambiguity and occlusions. AlignPose aggregates information from multiple calibrated RGB views without requiring object-specific training. It uses a multi-view feature-metric refinement to optimize a consistent object pose across all views, minimizing feature discrepancies. Experiments on four datasets show that AlignPose outperforms other methods, particularly on industrial datasets with multiple views available.
研究旨在通过提出AlignPose方法,利用多视图RGB信息解决单视图RGB模型姿态估计方法的深度模糊和遮挡等问题。AlignPose引入了一种多视图特征度量精炼方法,该方法在所有视图中优化一致的物体姿态,最小化渲染特征和观测特征之间的差异。实验结果表明,AlignPose在四个数据集上优于现有方法,特别是在具有多个可用视图的工业场景中表现出色。
SirenPose: Dynamic Scene Reconstruction via Geometric Supervision
Authors: Kaitong Cai, Jensen Zhang, Jing Yang, Keze Wang
First: 2025-12-23T17:23:21+00:00 · Latest: 2025-12-23T17:23:21+00:00
Comments: Under submission
Abstract
We introduce SirenPose, a geometry-aware loss formulation that integrates the periodic activation properties of sinusoidal representation networks with keypoint-based geometric supervision, enabling accurate and temporally consistent reconstruction of dynamic 3D scenes from monocular videos. Existing approaches often struggle with motion fidelity and spatiotemporal coherence in challenging settings involving fast motion, multi-object interaction, occlusion, and rapid scene changes. SirenPose incorporates physics inspired constraints to enforce coherent keypoint predictions across both spatial and temporal dimensions, while leveraging high frequency signal modeling to capture fine grained geometric details. We further expand the UniKPT dataset to 600,000 annotated instances and integrate graph neural networks to model keypoint relationships and structural correlations. Extensive experiments on benchmarks including Sintel, Bonn, and DAVIS demonstrate that SirenPose consistently outperforms state-of-the-art methods. On DAVIS, SirenPose achieves a 17.8 percent reduction in FVD, a 28.7 percent reduction in FID, and a 6.0 percent improvement in LPIPS compared to MoSCA. It also improves temporal consistency, geometric accuracy, user score, and motion smoothness. In pose estimation, SirenPose outperforms Monst3R with lower absolute trajectory error as well as reduced translational and rotational relative pose error, highlighting its effectiveness in handling rapid motion, complex dynamics, and physically plausible reconstruction.
中文标题/摘要
标题:SirenPose:通过几何监督实现动态场景重建
我们介绍了SirenPose,一种几何感知的损失公式,将正弦表示网络的周期激活特性与基于关键点的几何监督相结合,从而能够从单目视频中准确且时间一致地重建动态3D场景。现有方法在快速运动、多对象交互、遮挡和快速场景变化等具有挑战性的情况下,往往难以保持运动保真度和时空一致性。SirenPose结合了物理启发的约束条件,以确保在空间和时间维度上的一致性关键点预测,同时利用高频信号建模来捕捉细微的几何细节。我们进一步将UniKPT数据集扩展到600,000个标注实例,并结合图神经网络来建模关键点关系和结构相关性。在包括Sintel、Bonn和DAVIS在内的基准测试中进行的大量实验表明,SirenPose始终优于最先进的方法。在DAVIS上,SirenPose的FVD降低了17.8%,FID降低了28.7%,LPIPS提高了6.0%。它还提高了时间一致性、几何准确性、用户评分和运动平滑度。在姿态估计中,SirenPose在绝对轨迹误差、平移和旋转相对姿态误差方面均优于Monst3R,突显了其在处理快速运动、复杂动态和物理上合理重建方面的有效性。
Summary / 总结
SirenPose introduces a geometry-aware loss formulation that combines sinusoidal representation networks with keypoint-based geometric supervision to achieve accurate and temporally consistent dynamic 3D scene reconstruction from monocular videos. The method incorporates physics-inspired constraints and high-frequency signal modeling, and it outperforms state-of-the-art methods on benchmarks such as Sintel, Bonn, and DAVIS, with significant improvements in FVD, FID, LPIPS, temporal consistency, geometric accuracy, and user score. In pose estimation, SirenPose also shows better performance than Monst3R in handling rapid motion and complex dynamics.
SirenPose提出了一种几何感知的损失公式,结合了正弦表示网络和基于关键点的几何监督,以实现从单目视频中准确且时间一致地重建动态3D场景。该方法结合了物理启发的约束和高频信号建模,并在Sintel、Bonn和DAVIS等基准测试中表现出优越性能,显著降低了FVD、FID和LPIPS等指标。此外,它还在时间一致性、几何准确性、用户评分和运动平滑度方面有所提升。在姿态估计方面,SirenPose在绝对轨迹误差和相对姿态误差方面优于Monst3R,使其能够有效处理快速运动、复杂动态和物理上合理的重建。
Benchmarking LLMs for Predictive Applications in the Intensive Care Units
Authors: Chehak Malhotra, Mehak Gopal, Akshaya Devadiga, Pradeep Singh, Ridam Pal, Ritwik Kashyap, Tavpritesh Sethi
First: 2025-12-23T17:08:31+00:00 · Latest: 2025-12-23T17:08:31+00:00
Abstract
With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.
中文标题/摘要
标题:重症监护病房预测应用中大型语言模型的基准测试
随着大型语言模型(LLMs)的出现,自然语言处理领域中的各种任务都发生了转变。然而,它们在预测任务中的应用研究较少。本研究将GatorTron-Base(基于临床数据训练)、Llama 8B和Mistral 7B等大型语言模型与BioBERT、DocBERT、BioClinicalBERT、Word2Vec和Doc2Vec等模型进行比较,以建立预测重症患者休克的基准。及时预测休克可以实现早期干预,从而改善患者预后。从MIMIC III数据库中17,294例ICU住院患者的文本数据中,筛选出住院时间超过24小时且休克指数(SI)大于0.7的患者,分别得到355例正常SI指数和87例异常SI指数的患者。在微调过程中,使用焦点损失和交叉熵损失来解决类别不平衡问题。研究结果表明,虽然GatorTron Base的加权召回率最高,达到80.5%,但整体性能指标在SLMs和LLMs之间相当。这表明,尽管LLMs在文本任务上表现出色,但它们在预测未来临床事件方面并不比SLMs更具优越性。为了实现有意义的临床结果,未来在训练LLMs时应优先开发能够预测临床轨迹的模型,而不是专注于命名实体识别或表型识别等简单任务。
Summary / 总结
This study benchmarks large language models (LLMs) and semantic language models (SLMs) for predicting shock in critically ill patients using text data from the MIMIC III database. The models, including GatorTron-Base, Llama 8B, and Mistral 7B, were compared against BioBERT, DocBERT, and other models. GatorTron Base achieved the highest weighted recall of 80.5%, but overall performance metrics were similar between LLMs and SLMs, indicating that LLMs are not inherently superior for clinical event prediction. The research suggests that future efforts should focus on developing models capable of predicting clinical trajectories rather than simpler tasks.
该研究使用MIMIC III数据库中的文本数据,对比了大型语言模型(LLMs)和小型语言模型(SLMs)在预测重症患者休克方面的表现。GatorTron-Base,一种基于临床数据训练的模型,实现了最高的加权召回率80.5%,但LLMs和SLMs的整体性能指标相当。研究指出,LLMs在预测临床事件方面并不天然优于SLMs,强调未来应重点开发能够预测临床轨迹的模型,而非专注于简单的文本任务如命名实体识别或表型识别。
Deep Reinforcement Learning Optimization for Uncertain Nonlinear Systems via Event-Triggered Robust Adaptive Dynamic Programming
Authors: Ningwei Bai, Chi Pui Chan, Qichen Yin, Tengyang Gong, Yunda Yan, Zezhi Tang
First: 2025-12-05T22:52:22+00:00 · Latest: 2025-12-23T17:06:16+00:00
Comments: 9 pages, 9 figures
Abstract
This work proposes a unified control architecture that couples a Reinforcement Learning (RL)-driven controller with a disturbance-rejection Extended State Observer (ESO), complemented by an Event-Triggered Mechanism (ETM) to limit unnecessary computations. The ESO is utilized to estimate the system states and the lumped disturbance in real time, forming the foundation for effective disturbance compensation. To obtain near-optimal behavior without an accurate system description, a value-iteration-based Adaptive Dynamic Programming (ADP) method is adopted for policy approximation. The inclusion of the ETM ensures that parameter updates of the learning module are executed only when the state deviation surpasses a predefined bound, thereby preventing excessive learning activity and substantially reducing computational load. A Lyapunov-oriented analysis is used to characterize the stability properties of the resulting closed-loop system. Numerical experiments further confirm that the developed approach maintains strong control performance and disturbance tolerance, while achieving a significant reduction in sampling and processing effort compared with standard time-triggered ADP schemes.
中文标题/摘要
标题:不确定非线性系统的事件触发鲁棒自适应动态规划优化深度强化学习控制架构
本文提出了一种统一的控制架构,将基于强化学习(RL)的控制器与扰动抑制扩展状态观测器(ESO)相结合,并通过事件触发机制(ETM)限制不必要的计算。ESO用于实时估计系统状态和综合扰动,形成有效的扰动补偿基础。为了在没有精确系统描述的情况下获得接近最优的行为,采用基于值迭代的自适应动态规划(ADP)方法进行策略近似。ETM的引入确保只有当状态偏差超过预定义界限时才执行学习模块的参数更新,从而防止过度的学习活动并显著减少计算负载。使用Lyapunov导向分析来表征闭环系统的稳定性特性。数值实验进一步证实,所开发的方法保持了强大的控制性能和扰动鲁棒性,同时与标准时间触发ADP方案相比,实现了显著的采样和处理努力减少。
Summary / 总结
This work proposes a control architecture that integrates a Reinforcement Learning (RL)-driven controller with a disturbance-rejection Extended State Observer (ESO) and an Event-Triggered Mechanism (ETM) to reduce unnecessary computations. The ESO estimates system states and disturbances in real time, while the ADP method approximates the policy for near-optimal control without precise system knowledge. The ETM updates parameters only when necessary, ensuring stability and reducing computational load. Experiments show that the approach maintains strong control performance and disturbance tolerance, with a significant reduction in sampling and processing effort compared to standard time-triggered ADP schemes.
该研究提出了一种结合了基于强化学习的控制器、扰动抑制扩展状态观测器和事件触发机制的控制架构,以减少不必要的计算。该方法采用基于值迭代的自适应动态规划方法进行策略逼近,并通过李雅普诺夫分析确保系统的稳定性。关键发现表明,所提出的方法在保持强大控制性能和扰动鲁棒性的同时,显著减少了采样和处理努力,相比标准的时间触发ADP方案。
Recurrent Off-Policy Deep Reinforcement Learning Doesn't Have to be Slow
Authors: Tyler Clark, Christine Evers, Jonathon Hare
First: 2025-12-23T17:02:17+00:00 · Latest: 2025-12-23T17:02:17+00:00
Abstract
Recurrent off-policy deep reinforcement learning models achieve state-of-the-art performance but are often sidelined due to their high computational demands. In response, we introduce RISE (Recurrent Integration via Simplified Encodings), a novel approach that can leverage recurrent networks in any image-based off-policy RL setting without significant computational overheads via using both learnable and non-learnable encoder layers. When integrating RISE into leading non-recurrent off-policy RL algorithms, we observe a 35.6% human-normalized interquartile mean (IQM) performance improvement across the Atari benchmark. We analyze various implementation strategies to highlight the versatility and potential of our proposed framework.
中文标题/摘要
标题:循环离策略深度强化学习不必缓慢
循环离策略深度强化学习模型能够达到最先进的性能,但由于其高计算需求往往被忽视。为此,我们提出了RISE(循环集成通过简化编码)这一新颖方法,该方法能够在任何基于图像的离策略RL设置中利用循环网络,同时通过使用可学习和不可学习的编码层,不会产生显著的计算开销。将RISE整合到领先的非循环离策略RL算法中,我们发现在Atari基准测试中的人类标准化四分位均值(IQM)性能提高了35.6%。我们分析了各种实现策略,以突出我们提出框架的灵活性和潜力。
Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
Authors: Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang
First: 2025-12-21T14:02:53+00:00 · Latest: 2025-12-23T16:47:46+00:00
Comments: Code will be released at https://github.com/Xilluill/MAG
Abstract
Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose Memorize-and-Generate (MAG), a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce MAG-Bench to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
中文标题/摘要
标题:记忆与生成:实时视频生成中的长期一致性
帧级自回归(帧-AR)模型取得了显著进展,使实时视频生成与双向扩散模型相当,并成为交互式世界模型和游戏引擎的基础。然而,当前长视频生成方法通常依赖于窗口注意力,这会简单地丢弃窗口外的历史上下文,导致灾难性遗忘和场景不一致;相反,保留完整历史则会带来巨大的内存成本。为解决这一权衡,我们提出了一种名为记忆与生成(MAG)的框架,将记忆压缩和帧生成分解为独立的任务。具体而言,我们训练了一个记忆模型将历史信息压缩成紧凑的KV缓存,以及一个独立的生成模型利用此压缩表示合成后续帧。此外,我们引入了MAG-Bench严格评估历史记忆保留。大量实验表明,MAG在保持与标准视频生成基准相当性能的同时,实现了更优的历史场景一致性。
Summary / 总结
The research aims to improve long-term consistency in real-time video generation by addressing the trade-off between retaining historical context and memory efficiency. The proposed Memorize-and-Generate (MAG) framework decouples memory compression and frame generation. A memory model compresses historical information into a compact key-value cache, which is then used by a separate generator model to synthesize subsequent frames. Experiments show that MAG maintains high historical scene consistency while performing well on standard benchmarks.
这项工作的动机是提高实时视频生成中的长期一致性,解决当前方法要么丢弃历史上下文,要么产生高昂内存成本的问题。提出的Memorize-and-Generate (MAG) 方法将记忆压缩和帧生成分离为两个独立任务。记忆模型将历史信息压缩成紧凑的键值缓存,然后由生成模型利用该压缩表示生成后续帧。实验表明,MAG 在保持高历史场景一致性的同时,在标准视频生成基准上表现良好。
Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition
Authors: Gorjan Radevski
First: 2025-12-23T16:46:58+00:00 · Latest: 2025-12-23T16:46:58+00:00
Comments: Ph.D. manuscript; Supervisors/Mentors: Marie-Francine Moens and Tinne Tuytelaars
Abstract
This manuscript explores multimodal alignment, translation, fusion, and transference to enhance machine understanding of complex inputs. We organize the work into five chapters, each addressing unique challenges in multimodal machine learning.
Chapter 3 introduces Spatial-Reasoning Bert for translating text-based spatial relations into 2D arrangements between clip-arts. This enables effective decoding of spatial language into visual representations, paving the way for automated scene generation aligned with human spatial understanding.
Chapter 4 presents a method for translating medical texts into specific 3D locations within an anatomical atlas. We introduce a loss function leveraging spatial co-occurrences of medical terms to create interpretable mappings, significantly enhancing medical text navigability.
Chapter 5 tackles translating structured text into canonical facts within knowledge graphs. We develop a benchmark for linking natural language to entities and predicates, addressing ambiguities in text extraction to provide clearer, actionable insights.
Chapter 6 explores multimodal fusion methods for compositional action recognition. We propose a method fusing video frames and object detection representations, improving recognition robustness and accuracy.
Chapter 7 investigates multimodal knowledge transference for egocentric action recognition. We demonstrate how multimodal knowledge distillation enables RGB-only models to mimic multimodal fusion-based capabilities, reducing computational requirements while maintaining performance.
These contributions advance methodologies for spatial language understanding, medical text interpretation, knowledge graph enrichment, and action recognition, enhancing computational systems' ability to process complex, multimodal inputs across diverse applications.
中文标题/摘要
标题:跨模态融合与知识转移:增强的多模态理解和识别
本文探讨了多模态对齐、翻译、融合和转移,以增强机器对复杂输入的理解。我们将工作分为五章,每章解决多模态机器学习中的独特挑战。
第三章介绍了空间推理Bert,用于将基于文本的空间关系翻译成剪贴画之间的2D排列。这使得空间语言的有效解码为视觉表示成为可能,为与人类空间理解对齐的自动化场景生成铺平了道路。
第四章提出了一种将医学文本翻译到解剖学图谱中特定3D位置的方法。我们引入了一个利用医学术语空间共现性的损失函数,创建可解释的映射,显著提高了医学文本的可导航性。
第五章解决了将结构化文本翻译成知识图谱中的标准事实。我们开发了一个基准,将自然语言链接到实体和谓词,解决文本提取中的歧义,提供更清晰、可操作的见解。
第六章探讨了多模态融合方法在组合动作识别中的应用。我们提出了一种融合视频帧和对象检测表示的方法,提高了识别的鲁棒性和准确性。
第七章研究了多模态知识转移在第一人称动作识别中的应用。我们展示了多模态知识蒸馏如何使仅使用RGB的模型模仿多模态融合的能力,同时减少计算需求并保持性能。
这些贡献推进了空间语言理解、医学文本解释、知识图谱丰富和动作识别的方法,增强了计算系统处理各种应用中复杂多模态输入的能力。
Summary / 总结
This manuscript focuses on enhancing machine understanding of complex multimodal inputs through spatial reasoning, medical text translation, knowledge graph linking, multimodal fusion, and knowledge transference. Key methods include Spatial-Reasoning Bert for translating text-based spatial relations, a loss function for mapping medical texts to anatomical locations, a benchmark for linking natural language to knowledge graphs, multimodal fusion for action recognition, and multimodal knowledge distillation for egocentric action recognition. Main findings show significant improvements in spatial language understanding, medical text navigability, knowledge graph enrichment, action recognition robustness, and computational efficiency.
本论文旨在通过解决空间推理、医学文本翻译、知识图谱丰富、组成动作识别和多模态知识迁移等问题,增强机器对复杂多模态输入的理解。研究引入了Spatial-Reasoning Bert用于将文本中的空间关系转化为视觉表示,提出了利用医学术语空间共现性的损失函数以实现医学文本到3D位置的映射,开发了将自然语言与实体和谓词链接的基准,提出了多模态融合方法以提高动作识别的鲁棒性和准确性,并展示了多模态知识蒸馏技术以使单模态RGB模型具备多模态融合能力。这些贡献提高了机器学习模型在各种应用中的可解释性和鲁棒性。
Deep Research Comparator: A Platform For Fine-grained Human Annotations of Deep Research Agents
Authors: Prahaladh Chandrahasan, Jiahe Jin, Zhihan Zhang, Tevin Wang, Andy Tang, Lucy Mo, Morteza Ziyadi, Leonardo F. R. Ribeiro, Zimeng Qiu, Markus Dreyer, Akari Asai, Chenyan Xiong
First: 2025-07-07T21:35:09+00:00 · Latest: 2025-12-23T16:43:12+00:00
Abstract
Effectively evaluating deep research agents that autonomously search the web, analyze information, and generate reports remains a major challenge, particularly when it comes to assessing long reports and giving detailed feedback on their intermediate steps. To address these gaps, we introduce Deep Research Comparator, a platform that offers a holistic framework for deep research agent hosting, side-by-side comparison, fine-grained human feedback collection, and ranking calculation. Given a user query, our platform displays the final reports from two different agents along with their intermediate steps during generation. Annotators can evaluate the overall quality of final reports based on side-by-side comparison, and also provide detailed feedback separately by assessing intermediate steps or specific text spans within the final report. Furthermore, we develop Simple Deepresearch, an end-to-end agent scaffold. This scaffold serves as a baseline that facilitates the easy integration of various large language models to transform them into deep research agents for evaluation. To demonstrate the platform's utility for deep research agent development, we have collected real user preference data from 17 annotators on three deep research agents. A demo video of our platform can be found at https://www.youtube.com/watch?v=g4d2dnbdseg.
中文标题/摘要
标题:深度研究比较器:一种细粒度的人工标注深度研究代理平台
有效地评估能够自主搜索网络、分析信息并生成报告的深度研究代理仍然是一项重大挑战,尤其是在评估长报告和提供详细的中间步骤反馈方面。为了解决这些差距,我们引入了深度研究比较器,这是一个提供深度研究代理托管、并排比较、细粒度的人工反馈收集和排名计算的综合框架的平台。给定用户查询,我们的平台会显示两个不同代理的最终报告及其生成过程中的中间步骤。标注者可以根据并排比较来评估最终报告的整体质量,也可以分别评估中间步骤或最终报告中的特定文本段落。此外,我们还开发了简单深度研究,这是一种端到端的代理框架。该框架作为基准,有助于各种大型语言模型的轻松集成,从而将它们转化为用于评估的深度研究代理。为了展示该平台在深度研究代理开发中的实用性,我们从17名标注者那里收集了针对三个深度研究代理的真实用户偏好数据。我们的平台演示视频可以在https://www.youtube.com/watch?v=g4d2dnbdseg找到。
mLaSDI: Multi-stage latent space dynamics identification
Authors: William Anderson, Seung Whan Chung, Robert Stephany, Youngsoo Choi
First: 2025-06-10T19:57:35+00:00 · Latest: 2025-12-23T16:20:49+00:00
Abstract
Accurately solving partial differential equations (PDEs) is essential across many scientific disciplines. However, high-fidelity solvers can be computationally prohibitive, motivating the development of reduced-order models (ROMs). Recently, Latent Space Dynamics Identification (LaSDI) was proposed as a data-driven, non-intrusive ROM framework. LaSDI compresses the training data via an autoencoder and learns user-specified ordinary differential equations (ODEs), governing the latent dynamics, enabling rapid predictions for unseen parameters. While LaSDI has produced effective ROMs for numerous problems, the autoencoder must simultaneously reconstruct the training data and satisfy the imposed latent dynamics, which are often competing objectives that limit accuracy, particularly for complex or high-frequency phenomena. To address this limitation, we propose multi-stage Latent Space Dynamics Identification (mLaSDI). With mLaSDI, we train LaSDI sequentially in stages. After training the initial autoencoder, we train additional decoders which map the latent trajectories to residuals from previous stages. This staged residual learning, combined with periodic activation functions, enables recovery of high-frequency content without sacrificing interpretability of the latent dynamics. Numerical experiments on a multiscale oscillating system, unsteady wake flow, and the 1D-1V Vlasov equation demonstrate that mLaSDI achieves significantly lower reconstruction and prediction errors, often by an order of magnitude, while requiring less training time and reduced hyperparameter tuning compared to standard LaSDI.
中文标题/摘要
标题:mLaSDI:多阶段潜在空间动力学识别
准确求解偏微分方程(PDEs)在许多科学领域中至关重要。然而,高保真求解器可能因计算成本高昂而难以实现,因此推动了降阶模型(ROMs)的发展。最近,提出了潜在空间动力学识别(LaSDI)作为数据驱动的非侵入式ROM框架。LaSDI通过自编码器压缩训练数据,并学习用户指定的常微分方程(ODEs),以管理潜在动力学,从而实现对未见参数的快速预测。尽管LaSDI在许多问题上产生了有效的ROMs,但自编码器必须同时重构训练数据并满足施加的潜在动力学,这往往是相互竞争的目标,限制了准确性,尤其是在处理复杂或高频现象时。为了解决这一限制,我们提出了多阶段潜在空间动力学识别(mLaSDI)。在mLaSDI中,我们按阶段顺序训练LaSDI。在训练初始自编码器后,我们训练额外的解码器,将潜在轨迹映射到前一阶段的残差。这种阶段残差学习,结合周期激活函数,能够在不牺牲潜在动力学可解释性的情况下恢复高频内容。在多尺度振荡系统、不稳定的尾流以及1D-1V Vlasov方程上的数值实验表明,mLaSDI在重构和预测误差方面显著降低,通常降低一个数量级,同时需要更少的训练时间和减少的超参数调整,与标准LaSDI相比。
Summary / 总结
The paper introduces mLaSDI, a multi-stage approach to improve the accuracy of Latent Space Dynamics Identification (LaSDI) for solving partial differential equations. mLaSDI sequentially trains additional decoders to map latent trajectories to residuals from previous stages, enabling better recovery of high-frequency content. Experiments show that mLaSDI reduces reconstruction and prediction errors by an order of magnitude compared to standard LaSDI, with less training time and reduced hyperparameter tuning needed.
研究旨在通过改进Latent Space Dynamics Identification (LaSDI)方法来提高求解偏微分方程 (PDEs) 的降阶模型 (ROMs) 的准确性和效率,以解决LaSDI方法的局限性。主要方法是采用多阶段方法,称为多阶段Latent Space Dynamics Identification (mLaSDI),该方法在多个阶段依次训练LaSDI,从而更好地恢复高频内容同时保持可解释性。关键实验结果表明,mLaSDI在重建和预测误差方面比标准LaSDI低一个数量级以上,同时需要更少的训练时间和更少的超参数调整。
SweRank+: Multilingual, Multi-Turn Code Ranking for Software Issue Localization
Authors: Revanth Gangi Reddy, Ye Liu, Wenting Zhao, JaeHyeok Doo, Tarun Suresh, Daniel Lee, Caiming Xiong, Yingbo Zhou, Semih Yavuz, Shafiq Joty
First: 2025-12-23T16:18:39+00:00 · Latest: 2025-12-23T16:18:39+00:00
Abstract
Maintaining large-scale, multilingual codebases hinges on accurately localizing issues, which requires mapping natural-language error descriptions to the relevant functions that need to be modified. However, existing ranking approaches are often Python-centric and perform a single-pass search over the codebase. This work introduces SweRank+, a framework that couples SweRankMulti, a cross-lingual code ranking tool, with SweRankAgent, an agentic search setup, for iterative, multi-turn reasoning over the code repository. SweRankMulti comprises a code embedding retriever and a listwise LLM reranker, and is trained using a carefully curated large-scale issue localization dataset spanning multiple popular programming languages. SweRankAgent adopts an agentic search loop that moves beyond single-shot localization with a memory buffer to reason and accumulate relevant localization candidates over multiple turns. Our experiments on issue localization benchmarks spanning various languages demonstrate new state-of-the-art performance with SweRankMulti, while SweRankAgent further improves localization over single-pass ranking.
中文标题/摘要
标题:SweRank+: 多语言、多轮代码排名在软件问题本地化中的应用
维护大规模多语言代码库的关键在于准确地本地化问题,这需要将自然语言错误描述映射到需要修改的相关函数。然而,现有的排名方法往往以Python为中心,并且只在代码库中进行一次搜索。这项工作引入了SweRank+框架,该框架结合了SweRankMulti,一种跨语言代码排名工具,以及SweRankAgent,一种代理搜索设置,用于在代码库上进行迭代的多轮推理。SweRankMulti 包含一个代码嵌入检索器和一个列表级的LLM重排序器,并使用一个精心策划的跨多种流行编程语言的大规模问题本地化数据集进行训练。SweRankAgent 采用了一种代理搜索循环,超越了一次性本地化,使用记忆缓冲区在多轮次中推理和累积相关本地化候选。我们在涵盖多种语言的问题本地化基准测试上进行的实验表明,SweRankMulti 达到了新的最佳性能,而SweRankAgent 进一步提高了本地化性能,超越了一次性排名。
Summary / 总结
The research aims to improve the localization of issues in large-scale multilingual codebases by accurately mapping natural-language error descriptions to relevant functions. The study introduces SweRank+, which combines SweRankMulti, a cross-lingual code ranking tool, and SweRankAgent, an iterative search setup. SweRankMulti uses a code embedding retriever and a listwise LLM reranker, trained on a large dataset of issue localization across multiple programming languages. SweRankAgent employs an agentic search loop to iteratively reason and accumulate relevant localization candidates. Experiments show that SweRankMulti achieves new state-of-the-art performance, and SweRankAgent further enhances localization accuracy compared to single-pass ranking methods.
研究旨在通过解决现有单次排名方法的局限性,提高大规模多语言代码库中问题定位的准确性。SweRank+引入了SweRankMulti,它结合了代码嵌入检索器和列表级LLM重排序器,以及SweRankAgent,这是一种迭代的多轮推理系统。SweRankMulti在大规模多语言数据集上进行训练,而SweRankAgent使用记忆缓冲区逐步细化定位候选。实验表明,SweRankMulti在性能上超过了现有方法,而SweRankAgent进一步提高了单次排名的定位性能。
UTDesign: A Unified Framework for Stylized Text Editing and Generation in Graphic Design Images
Authors: Yiming Zhao, Yuanpeng Gao, Yuxuan Luo, Jiwei Duan, Shisong Lin, Longfei Xiong, Zhouhui Lian
Venue: SIGGRAPH Asia 2025
First: 2025-12-23T16:13:55+00:00 · Latest: 2025-12-23T16:13:55+00:00
Comments: 22 pages, 25 figures, SIGGRAPH Asia 2025, Conference Paper
Abstract
AI-assisted graphic design has emerged as a powerful tool for automating the creation and editing of design elements such as posters, banners, and advertisements. While diffusion-based text-to-image models have demonstrated strong capabilities in visual content generation, their text rendering performance, particularly for small-scale typography and non-Latin scripts, remains limited. In this paper, we propose UTDesign, a unified framework for high-precision stylized text editing and conditional text generation in design images, supporting both English and Chinese scripts. Our framework introduces a novel DiT-based text style transfer model trained from scratch on a synthetic dataset, capable of generating transparent RGBA text foregrounds that preserve the style of reference glyphs. We further extend this model into a conditional text generation framework by training a multi-modal condition encoder on a curated dataset with detailed text annotations, enabling accurate, style-consistent text synthesis conditioned on background images, prompts, and layout specifications. Finally, we integrate our approach into a fully automated text-to-design (T2D) pipeline by incorporating pre-trained text-to-image (T2I) models and an MLLM-based layout planner. Extensive experiments demonstrate that UTDesign achieves state-of-the-art performance among open-source methods in terms of stylistic consistency and text accuracy, and also exhibits unique advantages compared to proprietary commercial approaches. Code and data for this paper are available at https://github.com/ZYM-PKU/UTDesign.
中文标题/摘要
标题:UTDesign:图形设计图像中风格化文本编辑与生成的统一框架
AI辅助的图形设计已成为自动化设计元素(如海报、横幅和广告)创建和编辑的强大工具。尽管基于扩散的文本到图像模型在视觉内容生成方面表现出强大的能力,但它们在小型字体排印和非拉丁文字符集的文本呈现性能方面仍然有限。在本文中,我们提出了一种名为UTDesign的统一框架,用于设计图像中的高精度风格化文本编辑和条件文本生成,支持英文字体和中文字体。我们的框架引入了一种从合成数据集从零开始训练的新型DiT基文本风格转换模型,能够生成透明的RGBA文本前景,保留参考字符的风格。我们进一步通过在包含详细文本注释的精心策划的数据集上训练多模态条件编码器,将该模型扩展为条件文本生成框架,从而能够根据背景图像、提示和布局规范生成准确且风格一致的文本合成。最后,我们通过集成预训练的文本到图像(T2I)模型和基于MLLM的布局规划器,将我们的方法整合到一个完全自动化的文本到设计(T2D)流水线中。广泛的实验表明,UTDesign在开源方法中在风格一致性与文本准确性方面达到了最先进的性能,并且与专有的商业方法相比具有独特的优势。本文的代码和数据可在https://github.com/ZYM-PKU/UTDesign获取。
Summary / 总结
UTDesign is a unified framework for stylized text editing and generation in graphic design images, addressing limitations in text rendering performance for small-scale typography and non-Latin scripts. It introduces a DiT-based text style transfer model and a multi-modal condition encoder, achieving state-of-the-art performance in stylistic consistency and text accuracy. UTDesign integrates with a text-to-image pipeline and outperforms both open-source and proprietary commercial approaches.
UTDesign 是一个统一框架,用于在图形设计图像中进行样式化文本编辑和生成,解决了小规模字体排印和非拉丁文字体渲染的限制。该框架引入了一种基于 DiT 的文本样式转移模型和多模态条件编码器,实现了在样式一致性和文本准确性方面的领先性能。UTDesign 与文本到图像管道集成,并在这些方面优于专有的商业方法。
Snapshot 3D image projection using a diffractive decoder
Authors: Cagatay Isil, Alexander Chen, Yuhang Li, F. Onuralp Ardic, Shiqi Chen, Che-Yung Shen, Aydogan Ozcan
First: 2025-12-23T15:57:08+00:00 · Latest: 2025-12-23T15:57:08+00:00
Comments: 22 Pages, 8 Figures
Abstract
3D image display is essential for next-generation volumetric imaging; however, dense depth multiplexing for 3D image projection remains challenging because diffraction-induced cross-talk rapidly increases as the axial image planes get closer. Here, we introduce a 3D display system comprising a digital encoder and a diffractive optical decoder, which simultaneously projects different images onto multiple target axial planes with high axial resolution. By leveraging multi-layer diffractive wavefront decoding and deep learning-based end-to-end optimization, the system achieves high-fidelity depth-resolved 3D image projection in a snapshot, enabling axial plane separations on the order of a wavelength. The digital encoder leverages a Fourier encoder network to capture multi-scale spatial and frequency-domain features from input images, integrates axial position encoding, and generates a unified phase representation that simultaneously encodes all images to be axially projected in a single snapshot through a jointly-optimized diffractive decoder. We characterized the impact of diffractive decoder depth, output diffraction efficiency, spatial light modulator resolution, and axial encoding density, revealing trade-offs that govern axial separation and 3D image projection quality. We further demonstrated the capability to display volumetric images containing 28 axial slices, as well as the ability to dynamically reconfigure the axial locations of the image planes, performed on demand. Finally, we experimentally validated the presented approach, demonstrating close agreement between the measured results and the target images. These results establish the diffractive 3D display system as a compact and scalable framework for depth-resolved snapshot 3D image projection, with potential applications in holographic displays, AR/VR interfaces, and volumetric optical computing.
中文标题/摘要
标题:使用衍射解码器的3D图像快照投影
3D图像显示对于下一代体视成像至关重要;然而,由于衍射引起的交叉干扰迅速增加,轴向图像平面越接近,密集的深度复用3D图像投影仍然具有挑战性。在此,我们介绍了一种3D显示系统,该系统由数字编码器和衍射光学解码器组成,可以同时在多个目标轴向平面上以高轴向分辨率投影不同的图像。通过利用多层衍射波前解码和基于深度学习的端到端优化,该系统实现了高保真度的轴向分辨率3D图像快照投影,轴向间隔可达到一个波长的量级。数字编码器利用傅里叶编码网络从输入图像中捕获多尺度的空间和频域特征,集成轴向位置编码,并生成统一的相位表示,通过联合优化的衍射解码器在单个快照中同时编码所有要轴向投影的图像。我们表征了衍射解码器深度、输出衍射效率、空间光调制器分辨率和轴向编码密度的影响,揭示了影响轴向分离和3D图像投影质量的权衡。我们进一步展示了显示包含28个轴向切片的体视图像的能力,以及动态重新配置图像平面轴向位置的能力,按需进行。最后,我们通过实验验证了所提出的方法,证明了测量结果与目标图像之间的接近一致。这些结果确立了衍射3D显示系统作为紧凑且可扩展的框架,用于轴向分辨率的3D图像快照投影,具有在全息显示、AR/VR接口和体视光学计算中的潜在应用。
Summary / 总结
The research aims to address the challenge of dense depth multiplexing in 3D image projection by introducing a system that uses a digital encoder and a diffractive optical decoder. The method leverages multi-layer diffractive wavefront decoding and deep learning for end-to-end optimization, achieving high-fidelity 3D image projection in a snapshot with axial plane separations on the order of a wavelength. Key findings include the ability to project 28 axial slices and dynamically reconfigure image planes, with experimental validation showing close agreement between measured and target images.
研究旨在通过引入一种3D显示系统,解决密集深度复用在3D图像投影中的挑战,该系统使用数字编码器和衍射光学解码器。该系统采用多层衍射波前解码和基于深度学习的端到端优化,实现了高保真的深度解析3D图像投影。关键实验发现包括能够以高轴向分辨率投影28个轴向切片,并且能够动态重新配置图像平面的轴向位置。研究还分析了各种参数对轴向分离和图像质量的影响,并通过实验结果验证了该方法,实验结果与目标图像高度一致。
The Aligned Economic Index & The State Switching Model
Authors: Ilias Aarab
Venue: Financieel Forum Bank en Financiewezen 2020 3 pp 252-261
First: 2025-12-23T15:55:10+00:00 · Latest: 2025-12-23T15:55:10+00:00
Abstract
A growing empirical literature suggests that equity-premium predictability is state dependent, with much of the forecasting power concentrated around recessionary periods \parencite{Henkel2011,DanglHalling2012,Devpura2018}. I study U.S. stock return predictability across economic regimes and document strong evidence of time-varying expected returns across both expansionary and contractionary states. I contribute in two ways. First, I introduce a state-switching predictive regression in which the market state is defined in real time using the slope of the yield curve. Relative to the standard one-state predictive regression, the state-switching specification increases both in-sample and out-of-sample performance for the set of popular predictors considered by \textcite{WelchGoyal2008}, improving the out-of-sample performance of most predictors in economically meaningful ways. Second, I propose a new aggregate predictor, the Aligned Economic Index, constructed via partial least squares (PLS). Under the state-switching model, the Aligned Economic Index exhibits statistically and economically significant predictive power in sample and out of sample, and it outperforms widely used benchmark predictors and alternative predictor-combination methods.
中文标题/摘要
标题:对齐经济指数与状态转换模型
越来越多的经验研究表明,股票溢价可预测性具有状态依赖性,大部分预测能力集中在衰退期 \parencite{Henkel2011,DanglHalling2012,Devpura2018}。我研究了美国股票收益在不同经济状态下的可预测性,并记录了在扩张期和收缩期预期收益时间变化的强烈证据。我有两点贡献。首先,我引入了一种状态转换预测回归模型,市场状态使用收益率曲线的斜率在实时定义。与标准的一状态预测回归相比,状态转换模型提高了考虑的广泛预测因子的样本内和样本外表现,大多数预测因子在经济意义上显著提高了样本外表现。其次,我提出了一种新的综合预测因子,对齐经济指数,通过偏最小二乘法(PLS)构建。在状态转换模型下,对齐经济指数在样本内和样本外都表现出统计和经济上的显著预测能力,并且优于广泛使用的基准预测因子和替代预测因子组合方法。
Summary / 总结
This paper investigates U.S. stock return predictability across different economic states and introduces a state-switching predictive regression model using the slope of the yield curve. The model improves the performance of popular predictors, especially out-of-sample. Additionally, a new aggregate predictor, the Aligned Economic Index, is proposed using partial least squares, which shows statistically and economically significant predictive power both in-sample and out-of-sample, surpassing benchmark predictors and other combination methods.
该论文研究了不同经济状态下美国股票回报的可预测性,并引入了一种基于收益率曲线斜率的状态切换预测回归模型。该模型提高了常用预测因子的表现,尤其是在样本外。此外,还提出了一种新的综合预测指标——对齐经济指数,使用部分最小二乘法构建,该指标在样本内和样本外均表现出统计和经济上的显著预测能力,优于基准预测指标和其他组合方法。
Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding
Authors: Anh Dao, Manh Tran, Yufei Zhang, Xiaoming Liu, Zijun Cui
First: 2025-12-23T15:43:48+00:00 · Latest: 2025-12-23T15:43:48+00:00
Abstract
Human motion understanding has advanced rapidly through vision-based progress in recognition, tracking, and captioning. However, most existing methods overlook physical cues such as joint actuation forces that are fundamental in biomechanics. This gap motivates our study: if and when do physically inferred forces enhance motion understanding? By incorporating forces into established motion understanding pipelines, we systematically evaluate their impact across baseline models on 3 major tasks: gait recognition, action recognition, and fine-grained video captioning. Across 8 benchmarks, incorporating forces yields consistent performance gains; for example, on CASIA-B, Rank-1 gait recognition accuracy improved from 89.52% to 90.39% (+0.87), with larger gain observed under challenging conditions: +2.7% when wearing a coat and +3.0% at the side view. On Gait3D, performance also increases from 46.0% to 47.3% (+1.3). In action recognition, CTR-GCN achieved +2.00% on Penn Action, while high-exertion classes like punching/slapping improved by +6.96%. Even in video captioning, Qwen2.5-VL's ROUGE-L score rose from 0.310 to 0.339 (+0.029), indicating that physics-inferred forces enhance temporal grounding and semantic richness. These results demonstrate that force cues can substantially complement visual and kinematic features under dynamic, occluded, or appearance-varying conditions.
中文标题/摘要
标题:超越运动模式:基于物理力的人体运动理解实证研究
基于视觉的识别、跟踪和描述的进步使人体运动理解迅速发展。然而,大多数现有方法忽略了生物力学中至关重要的关节驱动力等物理线索。这一差距促使我们进行这项研究:物理推断的力何时何地能增强运动理解?通过将力整合到现有的运动理解管道中,我们系统地评估了它们在3个主要任务上的影响:步态识别、动作识别和细粒度视频描述。在8个基准测试中,整合力带来了一致的性能提升;例如,在CASIA-B上,步态识别的Rank-1准确率从89.52%提高到90.39%(+0.87),在穿着外套和侧面视角等具有挑战性条件下,性能提升更大:+2.7%和+3.0%。在Gait3D上,性能也从46.0%提高到47.3%(+1.3)。在动作识别中,CTR-GCN在Penn Action上的准确率提高了2.00%,而高用力类如拳击/拍打的准确率提高了6.96%。即使在视频描述中,Qwen2.5-VL的ROUGE-L分数也从0.310提高到0.339(+0.029),表明物理推断的力增强了时间定位和语义丰富性。这些结果表明,在动态、遮挡或外观变化条件下,力线索可以显著补充视觉和运动学特征。
Summary / 总结
This study investigates the impact of incorporating physical forces into motion understanding pipelines for tasks such as gait recognition, action recognition, and fine-grained video captioning. By evaluating across multiple benchmarks, the research shows consistent performance gains, with significant improvements observed in challenging conditions. For instance, on CASIA-B, gait recognition accuracy improved by 0.87% to 90.39%, and in action recognition, CTR-GCN achieved a 2.00% increase on Penn Action. These findings suggest that physical forces can enhance motion understanding under dynamic and occluded conditions.
研究探讨了将物理力纳入运动理解管道的影响。受生物力学线索利用不足的驱动,研究评估了各种模型在步态识别、动作识别和细粒度视频描述任务上的表现。在多个基准测试中,整合力矩一致提高了性能,在穿着外套或侧面观看等挑战性条件下尤为明显。例如,CASIA-B步态识别准确率提高了0.87%,而Penn Action动作识别提高了2.00%,特别是在高消耗动作类别中表现更佳。这些结果表明,物理力可以增强在动态、遮挡或外观变化条件下的时间定位和语义丰富性。
Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI
Authors: Muhammad Usman, Azka Rehman, Muhammad Mutti Ur Rehman, Abd Ur Rehman, Muhammad Umar Farooq
First: 2025-12-23T15:24:31+00:00 · Latest: 2025-12-23T15:24:31+00:00
Abstract
Accurate segmentation of ischemic stroke lesions from diffusion magnetic resonance imaging (MRI) is essential for clinical decision-making and outcome assessment. Diffusion-Weighted Imaging (DWI) and Apparent Diffusion Coefficient (ADC) scans provide complementary information on acute and sub-acute ischemic changes; however, automated lesion delineation remains challenging due to variability in lesion appearance.
In this work, we study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset. Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked. Based on performance, a dual-encoder TransUNet architecture is proposed to learn modality-specific representations from DWI and ADC inputs. To incorporate spatial context, adjacent slice information is integrated using a three-slice input configuration.
All models are trained under a unified framework and evaluated using the Dice Similarity Coefficient (DSC). Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set. The proposed framework offers a robust solution for automated ischemic stroke lesion segmentation from diffusion MRI.
中文标题/摘要
标题:基于双编码器变换器的多模态学习在使用弥散MRI的缺血性中风病灶分割中的应用
从弥散磁共振成像(MRI)中准确分割缺血性中风病灶对于临床决策和结果评估至关重要。弥散加权成像(DWI)和表观扩散系数(ADC)扫描提供了急性及亚急性缺血变化的互补信息;然而,由于病灶外观的变异性,自动病灶勾画仍然具有挑战性。
在这项工作中,我们研究了使用ISLES 2022数据集的多模态弥散MRI进行缺血性中风病灶分割。多种最先进的卷积和变换器架构,包括U-Net变体、Swin-UNet和TransUNet,进行了基准测试。基于性能,提出了一种双编码器TransUNet架构,从DWI和ADC输入中学习模态特定表示。为了整合空间上下文,使用三片输入配置整合了相邻切片信息。
所有模型都在统一框架下进行训练,并使用Dice相似性系数(DSC)进行评估。结果显示,基于变换器的模型优于基于卷积的基线模型,提出的双编码器TransUNet在测试集上达到85.4%的Dice分数,提供了从弥散MRI自动分割缺血性中风病灶的稳健解决方案。
Summary / 总结
This study aims to improve the accuracy of ischemic stroke lesion segmentation from diffusion MRI, which is crucial for clinical decision-making. The authors benchmark several state-of-the-art architectures and propose a dual-encoder TransUNet that learns modality-specific representations from DWI and ADC inputs, achieving a Dice score of 85.4% on the test set, outperforming convolutional baselines.
该研究旨在提高从扩散MRI中进行缺血性中风病灶分割的准确性。它评估了几种最先进的架构,并提出了一种双编码器TransUNet,从DWI和ADC输入中学习模态特定的表示。该提出的模型整合了相邻切片信息,在测试集上达到了85.4%的Dice分数,优于其他模型。
High Dimensional Data Decomposition for Anomaly Detection of Textured Images
Authors: Ji Song, Xing Wang, Jianguo Wu, Xiaowei Yue
First: 2025-12-23T15:21:18+00:00 · Latest: 2025-12-23T15:21:18+00:00
Abstract
In the realm of diverse high-dimensional data, images play a significant role across various processes of manufacturing systems where efficient image anomaly detection has emerged as a core technology of utmost importance. However, when applied to textured defect images, conventional anomaly detection methods have limitations including non-negligible misidentification, low robustness, and excessive reliance on large-scale and structured datasets. This paper proposes a texture basis integrated smooth decomposition (TBSD) approach, which is targeted at efficient anomaly detection in textured images with smooth backgrounds and sparse anomalies. Mathematical formulation of quasi-periodicity and its theoretical properties are investigated for image texture estimation. TBSD method consists of two principal processes: the first process learns the texture basis functions to effectively extract quasi-periodic texture patterns; the subsequent anomaly detection process utilizes that texture basis as prior knowledge to prevent texture misidentification and capture potential anomalies with high accuracy.The proposed method surpasses benchmarks with less misidentification, smaller training dataset requirement, and superior anomaly detection performance on both simulation and real-world datasets.
中文标题/摘要
标题:高维数据分解在纹理图像异常检测中的应用
在多样化的高维数据领域,图像在制造系统各种过程中扮演重要角色,高效的图像异常检测已成为至关重要的核心技术。然而,当应用于纹理缺陷图像时,传统的异常检测方法存在误识别率高、鲁棒性低和对大规模结构化数据集的过度依赖等问题。本文提出了一种纹理基底整合平滑分解(TBSD)方法,旨在高效地对具有平滑背景和稀疏异常的纹理图像进行异常检测。研究了准周期性的数学公式及其理论性质,用于估计图像纹理。TBSD方法包括两个主要过程:第一个过程学习纹理基函数以有效提取准周期纹理模式;第二个异常检测过程利用这些纹理基作为先验知识,防止纹理误识别并以高精度捕捉潜在异常。所提出的方法在误识别率、训练数据集需求和异常检测性能方面均优于基准方法,适用于模拟和真实世界数据集。
Summary / 总结
This paper addresses the limitations of conventional anomaly detection methods for textured images, such as misidentification and low robustness. It introduces a texture basis integrated smooth decomposition (TBSD) approach to efficiently detect anomalies in textured images with smooth backgrounds and sparse anomalies. The method involves learning texture basis functions to extract quasi-periodic patterns and uses this knowledge for anomaly detection, achieving better performance with fewer training samples and lower misidentification rates compared to benchmarks.
本文提出了一种纹理基底集成平滑分解(TBSD)方法,以解决传统方法在检测纹理图像中的不足。该方法通过学习纹理基底函数来提取准周期纹理模式,并利用这些知识进行异常检测,从而提高鲁棒性和准确性。实验结果表明,TBSD在模拟和真实世界数据集上的表现优于现有方法,具有更少的误识别和更小的训练数据集需求。
Skin Lesion Classification Using a Soft Voting Ensemble of Convolutional Neural Networks
Authors: Abdullah Al Shafi, Abdul Muntakim, Pintu Chandra Shill, Rowzatul Zannat, Abdullah Al-Amin
First: 2025-12-23T15:20:47+00:00 · Latest: 2025-12-23T15:20:47+00:00
Comments: Authors' version of the paper published in proceedings of ECCE, DOI: https://doi.org/10.1109/ECCE64574.2025.11013422
Abstract
Skin cancer can be identified by dermoscopic examination and ocular inspection, but early detection significantly increases survival chances. Artificial intelligence (AI), using annotated skin images and Convolutional Neural Networks (CNNs), improves diagnostic accuracy. This paper presents an early skin cancer classification method using a soft voting ensemble of CNNs. In this investigation, three benchmark datasets, namely HAM10000, ISIC 2016, and ISIC 2019, were used. The process involved rebalancing, image augmentation, and filtering techniques, followed by a hybrid dual encoder for segmentation via transfer learning. Accurate segmentation focused classification models on clinically significant features, reducing background artifacts and improving accuracy. Classification was performed through an ensemble of MobileNetV2, VGG19, and InceptionV3, balancing accuracy and speed for real-world deployment. The method achieved lesion recognition accuracies of 96.32\%, 90.86\%, and 93.92\% for the three datasets. The system performance was evaluated using established skin lesion detection metrics, yielding impressive results.
中文标题/摘要
标题:使用软投票卷积神经网络集成进行皮肤病变分类
皮肤癌可以通过皮肤镜检查和眼部检查识别,但早期发现可以显著提高生存率。人工智能(AI)利用标注的皮肤图像和卷积神经网络(CNNs)提高诊断准确性。本文提出了一种使用CNNs软投票集成的早期皮肤癌分类方法。在此研究中,使用了三个基准数据集,即HAM10000、ISIC 2016和ISIC 2019。过程包括重新平衡、图像增强和过滤技术,随后通过迁移学习使用混合双编码器进行分割。准确的分割使分类模型专注于临床显著特征,减少了背景伪影并提高了准确性。分类通过MobileNetV2、VGG19和InceptionV3的集成进行,平衡了准确性和速度以适应实际部署。该方法在三个数据集上的病变识别准确率分别为96.32%,90.86%和93.92%。系统性能使用现有的皮肤病变检测指标进行评估,取得了令人印象深刻的结果。
Summary / 总结
This paper aims to improve early detection of skin cancer by developing a method using a soft voting ensemble of Convolutional Neural Networks (CNNs). The method involves rebalancing datasets, image augmentation, and filtering techniques, followed by a hybrid dual encoder for segmentation via transfer learning. The ensemble includes MobileNetV2, VGG19, and InceptionV3, achieving accuracies of 96.32%, 90.86%, and 93.92% on HAM10000, ISIC 2016, and ISIC 2019 datasets, respectively.
本文旨在通过卷积神经网络的软投票集成提高早期皮肤癌检测的准确性。该方法包括重平衡、图像增强和过滤技术,随后通过迁移学习使用混合双编码器进行分割。MobileNetV2、VGG19和InceptionV3的集成在HAM10000、ISIC 2016和ISIC 2019数据集上的皮肤病变识别准确率分别为96.32%、90.86%和93.92%,展示了在皮肤病变分类中的高性能。
Video Generation Models Are Good Latent Reward Models
Authors: Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang
First: 2025-11-26T16:14:18+00:00 · Latest: 2025-12-23T15:17:06+00:00
Abstract
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
中文标题/摘要
标题:视频生成模型是良好的潜在空间奖励模型
奖励反馈学习(ReFL)已被证明对于使图像生成与人类偏好对齐非常有效。然而,将其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉语言模型,这将ReFL优化限制在昂贵的VAE解码之后的近完全去噪步骤中。这种像素空间的方法会产生大量的内存开销并增加训练时间,而且其后期优化缺乏早期监督,仅能优化视觉质量而不能优化基本的运动动态和结构一致性。在本文中,我们展示了预训练的视频生成模型天然适合在噪声潜在空间中进行奖励建模,因为它们明确设计为可以处理任意时间步的噪声潜在表示,并通过其序列建模能力内在地保留时间信息。因此,我们提出了过程奖励反馈学习(PRFL)框架,该框架完全在潜在空间中进行偏好优化,从而在整个去噪链中实现高效的梯度反向传播,而无需VAE解码。广泛的实验表明,PRFL在提高与人类偏好的对齐程度方面具有显著优势,同时与RGB ReFL相比实现了显著的内存消耗和训练时间减少。
Summary / 总结
This work addresses the challenge of applying reward feedback learning (ReFL) to video generation by proposing Process Reward Feedback Learning (PRFL). PRFL leverages pre-trained video generation models to optimize preferences in the noisy latent space, avoiding the need for computationally expensive VAE decoding. This approach reduces memory usage and training time while improving alignment with human preferences. Key findings include significant improvements in preference alignment and substantial reductions in memory and training time compared to traditional RGB ReFL methods.
本文提出了Process Reward Feedback Learning (PRFL) 方法来解决将奖励反馈学习 (ReFL) 应用于视频生成的问题。PRFL 利用预训练的视频生成模型在噪声的潜在空间中优化偏好,避免了昂贵的 VAE 解码。这种方法减少了内存消耗和训练时间,同时在人类偏好匹配方面优于传统的 RGB ReFL 方法。