arXiv 论文速递

SemanticGen: Video Generation in Semantic Space

Authors: Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai

First: 2025-12-23T18:59:56+00:00 · Latest: 2025-12-23T18:59:56+00:00

Comments: Project page: https://jianhongbai.github.io/SemanticGen/

Abs · PDF · Code1 · Code2 · Project1

Abstract

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos. In this paper, we introduce SemanticGen, a novel solution to address these limitations by generating videos in the semantic space. Our main insight is that, due to the inherent redundancy in videos, the generation process should begin in a compact, high-level semantic space for global planning, followed by the addition of high-frequency details, rather than directly modeling a vast set of low-level video tokens using bi-directional attention. SemanticGen adopts a two-stage generation process. In the first stage, a diffusion model generates compact semantic video features, which define the global layout of the video. In the second stage, another diffusion model generates VAE latents conditioned on these semantic features to produce the final output. We observe that generation in the semantic space leads to faster convergence compared to the VAE latent space. Our method is also effective and computationally efficient when extended to long video generation. Extensive experiments demonstrate that SemanticGen produces high-quality videos and outperforms state-of-the-art approaches and strong baselines.

中文标题/摘要

标题：SemanticGen：在语义空间中生成视频

最先进的视频生成模型通常在VAE空间中学习视频潜在变量的分布，并使用VAE解码器将它们映射到像素上。虽然这种方法可以生成高质量的视频，但在生成长视频时却存在收敛速度慢且计算成本高的问题。本文中，我们提出了一种名为SemanticGen的新颖解决方案，通过在语义空间中生成视频来解决这些问题。我们的主要见解是，由于视频中固有的冗余性，生成过程应该从紧凑的高层语义空间开始进行全局规划，然后添加高频细节，而不是直接使用双向注意力模型来建模大量的低级视频令牌。SemanticGen采用两阶段生成过程。在第一阶段，扩散模型生成紧凑的语义视频特征，这些特征定义了视频的全局布局。在第二阶段，另一个扩散模型根据这些语义特征生成VAE潜在变量，以产生最终输出。我们观察到，在语义空间中生成比在VAE潜在变量空间中生成具有更快的收敛速度。当扩展到长视频生成时，我们的方法也具有有效性和计算效率。大量实验表明，SemanticGen生成高质量的视频，并优于最先进的方法和强大的基线。

Summary / 总结

SemanticGen addresses the limitations of existing video generative models by generating videos in a semantic space, which leads to faster convergence and computational efficiency, especially for long videos. It uses a two-stage process where a diffusion model first generates compact semantic video features for global planning, followed by another diffusion model to generate VAE latents conditioned on these features to produce the final video. Experiments show that SemanticGen outperforms state-of-the-art approaches in generating high-quality videos.

SemanticGen通过在语义空间生成视频来解决现有视频生成模型的局限性，这使得收敛更快且计算效率更高，尤其适用于长视频生成。其采用两阶段过程，首先使用扩散模型生成紧凑的语义视频特征进行全局规划，然后使用另一个扩散模型生成基于这些特征的VAE潜变量。实验表明，SemanticGen生成高质量的视频并优于最先进的方法。

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Authors: Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

First: 2025-12-23T18:59:49+00:00 · Latest: 2025-12-23T18:59:49+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

中文标题/摘要

标题：长视频代理：多智能体长视频推理

近期多模态LLM和使用工具进行长视频问答的系统表明，可以在长达一小时的剧集中进行推理。然而，许多方法仍然将内容压缩成有损摘要，或者依赖有限的工具集，削弱了时间定位并错过了细微线索。我们提出了一种多智能体框架，在该框架中，一个主LLM协调一个定位代理来定位与问题相关的时间段，并协调一个视觉代理来提取目标文本观察。主代理在步数限制下进行规划，并通过强化学习训练以促进简洁、正确和高效的多智能体合作。这种设计有助于主代理通过定位关注相关片段，用视觉细节补充字幕，并产生可解释的轨迹。在我们提出的LongTVQA和LongTVQA+（从TVQA/TVQA+汇总而成的集水平数据集）上，我们的多智能体系统显著优于强大的非智能体基线。实验还表明，强化学习进一步增强了训练智能体的推理和规划能力。代码和数据将在https://longvideoagent.github.io/上共享。

Summary / 总结

The research aims to improve long-video question answering by developing a multi-agent framework that uses a master language model to coordinate a grounding agent and a vision agent. The master agent plans with a step limit and is trained with reinforcement learning to enhance multi-agent cooperation. The system significantly outperforms non-agent baselines on the LongTVQA and LongTVQA+ datasets, demonstrating improved reasoning and planning capabilities through reinforcement learning. Code and data are available at https://longvideoagent.github.io/.

研究旨在通过开发一个多代理框架来提高长视频问答能力，该框架利用主语言模型协调定位代理和视觉代理。系统通过强化学习训练以增强合作并专注于相关视频片段，从而在集锦级别的数据集上比非代理基线表现出色。强化学习还提高了代理的推理和规划能力。

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Authors: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang

First: 2025-12-23T18:59:46+00:00 · Latest: 2025-12-23T18:59:46+00:00

Comments: webpage: https://spatialtree.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

中文标题/摘要

标题：SpatialTree：空间能力在多模态大语言模型中的分支发展

认知科学表明，空间能力是逐步发展的，从感知到推理和互动。然而，在多模态大语言模型（MLLMs）中，这种层次结构仍然不甚明了，因为大多数研究都集中在少数任务上。我们引入了SpatialTree，这是一种认知科学启发式的层次结构，将空间能力分为四个层次：低级感知（L1）、心理制图（L2）、模拟（L3）和能动性（L4）。基于这一分类法，我们构建了第一个能力导向的层次基准，全面评估了主流MLLMs的27个子能力。评估结果揭示了一个清晰的结构：L1技能大多相互独立，而更高层次的技能则高度相关，表明相互依赖性在增加。通过有针对性的监督微调，我们发现了一个令人惊讶的转移动态：L1内的负向转移，但低级到高级能力之间存在强大的跨层次转移，且具有显著的协同效应。最后，我们探讨了如何改进整个层次结构。我们发现，鼓励大量“思考”的简单强化学习是不可靠的：它有助于复杂推理，但损害了直观感知。我们提出了一种简单的自动思考策略，抑制不必要的思考，使强化学习能够一致地提高所有层次的性能。通过构建SpatialTree，我们提供了一个概念验证框架，用于理解和系统地扩展MLLMs中的空间能力。

Summary / 总结

The research aims to understand the development of spatial abilities in multimodal language models (MLLMs) by introducing a cognitive-science-inspired hierarchy called SpatialTree, which categorizes spatial abilities into four levels: perception, mental mapping, simulation, and agentic competence. The study evaluates 27 sub-abilities across various MLLMs and finds that lower-level skills are largely independent, while higher-level skills are strongly correlated. Through targeted fine-tuning, the study reveals negative transfer within the lowest level but strong cross-level transfer from lower to higher abilities. The research also explores the impact of reinforcement learning (RL) on these abilities and proposes an auto-think strategy to enhance performance across all levels of the hierarchy.

研究旨在通过提出一个认知科学启发式的层次结构SpatialTree，来理解多模态语言模型（MLLMs）中的空间能力发展，该层次结构将空间能力分为感知、心理映射、模拟和行动能力四个层次。研究对主流MLLMs在27个子能力上的表现进行了评估，并发现较低层次的能力是相互独立的，而较高层次的能力则高度相关。通过针对性的微调发现，较低层次的能力之间存在负迁移，但较低层次能力向较高层次能力的迁移则非常强大且具有协同效应。研究还探讨了如何使用强化学习（RL）来提高这些能力，提出了一种自动思考策略来抑制不必要的思考，从而在所有层次上持续提升性能。

Active Intelligence in Video Avatars via Closed-loop World Modeling

Authors: Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen

First: 2025-12-23T18:59:16+00:00 · Latest: 2025-12-23T18:59:16+00:00

Comments: Project Page: https://xuanhuahe.github.io/ORCA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.

中文标题/摘要

标题：通过闭环世界建模的视频头像中的主动智能

当前的视频头像生成方法在身份保留和运动对齐方面表现出色，但缺乏真正的自主性，它们无法通过适应性环境交互自主追求长期目标。我们通过引入L-IVA（长期交互视觉头像）任务和基准来解决这一问题，用于评估随机生成环境中的目标导向规划，以及ORCA（在线推理和认知架构），这是第一个使视频头像具备主动智能的框架。ORCA 通过两个关键创新体现了内部世界模型（IWM）能力：(1) 一个闭环OTAR循环（观察-思考-行动-反思），通过不断验证预测结果与实际生成结果来在生成不确定性下保持稳健的状态跟踪；(2) 一个分层的双系统架构，其中系统2进行战略推理并预测状态，系统1将抽象计划转化为具体的、模型特定的动作指令。通过将头像控制建模为POMDP，并实施基于结果验证的连续信念更新，ORCA 使头像能够在开放域场景中自主完成多步任务。大量实验表明，ORCA 在任务成功率和行为一致性方面显著优于开环和非反思基线，验证了我们基于IWM的设计，使视频头像智能从被动动画发展到主动、目标导向的行为。

Summary / 总结

The research addresses the lack of genuine agency in current video avatar generation methods by introducing L-IVA and ORCA. L-IVA is a task and benchmark for evaluating goal-directed planning, while ORCA is a framework enabling active intelligence in video avatars through a closed-loop OTAR cycle and a hierarchical dual-system architecture. ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, demonstrating the effectiveness of its internal world model capabilities.

研究旨在通过使视频角色能够自主追求长期目标并适应性地互动来增强其智能。研究引入了L-IVA和ORCA，其中ORCA采用闭环OTAR循环和分层双系统架构来保持状态跟踪的稳健性并实现战略推理。实验结果表明，ORCA在任务成功率和行为一致性方面优于开环和非反思基线，验证了所提框架的有效性。

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

Authors: Yedi Zhang, Andrew Saxe, Peter E. Latham

First: 2025-12-23T18:55:30+00:00 · Latest: 2025-12-23T18:55:30+00:00