arXiv 论文速递

Snapshot: 20260329_0342

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Authors: Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue

First: 2026-03-26T17:59:59+00:00 · Latest: 2026-03-26T17:59:59+00:00

Comments: Project Page: https://luo0207.github.io/ShotStream/ Code: https://github.com/KlingAIResearch/ShotStream

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

中文标题/摘要

标题：ShotStream：流式多帧视频生成用于交互式叙事

多帧视频生成对于长叙事性讲故事至关重要，但当前的双向架构存在互动性有限和高延迟的问题。我们提出了一种名为ShotStream的新型因果多帧架构，该架构能够实现交互式叙事和高效的实时帧生成。通过将任务重新表述为基于历史上下文的下一帧生成，ShotStream允许用户通过流式提示动态地指导正在进行的叙述。我们首先将一个文本到视频模型微调为双向下一帧生成器，然后通过分布匹配蒸馏将其转化为因果学生。为了解决自回归生成中固有的帧间一致性问题和错误累积问题，我们引入了两项关键创新。首先，一种双缓存机制保持视觉连贯性：全局上下文缓存保留条件帧以实现帧间一致性，而局部上下文缓存保留当前帧内生成的帧以实现帧内一致性。我们还使用RoPE断点指示器明确区分两个缓存以消除歧义。其次，为了减轻错误累积，我们提出了一种两阶段蒸馏策略。该策略首先基于真实历史帧进行帧内自我强化，然后逐步扩展到使用自生成历史进行帧间自我强化，从而有效弥合训练与测试之间的差距。大量实验表明，ShotStream能够在亚秒延迟下生成连贯的多帧视频，单GPU上可实现16 FPS。它在质量和速度上均能与较慢的双向模型匹敌，为实时交互式叙事铺平了道路。我们的训练和推理代码以及模型可在以下链接获取：

Summary / 总结

ShotStream is a novel causal multi-shot architecture designed for interactive storytelling, addressing the limitations of current bidirectional models by enabling efficient on-the-fly frame generation and dynamic user instructions. It uses a dual-cache memory mechanism and a RoPE discontinuity indicator to maintain visual coherence and a two-stage distillation strategy to reduce error accumulation. Experiments show that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU, and matches or exceeds the quality of slower bidirectional models, facilitating real-time interactive storytelling.

ShotStream 是一种新颖的因果多镜头架构，旨在实现交互式叙事，解决了当前双向模型在互动性和延迟方面的局限性。它将任务重新定义为基于历史上下文的下一个镜头生成，允许用户通过流式提示进行动态指令。关键创新包括双缓存记忆机制以保持视觉连贯性和两阶段蒸馏策略以减轻错误累积，从而实现具有亚秒级延迟和单GPU上16 FPS的连贯多镜头视频，超越了较慢的双向模型，在实时性能方面表现出色。

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

Authors: Yixing Lao, Xuyang Bai, Xiaoyang Wu, Nuoyuan Yan, Zixin Luo, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Shiwei Li, Hengshuang Zhao

First: 2026-03-26T17:59:59+00:00 · Latest: 2026-03-26T17:59:59+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: https://yxlao.github.io/lgtm/

中文标题/摘要

标题：fewer Gaussians, 更多纹理: 4K 前馈纹理斑点绘制

现有的前馈3D高斯斑点绘制方法预测像素对齐的原语，导致分辨率增加时原语数量呈二次增长。这从根本上限制了它们的可扩展性，使得高分辨率合成如4K难以实现。我们提出了LGTM（fewer Gaussians, Texture More），一种前馈框架，克服了这一分辨率缩放障碍。通过预测紧凑的高斯原语并结合每个原语的纹理，LGTM 将几何复杂度与渲染分辨率脱钩。这种方法使得在无需针对每个场景进行优化的情况下即可实现高保真4K新颖视图合成，而前馈方法此前无法实现这一功能，同时使用了显著较少的高斯原语。项目页面: https://yxlao.github.io/lgtm/

Summary / 总结

The research addresses the scalability issue of feed-forward 3D Gaussian Splatting methods, which face a quadratic growth in primitive count with increasing resolution, making high-resolution synthesis impractical. The proposed LGTM framework reduces the number of Gaussian primitives and introduces per-primitive textures, decoupling geometric complexity from rendering resolution. This enables high-fidelity 4K novel view synthesis without scene-specific optimization, using fewer Gaussian primitives compared to existing methods.

研究解决了现有前馈3D高斯斑点化方法的可扩展性问题，这些方法预测像素对齐的原语，导致分辨率越高，原语数量呈二次增长。提出的LGTM框架引入了紧凑的高斯原语和每个原语的纹理，将几何复杂性与渲染分辨率解耦。这使得在无需针对每个场景进行优化的情况下，能够实现高保真度的4K新颖视图合成，并且使用比以前方法更少的高斯原语。

RefAlign: Representation Alignment for Reference-to-Video Generation

Authors: Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang

First: 2026-03-26T17:59:57+00:00 · Latest: 2026-03-26T17:59:57+00:00

Comments: 17 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

中文标题/摘要

标题：RefAlign: 参考信息与视频生成的表示对齐

参考信息到视频（R2V）生成是一种可控的视频合成范式，通过使用文本提示和参考图像来约束生成过程，使得个性化广告和虚拟试穿等应用成为可能。实践中，现有的R2V方法通常在参考图像的VAE潜在表示中引入附加的高层语义或跨模态特征，并将它们与扩散变换器（DiT）联合输入。这些辅助表示提供了语义指导，并作为隐式对齐信号，可以在一定程度上缓解VAE潜在空间中的像素级信息泄露。然而，它们可能仍然难以解决由于异构编码器特征间的模态不匹配而引起的复制粘贴伪影和多主体混淆。在本文中，我们提出了一种表示对齐框架RefAlign，该框架显式地将DiT参考分支特征对齐到视觉基础模型（VFM）的语义空间。RefAlign的核心是一种参考对齐损失，该损失将同一主体的参考特征和VFM特征拉近以提高身份一致性，同时将不同主体的相应特征推开以增强语义可区分性。这种简单而有效的策略仅在训练时应用，不会增加推理时间开销，并在文本可控性和参考保真度之间取得了更好的平衡。在OpenS2V-Eval基准上的广泛实验表明，RefAlign在TotalScore上优于当前最先进的方法，验证了显式参考对齐对R2V任务的有效性。

Summary / 总结

RefAlign is a representation alignment framework for reference-to-video generation, which explicitly aligns the reference features to the semantic space of a visual foundation model to improve identity consistency and semantic discriminability. This method is applied during training without adding inference-time overhead and shows superior performance compared to existing methods on the OpenS2V-Eval benchmark.

该论文提出了RefAlign，一种参考对视频生成的表示对齐框架，通过将扩散变换器的参考分支特征对齐到视觉基础模型的语义空间来提高身份一致性与语义可区分性。该方法增强了文本可控性和参考保真度之间的平衡，并在OpenS2V-Eval基准测试中以TotalScore指标超越了当前最先进的方法。

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Authors: Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li

Venue: CVPR 2026

First: 2026-03-26T17:59:54+00:00 · Latest: 2026-03-26T17:59:54+00:00

Comments: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026); Project website: https://dmw-cvpr.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.

中文标题/摘要

标题：Drive My Way：个性化驾驶行为的视觉-语言-行动模型偏好对齐

人类驾驶行为本质上是个性化的，由长期习惯塑造，并受到短期意图的影响。不同个体在加速、刹车、变道、让行和超越方面的行为在不同情况下存在差异。然而，现有的端到端自动驾驶系统要么优化通用目标，要么依赖固定驾驶模式，缺乏适应个人偏好或解释自然语言意图的能力。为解决这一问题，我们提出了Drive My Way (DMW) 个性化视觉-语言-行动（VLA）驾驶框架，该框架能够与用户的长期驾驶习惯对齐，并适应实时用户指令。DMW 从我们收集的跨多个实际驾驶员和条件的个性化驾驶数据集中学习用户嵌入，并在规划过程中根据此嵌入调整策略，而自然语言指令则提供额外的短期指导。在Bench2Drive基准上的闭环评估表明，DMW 改善了风格指令的适应性，用户研究显示其生成的行为可以被识别为每位驾驶员的独特风格，突显了个性化是面向人类的自动驾驶的关键能力。我们的数据和代码可在https://dmw-cvpr.github.io/ 获取。

Summary / 总结

Drive My Way (DMW) is a personalized Vision-Language-Action framework designed to align with individual driving preferences by learning from a dataset of multiple drivers and conditions. It conditions the driving policy on a user embedding and uses natural language instructions for short-term guidance. Experimental results show that DMW enhances the adaptation to style instructions and generates driving behaviors that are recognizable as each driver's own style, demonstrating the importance of personalization in human-centered autonomous driving systems.

研究旨在开发一个个性化自动驾驶系统，以适应个人驾驶偏好并实时响应指令。Drive My Way (DMW) 使用Vision-Language-Action框架，从个性化驾驶数据集中学习，并在规划时根据用户嵌入进行条件化。实验结果表明，DMW 提高了对风格指令的适应性，并生成的行为可以识别为每个驾驶员的独特风格，突显了人性化自动驾驶系统中个性化的重要性。

PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

Authors: Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, Dacheng Tao

Venue: CVPR 2026

First: 2026-03-26T17:59:51+00:00 · Latest: 2026-03-26T17:59:51+00:00

Comments: CVPR 2026, Project Page: https://henghuiding.com/PSDesigner/

Abs · PDF · Code1 · Code2

Abstract

Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose PSDesigner, an automated graphic design system that emulates the creative workflow of human designers. Building upon multiple specialized components, PSDesigner collects theme-related assets based on user instructions, and autonomously infers and executes tool calls to manipulate design files, such as integrating new assets or refining inferior elements. To endow the system with strong tool-use capabilities, we construct a design dataset, CreativePSD, which contains a large amount of high-quality PSD design files annotated with operation traces across a wide range of design scenarios and artistic styles, enabling models to learn expert design procedures. Extensive experiments demonstrate that PSDesigner outperforms existing methods across diverse graphic design tasks, empowering non-specialists to conveniently create production-quality designs.

中文标题/摘要

标题：PSDesigner：具有类人类创意工作流程的自动化图形设计

图形设计是一个创造性和创新性的过程，在电子商务和广告等应用中起着关键作用。然而，开发一个能够忠实翻译用户意图并生成可编辑设计文件的自动化设计系统仍然是一个开放的挑战。尽管最近的研究利用了强大的文本到图像模型和MLLM来辅助图形设计，但它们通常简化了专业的工作流程，导致灵活性和直观性有限。为了解决这些限制，我们提出了PSDesigner，这是一种模拟人类设计师创意工作流程的自动化图形设计系统。基于多个专门组件，PSDesigner根据用户指令收集相关主题的资产，并自主推断和执行工具调用以操作设计文件，例如整合新资产或改进劣质元素。为了赋予系统强大的工具使用能力，我们构建了一个设计数据集CreativePSD，其中包含大量高质量的PSD设计文件，并附有广泛的设计场景和艺术风格的操作轨迹注释，使模型能够学习专家级的设计流程。广泛的实验表明，PSDesigner在各种图形设计任务中均优于现有方法，使非专业人士能够方便地创建生产质量的设计。

Summary / 总结

PSDesigner is an automated graphic design system that mimics human designers' creative workflow to translate user intentions into editable design files. It leverages a specialized dataset, CreativePSD, to learn expert design procedures and autonomously manipulates design files by integrating new assets or refining elements. Experimental results show that PSDesigner outperforms existing methods in various graphic design tasks, enabling non-specialists to create high-quality designs easily.

PSDesigner 是一个模仿人类设计师创意流程的自动化图形设计系统。它根据用户指令收集相关主题资产，并自主操作设计文件，如整合新资产或改进劣质元素。PSDesigner 使用一个包含高质量 PSD 文件及其操作轨迹的设计数据集 CreativePSD，以学习专家级的设计流程。实验结果表明，PSDesigner 在各种图形设计任务中表现优于现有方法，使非专业人士能够轻松创建高质量的设计作品。

MegaFlow: Zero-Shot Large Displacement Optical Flow

Authors: Dingxi Zhang, Fangjinhua Wang, Marc Pollefeys, Haofei Xu

First: 2026-03-26T17:59:51+00:00 · Latest: 2026-03-26T17:59:51+00:00

Comments: Project Page: https://kristen-z.github.io/projects/megaflow Code: https://github.com/cvg/megaflow

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.

中文标题/摘要

标题：MegaFlow：零样本大位移光学流

准确估计大位移光学流仍然是一个关键挑战。现有方法通常依赖于迭代局部搜索或/和领域特定微调，这严重限制了它们在大位移和零样本泛化场景中的性能。为克服这一问题，我们引入了MegaFlow，这是一种简单而强大的零样本大位移光学流模型。MegaFlow 不依赖于高度复杂的、任务特定的架构设计，而是通过利用预训练的全局视觉先验来生成时间上一致的运动场。特别是，我们通过利用预训练的全局视觉变换器特征将流估计形式化为全局匹配问题，这自然地捕捉到了大位移。随后通过几轮轻量级的迭代细化进一步提高亚像素精度。广泛的实验表明，MegaFlow 在多个光学流基准测试中实现了最先进的零样本性能。此外，我们的模型在长距离点跟踪基准测试中也表现出高度竞争力，证明了其鲁棒的迁移性，并暗示了一种统一的可泛化运动估计范式。项目页面：https://kristen-z.github.io/projects/megaflow

Summary / 总结

MegaFlow is designed to address the challenge of estimating large displacement optical flow with zero-shot generalization. It leverages pre-trained global Vision Transformer features to formulate flow estimation as a global matching problem, followed by lightweight iterative refinements for sub-pixel accuracy. Experiments show that MegaFlow outperforms existing methods on multiple optical flow benchmarks and long-range point tracking benchmarks, indicating its robust transferability and potential for generalizable motion estimation.

MegaFlow 是一种针对零样本大位移光流的模型，解决了现有方法依赖于局部迭代搜索或领域特定微调的局限性。通过使用预训练的全局 Vision Transformer 特征将光流估计问题表述为全局匹配问题，MegaFlow 在多个基准测试中实现了最先进的零样本性能，并且在长距离点跟踪基准测试中也表现出高度竞争力，表明其具有良好的泛化能力。

How good was my shot? Quantifying Player Skill Level in Table Tennis

Authors: Akihiro Kubota, Tomoya Hasegawa, Ryo Kawahara, Ko Nishino

First: 2026-03-26T17:59:49+00:00 · Latest: 2026-03-26T17:59:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Gauging an individual's skill level is crucial, as it inherently shapes their behavior. Quantifying skill, however, is challenging because it is latent to the observed actions. To explore skill understanding in human behavior, we focus on dyadic sports -- specifically table tennis -- where skill manifests not just in complex movements, but in the subtle nuances of execution conditioned on game context. Our key idea is to learn a generative model of each player's tactical racket strokes and jointly embed them in a common latent space that encodes individual characteristics, including those pertaining to skill levels. By training these player models on a large-scale dataset of 3D-reconstructed professional matches and conditioning them on comprehensive game context -- including player positioning and opponent behaviors -- the models capture individual tactical identities within their latent space. We probe this learned player space and find that it reflects distinct play styles and attributes that collectively represent skill. By training a simple relative ranking network on these embeddings, we demonstrate that both relative and absolute skill predictions can be achieved. These results demonstrate that the learned player space effectively quantifies skill levels, providing a foundation for automated skill assessment in complex, interactive behaviors.

中文标题/摘要

标题：我的扣杀有多好？乒乓球运动员技能水平的量化

评估个体的技能水平至关重要，因为它会直接影响其行为。然而，量化技能具有挑战性，因为技能是潜在的，无法直接从观察到的动作中体现。为了探索人类行为中的技能理解，我们专注于双人运动——特别是乒乓球，其中技能不仅体现在复杂的动作中，还体现在根据比赛情境执行的微妙差异中。我们的核心思想是学习每个运动员战术拍击的生成模型，并将它们共同嵌入一个共同的潜在空间中，该空间编码个体特征，包括技能水平相关的特征。通过在大规模3D重建的专业比赛数据集上训练这些运动员模型，并根据包括运动员位置和对手行为在内的全面比赛情境进行条件化，模型在潜在空间中捕捉到个体战术身份。我们探索了这种学习到的运动员空间，并发现它反映了不同的比赛风格和属性，这些共同代表了技能。通过在这些嵌入上训练一个简单的相对排名网络，我们证明了可以实现相对和绝对的技能预测。这些结果表明，学习到的运动员空间有效地量化了技能水平，为复杂互动行为的自动化技能评估提供了基础。

Summary / 总结

The paper aims to quantify player skill levels in table tennis by learning a generative model of each player's tactical racket strokes and embedding them in a common latent space. The model captures individual characteristics and skill levels by conditioning on game context such as player positioning and opponent behaviors. The learned player space reflects distinct play styles and attributes, and a simple relative ranking network demonstrates effective relative and absolute skill predictions, validating the model's ability to quantify skill levels in complex interactive behaviors.

研究旨在通过学习每个球员战术击球的生成模型并在共同的潜在空间中嵌入它们来量化乒乓球中的球员技能水平。该模型基于大量3D重建的专业比赛数据和全面的比赛情境捕捉了个体特征和技能水平。结果表明，学习到的球员空间反映了不同的比赛风格和属性，通过简单的排名网络可以实现相对和绝对的技能预测。

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

Authors: Yuxing Lu, Xukai Zhao, Wei Wu, Jinzhuo Wang

First: 2026-03-26T17:59:49+00:00 · Latest: 2026-03-26T17:59:49+00:00

Comments: 15 pages

Abs · PDF · Code1 · Code2

Abstract

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.

中文标题/摘要

标题：通过证据提炼和写回丰富训练知识库

在检索增强生成（RAG）系统中，知识库通常只组装一次且从不修订，尽管查询所需的事实往往分散在多份文档中并埋藏在无关内容中。我们认为知识库应被视为可训练的组件，并提出了一种名为WriteBack-RAG的框架，该框架利用标记示例来识别检索成功的地方，隔离相关文档，并将它们提炼成紧凑的知识单元，与原始语料库一起索引。由于该方法仅修改语料库，因此可以作为一次性的离线预处理步骤应用，并与任何RAG流水线结合使用。在四种RAG方法、六个基准和两种LLM基础模型上，WriteBack-RAG改进了所有评估设置，平均提升幅度为+2.14%。跨方法迁移实验进一步表明，提炼的知识对除生成它的RAG流水线之外的其他流水线也有益，这证实了改进存在于语料库本身。

Summary / 总结

The paper addresses the issue of static knowledge bases in RAG systems, which are often incomplete and hard to revise. It introduces WriteBack-RAG, a framework that uses labeled examples to identify relevant documents, distill them into compact knowledge units, and integrate them into the corpus. This method improves performance across different RAG methods and benchmarks, with an average gain of +2.14%. The framework can be applied as an offline preprocessing step and combined with any RAG pipeline, showing that the distilled knowledge benefits other RAG pipelines as well.

论文针对RAG系统中静态的知识库问题，提出了一种名为WriteBack-RAG的框架，利用标记示例来识别并提炼相关文档为紧凑的知识单元。该方法在不同方法、基准和大模型上提升了RAG系统的性能，平均提升幅度为+2.14%。提炼出的知识对用于提炼的RAG管道之外的其他管道也有益，表明改进来源于语料库本身。

Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

Authors: Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui

First: 2026-03-26T17:59:34+00:00 · Latest: 2026-03-26T17:59:34+00:00

Comments: Project Page: http://ziyinwang1.github.io/LIGHT

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.

中文标题/摘要

标题：无需分类器释放指导以实现人类物体交互动画

生成逼真的人类物体交互（HOI）动画仍然具有挑战性，因为它需要同时建模动态的人类动作和多样的物体几何形状。先前基于扩散的方法通常依赖于手工制作的接触先验或人工施加的运动约束来提高接触质量。我们提出了一种数据驱动的替代方案，其中指导源自去噪速度本身，减少了对人工设计先验的依赖。基于扩散强迫，我们将表示分解为模态特定的组件，并为异步去噪计划分配个体化的噪声水平。在这种范式中，更干净的组件通过交叉注意力引导更嘈杂的组件，从而无需辅助分类器即可获得指导。我们发现，这种数据驱动的指导本质上是接触感知的，并且当训练增强为广泛的合成物体几何形状时，可以增强接触语义的几何多样性不变性。大量实验表明，速度诱导的指导比传统无分类器指导更有效地反映了接触先验的好处，同时实现更高的接触保真度、更真实的HOI生成以及更强的对未见过的对象和任务的泛化能力。

Summary / 总结

The paper addresses the challenge of generating realistic human-object interaction animations by proposing LIGHT, a data-driven approach that uses denoising pace to guide the process without relying on manually designed priors. By factoring the representation into modality-specific components and using asynchronous denoising schedules, LIGHT enhances contact awareness and generalization to unseen objects and tasks, achieving higher contact fidelity and more realistic HOI generation compared to conventional methods.

论文提出了一种名为LIGHT的方法，通过利用去噪速度本身来提供指导，减少对人工设计先验的依赖。通过将表示分解为特定模态的组件，并使用异步去噪时间表，LIGHT 提升了接触感知和HOI动画的真实感，特别是在使用多样化的合成物体几何形状进行训练时。实验表明，与传统的无辅助分类器方法相比，LIGHT 在接触保真度和对未见过的物体和任务的泛化能力方面表现更佳。

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Authors: Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi

Venue: CVPR 2026

First: 2026-03-26T17:59:31+00:00 · Latest: 2026-03-26T17:59:31+00:00

Comments: Accepted to GRAIL-V workshop at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

中文标题/摘要

标题：SlotVTG:面向对象的适配器以实现泛化视频时间定位

多模态大型语言模型（MLLMs）在视频时间定位（VTG）任务上表现出强大的性能。然而，它们粗略的识别能力不足以进行细粒度的时间理解，因此需要进行任务特定的微调。这种微调会导致模型记住数据集特定的捷径，而不是忠实于实际的视觉内容，从而导致较差的域外（OOD）泛化。面向对象的学习提供了一种有希望的解决方案，通过将场景分解为实体级表示，但现有方法需要从头开始重新运行整个多阶段训练管道。我们提出了SlotVTG框架，该框架以最小的成本引导MLLMs进行面向对象的、输入驱动的视觉推理。SlotVTG引入了一个轻量级的槽适配器，通过槽注意力将视觉标记分解为抽象槽，并重建原始序列，其中来自自监督视觉模型的对象性先验鼓励语义上一致的槽形成。跨域评估表明，我们的方法在保持与最小开销的域内（ID）性能的同时，显著提高了域外（OOD）鲁棒性。

Summary / 总结

The research aims to enhance the fine-grained temporal understanding of Multimodal Large Language Models (MLLMs) for Video Temporal Grounding (VTG) tasks. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention, encouraging semantically coherent slot formation with objectness priors. Experimental results show that SlotVTG significantly improves Out-of-Domain (OOD) robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

研究旨在通过提出SlotVTG框架来解决Multimodal Large Language Models (MLLMs)在Video Temporal Grounding (VTG)中的局限性，该框架增强了基于对象的输入导向视觉推理。SlotVTG引入了一个轻量级的槽适配器，使用槽注意力将视觉标记分解为抽象槽，这些槽由于来自自监督视觉模型的对象性先验而具有语义一致性。实验结果表明，SlotVTG在保持与In-Domain性能的同时，显著提高了Out-of-Domain鲁棒性，且具有最小的开销。

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Authors: Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, Chong Luo

First: 2026-03-26T17:59:16+00:00 · Latest: 2026-03-26T17:59:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

中文标题/摘要

标题：BizGenEval：商业视觉内容生成的系统基准

近期图像生成模型的进步使其应用范围从美学图像扩展到实用的视觉内容创作。然而，现有的基准主要集中在自然图像合成，未能系统地评估模型在现实世界商业设计任务的结构化和多约束要求下的表现。在本文中，我们介绍了BizGenEval，一个针对商业视觉内容生成的系统基准。该基准涵盖了五种代表性文档类型：幻灯片、图表、网页、海报和科学图表，并评估了四个关键能力维度：文本渲染、布局控制、属性绑定和基于知识的推理，形成了20项不同的评估任务。BizGenEval包含400个精心策划的提示和8000个人工验证的检查清单问题，以严格评估生成的图像是否满足复杂的视觉和语义约束。我们在26个流行的图像生成系统上进行了大规模基准测试，包括最先进的商业API和领先的开源模型。结果揭示了当前生成模型与专业视觉内容创作要求之间的能力差距。我们希望BizGenEval能够作为现实世界商业视觉内容生成的标准基准。

Summary / 总结

BizGenEval is a systematic benchmark for evaluating commercial visual content generation, addressing the limitations of existing benchmarks which focus on natural image synthesis. It evaluates five document types and four key capabilities, involving 20 diverse tasks and 400 prompts and 8000 checklist questions. The benchmark reveals significant gaps between current generative models and professional visual content creation requirements. This work aims to standardize evaluation for commercial visual content generation systems.

BizGenEval 是一个系统性的基准，用于评估商业视觉内容生成模型。它涵盖了五种文档类型和四个能力维度，包含20个不同的评估任务，以及400个提示和8000个检查清单问题，以评估复杂的约束条件。该基准评估了26个流行的模型，揭示了它们在能力上与专业视觉内容创作需求之间的显著差距。

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Authors: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang

First: 2026-03-26T17:59:05+00:00 · Latest: 2026-03-26T17:59:05+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

中文标题/摘要

标题：PackForcing：短视频培训足以实现长视频采样和长上下文推理

自回归视频扩散模型取得了显著进展，但仍受限于难以处理的线性KV缓存增长、时间重复和长视频生成中的累积误差。为解决这些挑战，我们提出了PackForcing，这是一种通过新颖的三部分KV缓存策略高效管理生成历史的统一框架。具体而言，我们将历史上下文分为三种类型：（1）Sink令牌，保留早期锚定帧的全分辨率以保持全局语义；（2）Mid令牌，通过结合逐层3D卷积和低分辨率VAE重新编码的双分支网络实现大规模时空压缩（32倍令牌减少）；（3）Recent令牌，保持全分辨率以确保局部时间连贯性。为了在不牺牲质量的情况下严格限制内存占用，我们引入了Mid令牌的动态top-$k$上下文选择机制，并结合连续的时空RoPE调整，无缝地重新对齐由于丢失令牌而产生的位置间隙，几乎没有额外开销。借助这种有原则的分层上下文压缩，PackForcing可以在单个H200 GPU上生成16 FPS、时长2分钟、分辨率为832x480的连贯视频。它实现了仅4 GB的KV缓存，并能够实现显著的24倍时间外推（5秒到120秒），无论是零样本还是仅训练5秒片段即可有效运行。VBench上的大量结果表明，PackForcing在时间一致性（26.07）和动态程度（56.25）方面达到了最先进的水平，证明了短视频监督足以实现高质量的长视频合成。

Summary / 总结

PackForcing addresses the challenges of long-video generation in autoregressive video diffusion models by introducing a three-partition KV-cache strategy. It categorizes historical context into sink, mid, and recent tokens to manage memory efficiently while maintaining quality. PackForcing can generate 2-minute videos at 16 FPS on a single H200 GPU with a 4 GB KV cache and achieves 24x temporal extrapolation, demonstrating that short-video training is sufficient for long-video synthesis with high temporal consistency and dynamic degree.

PackForcing通过引入三部分KV缓存策略来解决自回归视频扩散模型在长视频生成中的挑战。它将历史上下文分为sink、mid和recent三个部分以高效管理内存同时保持质量。PackForcing可以在单个H200 GPU上以16 FPS的速度生成2分钟的视频，使用4 GB的KV缓存，并实现24倍的时间外推，证明了短视频监督足以用于高质量的长视频合成，具有出色的时间一致性和动态程度。

PixelSmile: Toward Fine-Grained Facial Expression Editing

Authors: Jiabin Hua, Hengyuan Xu, Aojie Li, Wei Cheng, Gang Yu, Xingjun Ma, Yu-Gang Jiang

First: 2026-03-26T17:59:04+00:00 · Latest: 2026-03-26T17:59:04+00:00

Comments: 21 Pages; Project Page: https://ammmob.github.io/PixelSmile/ Code: https://github.com/Ammmob/PixelSmile

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

中文标题/摘要

标题：PixelSmile：朝向精细粒度面部表情编辑

精细粒度的面部表情编辑长期以来受限于固有的语义重叠。为了解决这一问题，我们构建了Flex面部表情（FFE）数据集，包含连续的情感注释，并建立了FFE-Bench来评估结构混淆、编辑准确性、线性可控性和表情编辑与身份保留之间的权衡。我们提出了PixelSmile，一种通过完全对称联合训练来分离表情语义的扩散框架。PixelSmile结合强度监督与对比学习，产生更强且更可区分的表情，通过文本潜在空间插值实现精确且稳定的线性表情控制。大量实验表明，PixelSmile实现了更好的分离和稳健的身份保留，证实了其在连续、可控和精细粒度表情编辑方面的有效性，同时自然支持平滑的表情过渡。

Summary / 总结

The research aims to address the challenge of fine-grained facial expression editing due to semantic overlap. The authors construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and develop FFE-Bench for evaluation. They propose PixelSmile, a diffusion framework that disentangles expression semantics through symmetric joint training, using intensity supervision and contrastive learning to achieve precise and stable linear expression control. Experiments show that PixelSmile excels in disentanglement and robust identity preservation, supporting smooth expression blending and continuous, controllable editing.

研究旨在解决由于语义重叠导致的精细面部表情编辑难题。作者构建了Flex Facial Expression (FFE) 数据集，并开发了FFE-Bench来评估表情编辑的各个方面。他们提出了PixelSmile，一种通过对称联合训练来分离表情语义的扩散框架。PixelSmile 使用强度监督和对比学习来生成可区分的表情，并通过文本潜在插值实现精确的线性控制。实验表明，PixelSmile 在分离和身份保留方面表现出色，使其能够有效进行连续、精细的表情编辑和平滑过渡。

Back to Basics: Revisiting ASR in the Age of Voice Agents

Authors: Geeyang Tay, Wentao Ma, Jaewon Lee, Yuzhi Tang, Daniel Lee, Weisu Yin, Dongming Shen, Silin Meng, Yi Zhu, Mu Li, Alex Smola

First: 2026-03-26T17:59:03+00:00 · Latest: 2026-03-26T17:59:03+00:00

Comments: 10 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.

中文标题/摘要

标题：回归基础：在语音代理时代重访ASR

自动语音识别（ASR）系统在精心策划的基准测试中已接近人类的准确性，但在当前评估未系统覆盖的现实世界语音代理条件下仍会失败。没有能够隔离特定失败因素的诊断工具，从业者无法预测哪些条件、哪种语言会导致何种程度的性能下降。我们引入了WildASR，这是一个多语言（四种语言）的诊断基准，完全源自真实的人类语音，将ASR的鲁棒性分解为三个维度：环境退化、人口变化和语言多样性。评估了七种广泛使用的ASR系统，我们发现严重的、不均匀的性能下降，模型的鲁棒性在不同语言或条件下无法转移。关键的是，模型在部分或退化输入下往往会生成可能但未言说的内容，这为下游代理行为带来了具体的安全风险。我们的结果表明，针对特定因素的分解评估对于理解并改进生产系统中的ASR可靠性至关重要。除了基准本身，我们还介绍了三种分析工具，从业者可以使用这些工具来指导部署决策。

Summary / 总结

This paper addresses the limitations of ASR systems in real-world voice agents by introducing WildASR, a multilingual diagnostic benchmark. Seven ASR systems were evaluated under three factors: environmental degradation, demographic shift, and linguistic diversity. The study found severe performance degradation across different conditions and languages, with models often generating plausible but unspoken content under degraded inputs, posing safety risks. The results emphasize the need for targeted evaluation to improve ASR reliability in production systems.

本文通过引入WildASR多语言诊断基准，解决了ASR系统在真实语音代理中的局限性问题。研究评估了七种ASR系统在环境退化、人口变化和语言多样性三个因素下的表现，发现不同条件和语言下性能严重下降，模型在退化输入下往往会生成合理的但未说出口的内容，存在安全风险。研究结果强调了在生产系统中提高ASR可靠性的目标化评估需求。

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Authors: Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su

First: 2026-03-26T17:58:54+00:00 · Latest: 2026-03-26T17:58:54+00:00

Abs · PDF · Code1 · Code2

Abstract

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

中文标题/摘要

标题：AnyHand：用于RGB(-D)手部姿态估计的大规模合成数据集

我们提出了AnyHand，一个大规模合成数据集，旨在推动基于RGB和RGB-D输入的手部姿态估计技术的发展。尽管最近使用基础方法的研究表明，增加训练数据的数量和多样性可以显著提高手部姿态估计的性能和鲁棒性，但现有的现实世界收集的数据集在覆盖范围上有限，而之前的合成数据集很少能同时提供遮挡、手臂细节和对齐的深度信息。为了解决这一瓶颈，我们的AnyHand包含250万张单手和410万张手物交互的RGB-D图像，并附有丰富的几何注释。在仅RGB设置中，我们展示了将现有基线的原始训练集扩展到AnyHand，即使保持架构和训练方案不变，也能在多个基准（FreiHAND和HO-3D）上获得显著的性能提升。更令人印象深刻的是，使用AnyHand训练的模型在未进行任何微调的情况下，对域外的HO-Cap数据集具有更强的泛化能力。我们还贡献了一个轻量级的深度融合模块，可以轻松集成到现有的RGB模型中。使用AnyHand训练的RGB-D模型在HO-3D基准上表现出色，展示了深度集成的好处以及我们合成数据的有效性。

Summary / 总结

AnyHand is a large-scale synthetic dataset for 3D hand pose estimation from RGB and RGB-D inputs. It addresses the limitations of existing datasets by providing extensive coverage, including occlusions and arm details. The dataset contains 2.5 million single-hand and 4.1 million hand-object interaction RGB-D images. Using AnyHand, models show significant improvements on benchmarks like FreiHAND and HO-3D, and the depth integration module further enhances performance on HO-3D.

AnyHand 是一个大规模合成数据集，用于从 RGB 和 RGB-D 输入中进行 3D 手部姿态估计。它解决了现有数据集的局限性，提供了广泛的覆盖范围，包括遮挡和手臂细节。该数据集包含 250 万张单手和 410 万张手物交互的 RGB-D 图像，并带有详细的几何注释。使用 AnyHand 来增强现有模型的训练集显著提高了 FreiHAND 和 HO-3D 等基准上的性能，且在 HO-Cap 数据集上表现出强大的泛化能力，无需微调。此外，还引入了一个轻量级的深度融合模块，增强了 RGB-D 模型在 HO-3D 上的性能。

Natural-Language Agent Harnesses

Authors: Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

First: 2026-03-26T17:58:15+00:00 · Latest: 2026-03-26T17:58:15+00:00

Comments: under review

Abs · PDF · Code1 · Code2

Abstract

Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

中文标题/摘要

标题：自然语言代理 harness

代理性能越来越多地取决于\emph{harness工程设计}，然而harness设计通常埋藏在控制器代码和特定运行时的约定中，使其难以转移、比较和作为科学对象进行研究。我们询问是否可以将代理 harness 的高级控制逻辑外部化为可移植的可执行软件包。我们引入了\textbf{自然语言代理 harness}（NLAHs），它用可编辑的自然语言表达 harness 行为，并引入了\textbf{智能 harness 运行时}（IHR），这是一种共享运行时，通过明确的合同、持久的软件包和轻量级适配器来执行这些 harness。在编程和计算机使用基准测试中，我们对操作可行性、模块消融和代码到文本 harness 迁移进行了受控评估。

Summary / 总结

The research aims to externalize agent harness behavior into portable natural language artifacts to facilitate scientific study and comparison. The study introduces Natural-Language Agent Harnesses (NLAHs) and the Intelligent Harness Runtime (IHR), which allows harnesses to be expressed in natural language and executed through explicit contracts and lightweight adapters. The evaluations across coding and computer-use benchmarks show the operational viability of NLAHs, the importance of modules, and the feasibility of migrating code to text-based harnesses.

研究旨在将代理控制逻辑外部化为可编辑的自然语言模块，以促进科学研究和比较。研究引入了自然语言代理控制（NLAH）和智能控制运行时（IHR），通过明确的合同和轻量级适配器执行这些控制模块。实验表明，NLAH在编码和计算机使用基准测试中的操作可行性，模块的重要性以及从代码到文本控制模块的迁移可行性。

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Authors: Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

Venue: CVPR 2026

First: 2026-03-26T17:58:04+00:00 · Latest: 2026-03-26T17:58:04+00:00

Comments: Accepted at CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

中文标题/摘要

标题：无需硬负样本：概念中心学习促进组合性而不损害对比模型的零样本能力

对比视觉-语言（V&L）模型仍然是各种应用的热门选择。然而，这些模型在学习组合性表示方面的能力有限。先前的方法通常通过生成自定义训练数据来获得硬负样本来解决这一限制。硬负样本已被证明在组合性任务上可以提高性能，但它们往往是针对单一基准的，不具有一般性，并且可能导致基本的V&L能力，如零样本或检索性能的显著下降，使其变得不切实际。在本工作中，我们采取了不同的方法。我们识别了限制V&L组合性性能的两个根本原因：1）长训练描述不需要组合性表示；2）文本和图像编码器中的最终全局池化导致完全丢失了学习绑定所需的信息。作为补救措施，我们提出了两个简单的解决方案：1）我们使用标准NLP软件获得短的概念中心描述片段，并将其与图像对齐；2）我们引入了一个无参数的跨模态注意力池化，从图像编码器中获得概念中心的视觉嵌入。通过这两个改变和简单的辅助对比损失，我们在标准组合性基准上获得了SOTA性能，同时保持或提高了强大的零样本和检索能力。这不会增加推理成本。我们将在https://github.com/SamsungLabs/concept_centric_clip/发布此工作的代码。

Summary / 总结

This paper addresses the limitation of contrastive vision-language models in learning compositional representations by proposing a concept-centric learning approach. Instead of relying on hard negative samples, the authors identify two root causes and propose two simple solutions: using short concept-centric caption parts and introducing a parameter-free cross-modal attention-pooling. The model achieves state-of-the-art performance on compositionality benchmarks while maintaining strong zero-shot and retrieval capabilities without increasing inference cost.

本文提出了一种概念中心的学习方法来解决对比视觉-语言模型在学习组合表示方面的局限性。作者识别了两个根本原因，并提出了两个简单的解决方案：短的概念中心化标题片段和参数自由的跨模态注意力池化。这些改变使模型能够在保持强大的零样本和检索能力的同时，不增加推理成本，达到标准组合性基准的最先进性能。

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Authors: Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao

First: 2026-03-26T17:58:04+00:00 · Latest: 2026-03-26T17:58:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

中文标题/摘要

标题：R-C2：循环一致强化学习提高多模态推理

稳健的感知和推理需要跨感官模态的一致性。然而，当前的多模态模型往往违背了这一原则，导致对同一概念的视觉和文本表示产生矛盾的预测。我们不通过标准的投票机制掩盖这些失败，因为这可能会放大系统性偏差，而是展示了跨模态不一致为学习提供了丰富而自然的信号。我们引入了RC2，这是一种通过强制执行跨模态循环一致性来解决内部冲突的强化学习框架。通过要求模型进行反向推理、切换模态，并可靠地通过正向推理重建答案，我们获得了一个密集的、无需标签的奖励。这种循环约束促使模型自主对齐其内部表示。优化这种结构可以减轻模态特定的错误，并将推理准确性提高多达7.6个百分点。我们的结果表明，高级推理不仅来自于数据的扩展，还来自于对世界结构一致理解的强制执行。

Summary / 总结

The research aims to improve multimodal reasoning by ensuring consistency across sensory modalities. The method involves a reinforcement learning framework called R-C2, which enforces cross-modal cycle consistency to resolve internal conflicts. This cyclic constraint encourages the model to autonomously align its internal representations, leading to improved reasoning accuracy by up to 7.6 points.

研究旨在通过解决跨模态不一致性来提高多模态推理的一致性。方法是采用一种名为R-C2的强化学习框架，通过强制执行循环一致性来解决内部冲突。这种循环约束有助于自动对齐内部表示，并将推理准确性提高7.6个百分点。关键发现是，这种方法能够自主对齐模态，并减少模态特定的错误，从而增强多模态推理的鲁棒性。

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Authors: Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, Akash Srivastava

First: 2026-03-26T17:57:50+00:00 · Latest: 2026-03-26T17:57:50+00:00

Abs · PDF · Code1 · Code2

Abstract

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

中文标题/摘要

标题：用于高级综合的代理工厂：通用编码代理在硬件优化中的潜力有多远？

我们对通用编码代理（无需特定硬件训练）如何从高级算法规范优化硬件设计进行了实证研究。我们引入了一个代理工厂，这是一个两阶段流水线，用于构建和协调多个自主优化代理。在第一阶段，流水线将设计分解为子内核，独立地使用pragma和代码级变换优化每个子内核，并通过面积约束下的整数线性规划（ILP）来组装全局有希望的配置。在第二阶段，它在ILP解决方案上启动N个专家代理，每个代理探索子内核分解无法捕捉的跨功能优化，如pragma重组、循环融合和内存重构。我们使用Claude Code（Opus 4.5/4.6）和AMD Vitis HLS在HLS-Eval和Rodinia-HLS的12个内核上评估了该方法。从1个到10个代理的扩展在基线上的平均加速比为8.27倍，对于更难的基准，加速比超过20倍，kmeans达到约10倍。在所有基准测试中，代理一致地重新发现了无需特定领域训练即可发现的硬件优化模式，最佳设计往往不是来自ILP候选的顶级选项，这表明全局优化揭示了子内核搜索中遗漏的改进。这些结果确立了代理扩展作为高级综合优化的实际和有效轴心。

Summary / 总结

The study investigates the capability of general-purpose coding agents in optimizing hardware designs from high-level specifications without hardware-specific training. An agent factory, a two-stage pipeline, constructs and coordinates multiple autonomous agents to optimize hardware designs. Stage 1 decomposes the design into sub-kernels, optimizes each, and formulates an ILP to assemble globally promising configurations. Stage 2 launches expert agents to explore cross-function optimizations. Evaluating on 12 kernels, the approach yields a mean 8.27x speedup over the baseline, with larger gains on harder benchmarks, indicating that global optimization can expose improvements missed by sub-kernel search without domain-specific training.

这项研究探讨了通用编码代理在无需特定硬件训练的情况下，从高级规格优化硬件设计的能力。一个代理工厂，一个两阶段流水线，构建并协调多个自主代理来优化硬件设计。第一阶段将设计分解为子内核，独立优化并制定整数线性规划（ILP）进行全局配置组装。第二阶段启动专家代理探索跨功能优化。在使用Claude Code对来自HLS-Eval和Rodinia-HLS的12个内核进行评估时，显示平均8.27倍的加速，特别是在更难的基准上，表明全局优化可以揭示子内核搜索中遗漏的改进，并且代理扩展是高级综合优化的一个实用和有效的方向。

Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Authors: Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, Xiang Bai

First: 2026-03-26T17:56:01+00:00 · Latest: 2026-03-26T17:56:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.

中文标题/摘要

标题：视而不见但记在心：动态视频世界模型中的混合记忆

视频世界模型在模拟物理世界方面展现了巨大的潜力，但现有的记忆机制主要将环境视为静态的画布。当动态主体暂时消失后重新出现时，当前的方法往往难以应对，导致主体出现冻结、失真或消失的情况。为了解决这个问题，我们引入了混合记忆这一新颖的范式，要求模型同时作为静态背景的精确档案员和动态主体的警惕追踪者，确保在视线之外的时段内保持运动连续性。为了促进这一方向的研究，我们构建了HM-World，这是首个专注于混合记忆的大型视频数据集，包含59000个高质量片段，具有解耦的摄像机和主体轨迹，涵盖了17个不同的场景、49个不同的主体以及精心设计的进出事件，以严格评估混合一致性。此外，我们还提出了HyDRA，这是一种专门的记忆架构，将记忆压缩成令牌，并利用时空相关性驱动的检索机制。通过选择性地关注相关的运动线索，HyDRA有效地保留了隐藏主体的身份和运动。在HM-World上的广泛实验表明，我们的方法在动态主体一致性和整体生成质量方面显著优于最先进的方法。

Summary / 总结

The research addresses the challenge of dynamic subjects disappearing from view and reappearing in video world models, which current methods often fail to handle properly. It introduces Hybrid Memory, a new approach that requires models to both archive static backgrounds and track dynamic subjects, ensuring continuous motion. The study also presents HM-World, a large-scale dataset for hybrid memory research, and HyDRA, a memory architecture that uses token compression and spatiotemporal relevance to maintain subject identity and motion. Experiments show that the proposed method outperforms existing techniques in dynamic subject consistency and overall generation quality.

论文针对视频世界模型中动态主体消失后出现的冻结、扭曲或消失问题，提出了Hybrid Memory，该方法要求模型同时记录静态背景和追踪动态主体。作者还构建了HM-World，这是一个大规模的视频数据集，并提出了HyDRA，一种利用时空相关性来保存隐藏主体身份和运动的内存架构。实验表明，HyDRA在动态主体一致性和整体生成质量方面优于现有方法。

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Authors: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah

First: 2026-03-26T17:53:49+00:00 · Latest: 2026-03-26T17:53:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

中文标题/摘要

标题：视觉关注以求地基：视觉注意力在幻觉抗性MDLLMs中的应用

多模态扩散大型语言模型（MDLLMs）通过并行遮蔽解码实现高并发生成，但架构仍易受到多模态幻觉的影响。这种结构上的脆弱性源于算法缺陷：解码器根据文本可能性对候选词进行排序，而未验证局部视觉支持。我们证明这种仅语言的排序导致了目标不匹配，其中语言概率质量充当了对多模态任务的不恰当代理。因此，我们将幻觉重新解释为局部优化错误，即解码器利用语言捷径以最大化代理分数，而牺牲了视觉地基。为解决这种目标不匹配，我们引入了VISAGE，这是一种无需训练的解码框架，在推理时校准目标。VISAGE通过量化跨注意力分布的空间熵来估计代理差异。通过在注意力头之间强制执行定位共识，该方法惩罚空间均匀分布并重新排序词元承诺，以有利于视觉地基的结果。我们提供了分析的稳定性保证，表明在估计误差下VISAGE保持有界的目标损失。在幻觉敏感和通用基准上的评估表明该框架的鲁棒性，分别在MMMU-val上获得8.59%的相对增益，在HallusionBench上获得7.75%的相对增益。

Summary / 总结

The research addresses the issue of multimodal hallucinations in MDLLMs by introducing VISAGE, a decoding framework that calibrates the objective at inference time to better align with visual grounding. VISAGE quantifies the spatial entropy of cross-attention distributions to penalize spatially uniform distributions, thereby favoring visually grounded outcomes. Experiments show that VISAGE improves robustness across benchmarks, achieving relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

论文通过引入VISAGE解码框架来解决MDLLMs中的多模态幻觉问题，该框架在推理时校准目标以更好地与视觉接地对齐。VISAGE通过量化交叉注意力分布的空间熵来惩罚空间均匀分布，从而更倾向于视觉接地的结果。实验表明，VISAGE在幻觉敏感和通用基准上提高了鲁棒性，分别在MMMU-val和HallusionBench上实现了8.59%和7.75%的相对增益。

TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

Authors: Quynh Phung, Long Mai, Cusuh Ham, Feng Liu, Jia-Bin Huang, Aniruddha Mahapatra

First: 2026-03-26T17:50:42+00:00 · Latest: 2026-03-26T17:50:42+00:00

Comments: webpage: https://trace-motion.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We study object motion path editing in videos, where the goal is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

中文标题/摘要

标题：TRACE: 视频中物体运动编辑的第一帧轨迹引导方法

我们研究视频中物体运动路径编辑的问题，目标是改变目标物体的轨迹同时保留原始场景内容。与之前主要操作外观或依赖基于点轨迹的路径控制的方法不同，这些方法在推断过程中用户往往难以提供，尤其是在有摄像机运动的视频中，我们提供了一种实用且易于使用的可控物体中心运动编辑方法。我们提出了Trace框架，允许用户在单个锚定帧中设计所需的轨迹，然后合成出时间上一致的编辑视频。我们的方法通过两阶段流水线来解决此任务：一个跨视图运动变换模块，将第一帧路径设计映射到摄像机运动下的帧对齐的框轨迹，以及一个基于运动条件的视频重新合成模块，遵循这些轨迹再生物体同时保留输入视频的其余内容。在多种真实世界视频上的实验表明，我们的方法生成的运动编辑更加连贯、真实且可控，优于最近的图像到视频和视频到视频方法。

Summary / 总结

The research aims to enable users to edit the motion path of objects in videos while maintaining the original scene content. The proposed Trace framework allows users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. The method uses a two-stage pipeline: a cross-view motion transformation module and a motion-conditioned video re-synthesis module. Experiments show that Trace produces more coherent, realistic, and controllable motion edits compared to recent methods.

研究旨在让用户能够在保持原始场景内容的同时编辑视频中物体的运动路径。提出的Trace方法允许用户在一个锚定帧中设计所需的轨迹，然后合成一个时间上一致的编辑视频。该方法使用两阶段管道：一个跨视图运动转换模块将第一帧路径设计映射到摄像机运动下的帧对齐的盒子轨迹，以及一个基于运动条件的视频重新合成模块，按照这些轨迹再生物体同时保留输入视频的其余内容。实验表明，Trace生成的运动编辑更加连贯、真实且可控，优于最近的方法。

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Authors: Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

Venue: CVPR 2026

First: 2026-03-26T17:50:37+00:00 · Latest: 2026-03-26T17:50:37+00:00

Comments: CVPR 2026 Camera-ready, Webpage: https://doubiiu.github.io/projects/WanWeaver

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

中文标题/摘要

标题：Wan-Weaver: 通过解耦训练实现交错多模态生成

近期统一模型在理解和生成方面取得了前所未有的进展。然而，尽管大多数模型接受多模态输入，它们通常只生成单一模态的输出。这种交错内容生成的挑战主要源于训练数据稀缺以及建模长距离跨模态上下文的难度。为了解决这一问题，我们将交错生成分解为文本规划和视觉一致性建模，并引入了一个由规划器和视觉器组成的框架。规划器生成视觉内容的密集文本描述，而视觉器据此合成图像。在这一指导下，我们构建了大规模的文本代理交错数据（其中视觉内容以文本形式表示）来训练规划器，并整理了参考指导图像数据来训练视觉器。这些设计催生了Wan-Weaver，它展示了具有长距离文本连贯性和视觉一致性的新兴交错生成能力。同时，将多样化的理解和生成数据整合到规划器训练中，使Wan-Weaver能够实现稳健的任务推理和生成能力。为了评估模型在交错生成方面的能力，我们进一步构建了一个涵盖多个维度广泛应用场景的基准。大量实验表明，即使没有访问任何真实交错数据，Wan-Weaver也优于现有方法。

Summary / 总结

Wan-Weaver addresses the challenge of producing interleaved multi-modal content by decomposing the task into textual planning and visual consistency modeling. It uses large-scale textual-proxy interleaved data to train a planner and reference-guided image data to train a visualizer. The model demonstrates strong interleaved generation ability with long-range textual coherence and visual consistency, and outperforms existing methods even without real interleaved data.

Wan-Weaver通过将任务分解为文本规划和视觉一致性建模来解决生成交织多模态内容的挑战。它使用大规模的文本代理数据来训练规划器，并使用参考引导的图像数据来训练视觉器。该框架使Wan-Weaver能够生成具有长距离文本连贯性和视觉一致性的交织内容，即使没有实际交织数据，其性能也优于现有方法。

Hyper-Connections for Adaptive Multi-Modal MRI Brain Tumor Segmentation

Authors: Lokendra Kumar, Shubham Aggarwal

First: 2026-03-20T10:44:58+00:00 · Latest: 2026-03-26T17:40:04+00:00

Comments: 29 pages,6 tables,17 figures

Abs · PDF · Code1 · Code2

Abstract

We present the first study of Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, integrating them as a drop-in replacement for fixed residual connections across five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. Dynamic HC consistently improves all 3D models on the BraTS 2021 dataset, yielding up to +1.03 percent mean Dice gain with negligible parameter overhead. Gains are most pronounced in the Enhancing Tumor sub-region, reflecting improved fine-grained boundary delineation. Modality ablation further reveals that HC-equipped models develop sharper sensitivity toward clinically dominant sequences, specifically T1ce for Tumor Core and Enhancing Tumor, and FLAIR for Whole Tumor, a behavior absent in fixed-connection baselines and consistent across all architectures. In 2D settings, improvements are smaller and configuration-sensitive, suggesting that volumetric spatial context amplifies the benefit of adaptive aggregation. These results establish HC as a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation.

中文标题/摘要

标题：超连接用于自适应多模态MRI脑肿瘤分割

我们首次研究了超连接（HC）在体多模态脑肿瘤分割中的应用，将它们作为固定残差连接的即插即用替代品集成到五个架构中：nnU-Net、SwinUNETR、VT-UNet、U-Net和U-Netpp中。动态HC在BraTS 2021数据集上一致提高了所有3D模型的表现，参数量几乎没有增加，最高可获得+1.03百分点的平均Dice增益。在增强肿瘤亚区域中，增益最为显著，反映了更精细的边界划分。模态消融进一步表明，配备HC的模型对临床主导序列具有更锐利的敏感性，特别是对于肿瘤核心和增强肿瘤的T1ce，以及对于整个肿瘤的FLAIR，这种行为在固定连接基线中不存在，并且在所有架构中都是一致的。在2D设置中，改进较小且配置敏感，表明体素空间上下文放大了自适应聚合的好处。这些结果确立了HC作为多模态特征融合的简单、高效且广泛适用机制的地位。

Summary / 总结

This study introduces Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, enhancing five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. HC improves all models on the BraTS 2021 dataset, achieving up to 1.03 percent mean Dice gain with minimal parameter increase. The gains are particularly significant in the Enhancing Tumor sub-region, indicating better fine-grained boundary delineation. Modality ablation shows that HC-equipped models are more sensitive to clinically dominant sequences, a behavior not observed in fixed-connection models. In 2D settings, the improvements are smaller and depend on configuration, suggesting that volumetric spatial context enhances the benefits of adaptive aggregation.

研究引入了Hyper-连接（HC）用于多模态脑肿瘤体积分割，增强了五种架构：nnU-Net、SwinUNETR、VT-UNet、U-Net和U-Netpp。HC在BraTS 2021数据集上提高了所有模型的表现，最高可获得1.03个百分点的Dice分数提升，且参数量几乎没有增加。特别是在增强肿瘤子区域，显示出更好的细粒度边界分割效果。模态消融实验表明，HC增强的模型对临床主导序列更为敏感，这一行为在固定连接模型中未见。在2D设置中，改进较小且依赖于配置，突出了体积空间上下文对自适应聚合的重要性。

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Authors: Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

First: 2026-03-26T17:36:33+00:00 · Latest: 2026-03-26T17:36:33+00:00

Comments: 18 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

中文标题/摘要

标题：只需放大：通过自回归放大进行跨视图地理定位

跨视图地理定位（CVGL）通过将街道视图图像与地理参考的航拍图像匹配来估计相机的位置，从而实现GPS受限的定位和导航。现有方法几乎都将CVGL普遍形式化为对比训练嵌入空间中的图像检索问题。这将性能与大批次和硬负样本挖掘联系在一起，并忽略了地图的几何结构以及街道视图和航拍图像之间的覆盖不匹配。特别是，从街道视图可见的显著地标可能位于固定卫星裁剪之外，使检索目标变得模糊，并限制了对地图的显式空间推理。我们提出了一种替代方案Just Zoom In，通过在城市规模的航拍地图上进行自回归放大来执行CVGL。从粗略的卫星视图开始，模型通过一系列放大决策选择目标分辨率的终端卫星单元，而无需对比损失或硬负样本挖掘。我们还引入了一个基于众包街道视图和高分辨率卫星图像的现实基准，反映了实际拍摄条件。在该基准上，Just Zoom In 达到了最先进的性能，Recall@1 在50米内的提升为5.5%，在100米内的提升为9.6%，超过最强的对比检索基线。这些结果表明，顺序的粗细空间推理对于跨视图地理定位的有效性。

Summary / 总结

The paper addresses the challenge of cross-view geo-localization (CVGL) by proposing Just Zoom In, which formulates CVGL through autoregressive zooming over a city-scale overhead map. Unlike existing methods that rely on contrastive losses and hard negative mining, Just Zoom In starts from a coarse satellite view and makes a series of zoom-in decisions to select a terminal satellite cell at the target resolution. On a new benchmark with crowd-sourced street views and high-resolution satellite imagery, Just Zoom In outperforms the strongest contrastive-retrieval baseline, improving Recall@1 within 50 meters by 5.5% and within 100 meters by 9.6%. This highlights the effectiveness of sequential coarse-to-fine spatial reasoning in CVGL.

研究通过提出Just Zoom In方法，使用城市规模的航拍地图进行自回归缩放，解决了现有跨视角地理定位方法的局限性。从粗略的卫星视图开始，模型通过一系列缩放决策选择目标分辨率的终端卫星单元。该方法避免了对比损失和硬负样本挖掘，实现了最先进的性能，在50米内的召回率提高了5.5%，在100米内的召回率提高了9.6%，优于最强的对比检索基线。

Tensor Gaussian Processes: Efficient Solvers for Nonlinear PDEs

Authors: Qiwei Yuan, Zhitong Xu, Yinghao Chen, Yiming Xu, Houman Owhadi, Shandian Zhe

First: 2025-10-15T17:23:21+00:00 · Latest: 2026-03-26T17:36:11+00:00

Comments: Accepted at AISTATS 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Machine learning solvers for partial differential equations (PDEs) have attracted growing interest. However, most existing approaches, such as neural network solvers, rely on stochastic training, which is inefficient and typically requires a great many training epochs. Gaussian process (GP)/kernel-based solvers, while mathematical principled, suffer from scalability issues when handling large numbers of collocation points often needed for challenging or higher-dimensional PDEs. To overcome these limitations, we propose TGPS, a tensor-GP-based solver that introduces factor functions along each input dimension using one-dimensional GPs and combines them via tensor decomposition to approximate the full solution. This design reduces the task to learning a collection of one-dimensional GPs, substantially lowering computational complexity, and enabling scalability to massive collocation sets. For efficient nonlinear PDE solving, we use a partial freezing strategy and Newton's method to linerize the nonlinear terms. We then develop an alternating least squares (ALS) approach that admits closed-form updates, thereby substantially enhancing the training efficiency. We establish theoretical guarantees on the expressivity of our model, together with convergence proof and error analysis under standard regularity assumptions. Experiments on several benchmark PDEs demonstrate that our method achieves superior accuracy and efficiency compared to existing approaches. The code is released at https://github.com/BayesianAIGroup/TGPSolve-NonLinear-PDEs

中文标题/摘要

标题：张量高斯过程：非线性偏微分方程的有效求解器

机器学习求解偏微分方程（PDEs）的方法引起了广泛关注。然而，大多数现有方法，如神经网络求解器，依赖于随机训练，这效率低下，通常需要大量的训练周期。基于高斯过程（GP）/核的方法虽然具有数学上的原理，但在处理大量插值点时会遇到可扩展性问题，这些插值点常用于解决具有挑战性或高维的PDEs。为克服这些限制，我们提出了TGPS，一种基于张量高斯过程的求解器，通过在每个输入维度上引入一维GP并使用张量分解将它们结合起来，以近似完整解。这种设计将任务简化为学习一组一维GP，显著降低了计算复杂性，并使大规模插值集的处理成为可能。为了高效求解非线性PDEs，我们采用部分冻结策略和牛顿法线性化非线性项。然后，我们开发了一种交替最小二乘（ALS）方法，该方法具有闭式更新，从而显著提高了训练效率。我们建立了模型的表达能力理论保证，以及在标准正则性假设下的收敛性和误差分析证明。在几个基准PDEs上的实验表明，我们的方法在准确性和效率方面优于现有方法。代码发布在https://github.com/BayesianAIGroup/TGPSolve-NonLinear-PDEs

Summary / 总结

The research aims to improve the efficiency and scalability of solvers for partial differential equations (PDEs) by proposing Tensor Gaussian Processes (TGPS), which use one-dimensional Gaussian processes combined via tensor decomposition. This method reduces computational complexity and allows handling large collocation sets. The approach employs a partial freezing strategy and Newton's method for nonlinear terms, and an alternating least squares (ALS) approach for efficient training. Experiments show that TGPS outperforms existing methods in both accuracy and efficiency on benchmark PDEs.

论文提出了一种张量高斯过程（TGPS）方法，以更高效地解决偏微分方程（PDEs）。该方法通过使用张量分解来降低计算复杂度，从而解决传统高斯过程求解器的可扩展性问题。该方法采用部分冻结策略和牛顿法处理非线性项，并使用交替最小二乘法进行高效训练。实验表明，TGPS在基准PDEs上在准确性和效率方面均优于现有方法。

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning

Authors: Jai Bardhan, Patrik Drozdik, Josef Sivic, Vladimir Petrik

First: 2026-03-26T17:36:08+00:00 · Latest: 2026-03-26T17:36:08+00:00

Comments: 34 pages, 11 figures, 12 tables

Abs · PDF · Code1 · Code2

Abstract

Action-conditioned robot world models generate future video frames of the manipulated scene given a robot action sequence, offering a promising alternative for simulating tasks that are difficult to model with traditional physics engines. However, these models are optimized for short-term prediction and break down when deployed autoregressively: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade. We address this through the following contributions. First, we introduce a reinforcement learning (RL) post-training scheme that trains the world model on its own autoregressive rollouts rather than on ground-truth histories. We achieve this by adapting a recent contrastive RL objective for diffusion models to our setting and show that its convergence guarantees carry over exactly. Second, we design a training protocol that generates and compares multiple candidate variable-length futures from the same rollout state, reinforcing higher-fidelity predictions over lower-fidelity ones. Third, we develop efficient, multi-view visual fidelity rewards that combine complementary perceptual metrics across camera views and are aggregated at the clip level for dense, low-variance training signal. Fourth, we show that our approach establishes a new state-of-the-art for rollout fidelity on the DROID dataset, outperforming the strongest baseline on all metrics (e.g., LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera), winning 98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

中文标题/摘要

标题：持久的机器人世界模型：通过强化学习稳定多步展开

基于动作条件的机器人世界模型可以根据给定的机器人动作序列生成未来视频帧，为模拟传统物理引擎难以建模的任务提供了有前景的替代方案。然而，这些模型优化了短期预测，在自回归部署时会失效：每次预测的片段都会作为下一个片段的上下文，导致错误累积并导致视觉质量迅速下降。我们通过以下贡献解决了这一问题。首先，我们引入了一种强化学习（RL）后训练方案，该方案在自回归展开上训练世界模型，而不是在真实历史数据上。我们通过将最近的对比RL目标适应到我们的设置中，并展示了其收敛性保证完全适用。其次，我们设计了一种训练协议，从相同的展开状态生成并比较多个候选的变长未来，强化高保真预测。第三，我们开发了高效的多视图视觉保真度奖励，结合了来自不同摄像机视图的互补感知度量，并在片段级别进行聚合，以提供密集且低方差的训练信号。第四，我们展示了我们的方法在DROID数据集上建立了新的展开保真度的最新成果，所有指标上均优于最强基线（例如，外部摄像机上的LPIPS降低了14%，手腕摄像机上的SSIM提高了9.1%），赢得了98%的配对比较，并在盲测中获得了80%的偏好率。

Summary / 总结

This paper addresses the issue of visual quality degradation in autoregressive predictions from robot world models by introducing a reinforcement learning (RL) post-training scheme. The method trains the model on its own predictions and designs a training protocol that generates and compares multiple candidate futures, reinforcing higher-fidelity predictions. The approach achieves superior rollout fidelity, outperforming existing methods on metrics such as LPIPS and SSIM, and receives high preference rates in human studies.

本文解决了机器人世界模型在进行多步预测时出现的误差累积问题。提出了一种基于强化学习（RL）的后训练方案，通过自适应对比RL目标优化模型的自回归卷出。方法还包括生成和比较多个候选未来，强化更高保真度的预测。开发了高效的多视图视觉保真度奖励，以提高训练信号的密度和减少方差。实验结果表明，该方法在DROID数据集上显著优于现有方法，LPIPS和SSIM指标均有显著提升，并在盲测中获得了高偏好率。

Analysing Environmental Efficiency in AI for X-Ray Diagnosis

Authors: Liam Kearns

Venue: Journal of AI 10 (2026) 37-55

First: 2025-10-31T14:19:57+00:00 · Latest: 2026-03-26T17:32:49+00:00

Comments: Accepted for publication in Journal of AI. The final published version is available at https://doi.org/10.61969/jai.1838517

Abs · PDF · Code1 · Code2

Abstract

The integration of AI tools into medical applications has aimed to improve the efficiency of diagnosis. The emergence of large language models (LLMs), such as ChatGPT and Claude, has expanded this integration even further despite a concern for their environmental impact. Because of LLM versatility and ease of use through APIs, these larger models are often utilised even though smaller, custom models can be used instead. In this paper, LLMs and small discriminative models are integrated into a Mendix application to detect Covid-19 in chest X-rays. These discriminative models are also used to provide knowledge bases for LLMs to improve accuracy. This provides a benchmark study of 14 different model configurations for comparison of diagnostic accuracy and environmental impact. The findings indicated that while smaller models reduced the carbon footprint of the application, the output was biased towards a positive diagnosis and the output probabilities were lacking confidence. Meanwhile, restricting LLMs to only give probabilistic output caused poor performance in both accuracy and carbon footprint, demonstrating the risk of using LLMs as a universal AI solution. While using the smaller LLM GPT-4.1-Nano reduced the carbon footprint by 94.2% compared to the larger models, this was still disproportionate to the discriminative models; the most efficient solution was the Covid-Net model. Although it had a larger carbon footprint than other small models, its carbon footprint was 99.9% less than when using GPT-4.5-Preview, whilst achieving an accuracy of 95.5%, the highest of all models examined. This paper contributes to knowledge by comparing generative and discriminative models in Covid-19 detection as well as highlighting the environmental risk of using generative tools for classification tasks.

中文标题/摘要

标题：分析AI在X射线诊断中的环境效率

将AI工具集成到医疗应用中旨在提高诊断效率。尽管大型语言模型（LLMs）如ChatGPT和Claude的出现进一步扩展了这种集成，但对其环境影响的担忧仍然存在。由于LLMs的多功能性和通过API使用的便捷性，这些较大的模型经常被使用，尽管较小的定制模型可以替代。在本文中，LLMs和小型判别模型被集成到一个Mendix应用程序中，用于检测胸部X射线中的新冠肺炎。这些判别模型还用于为LLMs提供知识库，以提高准确性。这为14种不同模型配置的诊断准确性和环境影响提供了基准研究。研究结果表明，虽然较小的模型减少了应用程序的碳足迹，但输出偏向于阳性诊断，输出概率缺乏信心。同时，限制LLMs仅提供概率输出导致在准确性和碳足迹方面表现不佳，表明使用LLMs作为通用AI解决方案的风险。虽然使用较小的LLM GPT-4.1-Nano将碳足迹减少了94.2%，但与较大的模型相比，这仍然不成比例；最有效的解决方案是Covid-Net模型。尽管它的碳足迹比其他小型模型大，但其碳足迹比使用GPT-4.5-Preview减少了99.9%，同时实现了95.5%的最高准确率。本文通过比较生成性和判别性模型在新冠肺炎检测中的应用，以及强调使用生成工具进行分类任务的环境风险，为知识做出了贡献。

Summary / 总结

This paper investigates the environmental efficiency of AI tools in X-ray diagnosis, particularly focusing on the use of large language models (LLMs) and smaller discriminative models. The study compares 14 different model configurations to assess diagnostic accuracy and environmental impact. Key findings include that smaller models significantly reduce the carbon footprint but can be biased towards positive diagnoses, while restricting LLMs to probabilistic outputs degrades performance. The most efficient solution was the Covid-Net model, which achieved high accuracy while having a minimal carbon footprint compared to larger LLMs.

本文研究了将AI工具集成到医学应用中的环境效率，特别是大型语言模型（LLMs）和较小的判别模型在X射线诊断中的应用。研究比较了14种不同的模型配置在诊断准确性和环境影响方面的表现。主要发现表明，虽然较小的模型可以显著减少碳足迹，但它们可能会引入偏差并且缺乏信心。限制LLMs仅给出概率输出也会降低性能。最有效的解决方案是Covid-Net模型，它实现了95.5%的准确率，其碳足迹比其他较大模型低99.9%。

Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment

Authors: Roy Amoyal, Oren Freifeld, Chaim Baskin

Venue: CVPR 2026

First: 2026-03-23T12:53:57+00:00 · Latest: 2026-03-26T17:32:02+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180 degrees misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: https://bgu-cs-vil.github.io/GSA-project/

中文标题/摘要

标题：基于几何感知特征引导对齐的跨实例高斯点云配准

我们提出了高斯点云对齐（GSA），这是一种通过相似变换（旋转、平移和缩放）对齐两个独立的3D高斯点云（3DGS）模型的新方法，即使它们是同一类别的不同对象（例如，不同车型的汽车）。与现有方法只能对齐同一对象的3DGS模型（例如，同一车型的汽车）且通常需要提供真实比例不同，我们能够成功估计比例。GSA利用视角引导的球面图特征获得稳健的对应关系，并引入了一种两步优化框架，在保持3DGS模型不变的情况下对齐它们。首先，我们应用迭代特征引导绝对定向求解器作为粗略对齐，该方法对初始条件差（例如，180度错位或10倍比例差距）具有鲁棒性。接下来，我们使用多视图特征一致性约束的精细对齐步骤，灵感来源于逆辐射场公式。第一步已经达到了最先进的性能，第二步进一步提高了结果。在同一对象的情况下，GSA优于先前的工作，即使其他方法提供了真实比例，差距也很大。在更难的不同类别对象的情况下，GSA远远超过了它们，提供了第一个有效的类别级3DGS配准解决方案，并解锁了新的应用。项目网页：https://bgu-cs-vil.github.io/GSA-project/

Summary / 总结

Gaussian Splatting Alignment (GSA) is a novel method for aligning 3D Gaussian Splatting (3DGS) models of different objects in the same category using a similarity transformation. It leverages viewpoint-guided spherical map features for robust correspondences and employs a two-step optimization framework. The first step uses an iterative feature-guided absolute orientation solver, while the second enforces multi-view feature consistency. GSA outperforms previous methods in aligning 3DGS models of the same object and achieves state-of-the-art results for aligning different objects in the same category, unlocking new applications.

Gaussian Splatting Alignment (GSA) 是一种用于通过相似变换对同一类别中不同对象的 3D Gaussian Splatting (3DGS) 模型进行对齐的新方法。它利用视点导向的球面图特征来获得稳健的对应关系，并采用两步优化框架。第一步使用迭代特征导向的绝对定向求解器，第二步则强制多视图特征一致性。GSA 在对齐同一类别中的不同对象时优于先前的方法，并在该领域达到了最先进的性能，解锁了新的应用。

Can Users Specify Driving Speed? Bench2Drive-Speed: Benchmark and Baselines for Desired-Speed Conditioned Autonomous Driving

Authors: Yuqian Shao, Xiaosong Jia, Langechuan Liu, Junchi Yan

First: 2026-03-26T17:26:41+00:00 · Latest: 2026-03-26T17:26:41+00:00

Comments: Project page: https://thinklab-sjtu.github.io/Bench2Drive-Speed/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

End-to-end autonomous driving (E2E-AD) has achieved remarkable progress. However, one practical and useful function has been long overlooked: users may wish to customize the desired speed of the policy or specify whether to allow the autonomous vehicle to overtake. To bridge this gap, we present Bench2Drive-Speed, a benchmark with metrics, dataset, and baselines for desired-speed conditioned autonomous driving. We introduce explicit inputs of users' desired target-speed and overtake/follow instructions to driving policy models. We design quantitative metrics, including Speed-Adherence Score and Overtake Score, to measure how faithfully policies follow user specifications, while remaining compatible with standard autonomous driving metrics. To enable training of speed-conditioned policies, one approach is to collect expert demonstrations that strictly follow speed requirements, an expensive and unscalable process in the real world. An alternative is to adapt existing regular driving data by treating the speed observed in future frames as the target speed for training. To investigate this, we construct CustomizedSpeedDataset, composed of 2,100 clips annotated with experts demonstrations, enabling systematic investigation of supervision strategies. Our experiments show that, under proper re-annotation, models trained on regular driving data perform comparably to on expert demonstrations, suggesting that speed supervision can be introduced without additional complex real-world data collection. Furthermore, we find that while target-speed following can be achieved without degrading regular driving performance, executing overtaking commands remains challenging due to the inherent difficulty of interactive behaviors. All code, datasets and baselines are available at https://github.com/Thinklab-SJTU/Bench2Drive-Speed

中文标题/摘要

标题：用户能否指定驾驶速度？Bench2Drive-Speed：基于期望速度条件的自动驾驶基准和基线

端到端自动驾驶（E2E-AD）取得了显著进展。然而，一个实用且有用的功能长期被忽视：用户可能希望自定义政策的期望速度或指定是否允许自动驾驶车辆超车。为解决这一问题，我们提出了Bench2Drive-Speed，一个基于期望速度条件的自动驾驶基准，包含指标、数据集和基线。我们引入了用户的期望目标速度和超车/跟随指令作为驾驶策略模型的显式输入。我们设计了定量指标，包括速度一致性评分和超车评分，以衡量政策如何忠实于用户规定，同时保持与标准自动驾驶指标的兼容性。为了训练基于期望速度的策略，一种方法是收集严格遵循速度要求的专家演示，这在现实世界中成本高昂且难以扩展。另一种方法是通过将未来帧中观察到的速度视为训练目标速度来适应现有的常规驾驶数据。为了研究这一点，我们构建了CustomizedSpeedDataset，包含2100个带有专家演示标注的片段，使监督策略研究系统化。我们的实验表明，在适当重新标注后，基于常规驾驶数据训练的模型与基于专家演示训练的模型表现相当，表明可以通过引入速度监督而无需额外复杂的现实世界数据收集。此外，我们发现，虽然可以实现期望速度跟随而不会降低常规驾驶性能，但执行超车命令仍然具有挑战性，因为交互行为的固有难度。所有代码、数据集和基线均可在https://github.com/Thinklab-SJTU/Bench2Drive-Speed 获取

Summary / 总结

The paper presents Bench2Drive-Speed, a benchmark for desired-speed conditioned autonomous driving, addressing the need for users to customize vehicle speed and overtake/follow instructions. It introduces explicit user inputs and metrics to evaluate policy models. The study finds that training on regular driving data, with proper re-annotation, can achieve comparable performance to training on expert demonstrations, suggesting that speed supervision can be effectively introduced without additional complex data collection. However, executing overtaking commands remains challenging despite maintaining regular driving performance.

本文通过引入Bench2Drive-Speed基准，解决了端到端自动驾驶中用户自定义速度的问题。该基准包括指标、数据集和基线。作者提出了明确的用户输入目标速度和超车/跟随指令，并引入了如速度遵从评分和超车评分等指标来衡量策略的合规性。他们还提出了由2,100个带有专家注释的片段组成的CustomizedSpeedDataset，以评估不同的监督策略。实验表明，使用常规驾驶数据训练的模型可以与使用专家演示数据训练的模型表现相当，表明可以通过有效引入速度监督而无需额外的复杂数据收集。然而，尽管保持了常规驾驶性能，执行超车命令仍然具有挑战性。

Uncertainty-Guided Label Rebalancing for CPS Safety Monitoring

Authors: John Ayotunde, Qinghua Xu, Guancheng Wang, Lionel C. Briand

First: 2026-03-26T17:26:02+00:00 · Latest: 2026-03-26T17:26:02+00:00

Comments: 10 pages (main content), 3 pages references, 5 figures, 5 tables. Under review

Abs · PDF · Code1 · Code2

Abstract

Safety monitoring is essential for Cyber-Physical Systems (CPSs). However, unsafe events are rare in real-world CPS operations, creating an extreme class imbalance that degrades safety predictors. Standard rebalancing techniques perform poorly on time-series CPS telemetry, either generating unrealistic synthetic samples or overfitting on the minority class. Meanwhile, behavioral uncertainty in CPS operations, defined as the degree of doubt or uncertainty in CPS decisions , is often correlated with safety outcomes but unexplored in safety monitoring. To that end, we propose U-Balance, a supervised approach that leverages behavioral uncertainty to rebalance imbalanced datasets prior to training a safety predictor. U-Balance first trains a GatedMLP-based uncertainty predictor that summarizes each telemetry window into distributional kinematic features and outputs an uncertainty score. It then applies an uncertainty-guided label rebalancing (uLNR) mechanism that probabilistically relabels \textit{safe}-labeled windows with unusually high uncertainty as \textit{unsafe}, thereby enriching the minority class with informative boundary samples without synthesizing new data. Finally, a safety predictor is trained on the rebalanced dataset for safety monitoring. We evaluate U-Balance on a large-scale UAV benchmark with a 46:1 safe-to-unsafe ratio. Results confirm a moderate but significant correlation between behavioral uncertainty and safety. We then identify uLNR as the most effective strategy to exploit uncertainty information, compared to direct early and late fusion. U-Balance achieves a 0.806 F1 score, outperforming the strongest baseline by 14.3 percentage points, while maintaining competitive inference efficiency. Ablation studies confirm that both the GatedMLP-based uncertainty predictor and the uLNR mechanism contribute significantly to U-Balance's effectiveness.

中文标题/摘要

标题：基于不确定性指导的标签再平衡方法以提高CPS安全监控

安全监控对于网络物理系统（CPS）至关重要。然而，在实际的CPS操作中，不安全事件极为罕见，导致严重的类别不平衡，从而降低安全预测器的效果。标准的再平衡技术在处理CPS时间序列遥测数据时表现不佳，要么生成不切实际的合成样本，要么过度拟合少数类。同时，CPS操作中的行为不确定性，即CPS决策中的不确定程度，通常与安全结果相关，但在安全监控中尚未被探索。为此，我们提出了一种名为U-Balance的监督方法，该方法利用行为不确定性在训练安全预测器之前对不平衡数据集进行再平衡。U-Balance首先训练一个基于门控MLP的行为不确定性预测器，将每个遥测窗口总结为分布性动力学特征，并输出不确定性评分。然后，它应用一种基于不确定性的标签再平衡机制（uLNR），以概率方式将标记为“安全”但具有异常高不确定性的遥测窗口重新标记为“不安全”，从而在不生成新数据的情况下丰富少数类，提供具有信息性的边界样本。最后，基于再平衡后的数据集训练安全预测器以进行安全监控。我们在一个具有46:1安全到不安全比例的大规模无人机基准上评估了U-Balance。结果证实了行为不确定性与安全之间的中等但显著的相关性。我们还发现，与直接早期和晚期融合相比，uLNR是最有效的策略，以利用不确定性信息。U-Balance实现了0.806的F1分数，比最强基线高出14.3个百分点，同时保持了竞争力的推理效率。消融研究证实，基于门控MLP的行为不确定性预测器和uLNR机制对U-Balance的有效性贡献显著。

Summary / 总结

The paper addresses the challenge of class imbalance in safety monitoring for Cyber-Physical Systems (CPSs) by proposing U-Balance, a supervised approach that leverages behavioral uncertainty to rebalance datasets. U-Balance first predicts uncertainty using a GatedMLP-based model and then applies an uncertainty-guided label rebalancing mechanism to enrich the minority class. Experimental results on a UAV benchmark show that U-Balance significantly improves F1 score by 14.3 percentage points compared to the strongest baseline, while maintaining efficient inference.

论文提出了一种名为U-Balance的方法，通过利用行为不确定性来解决CPS安全监控中的类别不平衡问题。U-Balance首先使用GatedMLP模型预测不确定性，然后应用不确定性引导的标签重平衡机制来丰富少数类。实验结果表明，U-Balance在UAV基准测试上将F1分数提高了14.3个百分点，同时保持了高效的推理效率。

3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

Authors: Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy, Andrei Bursuc, Yuki M. Asano

Venue: CVPR 2026

First: 2025-12-28T18:59:25+00:00 · Latest: 2026-03-26T17:21:05+00:00

Comments: Accepted to CVPR 2026. Project page: https://ryosuke-yamada.github.io/lam3c/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds reconstructed from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves better performance than previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning. Our source code is available at https://ryosuke-yamada.github.io/lam3c/.

中文标题/摘要

标题：无需3D扫描的3D表示学习：基于视频生成点云的可扩展预训练

尽管在3D自我监督学习方面取得了进展，但收集大规模3D场景扫描仍然非常昂贵且劳动密集。在本文中，我们研究是否可以从没有任何真实3D传感器的未标记视频中学习3D表示。我们提出了Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C)，这是一种自我监督框架，可以从未标记视频中重建的点云中学习。我们首先介绍了RoomTours，这是一个由从网络上收集的房间游览视频（例如，房地产巡游）构建的视频生成点云数据集，并使用现成的前馈重建模型生成了49,219个场景。我们还提出了一种噪声正则化损失，通过确保局部几何平滑性和在噪声点云下特征的稳定性来稳定表示学习。令人惊讶的是，LAM3C在室内语义分割和实例分割方面优于之前的自我监督方法，而无需使用任何真实3D扫描。这些结果表明，未标记的视频是3D自我监督学习中一个丰富的数据来源。我们的源代码可在https://ryosuke-yamada.github.io/lam3c/获取。

Summary / 总结

This work addresses the high cost of collecting 3D scene scans by proposing a method to learn 3D representations from unlabeled videos. The authors introduce LAM3C, a self-supervised framework that uses video-generated point clouds to train models. They construct RoomTours, a dataset of 49,219 scenes derived from web videos, and develop a noise-regularized loss to improve representation learning. LAM3C outperforms previous self-supervised methods on indoor semantic and instance segmentation without using any real 3D scans, indicating the potential of unlabeled videos for 3D self-supervised learning.

该研究通过提出一个自监督学习框架LAM3C，利用从网络视频生成的点云来学习3D表示，以应对大规模3D场景扫描的高成本问题。该框架使用RoomTours数据集，包含49,219个场景，由网络视频重建而成，并引入噪声正则化损失以稳定表示学习。LAM3C在室内语义分割和实例分割任务上优于先前的方法，且未使用任何真实3D扫描数据，这表明无标签视频是3D自监督学习的丰富数据来源。

LLM4AD: Large Language Models for Autonomous Driving -- Concept, Review, Benchmark, Experiments, and Future Trends

Authors: Can Cui, Yunsheng Ma, Sung-Yeon Park, Zichong Yang, Yupeng Zhou, Peiran Liu, Juanwu Lu, Juntong Peng, Jiaru Zhang, Ruqi Zhang, Lingxi Li, Yaobin Chen, Jitesh H. Panchal, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Ziran Wang

First: 2024-10-20T04:36:19+00:00 · Latest: 2026-03-26T17:19:01+00:00

Comments: The paper was accepted by the Proceedings of the IEEE

Abs · PDF · Code1 · Code2

Abstract

With the broader adoption and highly successful development of Large Language Models (LLMs), there has been growing interest and demand for applying LLMs to autonomous driving technology. Driven by their natural language understanding and reasoning capabilities, LLMs have the potential to enhance various aspects of autonomous driving systems, from perception and scene understanding to interactive decision-making. This paper first introduces the novel concept of designing Large Language Models for Autonomous Driving (LLM4AD), followed by a review of existing LLM4AD studies. Then, a comprehensive benchmark is proposed for evaluating the instruction-following and reasoning abilities of LLM4AD systems, which includes LaMPilot-Bench, CARLA Leaderboard 1.0 Benchmark in simulation and NuPlanQA for multi-view visual question answering. Furthermore, extensive real-world experiments are conducted on autonomous vehicle platforms, examining both on-cloud and on-edge LLM deployment for personalized decision-making and motion control. Next, the future trends of integrating language diffusion models into autonomous driving are explored, exemplified by the proposed ViLaD (Vision-Language Diffusion) framework. Finally, the main challenges of LLM4AD are discussed, including latency, deployment, security and privacy, safety, trust and transparency, and personalization.

中文标题/摘要

标题：LLM4AD：大型语言模型在自动驾驶中的应用——概念、综述、基准测试、实验及未来趋势

随着大型语言模型（LLMs）的更广泛采用和高度成功的开发，人们对其应用于自动驾驶技术的兴趣和需求日益增长。受其自然语言理解和推理能力的驱动，LLMs 有望增强自动驾驶系统的各个方面，从感知和场景理解到交互式决策。本文首先介绍了为自动驾驶设计大型语言模型（LLM4AD）的新概念，随后回顾了现有的LLM4AD研究。接着，提出了一种全面的基准测试，用于评估LLM4AD系统的指令遵循和推理能力，包括LaMPilot-Bench、CARLA Leaderboard 1.0基准测试（用于模拟）和NuPlanQA（用于多视图视觉问答）。此外，在自动驾驶车辆平台上进行了广泛的实地实验，考察了云上和边缘部署LLM4AD系统进行个性化决策和运动控制的情况。接着，探讨了将语言扩散模型整合到自动驾驶中的未来趋势，以提出的ViLaD（视觉-语言扩散）框架为例。最后，讨论了LLM4AD的主要挑战，包括延迟、部署、安全和隐私、安全、信任和透明度以及个性化问题。

Summary / 总结

This paper introduces the concept of Large Language Models for Autonomous Driving (LLM4AD) and explores their application in enhancing various aspects of autonomous driving systems. It reviews existing studies, proposes a comprehensive benchmark for evaluating LLM4AD systems, and conducts extensive real-world experiments on autonomous vehicle platforms. Key findings include the effectiveness of LLM4AD in instruction-following and reasoning tasks, as well as the challenges in deployment such as latency and security.

本文介绍了大型语言模型在自动驾驶（LLM4AD）中的概念及其在增强自动驾驶系统各方面能力的应用。它回顾了现有研究，提出了一个全面的基准来评估LLM4AD系统的指令跟随和推理能力，并在自动驾驶车辆平台上进行了广泛的实地实验。主要发现包括LLM4AD在指令跟随和推理任务中的有效性，以及在部署中面临的挑战，如延迟和安全问题。

Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

Authors: Wenxuan Song, Jiayi Chen, Shuai Chen, Jingbo Wang, Pengxiang Ding, Han Zhao, Yikai Qin, Xinhu Zheng, Donglin Wang, Yan Wang, Haoang Li

First: 2026-03-26T17:14:57+00:00 · Latest: 2026-03-26T17:14:57+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary tasks. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary task training within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver this goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies. The difference between the resulting model parameters can then be interpreted as capability vectors provided by auxiliary tasks. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Experimental results demonstrate that this approach is highly effective across diverse robot tasks. Project page: https://chris1220313648.github.io/Fast-dVLA/

中文标题/摘要

标题：Fast-dVLA：加速离散扩散VLA以实现实时性能

本文提出了一种新颖的方法，以解决预训练VLA模型在标准监督微调（SFT）过程中往往无法有效提高性能并降低适应成本的挑战。一些带有辅助训练目标的高级微调方法可以提高性能并减少收敛步骤。然而，它们通常由于辅助任务的附加损失而产生显著的计算开销。为了同时实现辅助训练增强的能力和标准SFT的简单性，我们在参数空间内解耦辅助任务训练的两个目标，即增强通用能力和拟合任务特定的动作分布。为了实现这一目标，我们只需要使用两种不同的训练策略对模型进行训练，使其在小型任务集上收敛。由此产生的模型参数之间的差异可以解释为由辅助任务提供的能力向量。然后将这些向量与预训练参数合并，形成能力增强的元模型。此外，当标准SFT与轻量级正交正则化损失结合时，合并后的模型在减少计算开销的情况下，性能可与辅助微调基线相媲美。实验结果表明，该方法在多种机器人任务中具有很高的有效性。项目页面：https://chris1220313648.github.io/Fast-dVLA/

Summary / 总结

This paper introduces Fast-dVLA, a method to accelerate discrete diffusion VLA to real-time performance by decoupling the objectives of auxiliary task training and standard supervised finetuning. The approach trains the model on a small-scale task set using two distinct strategies, merging the resulting parameters with pretrained parameters to form a capability-enhanced meta model. Experimental results show that this method achieves performance comparable to auxiliary finetuned baselines with reduced computational overhead across various robot tasks.

该论文提出Fast-dVLA方法，通过在参数空间内解耦辅助任务训练目标来加速离散扩散VLA至实时性能。该方法使用两种不同的训练策略在小规模任务集上训练模型，将结果参数与预训练参数合并形成能力增强的元模型。该方法减少了计算开销，同时在各种机器人任务中达到了与辅助微调基线相当的性能。

ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference

Authors: Ali Hojjat, Janek Haberer, Soren Pirk, Olaf Landsiedel

Venue: CVPR

First: 2025-07-14T20:54:41+00:00 · Latest: 2026-03-26T17:14:32+00:00

Comments: Accepted at CVPR'26, please cite the conference version

Abs · PDF · Code1 · Code2 · Code3

Abstract

ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity. To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin Transformers. The source code is available at https://github.com/ds-kiel/ThinkingViT.

中文标题/摘要

标题：ThinkingViT：马特罗什卡思考视觉变换器以实现弹性推理

ViTs 提供了SOTA性能，但其固定的计算预算阻止了其在异构硬件上的可扩展部署。最近的马特罗什卡风格的变换器架构通过在一个模型中嵌入嵌套子网络来解决这一问题，以实现可扩展的推理。然而，这些模型对所有输入分配相同的计算量，无论其复杂度如何，这导致了效率低下。为了解决这一问题，我们引入了ThinkingViT，这是一种嵌套ViT架构，它采用渐进思考阶段来根据输入难度动态调整推理计算。ThinkingViT首先激活最重要的注意力头的较小子集以生成初始预测。如果预测置信度超过预定义阈值，则推理提前终止。否则，在同一主干中，它激活更大的注意力头子集并进行新的前向传递。这一过程将迭代进行，直到模型达到预定义的置信水平或耗尽其最大容量。为了提高后续轮次的性能，我们引入了一种Token Recycling方法，该方法将输入嵌入与上一阶段的嵌入融合。实验表明，ThinkingViT在相同的吞吐量下比嵌套基线高出2.0个百分点，在相同GMACs下高出2.9个百分点。在ImageNet-1K上。我们展示了ThinkingViT的主干保留设计使其能够作为ViTs在下游任务如语义分割中的插件升级。我们还展示了ThinkingViT可以有效地转移到其他架构如Swin变换器。源代码可在https://github.com/ds-kiel/ThinkingViT/获得。

Summary / 总结

ThinkingViT is designed to improve the efficiency of Vision Transformers (ViTs) for scalable deployment across different hardware by dynamically adjusting inference computation based on input complexity. It uses progressive thinking stages to activate attention heads in a nested manner, terminating early if the prediction confidence is high. Token Recycling is introduced to enhance performance in subsequent rounds. Experiments on ImageNet-1K show that ThinkingViT outperforms nested baselines by up to 2.0 percentage points at the same throughput and by up to 2.9 percentage points at equal GMACs. The backbone-preserving design allows ThinkingViT to be easily integrated into downstream tasks like semantic segmentation and other architectures such as Swin Transformers.

ThinkingViT 通过根据输入复杂度动态调整推理计算来提升 Vision Transformers（ViTs）的效率。它使用渐进的思考阶段逐步激活注意力头，并在预测置信度高时提前终止。这种方法在 ImageNet-1K 上将准确率提高了最多 2.9 个百分点。ThinkingViT 还支持下游任务的插件升级，并可以适应其他架构如 Swin Transformer。

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

Authors: Abdullah Hamdi, Changchun Yang, Xin Gao

First: 2026-03-26T16:58:43+00:00 · Latest: 2026-03-26T16:58:43+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

中文标题/摘要

标题：Colon-Bench：一种用于全程序结肠镜检查视频中可扩展密集病灶标注的代理工作流

结肠镜检查早期筛查对于结肠癌预防至关重要，但开发此领域的稳健AI系统受到缺乏密集标注的长序列视频数据集的阻碍。现有数据集主要集中在单类息肉检测，缺乏评估现代多模态大型语言模型（MLLMs）所需的丰富空间、时间和语言标注。为解决这一关键缺口，我们引入了Colon-Bench，通过一种新颖的多阶段代理工作流生成。我们的管道无缝集成时间提议、边界框跟踪、AI驱动的视觉确认和人工在环审查，以可扩展的方式标注全程序视频。最终验证的基准数据集在规模上前所未有，包括528个视频、14个不同的病灶类别（包括息肉、溃疡和出血）、超过30万个边界框、21.3万个分割掩码和13.3万个临床描述词。我们利用Colon-Bench严格评估最先进的MLLMs在病灶分类、开放词汇视频对象分割（OV-VOS）和视频视觉问答（VQA）方面的性能。MLLM结果在医学领域显示出令人惊讶的高定位性能，与SAM-3相比。最后，我们分析了MLLMs常见的VQA错误，提出了新的“结肠技能”提示策略，提高了大多数MLLMs的零样本性能最多9.7%。数据集和代码可在https://abdullahamdi.com/colon-bench 获取。

Summary / 总结

Colon-Bench is a novel dataset for colonoscopy video annotation, addressing the lack of richly annotated datasets in this domain. The pipeline integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human review to annotate 528 full-procedure videos with 14 lesion categories, 300,000 bounding boxes, and 133,000 words of clinical descriptions. Evaluating state-of-the-art MLLMs on this dataset shows high localization performance, and a new prompting strategy improves zero-shot MLLM performance by up to 9.7%.

研究旨在通过创建全程序结肠镜检查视频的密集标注数据集，开发用于结肠癌预防的 robust AI 系统。方法包括一个多阶段的代理工作流，整合时间提议、边界框跟踪、AI 驱动的视觉确认和人工审查，标注了528个视频，包含14种病变类别、30万个边界框、213,000个分割掩码和133,000个临床描述词。关键发现包括MLLMs在定位性能上的出色表现，并引入了一种新的“结肠技能”提示策略，提高了零样本MLLM性能最多9.7%。

Central Dogma Transformer III: Interpretable AI Across DNA, RNA, and Protein

Authors: Nobuyuki Ota

First: 2026-03-24T15:57:23+00:00 · Latest: 2026-03-26T16:53:09+00:00

Comments: 21 pages, 8 figures, v2: corrected mRNA-protein divergence analysis with DSB-normalized data

Abs · PDF · Code1 · Code2

Abstract

Biological AI models increasingly predict complex cellular responses, yet their learned representations remain disconnected from the molecular processes they aim to capture. We present CDT-III, which extends mechanism-oriented AI across the full central dogma: DNA, RNA, and protein. Its two-stage Virtual Cell Embedder architecture mirrors the spatial compartmentalization of the cell: VCE-N models transcription in the nucleus and VCE-C models translation in the cytosol. On five held-out genes, CDT-III achieves per-gene RNA r=0.843 and protein r=0.969. Adding protein prediction improves RNA performance (r=0.804 to 0.843), demonstrating that downstream tasks regularize upstream representations. Protein supervision sharpens DNA-level interpretability, increasing CTCF enrichment by 30%. Analysis of experimentally measured mRNA and protein responses reveals that the majority of genes with observable mRNA changes show opposite protein-level changes (66.7% at |log2FC|>0.01, rising to 87.5% at |log2FC|>0.02), exposing a fundamental limitation of RNA-only perturbation models. Despite this pervasive direction discordance, CDT-III correctly predicts both mRNA and protein responses. Applied to in silico CD52 knockdown approximating Alemtuzumab, the model predicts 29/29 protein changes correctly and rediscovers 5 of 7 known clinical side effects without clinical data. Gradient-based side effect profiling requires only unperturbed baseline data (r=0.939), enabling screening of all 2,361 genes without new experiments.

中文标题/摘要

标题：中央狗ma变换器III：贯穿DNA、RNA和蛋白质的可解释AI

生物AI模型越来越多地预测复杂的细胞反应，但它们学到的表示与所要捕捉的分子过程仍然脱节。我们提出了CDT-III，它将机制导向的AI扩展到整个中央狗ma：DNA、RNA和蛋白质。其两阶段的虚拟细胞嵌入器架构模拟了细胞的空间分隔化：VCE-N模型核内的转录，VCE-C模型胞质中的翻译。在五个保留的基因上，CDT-III实现了每基因RNA r=0.843和蛋白质 r=0.969。添加蛋白质预测提高了RNA性能（r=0.804到0.843），表明下游任务正则化了上游表示。蛋白质监督增强了DNA层面的可解释性，增加了CTCF富集30%。实验测量的mRNA和蛋白质反应分析表明，大多数在mRNA水平上可观察到变化的基因在蛋白质水平上显示出相反的变化（|log2FC|>0.01时为66.7%，上升到|log2FC|>0.02时为87.5%），揭示了RNA仅扰动模型的基本局限性。尽管存在这种普遍的方向不一致，CDT-III正确预测了mRNA和蛋白质反应。应用于模拟Alemtuzumab的CD52敲低，该模型正确预测了29/29的蛋白质变化，并重新发现了7个已知临床副作用中的5个，无需临床数据。基于梯度的副作用表征只需要未扰动的基线数据（r=0.939），从而能够筛选所有2,361个基因而无需新的实验。

Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers

Authors: Mingmeng Geng, Yuhang Dong, Thierry Poibeau

First: 2026-03-26T16:49:00+00:00 · Latest: 2026-03-26T16:49:00+00:00

Comments: Visualization of word usage patterns in arXiv abstracts: https://llm-impact.github.io/word-usage-arxiv-abstract/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of "beyond" and "via" in titles and the decreased frequency of "the" and "of" in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.

中文标题/摘要

标题：超越Via：大型语言模型在学术论文中的影响分析与估计

通过对arXiv论文的分析，我们报告了几种可能由大型语言模型（LLMs）驱动但尚未得到足够关注的词汇使用变化，例如标题中“beyond”和“via”频率的增加以及摘要中“the”和“of”频率的减少。由于不同LLM之间的相似性，实验表明当前的分类器在多类分类任务中难以准确确定给定文本是由哪个具体模型生成的。同时，LLM之间的差异也导致了学术论文中词汇使用模式的变化。通过采用直接且高度可解释的线性方法，并考虑到模型和提示之间的差异，我们定量评估了这些影响，并表明现实世界中LLM的使用是异质且动态的。

Summary / 总结

The study analyzes arXiv papers to identify shifts in word usage likely influenced by large language models (LLMs), such as increased use of 'beyond' and 'via' in titles and decreased use of 'the' and 'of' in abstracts. Experiments show that current classifiers have difficulty distinguishing between different LLMs in multi-class classification tasks, and that LLM usage patterns evolve due to model differences. A linear approach is used to quantify these effects, revealing that LLM usage is heterogeneous and dynamic in real-world academic papers.

研究通过分析arXiv论文，发现了一些词汇使用的变化，这些变化可能由大型语言模型（LLMs）引起，例如标题中‘beyond’和‘via’的使用频率增加，而摘要中‘the’和‘of’的使用频率减少。实验表明，当前的分类器在区分不同LLM时存在困难，且LLM的使用模式因模型差异而变化。通过采用直接且可解释的线性方法，研究量化了这些影响，揭示了LLM在真实世界学术论文中的使用是异质性和动态的。

Designing Any Imaging System from Natural Language: Agent-Constrained Composition over a Finite Primitive Basis

Authors: Chengshuai Yang

First: 2026-03-26T16:47:27+00:00 · Latest: 2026-03-26T16:47:27+00:00

Comments: 28 pages, 7 figures, 8 tables, includes Supplementary Information (sections S1-S6)

Abs · PDF · Code1 · Code2

Abstract

Designing a computational imaging system -- selecting operators, setting parameters, validating consistency -- requires weeks of specialist effort per modality, creating an expertise bottleneck that excludes the broader scientific community from prototyping imaging instruments. We introduce spec.md, a structured specification format, and three autonomous agents -- Plan, Judge, and Execute -- that translate a one-sentence natural-language description into a validated forward model with bounded reconstruction error. A design-to-real error theorem decomposes total reconstruction error into five independently bounded terms, each linked to a corrective action. On 6 real-data modalities spanning all 5 carrier families, the automated pipeline matches expert-library quality (98.1 +/- 4.2%). Ten novel designs -- composing primitives into chains from 3D to 5D -- demonstrate compositional reach beyond any single-modality tool.

中文标题/摘要

标题：从自然语言设计任何成像系统：基于有限基本单元的代理约束组合

设计计算成像系统——选择操作符、设置参数、验证一致性——需要每个模态数周的专家努力，形成一种专业知识瓶颈，排除了更广泛的科学社区从原型设计成像仪器。我们引入了spec.md，一种结构化规范格式，以及三个自主代理——计划、评判和执行——将一句话的自然语言描述转化为一个验证过的前向模型，具有有界重构误差。设计到实现误差定理将总重构误差分解为五个独立有界的项，每项都与一项纠正措施相关联。在6种真实数据模态中，涵盖所有5种载体家族，自动管道达到专家库质量（98.1±4.2%）。十个新颖设计——将从3D到5D的基本单元组合成链——展示了超越任何单一模态工具的组合能力。

Summary / 总结

The research addresses the bottleneck in designing computational imaging systems, which typically requires weeks of specialist effort. It introduces spec.md, a structured specification format, and three autonomous agents that translate a natural-language description into a validated forward model. The automated pipeline matches the quality of expert-designed models on six real-data modalities. Novel designs demonstrate the compositional reach of the system beyond single-modality tools.

研究旨在自动化计算成像系统的设计，这通常需要专家数周的努力。它引入了spec.md结构化规范格式和三个自主代理，将自然语言描述转换为验证过的前向模型。自动化管道在六个实际数据模态上达到了专家级别的结果，并通过设计新的多模态成像系统展示了组合能力。

Anchored-Branched Steady-state WInd Flow Transformer (AB-SWIFT): a metamodel for 3D atmospheric flow in urban environments

Authors: Armand de Villeroché, Rem-Sophia Mouradi, Vincent Le Guen, Sibo Cheng, Marc Bocquet, Alban Farchi, Patrick Armand, Patrick Massin

First: 2026-03-26T16:46:54+00:00 · Latest: 2026-03-26T16:46:54+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Air flow modeling at a local scale is essential for applications such as pollutant dispersion modeling or wind farm modeling. To circumvent costly Computational Fluid Dynamics (CFD) computations, deep learning surrogate models have recently emerged as promising alternatives. However, in the context of urban air flow, deep learning models struggle to adapt to the high variations of the urban geometry and to large mesh sizes. To tackle these challenges, we introduce Anchored Branched Steady-state WInd Flow Transformer (AB-SWIFT), a transformer-based model with an internal branched structure uniquely designed for atmospheric flow modeling. We train our model on a specially designed database of atmospheric simulations around randomised urban geometries and with a mixture of unstable, neutral, and stable atmospheric stratifications. Our model reaches the best accuracy on all predicted fields compared to state-of-the-art transformers and graph-based models. Our code and data is available at https://github.com/cerea-daml/abswift.

中文标题/摘要

标题：锚定分支稳态风流转换器（AB-SWIFT）：一种城市环境三维大气流的元模型

局部尺度上的气流建模对于污染物扩散建模或风力发电场建模等应用至关重要。为了绕过昂贵的计算流体动力学（CFD）计算，最近出现了深度学习代理模型作为有前途的替代方案。然而，在城市气流的背景下，深度学习模型难以适应城市几何形状的高变化和大网格尺寸。为了解决这些挑战，我们引入了锚定分支稳态风流转换器（AB-SWIFT），这是一种具有内部分支结构的基于转换器的模型，专门设计用于大气流建模。我们使用一个特别设计的大气模拟数据库进行训练，该数据库包含随机城市几何形状和不稳定的、中性的和稳定的大气层结构的混合。我们的模型在所有预测字段上的准确性都优于最先进的转换器和基于图的模型。我们的代码和数据可在https://github.com/cerea-daml/abswift获取。

History

20260328_0350 20260327_0407 20260326_0356 20260325_0407 20260324_0402 20260323_0334 20260322_0333 20260321_0346 20260320_0355 20260319_0358 20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553