arXiv 论文速递

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Authors: Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu

First: 2025-10-23T17:59:59+00:00 · Latest: 2025-10-23T17:59:59+00:00

Comments: Project page and code: https://holo-cine.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.

中文标题/摘要

标题：HoloCine：整体生成电影多镜头长视频叙事

最先进的文本到视频模型在生成孤立片段方面表现出色，但在创建连贯的多镜头叙事方面却力有未逮，而多镜头叙事正是叙事的核心。我们通过HoloCine模型填补了这一“叙事缺口”，该模型能够整体生成整个场景，以确保从第一镜头到最后一个镜头的全局一致性。我们的架构通过窗口交叉注意力机制实现精确的导演控制，将文本提示定位到特定的镜头上，同时稀疏跨镜头自注意力模式（镜头内部密集但镜头之间稀疏）确保了分钟级生成所需的效率。除了在叙事连贯性方面达到新的最先进水平外，HoloCine还发展出非凡的涌现能力：持久的人物和场景记忆，以及对电影技术的直观理解。我们的工作标志着从片段合成向自动化电影制作的关键转变，使端到端的电影创作成为可实现的未来。我们的代码可在：https://holo-cine.github.io/ 获取。

Summary / 总结

HoloCine addresses the limitation of current text-to-video models in generating coherent multi-shot narratives by introducing a holistic generation approach. It uses a Window Cross-Attention mechanism to localize text prompts and a Sparse Inter-Shot Self-Attention pattern to maintain efficiency. The model demonstrates impressive emergent abilities, including persistent memory and understanding of cinematic techniques, setting a new standard in narrative coherence and automated filmmaking.

HoloCine通过引入整体生成方法解决了当前文本到视频模型在生成连贯多镜头叙事方面的局限性。它使用Window Cross-Attention机制来定位文本提示，并使用Sparse Inter-Shot Self-Attention模式来确保效率。该模型展示了强大的叙事连贯性和持久记忆以及对电影技术的理解等 emergent 能力，标志着向自动化电影制作迈出的重要一步。

One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling

Authors: Nimrod Berman, Ilan Naiman, Moshe Eliasof, Hedi Zisling, Omri Azencot

First: 2025-05-19T16:59:47+00:00 · Latest: 2025-10-23T17:59:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model (KDM), a novel offline distillation approach grounded in Koopman theory - a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. KDM achieves highly competitive performance across standard offline distillation benchmarks.

中文标题/摘要

标题：基于Koopman建模的一步离线蒸馏扩散模型

基于扩散的生成模型已经展示了出色的表现，但其迭代采样过程仍然计算成本高昂。减轻这一成本的一种主要策略是蒸馏，而离线蒸馏特别在效率、模块化和灵活性方面具有优势。在本文中，我们识别了两个关键观察结果，以指导一个有原则的蒸馏框架：(1) 尽管扩散模型被视作动力系统理论的一部分，但还有强大的且未充分探索的工具可以进一步利用；(2) 扩散模型本质上会在潜在空间中施加结构化的、语义连贯的轨迹。基于这些观察结果，我们引入了Koopman蒸馏模型(KDM)，这是一种基于Koopman理论的新型离线蒸馏方法——Koopman理论是一种经典的方法，用于在变换空间中将非线性动力学线性表示。KDM将嘈杂的输入编码到嵌入空间中，在该空间中，一个学习到的线性算子将它们向前传播，随后通过解码器重建干净的样本。这使得单步生成同时保持语义保真度。我们为我们的方法提供了理论上的证明：(1) 在温和的假设下，学习到的扩散动力学具有有限维的Koopman表示；(2) 在Koopman潜在空间中的邻近性与生成输出中的语义相似性相关，这使得轨迹对齐变得有效。KDM在标准的离线蒸馏基准测试中取得了非常有竞争力的性能。

Summary / 总结

This work addresses the computational inefficiency of diffusion-based generative models by introducing the Koopman Distillation Model (KDM). Motivated by the potential of Koopman theory to linearize nonlinear dynamics and the inherent structured trajectories in latent space of diffusion models, KDM uses an offline distillation approach to encode noisy inputs into a latent space where a linear operator propagates them, followed by a decoder to reconstruct clean samples. Theoretical justification supports the approach, showing that the learned diffusion dynamics can be represented in a finite-dimensional Koopman space and that proximity in this space correlates with semantic similarity in generated outputs. KDM performs competitively in offline distillation benchmarks.

本文通过提出Koopman Distillation Model (KDM)，利用Koopman理论来解决基于扩散的生成模型的计算效率问题，该方法允许通过单步离线蒸馏来编码噪声输入到嵌入空间，其中线性算子进行传播，随后通过解码器重建干净样本，从而保持语义保真度。理论分析支持该方法，表明学习到的扩散动力学可以在有限维的Koopman空间中表示，并且该空间中的接近性与生成输出的语义相似性相关。KDM在离线蒸馏基准测试中表现出色。

LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas

Authors: Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, Ju Hu, Sergey Tulyakov, Kuan-Chieh Jackson Wang

First: 2025-10-23T17:59:55+00:00 · Latest: 2025-10-23T17:59:55+00:00

Comments: 9 pages, preprint

Abs · PDF · Code1 · Code2

Abstract

Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.

中文标题/摘要

标题：LayerComposer：基于空间感知分层画布的交互式个性化T2I

尽管现有的个性化生成模型具有出色的视觉保真度，但它们缺乏对空间组成进行交互式控制的能力，并且在处理多个主体时扩展性较差。为了解决这些限制，我们提出了LayerComposer，这是一种交互式的多主体文本到图像生成框架。我们的方法引入了两个主要贡献：(1) 分层画布，这是一种新颖的表示方法，其中每个主体位于不同的层上，从而实现无遮挡的组合；(2) 锁定机制，该机制保持选定层的高保真度，同时允许其余层灵活适应周围环境。类似于专业的图像编辑软件，所提出的分层画布允许用户通过直观的层操作放置、调整大小或锁定输入主体。我们的灵活锁定机制不需要进行架构更改，而是依赖于固有的位置嵌入以及一种新的互补数据采样策略。广泛的实验表明，与多主体个性化图像生成的最新方法相比，LayerComposer在空间控制和身份保留方面具有优越性。

Summary / 总结

LayerComposer is an interactive framework for generating personalized multi-subject text-to-image outputs. It introduces a layered canvas where each subject is placed on a distinct layer, allowing for occlusion-free composition and flexible adaptation of remaining layers. The framework also includes a locking mechanism that preserves selected layers while adapting others to the context. Experiments show that LayerComposer outperforms state-of-the-art methods in terms of spatial control and identity preservation in multi-subject image generation.

LayerComposer 是一个交互式框架，用于生成多主体的个性化文本到图像内容。它引入了一种分层画布，每个主体放在一个独立的层上，允许无遮挡的组合和剩余层的灵活适应。锁定机制保留选定的层，同时根据上下文调整其他层。实验表明，LayerComposer 在多主体图像生成中的空间控制和身份保留方面优于现有最佳方法。

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Authors: Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

First: 2025-10-23T17:59:54+00:00 · Latest: 2025-10-23T17:59:54+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

中文标题/摘要

标题：通用模态翻译的对比预测潜在去噪扩散桥模型

生成建模的最新进展将去噪扩散模型定位为从复杂数据分布中采样的最先进的工具。尽管这些模型在图像和音频等单一模态领域取得了显著的成功，但将它们的能力扩展到模态翻译（MT），即在不同感官模态之间翻译信息，仍然是一个开放的挑战。现有方法通常依赖于共享维度、高斯先验和特定模态的架构等限制性假设，这限制了它们的通用性和理论基础。在本文中，我们提出了潜在去噪扩散桥模型（LDDBM），这是一种基于潜在变量扩展的去噪扩散桥模型的通用框架。通过在共享的潜在空间中操作，我们的方法可以在不需要对齐维度的情况下学习任意模态之间的桥梁。我们引入了一种对比对齐损失来强制配对样本之间的语义一致性，并设计了一种通用的编码器-解码器架构，专门用于潜在空间中的噪声预测。此外，我们提出了预测损失来指导训练以实现准确的跨域翻译，并探索了几种训练策略以提高稳定性。我们的方法支持任意模态对，并在包括多视图到3D形状生成、图像超分辨率和多视图场景合成在内的多种MT任务中表现出色。全面的实验和消融实验验证了我们框架的有效性，建立了通用模态翻译的新强基线。欲了解更多信息，请参见我们的项目页面：https://sites.google.com/view/lddbm/home.

Summary / 总结

This paper addresses the challenge of modality translation by proposing the Latent Denoising Diffusion Bridge Model (LDDBM), which operates in a shared latent space to bridge arbitrary modalities without requiring aligned dimensions. The method uses a contrastive alignment loss to enforce semantic consistency and a predictive loss to guide accurate cross-domain translation. Experiments show that LDDBM performs well on diverse tasks such as multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis, establishing a strong baseline in general modality translation.

本文通过提出Latent Denoising Diffusion Bridge Model (LDDBM)解决了不同感官模态之间的信息翻译问题，该方法在共享的潜在空间中操作以实现跨模态翻译。方法使用对比对齐损失来确保语义一致性，并使用预测损失来引导准确的跨域翻译。实验表明，LDDBM在多视图到3D形状生成、图像超分辨率和多视图场景合成等多种任务上表现出色，为通用模态翻译建立了新的基准线。

VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

Authors: Mateo Guaman Castro, Sidharth Rajagopal, Daniel Gorbatov, Matt Schmittle, Rohan Baijal, Octi Zhang, Rosario Scalise, Sidharth Talia, Emma Romig, Celso de Melo, Byron Boots, Abhishek Gupta

First: 2025-10-23T17:59:45+00:00 · Latest: 2025-10-23T17:59:45+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/

中文标题/摘要

标题：VAMOS：一种分层的视觉-语言-行动模型，用于能力调节和可调节导航

机器人导航中的一个基本挑战在于学习能够在多种环境中泛化的策略，同时符合特定实体的独特物理约束和能力（例如，四足机器人可以爬楼梯，但轮式机器人不能）。我们提出了VAMOS，一种分层的视觉-语言-行动模型，将语义规划与实体接地分离：一个通用的规划者从多样化的开放世界数据中学习，而一个专门的手段模型在安全、低成本的模拟中学习机器人的物理约束和能力。我们通过精心设计一个接口实现了这种分离，该接口允许高层规划者直接在图像空间中提出候选路径，然后由手段模型评估和重新排名。我们的实地实验表明，VAMOS在室内和复杂室外导航中都比最先进的基于模型和端到端学习方法具有更高的成功率。我们还展示了这种分层设计能够在腿式和轮式机器人之间实现跨实体导航，并且可以通过自然语言轻松调节。实地消融实验证实，专门模型对于实体接地至关重要，使得一个高层规划者能够在物理上不同的轮式和腿式机器人上部署。最后，该模型显著提高了单个机器人的可靠性，通过拒绝物理上不可行的计划，成功率提高了3倍。

Summary / 总结

VAMOS is a hierarchical vision-language-action model designed to navigate diverse environments while respecting the physical constraints of the robot's embodiment. It separates semantic planning from embodiment grounding, with a generalist planner learning from diverse data and a specialist model learning the robot's capabilities. Experiments show VAMOS outperforms state-of-the-art methods in both indoor and outdoor navigation, and its hierarchical design allows for cross-embodied navigation and natural language steering. The specialist model is crucial for grounding the robot's physical constraints, enabling a single planner to work across different types of robots and significantly improving navigation reliability by rejecting infeasible plans.

VAMOS 是一种分层的视觉-语言-行动模型，通过分离语义规划和实体接地来实现跨多样环境的导航。它使用一个在多样化数据上训练的通用规划器和一个学习机器人物理约束的专业模型。实验表明，VAMOS 在室内和复杂户外导航中优于最先进的方法，并且能够实现跨不同类型的机器人进行导航和自然语言控制。专业模型对于实体接地至关重要，允许一个规划器在不同类型的机器人上通用，并显著提高了可靠性，通过拒绝不可行的计划实现了成功率提升3倍。

KL-Regularized Reinforcement Learning is Designed to Mode Collapse

Authors: Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, Rajesh Ranganath

First: 2025-10-23T17:59:40+00:00 · Latest: 2025-10-23T17:59:40+00:00

Abs · PDF · Code1 · Code2

Abstract

It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.

中文标题/摘要

标题：KL-正则化强化学习旨在模式塌陷

通常认为，优化逆KL散度会导致“模式寻求”，而优化正向KL则会导致“质量覆盖”，后者如果目标是从多个多样模式中采样则更受欢迎。我们通过数学和实验证明，这种直觉并不一定适用于使用逆/正向KL正则化（例如，语言模型中常用的方式）进行强化学习。相反，逆/正向KL的选择决定了最优目标分布的家族，参数化为正则化系数。模式覆盖主要取决于其他因素，如正则化强度，以及奖励和参考概率之间的相对比例。此外，我们展示了常用设置，如低正则化强度和相等的验证奖励，往往会指定单模目标分布，这意味着优化目标，从本质上讲，是非多样性的。我们利用这些见解构建了一个简单、可扩展且具有理论依据的算法。它对奖励幅度的修改最小，但优化了一个将高概率集中在所有高质量采样模式上的目标分布。在实验中，这种简单的修改能够提高大型语言模型和化学语言模型的解决方案质量和多样性，无需任何多样性的外部信号，并且在使用逆/正向KL时，无论是哪种情况都有效。

GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

Authors: Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, Xiaolong Wang

First: 2025-10-23T17:59:26+00:00 · Latest: 2025-10-23T17:59:26+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: https://3dgsworld.github.io/.

中文标题/摘要

标题：GSWorld：闭环照片级真实感模拟套件用于机器人操作

本文介绍了GSWorld，这是一种结合了3D高斯点积和物理引擎的稳健的照片级真实感机器人操作模拟器。我们的框架提倡“闭环”开发操作策略，通过可重复的评估从真实机器人数据中学习的操作策略，并在无需使用真实机器人的情况下进行模拟到现实的策略训练。为了实现多样场景的照片级真实感渲染，我们提出了一种新的资产格式，我们称之为GSDF（高斯场景描述文件），它将高斯-网格表示与机器人URDF和其他对象结合在一起。通过简化重建流水线，我们整理了一个包含3个单臂和双臂操作的机器人实体以及超过40个物体的GSDF数据库。结合GSDF和物理引擎，我们展示了几个有趣的即时应用：（1）使用照片级真实感渲染学习零样本模拟到现实的像素到动作操作策略，（2）自动高质量的DAgger数据收集以适应部署环境，（3）在模拟中重复评估真实机器人操作策略，（4）通过虚拟遥控收集模拟数据，以及（5）零样本模拟到现实的视觉强化学习。网站：https://3dgsworld.github.io/

Summary / 总结

GSWorld is a robust photo-realistic simulator for robotic manipulation that integrates 3D Gaussian Splatting with physics engines. It enables the development of manipulation policies by closing the loop with reproducible evaluation of real-robot data and sim2real policy training. Key findings include learning zero-shot sim2real pixel-to-action policies, automated high-quality DAgger data collection, reproducible benchmarking of real-robot policies, simulation data collection by virtual teleoperation, and zero-shot sim2real visual reinforcement learning.

GSWorld 是一个结合了 3D 高斯散点图和物理引擎的稳健的机器人操作仿真器。它通过闭环方式结合真实机器人数据进行开发和评估，并实现从仿真到现实的政策训练。主要发现包括零样本仿真到现实的像素到动作策略学习、高质量的 DAgger 数据自动收集、真实机器人策略的可重复基准测试、通过虚拟遥操作收集仿真数据以及零样本仿真到现实的视觉强化学习。

SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution

Authors: Ritik Shah, Marco F Duarte

First: 2025-10-23T17:59:26+00:00 · Latest: 2025-10-23T17:59:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.

中文标题/摘要

标题：SpectraMorph：结构化潜在学习的自监督超光谱超分辨率

超光谱传感器每像素捕获密集光谱，但存在低空间分辨率的问题，导致边界模糊和混合像素效应。配准的伴生传感器，如多光谱、RGB或全色相机，提供高分辨率的空间细节，推动了通过融合超光谱和多光谱图像（HSI-MSI）实现超光谱超分辨率。现有的基于深度学习的方法表现强大，但依赖于不透明的回归器，缺乏可解释性，且当MSI带宽很少时往往失效。我们提出了一种基于物理的自监督融合框架SpectraMorph，具有结构化的潜在空间。SpectraMorph 不直接进行回归，而是施加一个解混瓶颈：从低分辨率HSI中提取端元签名，并通过MSI预测类似丰度的图。光谱通过线性混合重建，通过MSI传感器的光谱响应函数在自监督模式下进行训练。SpectraMorph 生成可解释的中间结果，训练时间不到一分钟，并且即使在单带（全色）MSI的情况下也能保持鲁棒性。在合成和真实数据集上的实验表明，SpectraMorph 在自监督/半监督基准上始终优于最先进的方法，同时在监督基准上保持非常有竞争力。

Summary / 总结

SpectraMorph is a physics-guided self-supervised fusion framework for hyperspectral super-resolution, which addresses the limitations of existing opaque deep learning methods. It uses a structured latent space to enforce an unmixing bottleneck, extracting endmember signatures from low-resolution hyperspectral images and predicting abundance maps from multispectral images. SpectraMorph trains quickly and remains robust even with a single-band MSI, outperforming state-of-the-art unsupervised and self-supervised methods on both synthetic and real-world datasets.

SpectraMorph 是一个基于物理的自监督框架，用于高光谱超分辨率，使用结构化的潜在空间来强制执行解混瓶颈。它从低分辨率高光谱图像中提取端元签名，并使用紧凑的多层感知机从多光谱图像中预测类似丰度的图。训练通过多光谱传感器的光谱响应函数以自监督方式完成。实验表明，SpectraMorph 在性能上优于最先进的无监督/自监督基线方法，并且即使使用单波段多光谱图像也能保持鲁棒性。

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Authors: Yuhan Liu, Lianhui Qin, Shengjie Wang

First: 2025-10-23T17:59:21+00:00 · Latest: 2025-10-23T17:59:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict

中文标题/摘要

标题：小草图，大裁决：基于推测的密集信息视觉推理

大型多模态视觉语言模型（VLMs）在多模态理解方面取得了显著进展，但在处理密集交织了文本注释和细粒度图形元素的信息密集型图像时，它们面临挑战。主要挑战在于在密集布局中精确定位关键线索以及进行多跳推理以整合分散的证据。我们提出了推测裁决（SV），这是一种无需训练的框架，灵感来源于推测解码，结合了多个轻量级草图专家和一个大型裁决模型。在草图阶段，小型VLMs作为草图专家生成提供多样化定位候选的推理路径；在裁决阶段，强大的VLM综合这些路径生成最终答案，同时降低计算成本并恢复正确答案。为了进一步提高效率和准确性，SV引入了一种共识专家选择机制，仅将高一致性的推理路径转发到裁决阶段。实验证明，SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续的改进。通过综合多个部分准确推理路径中的正确见解，SV在错误纠正和成本效率方面优于大型专有模型或训练管道。代码可在https://github.com/Tinaliu0123/speculative-verdict 获取

Summary / 总结

The paper addresses the challenge of visual reasoning over dense and information-intensive images, where large VLMs struggle. It introduces Speculative Verdict (SV), a training-free framework that uses multiple lightweight draft experts to generate diverse reasoning paths, which are then synthesized by a strong VLM in the verdict stage. This approach minimizes computational cost while achieving consistent gains on benchmarks like InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K, demonstrating both error correction and cost-efficiency compared to large models or training pipelines.

论文针对大VLM在处理密集且信息密集型图像上的视觉推理难题。提出了一种名为Speculative Verdict (SV) 的无训练框架，该框架利用多个轻量级的草稿专家生成多种推理路径，然后由强大的VLM在判决阶段综合这些路径，从而减少计算成本并一致地在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等基准上取得良好表现，展示了与大型模型或训练管道相比的错误修正和成本效益优势。

On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?

Authors: Mingmeng Geng, Thierry Poibeau

First: 2025-10-23T17:59:06+00:00 · Latest: 2025-10-23T17:59:06+00:00

Abs · PDF · Code1 · Code2

Abstract

With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.

中文标题/摘要

标题：大语言模型生成文本的可检测性：什么是大语言模型生成的文本？

随着大语言模型（LLMs）的广泛应用，许多研究人员开始关注如何检测它们生成的文本。然而，对于“LLM生成的文本”这一目标并没有一致或精确的定义。不同的应用场景和LLM的多样性进一步增加了检测的难度。通常认为的检测目标通常只代表LLM能够生成的文本的一部分。人类对LLM输出的编辑以及LLM对用户微妙的影响正在模糊LLM生成文本与人类撰写的文本之间的界限。现有的基准测试和评估方法未能充分解决实际应用中各种条件的问题。因此，检测器的数值结果往往被误解，其重要性也在减弱。因此，检测器在特定条件下仍然有用，但其结果应仅作为参考而非决定性指标。