arXiv 论文速递

2025-10-26 03:28
Snapshot: 20251026_0328
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
Authors: Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu
First: 2025-10-23T17:59:59+00:00 · Latest: 2025-10-23T17:59:59+00:00
Comments: Project page and code: https://holo-cine.github.io/
Abstract
State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
中文标题/摘要
标题:HoloCine:整体生成电影多镜头长视频叙事
最先进的文本到视频模型在生成孤立片段方面表现出色,但在创建连贯的多镜头叙事方面却力有未逮,而多镜头叙事正是叙事的核心。我们通过HoloCine模型填补了这一“叙事缺口”,该模型能够整体生成整个场景,以确保从第一镜头到最后一个镜头的全局一致性。我们的架构通过窗口交叉注意力机制实现精确的导演控制,将文本提示定位到特定镜头,同时稀疏跨镜头自注意力模式(镜头内部密集但镜头之间稀疏)确保了分钟级生成所需的效率。除了在叙事连贯性方面达到新的最先进的水平外,HoloCine还发展出非凡的涌现能力:持久的人物和场景记忆,以及对电影技术的直观理解。我们的工作标志着从片段合成向自动化电影制作的关键转变,使端到端的电影创作成为可实现的未来。我们的代码可在:https://holo-cine.github.io/ 获取。
Summary / 总结
HoloCine addresses the limitation of existing text-to-video models in generating coherent multi-shot narratives by introducing a holistic generation approach. It uses a Window Cross-Attention mechanism to localize text prompts and a Sparse Inter-Shot Self-Attention pattern to maintain efficiency. Key findings include improved narrative coherence, persistent character and scene memory, and an intuitive understanding of cinematic techniques, marking a significant step towards automated filmmaking.
HoloCine通过引入整体生成方法解决了当前文本到视频模型在生成连贯多镜头叙事方面的局限性。它使用Window Cross-Attention机制来定位文本提示,并使用Sparse Inter-Shot Self-Attention模式来保持效率。该模型展示了强大的叙事连贯性和持久记忆以及对电影技术的理解等 emergent 能力,标志着向自动化电影制作迈出的重要一步。
One-Step Offline Distillation of Diffusion-based Models via Koopman Modeling
Authors: Nimrod Berman, Ilan Naiman, Moshe Eliasof, Hedi Zisling, Omri Azencot
First: 2025-05-19T16:59:47+00:00 · Latest: 2025-10-23T17:59:57+00:00
Abstract
Diffusion-based generative models have demonstrated exceptional performance, yet their iterative sampling procedures remain computationally expensive. A prominent strategy to mitigate this cost is distillation, with offline distillation offering particular advantages in terms of efficiency, modularity, and flexibility. In this work, we identify two key observations that motivate a principled distillation framework: (1) while diffusion models have been viewed through the lens of dynamical systems theory, powerful and underexplored tools can be further leveraged; and (2) diffusion models inherently impose structured, semantically coherent trajectories in latent space. Building on these observations, we introduce the Koopman Distillation Model (KDM), a novel offline distillation approach grounded in Koopman theory - a classical framework for representing nonlinear dynamics linearly in a transformed space. KDM encodes noisy inputs into an embedded space where a learned linear operator propagates them forward, followed by a decoder that reconstructs clean samples. This enables single-step generation while preserving semantic fidelity. We provide theoretical justification for our approach: (1) under mild assumptions, the learned diffusion dynamics admit a finite-dimensional Koopman representation; and (2) proximity in the Koopman latent space correlates with semantic similarity in the generated outputs, allowing for effective trajectory alignment. KDM achieves highly competitive performance across standard offline distillation benchmarks.
中文标题/摘要
标题:基于Koopman建模的一步离线蒸馏扩散模型
基于扩散的生成模型已经展示了出色的表现,但其迭代采样过程仍然计算成本高昂。一种减轻这种成本的显著策略是蒸馏,离线蒸馏特别在效率、模块化和灵活性方面具有优势。在本文中,我们识别了两个关键观察结果,以指导一个有原则的蒸馏框架:(1) 尽管扩散模型被视作动力系统理论的视角,但还有强大的且未充分探索的工具可以进一步利用;(2) 扩散模型本质上会在潜在空间中施加结构化的、语义连贯的轨迹。基于这些观察结果,我们引入了Koopman蒸馏模型(KDM),这是一种基于Koopman理论的新型离线蒸馏方法——Koopman理论是一种经典框架,用于在变换空间中将非线性动力学线性表示。KDM将嘈杂的输入编码到嵌入空间中,在该空间中,一个学习到的线性算子将它们向前传播,随后通过解码器重建干净的样本。这使得单步生成同时保持语义保真度。我们为我们的方法提供了理论上的证明:(1) 在温和的假设下,学习到的扩散动力学具有有限维的Koopman表示;(2) 在Koopman潜在空间中的邻近性与生成输出中的语义相似性相关,允许有效的轨迹对齐。KDM在标准的离线蒸馏基准测试中取得了非常有竞争力的性能。
Summary / 总结
This work addresses the computational inefficiency of diffusion-based generative models by proposing the Koopman Distillation Model (KDM). Motivated by the potential of Koopman theory to linearize nonlinear dynamics and the inherent structured trajectories in latent space of diffusion models, KDM uses an offline distillation approach to encode noisy inputs into an embedded space, where a linear operator propagates them, followed by a decoder to reconstruct clean samples. Theoretical justification supports the approach, showing that the learned diffusion dynamics can be represented in a finite-dimensional Koopman space, and that proximity in this space correlates with semantic similarity in generated outputs. KDM performs competitively on standard offline distillation benchmarks.
本文旨在通过提出Koopman Distillation Model (KDM)解决基于扩散的生成模型的计算效率问题。受Koopman理论可以将非线性动力学线性化以及扩散模型在潜在空间中固有的结构化轨迹的启发,KDM利用离线蒸馏将噪声输入编码到嵌入空间中,在该空间中,一个线性算子进行传播,随后通过解码器重建干净样本。理论证明支持该方法,表明学习到的扩散动力学可以在有限维的Koopman空间中表示,并且该空间中的接近性与生成输出的语义相似性相关。KDM在标准的离线蒸馏基准测试中表现出色。
LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas
Authors: Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, Ju Hu, Sergey Tulyakov, Kuan-Chieh Jackson Wang
First: 2025-10-23T17:59:55+00:00 · Latest: 2025-10-23T17:59:55+00:00
Comments: 9 pages, preprint
Abstract
Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.
中文标题/摘要
标题:LayerComposer:基于空间感知分层画布的交互式个性化T2I
尽管现有的个性化生成模型在视觉保真度方面表现出色,但它们缺乏对空间组成进行交互式控制的能力,并且在处理多个主体时扩展性较差。为了解决这些限制,我们提出了LayerComposer,这是一种交互式的多主体文本到图像生成框架。我们的方法引入了两个主要贡献:(1) 分层画布,这是一种新颖的表示方法,其中每个主体位于不同的层上,从而实现无遮挡的组合;(2) 锁定机制,该机制保持选定层的高保真度,同时允许其余层灵活适应周围环境。类似于专业的图像编辑软件,所提出的分层画布允许用户通过直观的分层操作放置、调整大小或锁定输入主体。我们的灵活锁定机制不需要进行架构更改,而是依赖于固有的位置嵌入以及一种新的互补数据采样策略。广泛的实验表明,与多主体个性化图像生成的最新方法相比,LayerComposer在空间控制和身份保留方面具有优越性。
Summary / 总结
LayerComposer is an interactive framework for generating personalized multi-subject text-to-image outputs. It introduces a layered canvas where each subject is placed on a distinct layer, enabling occlusion-free composition, and a locking mechanism that preserves selected layers while allowing others to adapt flexibly. Experiments show that LayerComposer outperforms state-of-the-art methods in terms of spatial control and identity preservation.
LayerComposer 是一个交互式框架,用于生成多主题的个性化文本到图像输出。它引入了一种分层画布,每个主题都放在一个独立的层上,允许无遮挡的组合和剩余层的灵活适应。锁定机制保留选定的层,同时根据上下文适应其他层。实验表明,LayerComposer 在多主题图像生成中的空间控制和身份保留方面优于现有方法。
Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge
Authors: Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot
First: 2025-10-23T17:59:54+00:00 · Latest: 2025-10-23T17:59:54+00:00
Abstract
Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.
中文标题/摘要
标题:通用模态翻译的对比预测潜在去噪扩散桥模型
生成建模的最新进展将去噪扩散模型定位为从复杂数据分布中采样的最先进的工具。尽管这些模型在图像和音频等单一模态领域取得了显著的成功,但将它们的能力扩展到模态翻译(MT),即在不同感官模态之间翻译信息,仍然是一个开放的挑战。现有方法通常依赖于共享维度、高斯先验和特定模态的架构等限制性假设,这限制了它们的通用性和理论基础。在本文中,我们提出了潜在去噪扩散桥模型(LDDBM),这是一种基于潜在变量扩展的去噪扩散桥模型的通用框架。通过在共享的潜在空间中操作,我们的方法可以在不需要对齐维度的情况下学习任意模态之间的桥梁。我们引入了一种对比对齐损失来强制配对样本之间的语义一致性,并设计了一种通用的编码器-解码器架构,专门用于潜在空间中的噪声预测。此外,我们提出了预测损失来指导训练以实现准确的跨域翻译,并探索了几种训练策略以提高稳定性。我们的方法支持任意模态对,并在包括多视图到3D形状生成、图像超分辨率和多视图场景合成在内的多种MT任务中表现出色。全面的实验和消融实验验证了我们框架的有效性,建立了通用模态翻译的新强基线。欲了解更多信息,请参见我们的项目页面:https://sites.google.com/view/lddbm/home.
Summary / 总结
This work addresses the challenge of modality translation by proposing the Latent Denoising Diffusion Bridge Model (LDDBM), which operates in a shared latent space to translate information across different sensory modalities without requiring aligned dimensions. The method includes a contrastive alignment loss for semantic consistency and a predictive loss for accurate cross-domain translation. Experiments show that LDDBM performs well on various tasks such as multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis, establishing a strong baseline in general modality translation.
本文通过提出Latent Denoising Diffusion Bridge Model (LDDBM)解决了不同感官模态间信息翻译的挑战,该方法在共享的潜在空间中操作以实现跨模态翻译。方法使用对比对齐损失来确保语义一致性,并使用预测损失来引导准确的跨域翻译。实验表明,LDDBM在多视图到3D形状生成、图像超分辨率和多视图场景合成等多种任务上表现良好,建立了通用模态翻译的新基准。
VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation
Authors: Mateo Guaman Castro, Sidharth Rajagopal, Daniel Gorbatov, Matt Schmittle, Rohan Baijal, Octi Zhang, Rosario Scalise, Sidharth Talia, Emma Romig, Celso de Melo, Byron Boots, Abhishek Gupta
First: 2025-10-23T17:59:45+00:00 · Latest: 2025-10-23T17:59:45+00:00
Abstract
A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/
中文标题/摘要
标题:VAMOS:一种分层的视觉-语言-行动模型,用于能力调节和可调节导航
机器人导航中的一个基本挑战在于学习能够在多种环境中泛化的策略,同时符合特定载体的独特物理约束和能力(例如,四足机器人可以爬楼梯,但轮式机器人不能)。我们提出了VAMOS,一种分层的视觉-语言-行动(VLA)模型,将语义规划与载体接地分离:一个通用规划器从多样化的开放世界数据中学习,而一个专门的执行能力模型在安全、低成本的模拟中学习机器人的物理约束和能力。我们通过精心设计一个接口实现了这种分离,该接口允许高层规划器直接在图像空间中提出候选路径,然后由执行能力模型评估和重新排名。我们的实地实验表明,VAMOS在室内和复杂室外导航中的成功率均高于最先进的基于模型和端到端学习方法。我们还展示了这种分层设计能够跨不同载体(腿式和轮式)实现导航,并且可以通过自然语言轻松调节。实地消融实验证实,专门模型对于载体接地至关重要,使得一个高层规划器能够在物理上不同的轮式和腿式机器人上部署。最后,该模型显著提高了单个机器人的可靠性,通过拒绝物理上不可行的计划,成功率提高了3倍。
Summary / 总结
VAMOS is a hierarchical vision-language-action model designed to address the challenge of robot navigation by decoupling semantic planning from embodiment grounding. It uses a generalist planner trained on diverse data and a specialist affordance model that learns the robot's physical constraints in simulation. Experiments show VAMOS outperforms state-of-the-art methods in both indoor and outdoor navigation, and its hierarchical design allows for cross-embodied navigation and easy steering with natural language. The specialist model is crucial for embodiment grounding, enabling a single high-level planner to work across different types of robots and improving single-robot reliability by rejecting infeasible plans.
VAMOS是一种分层的视觉-语言-行动模型,旨在在多样化的环境中导航并遵守机器人的物理能力。它将语义规划与实体接地分离,通用规划器从多样化的数据中学习,而专门模型学习机器人的物理限制。实验表明,VAMOS在室内和复杂户外导航中优于最先进的方法,并且其分层设计允许跨不同类型的机器人进行导航和自然语言控制。专门模型对于实体接地至关重要,使单个规划器能够在不同类型的机器人上运行,并通过拒绝不可行的计划显著提高导航可靠性。
KL-Regularized Reinforcement Learning is Designed to Mode Collapse
Authors: Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, Rajesh Ranganath
First: 2025-10-23T17:59:40+00:00 · Latest: 2025-10-23T17:59:40+00:00
Abstract
It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.
中文标题/摘要
标题:KL-正则化强化学习旨在模式塌陷
通常认为,优化逆KL散度会导致“模式寻求”,而优化正向KL则会导致“质量覆盖”,后者如果目标是从多个多样模式中采样则更受欢迎。我们通过数学和实验证明,这种直觉并不一定适用于使用逆/正向KL正则化(例如,如语言模型中常用)进行强化学习。相反,逆/正向KL的选择决定了最优目标分布的家族,参数化为正则化系数。模式覆盖主要取决于其他因素,如正则化强度,以及奖励和参考概率之间的相对比例。此外,我们证明了常用的设置,如低正则化强度和相等的验证奖励,往往会指定单模目标分布,这意味着优化目标,从本质上讲,是非多样性的。我们利用这些见解构建了一个简单、可扩展且具有理论依据的算法。它对奖励幅度的修改最小,但优化了一个将高概率集中在所有高质量采样模式上的目标分布。在实验中,这种简单的修改能够提高大型语言模型和化学语言模型的解决方案质量和多样性,无需任何多样性的外部信号,并且在使用逆/正向KL时,无论是哪种情况都有效。
GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation
Authors: Guangqi Jiang, Haoran Chang, Ri-Zhao Qiu, Yutong Liang, Mazeyu Ji, Jiyue Zhu, Zhao Dong, Xueyan Zou, Xiaolong Wang
First: 2025-10-23T17:59:26+00:00 · Latest: 2025-10-23T17:59:26+00:00
Abstract
This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates "closing the loop" of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: https://3dgsworld.github.io/.
中文标题/摘要
标题:GSWorld:闭环照片级真实感模拟套件用于机器人操作
本文介绍了GSWorld,这是一种结合了3D高斯点积和物理引擎的稳健的照片级真实感机器人操作模拟器。我们的框架提倡“闭环”开发操作策略,通过可重复的评估从真实机器人数据中学习的操作策略,并在无需使用真实机器人的情况下进行模拟到现实的策略训练。为了实现多样场景的照片级真实感渲染,我们提出了一种新的资产格式,我们称之为GSDF(高斯场景描述文件),它将高斯-网格表示与机器人URDF和其他对象结合在一起。通过简化重建流水线,我们整理了一个包含3个单臂和双臂操作的机器人实体以及超过40个物体的GSDF数据库。结合GSDF和物理引擎,我们展示了几个有趣的即时应用:(1)使用照片级真实感渲染学习零样本模拟到现实的像素到动作操作策略,(2)自动高质量的DAgger数据收集以适应部署环境,(3)在模拟中重复评估真实机器人操作策略,(4)通过虚拟遥控收集模拟数据,以及(5)零样本模拟到现实的视觉强化学习。网站:https://3dgsworld.github.io/
Summary / 总结
GSWorld is a robust photo-realistic simulator for robotic manipulation that integrates 3D Gaussian Splatting with physics engines. It enables the development of manipulation policies by closing the loop with reproducible evaluation and sim2real policy training without using real robots. Key applications include zero-shot sim2real pixel-to-action manipulation, automated high-quality data collection, reproducible benchmarking, virtual teleoperation data collection, and visual reinforcement learning.
GSWorld 是一个结合了 3D 高斯散点图与物理引擎的稳健照片级真实模拟器,用于机器人操作。它通过使用真实机器人数据进行评估和无需真实机器人进行 sim2real 策略训练来实现闭环开发。主要发现包括学习零样本 sim2real 像素到动作操作策略、高质量的 DAgger 数据自动收集、真实机器人操作策略的可重复基准测试、通过虚拟遥操作收集模拟数据以及零样本 sim2real 视觉强化学习。
SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution
Authors: Ritik Shah, Marco F Duarte
First: 2025-10-23T17:59:26+00:00 · Latest: 2025-10-23T17:59:26+00:00
Abstract
Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.
中文标题/摘要
标题:SpectraMorph:结构化潜在学习的自监督超光谱超分辨率
超光谱传感器每像素捕获密集光谱,但存在低空间分辨率的问题,导致边界模糊和混合像素效应。配准的伴生传感器,如多光谱、RGB或全色相机,提供高分辨率的空间细节,推动了通过融合超光谱和多光谱图像(HSI-MSI)实现超光谱超分辨率。现有的基于深度学习的方法表现强大,但依赖于不透明的回归器,缺乏可解释性,且当MSI带宽很少时往往失效。我们提出了一种基于物理的自监督融合框架SpectraMorph,具有结构化的潜在空间。SpectraMorph 不直接进行回归,而是施加一个解混瓶颈:从低分辨率HSI中提取端元签名,并通过MSI预测类似丰度的图。光谱通过线性混合重建,通过MSI传感器的光谱响应函数在自监督模式下进行训练。SpectraMorph 生成可解释的中间结果,训练时间不到一分钟,并且即使在单带(全色)MSI的情况下也能保持鲁棒性。在合成和真实数据集上的实验表明,SpectraMorph 在自监督/半监督基准上始终优于最先进的方法,同时在监督基准上保持非常有竞争力。
Summary / 总结
SpectraMorph is a physics-guided self-supervised framework for hyperspectral super-resolution that uses a structured latent space to enforce an unmixing bottleneck. It extracts endmember signatures from low-resolution hyperspectral images and predicts abundance-like maps from high-resolution multispectral images using a compact multilayer perceptron. Spectra are reconstructed by linear mixing, with training performed via the multispectral sensor's spectral response function. Experiments show that SpectraMorph outperforms state-of-the-art unsupervised/self-supervised methods and remains robust even with a single-band multispectral image.
SpectraMorph 是一种自监督融合框架,用于高光谱超分辨率,使用结构化的潜在空间来强制执行解混瓶颈。它从低分辨率高光谱图像中提取端元签名,并从高分辨率多光谱图像中预测类似丰度的图。实验表明,SpectraMorph 在性能上优于现有的无监督和自监督方法,并且即使使用单波段 MSI 传感器也能保持鲁棒性。
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Authors: Yuhan Liu, Lianhui Qin, Shengjie Wang
First: 2025-10-23T17:59:21+00:00 · Latest: 2025-10-23T17:59:21+00:00
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict
中文标题/摘要
标题:小草图,大裁决:基于推测的密集信息视觉推理
大型多模态视觉语言模型(VLMs)在多模态理解方面取得了显著进展,但在处理密集交织了文本注释和细粒度图形元素的信息密集型图像时,它们面临挑战。主要挑战在于在密集布局中精确定位关键线索以及进行多跳推理以整合分散的证据。我们提出了一种名为推测裁决(SV)的无训练框架,该框架受到推测解码的启发,结合了多个轻量级草图专家和一个大型裁决模型。在草图阶段,小型VLM作为草图专家生成提供多样化定位候选的推理路径;在裁决阶段,强大的VLM综合这些路径以生成最终答案,同时降低计算成本并恢复正确答案。为了进一步提高效率和准确性,SV引入了一种共识专家选择机制,仅将高一致性的推理路径转发到裁决阶段。实验证明,SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续的改进。通过综合多个部分准确推理路径中的正确见解,SV在错误纠正和成本效率方面优于大型专有模型或训练管道。代码可在https://github.com/Tinaliu0123/speculative-verdict 获取
Summary / 总结
The paper addresses the challenge of reasoning over information-intensive images with dense textual and graphical elements, which is difficult for large VLMs. It introduces Speculative Verdict (SV), a training-free framework that uses multiple lightweight draft experts to generate diverse reasoning paths, which are then synthesized by a strong VLM in the verdict stage. SV includes a consensus expert selection mechanism to forward only high-agreement paths, reducing computational cost while maintaining accuracy. Experiments show consistent improvements on benchmarks like InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K, demonstrating both error correction and cost-efficiency compared to large proprietary models or training pipelines.
论文针对大型视觉语言模型在处理密集文本和图形元素交织的信息密集型图像时的推理难题。提出了一种名为Speculative Verdict (SV) 的无训练框架,该框架使用多个轻量级的草案专家生成多样化的推理路径,然后由强大的VLM在决断阶段综合这些路径,以减少计算成本同时保持准确性。SV还引入了一种共识专家选择机制,仅转发高一致性的推理路径。实验结果显示,SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等基准测试中表现出一致的改进,展示了与大型专有模型或训练管道相比的错误修正和成本效率优势。
On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?
Authors: Mingmeng Geng, Thierry Poibeau
First: 2025-10-23T17:59:06+00:00 · Latest: 2025-10-23T17:59:06+00:00
Abstract
With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely "LLM-generated text". Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
中文标题/摘要
标题:大语言模型生成文本的可检测性:什么是大语言模型生成的文本?
随着大语言模型(LLMs)的广泛应用,许多研究人员开始关注如何检测它们生成的文本。然而,对于“LLM生成的文本”这一目标并没有一致或精确的定义。不同的应用场景和LLM的多样性进一步增加了检测的难度。通常认为的检测目标通常只代表LLM可能生成文本的一部分。人类对LLM输出的编辑以及LLM对用户微妙的影响正在模糊LLM生成文本与人类撰写的文本之间的界限。现有的基准测试和评估方法未能充分解决实际应用中各种条件的问题。因此,检测器的数值结果往往被误解,其重要性也在减弱。因此,检测器在特定条件下仍然有用,但其结果应仅作为参考而非决定性指标。
Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers
Authors: Dean L Slack, G Thomas Hudson, Thomas Winterbottom, Noura Al Moubayed
Venue: IEEE Transactions on Neural Networks and Learning Systems, 36, 19106-19118, 2025
First: 2025-10-23T17:58:45+00:00 · Latest: 2025-10-23T17:58:45+00:00
Comments: 14 pages, 14 figures
Abstract
Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
中文标题/摘要
标题:基于像素空间时空变换器的动态物理模拟视频预测
受自回归大型语言模型(LLMs)性能和可扩展性的启发,基于变换器的模型在视觉领域取得了近期的成功。本研究探讨了一种基于变换器的视频预测方法,采用简单端到端的方法,比较了各种时空自注意力布局。专注于时间上的物理模拟因果建模;现有视频生成方法的一个常见缺点,我们试图通过物理对象跟踪指标和无监督训练在物理模拟数据集上隔离时空推理。我们引入了一种简单而有效的纯变换器模型,用于自回归视频预测,利用连续像素空间表示进行视频预测。无需复杂的训练策略或潜在特征学习组件,我们的方法在与现有潜在空间方法相比时,物理准确预测的时间范围显著延长了50%,同时在常见的视频质量指标上保持了相当的性能。此外,我们进行了可解释性实验,以识别网络区域,这些区域编码了通过探针模型进行准确的PDE模拟参数估计有用的信息,并发现这可以推广到对分布外模拟参数的估计。这项工作为通过简单、参数高效和可解释的方法进一步进行基于注意力的时空视频建模提供了平台。
Summary / 总结
This study explores the use of transformer-based models for video prediction, focusing on causal modeling of physical simulations. By employing a simple end-to-end approach and continuous pixel-space representations, the model achieves up to 50% longer time horizons for physically accurate predictions compared to existing latent-space approaches, while maintaining comparable performance on video quality metrics. Interpretability experiments also reveal that the model can generalize to estimate out-of-distribution simulation parameters effectively.
该研究探索了基于变压器的模型在视频预测中的应用,专注于物理模拟的因果建模。通过采用简单的端到端方法和连续的像素空间表示,该模型在物理准确预测的时间跨度上比现有的潜在空间方法长50%以上,同时在视频质量指标上保持了相当的性能。此外,可解释性实验还表明,该模型能够有效估计出分布外的模拟参数。
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model
Authors: Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng, Jingdong Chen, Jun Zhou
Venue: NeurIPS 2025
First: 2025-10-23T17:58:26+00:00 · Latest: 2025-10-23T17:58:26+00:00
Comments: Accepted to NeurIPS 2025, 18 pages
Abstract
We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
中文标题/摘要
标题:ARGenSeg:基于自回归图像生成模型的图像分割
我们提出了一种新颖的基于自回归生成的图像分割(ARGenSeg)范式,在统一框架内实现多模态理解和像素级感知。以往将图像分割集成到多模态大型语言模型(MLLM)中的工作通常采用边界点表示或专用分割头。这些方法依赖于离散表示或输入任务特定解码器的语义提示,这限制了MLLM捕捉细粒度视觉细节的能力。为了解决这些挑战,我们基于图像生成引入了一种针对MLLM的分割框架,自然地生成目标对象的密集掩码。我们利用MLLM输出视觉标记,并使用通用VQ-VAE将它们解码为图像,使分割完全依赖于MLLM的像素级理解。为了减少推理延迟,我们采用下一尺度预测策略并行生成所需的视觉标记。大量实验表明,我们的方法在多个分割数据集上超越了先前的最先进方法,同时显著提高了推理速度,而保持了强大的理解能力。
Summary / 总结
The research proposes ARGenSeg, an innovative AutoRegressive Generation-based image segmentation method that integrates multimodal understanding and pixel-level perception. Unlike previous approaches that use boundary points or dedicated segmentation heads, ARGenSeg uses a unified framework to produce dense masks directly from the model's pixel-level understanding. The method employs a next-scale-prediction strategy to enhance inference speed and outperforms existing state-of-the-art techniques on multiple segmentation datasets while maintaining strong understanding capabilities.
该研究旨在通过将自回归生成模型集成到多模态大型语言模型(MLLMs)中来改进图像分割。提出的ARGenSeg框架通过使用密集掩码和像素级理解解决了先前方法的局限性,从而增强了细粒度的视觉细节。实验表明,ARGenSeg在多个数据集上优于现有方法,具有更快的推理速度同时保持强大的理解能力。
DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
Authors: Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong
First: 2025-10-02T17:39:13+00:00 · Latest: 2025-10-23T17:58:02+00:00
Comments: Preprint
Abstract
Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
中文标题/摘要
标题:DragFlow:利用区域监督释放DiT先验以进行拖动编辑
基于拖动的图像编辑长期以来一直受到目标区域失真的困扰,主要是因为早期基础模型Stable Diffusion的先验不足,无法将优化的潜在变量投影回自然图像流形。随着从基于UNet的DDPMs转向更可扩展的DiT(例如SD3.5,FLUX),生成先验变得显著更强,使各种编辑任务取得了进展。然而,基于拖动的编辑尚未从这些更强的先验中受益。本文提出了第一个框架,有效利用FLUX丰富的先验进行基于拖动的编辑,称为DragFlow,实现了相对于基线的显著改进。我们首先表明,直接将基于点的拖动编辑应用于DiT效果不佳:与高度压缩的UNet特征不同,DiT特征不足以提供可靠的点动监督指导。为克服这一限制,DragFlow引入了一种基于区域的编辑范式,其中仿射变换使特征监督更加丰富且一致。此外,我们整合了预训练的开放域个性化适配器(例如IP-Adapter)以增强主体一致性,同时通过梯度掩码的硬约束保持背景保真度。进一步使用多模态大型语言模型(MLLMs)解决任务歧义。为了评估,我们构建了一个新的基于区域的拖动基准(ReD Bench),包含区域级拖动指令。在DragBench-DR和ReD Bench上的大量实验表明,DragFlow超越了基于点和基于区域的基线,为基于拖动的图像编辑设定了新的SOTA。代码和数据集将在发表后公开。
Summary / 总结
DragFlow is a framework that leverages DiT priors for drag-based image editing, addressing the issue of distortions in target regions. It introduces a region-based editing paradigm using affine transformations to provide more reliable feature supervision. DragFlow also integrates pretrained personalization adapters and gradient mask constraints to enhance subject consistency and preserve background fidelity. Experiments on DragBench-DR and ReD Bench demonstrate that DragFlow outperforms existing point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing.
DragFlow 是一个框架,利用 DiT 模型的强大生成先验进行拖拽式图像编辑。它引入了基于区域的编辑范式,使用仿射变换提供更可靠的特征监督,并集成预训练适配器以增强主体一致性同时保留背景保真度。大量实验表明,DragFlow 在拖拽式图像编辑中超越了基于点和基于区域的基线,达到了新的技术水平。
Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature
Authors: Lei Cheng, Siyang Cao
First: 2025-10-23T17:54:57+00:00 · Latest: 2025-10-23T17:54:57+00:00
Comments: accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
Abstract
This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role--despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system--our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at https://github.com/radar-lab/Radar_Camera_MOT
中文标题/摘要
标题:雷达-摄像头融合多目标跟踪:在线校准与公共特征
本文提出了一种融合雷达和摄像头数据的多目标跟踪(MOT)框架,以提高跟踪效率并减少人工干预。与许多研究中雷达被低估并仅作为辅助角色不同,我们的方法将雷达置于关键位置。同时,本文利用雷达和摄像头数据之间的公共特征,实现在线校准,自主关联雷达和摄像头的检测结果。本文的主要贡献包括:(1)开发了一种利用在线雷达-摄像头校准的雷达-摄像头融合MOT框架,简化了来自这两种传感器的检测结果的集成;(2)利用雷达和摄像头数据之间的公共特征,准确推导出检测目标的现实位置;(3)采用特征匹配和类别一致性检查,超越了仅位置匹配的局限性,提高了传感器关联的准确性。据我们所知,我们是第一个研究雷达-摄像头公共特征的集成及其在线校准在实现MOT中的应用。通过在受控环境和实际交通场景中的实际实验,证明了我们框架能够简化雷达-摄像头映射过程并提高跟踪精度的有效性。代码可在https://github.com/radar-lab/Radar_Camera_MOT 获取。
Summary / 总结
This paper introduces a radar-camera fusion Multi-Object Tracking (MOT) framework that leverages online calibration to enhance tracking efficiency and accuracy. The method utilizes common features between radar and camera data to autonomously associate detections, improving the integration of these two sensors. Key experimental findings show that the framework simplifies the radar-camera mapping process and improves tracking precision in both controlled and real-world traffic scenarios.
本文提出了一种雷达-相机融合的多目标跟踪(MOT)框架,通过在线校准提升跟踪效率和精度。该方法利用雷达和相机数据之间的共同特征自动关联检测结果,超越了仅依靠位置匹配的传感器关联准确性。在受控环境和实际交通场景中的实验结果表明,该框架能够简化雷达-相机映射过程并提高跟踪精度,是首次将雷达-相机共同特征用于在线校准以实现MOT的研究。
BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
Authors: Liang Ye, Shengqin Chen, Jiazhu Dai
First: 2025-10-23T17:54:17+00:00 · Latest: 2025-10-23T17:54:17+00:00
Abstract
The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models' applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
中文标题/摘要
标题:BadGraph:针对文本引导图生成的潜在扩散模型后门攻击
图生成的快速发展引发了新的安全问题,特别是后门漏洞问题。尽管先前的工作已经探索了图像扩散和无条件图生成中的后门攻击,但条件生成,尤其是文本引导图生成仍然很大程度上未被研究。本文提出BadGraph,一种针对文本引导图生成的潜在扩散模型的后门攻击方法。BadGraph利用文本触发器污染训练数据,秘密植入后门,在触发器出现时诱导攻击者指定的子图,同时在干净输入上保持正常性能。在四个基准数据集(PubChem、ChEBI-20、PCDes、MoMu)上的广泛实验表明,该攻击的有效性和隐蔽性:不到10%的污染率可以实现50%的攻击成功率,而24%则足以超过80%的成功率,对良性样本的性能退化可以忽略不计。消融研究进一步表明,后门是在VAE和扩散训练过程中植入的,而不是在预训练过程中。这些发现揭示了文本引导图生成潜在扩散模型的安全漏洞,突显了模型应用中的严重风险,如药物发现,并强调了此类扩散模型中对抗后门攻击的鲁棒防御的必要性。
Summary / 总结
BadGraph is a backdoor attack method targeting latent diffusion models for text-guided graph generation. It uses textual triggers to poison training data, covertly implanting backdoors that trigger specific subgraphs during inference. Experiments on four benchmark datasets show that less than 10% poisoning rate can achieve 50% attack success rate, while 24% is sufficient for over 80% success rate with minimal impact on normal performance. This highlights the security vulnerabilities in text-guided graph generation models and the need for robust defenses against backdoor attacks.
BadGraph 是一种针对文本引导图生成的潜在扩散模型的后门攻击方法。它利用文本触发器污染训练数据,秘密植入后门,在推理过程中触发特定子图。在四个基准数据集上的实验显示,不到10%的污染率可以实现50%的攻击成功率,而24%则足以实现超过80%的成功率,且对正常样本的性能影响很小。消融研究进一步表明,后门是在VAE和扩散训练阶段植入的。这项工作揭示了文本引导图生成模型的安全漏洞,并强调了针对此类扩散模型的后门攻击需要 robust 防御的重要性。
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Authors: Mutian He, Philip N. Garner
First: 2025-10-23T17:53:03+00:00 · Latest: 2025-10-23T17:53:03+00:00
Comments: 19 pages, 5 figures
Abstract
Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
中文标题/摘要
标题:通过混合稀疏注意和上下文可学习令牌驱逐缓解线性注意的健忘
线性注意模型通过将整个输入序列压缩为固定大小的递归状态,为Transformer提供了一种高效的替代方案,但其有限的内存会导致健忘,损害检索密集型任务。为缓解这一问题,我们探索了一系列混合模型,以恢复对过去令牌的直接访问。我们交替使用中间时间空间复杂度的令牌混合器,包括稀疏注意和令牌驱逐,以及查询感知的原生稀疏注意。特别地,我们提出了一种新颖的可学习令牌驱逐方法。结合滑动窗口注意,端到端可训练的轻量级CNN从过去和未来的相邻令牌中聚合信息,以自适应地保留每个头的有限关键KV对,保持线性注意的恒定时间和空间复杂性。提供了稀疏注意机制的高效Triton内核。在检索密集型基准上的实证评估支持了我们方法的有效性。
Summary / 总结
This paper addresses the issue of forgetfulness in linear-attention models, which is detrimental to retrieval-intensive tasks. The authors propose a series of hybrid models that include sparse attention with token eviction and query-aware native sparse attention. They introduce a novel learnable token eviction approach that, combined with sliding-window attention, allows for adaptive retention of critical key-value pairs while maintaining the efficiency of linear attention. Experimental results on retrieval benchmarks demonstrate the effectiveness of these approaches in mitigating forgetfulness.
本文针对线性注意力模型中存在的遗忘问题,该问题对检索密集型任务有害。作者提出了一系列混合模型,包括带有可学习令牌驱逐的稀疏注意力和查询感知的原生稀疏注意力。他们引入了一种新的可学习令牌驱逐方法,结合滑动窗口注意力,可以在保持线性注意力效率的同时,自适应地保留关键的关键值对。在检索基准上的实验结果表明,这些方法在减轻遗忘方面是有效的。
Out-of-distribution Tests Reveal Compositionality in Chess Transformers
Authors: Anna Mészáros, Patrik Reizinger, Ferenc Huszár
First: 2025-10-23T17:51:28+00:00 · Latest: 2025-10-23T17:51:28+00:00
Abstract
Chess is a canonical example of a task that requires rigorous reasoning and long-term planning. Modern decision Transformers - trained similarly to LLMs - are able to learn competent gameplay, but it is unclear to what extent they truly capture the rules of chess. To investigate this, we train a 270M parameter chess Transformer and test it on out-of-distribution scenarios, designed to reveal failures of systematic generalization. Our analysis shows that Transformers exhibit compositional generalization, as evidenced by strong rule extrapolation: they adhere to fundamental syntactic rules of the game by consistently choosing valid moves even in situations very different from the training data. Moreover, they also generate high-quality moves for OOD puzzles. In a more challenging test, we evaluate the models on variants including Chess960 (Fischer Random Chess) - a variant of chess where starting positions of pieces are randomized. We found that while the model exhibits basic strategy adaptation, they are inferior to symbolic AI algorithms that perform explicit search, but gap is smaller when playing against users on Lichess. Moreover, the training dynamics revealed that the model initially learns to move only its own pieces, suggesting an emergent compositional understanding of the game.
中文标题/摘要
标题:超出分布测试揭示国际象棋变换器的组合性
国际象棋是需要严谨推理和长期规划的经典任务示例。现代决策变换器——类似于大语言模型训练——能够学习熟练的游戏玩法,但不清楚它们在多大程度上真正捕捉到了国际象棋的规则。为了调查这一点,我们训练了一个包含2.7亿参数的国际象棋变换器,并在超出分布的场景中对其进行测试,设计这些场景以揭示系统性泛化的失败。我们的分析表明,变换器表现出组合性泛化,这体现在强大的规则外推中:它们通过在与训练数据非常不同的情况下始终选择有效移动来遵循游戏的基本句法规则。此外,它们还为超出分布的谜题生成高质量的走法。在更具挑战性的测试中,我们评估了模型在包括Chess960(费舍尔随机国际象棋)等变体中的表现——在该变体中,棋子的起始位置是随机化的。我们发现,虽然模型表现出基本策略的适应性,但它们在与Lichess用户对战时不如执行显式搜索的符号AI算法,但差距较小。此外,训练动态表明,模型最初学会只移动自己的棋子,这表明模型对游戏产生了自发的组合理解。
Summary / 总结
The study investigates whether decision Transformers, trained similarly to language models, can capture the rules of chess through out-of-distribution (OOD) tests. A 270M parameter chess Transformer was trained and tested on scenarios that reveal systematic generalization failures. The model demonstrated compositional generalization by adhering to fundamental syntactic rules and generating high-quality moves in OOD situations. However, the model showed basic strategy adaptation in Chess960 but was outperformed by symbolic AI algorithms, though the gap was smaller when playing against human users on Lichess. Training dynamics suggested the model initially learned to move only its own pieces, indicating an emergent compositional understanding of the game.
研究通过出分布测试考察了类似语言模型训练的决策Transformer是否真正掌握了国际象棋的规则。一个2.7亿参数的国际象棋Transformer被训练并在能揭示系统泛化失败的场景中进行测试。模型展示了组合泛化的表现,严格遵守游戏的基本语法规则并生成高质量的走法。然而,在Chess960(随机国际象棋)中,模型展示了基本的战略适应性,但被符号AI算法超越,尽管与Lichess上的用户对战时差距较小。训练动态表明,模型最初仅学会移动自己的棋子,这表明其对游戏的组合理解是逐步形成的。
Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
Authors: Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong
Venue: NeurIPS 2025
First: 2025-10-23T17:48:36+00:00 · Latest: 2025-10-23T17:48:36+00:00
Comments: NeurIPS 2025
Abstract
Recent advancements in large reasoning models (LRMs) have introduced an intermediate "thinking" process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to "overthink" simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
中文标题/摘要
标题:大型推理模型是良好的机器翻译评估器吗?分析与性能提升
最近在大型推理模型(LRMs)方面的进展引入了一个中间的“思考”过程,在生成最终答案之前,这提高了它们在复杂下游任务上的推理能力。然而,LRMs作为机器翻译(MT)质量评估器的潜力尚未得到充分探索。我们提供了LRM作为评估者在MT评估中的首次系统分析。我们确定了关键挑战,揭示了LRMs需要定制的评估材料,倾向于对简单实例“过度思考”,并且评分机制存在问题导致高估。为了解决这些问题,我们建议通过训练它们在合成的人类思考轨迹上来校准LRM的思考。我们在WMT24指标基准上的实验表明,这种方法将思考预算大幅减少了约35倍,同时在不同规模的LRM(例如,R1-Distill-Qwen-7B实现了+8.7相关性点的改进)上提高了评估性能。这些发现突显了高效校准的LRMs在推进细粒度自动MT评估方面的潜力。
CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image
Authors: Binbin Huang, Haobin Duan, Yiqun Zhao, Zibo Zhao, Yi Ma, Shenghua Gao
First: 2025-10-23T17:47:38+00:00 · Latest: 2025-10-23T17:47:38+00:00
Comments: project page at https://cupid3d.github.io
Abstract
This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.
中文标题/摘要
标题:CUPID:基于姿态生成的单张图像三维重建
这项工作提出了一种新的基于生成的三维重建方法,名为Cupid,可以从单张二维图像中准确推断出相机姿态、三维形状和纹理。Cupid将三维重建视为从学习到的三维物体分布中进行条件采样的过程,并且它联合生成体素和像素-体素对应关系,能够在统一的生成框架下实现稳健的姿态和形状估计。通过将输入相机姿态和三维形状表示为共享三维潜在空间中的分布,Cupid采用两阶段流匹配管道:(1)粗略阶段生成初始三维几何结构及其相关的二维投影以恢复姿态;(2)细化阶段将姿态对齐的图像特征整合进来,以增强结构保真度和外观细节。大量实验表明,Cupid在峰值信噪比(PSNR)上比领先三维重建方法高出3dB以上,在切比雪夫距离上降低超过10%,同时在姿态准确性上与单目估计器相当,并且在视觉保真度上优于基线三维生成模型。要查看Cupid生成的三维结果的沉浸式视图,请访问cupid3d.github.io。
Summary / 总结
Cupid is a new 3D reconstruction method that accurately infers the camera pose, 3D shape, and texture from a single image by casting 3D reconstruction as a conditional sampling process. It uses a two-stage flow matching pipeline to first produce initial 3D geometry and then refine it with pose-aligned image features. Experiments show that Cupid outperforms leading 3D reconstruction methods with higher PSNR and lower Chamfer Distance, while maintaining pose accuracy and visual fidelity better than baseline models.
Cupid 是一种新颖的单图像 3D 重建方法,通过将 3D 重建视为条件采样过程来准确地从单张图像中推断出相机姿态、3D 形状和纹理。它使用两阶段流匹配管道,首先生成初始 3D 几何形状,然后用姿态对齐的图像特征对其进行细化。实验表明,Cupid 在 PSNR 更高、体素距离更低的情况下优于领先的方法,同时在姿态精度和视觉保真度方面优于基线模型。
CSU-PCAST: A Dual-Branch Transformer Framework for medium-range ensemble Precipitation Forecasting
Authors: Tianyi Xiong, Haonan Chen
First: 2025-10-23T17:43:38+00:00 · Latest: 2025-10-23T17:43:38+00:00
Comments: 20 pages, 12 figures, submitted to arXiv under Atmospheric and Oceanic Physics (physics.ao-ph) and Machine Learning (cs.LG)
Abstract
Accurate medium-range precipitation forecasting is crucial for hydrometeorological risk management and disaster mitigation, yet remains challenging for current numerical weather prediction (NWP) systems. Traditional ensemble systems such as the Global Ensemble Forecast System (GEFS) struggle to maintain high skill, especially for moderate and heavy rainfall at extended lead times. This study develops a deep learning-based ensemble framework for multi-step precipitation prediction through joint modeling of a comprehensive set of atmospheric variables. The model is trained on ERA5 reanalysis data at 0.25$^{\circ}$ spatial resolution, with precipitation labels from NASA's Integrated Multi-satellite Retrievals for Global Precipitation Measurement (GPM) constellation (IMERG), incorporating 57 input variables, including upper-air and surface predictors. The architecture employs a patch-based Swin Transformer backbone with periodic convolutions to handle longitudinal continuity and integrates time and noise embeddings through conditional layer normalization. A dual-branch decoder predicts total precipitation and other variables, with targeted freezing of encoder-decoder pathways for specialized training. Training minimizes a hybrid loss combining the Continuous Ranked Probability Score (CRPS) and weighted log1p mean squared error (log1pMSE), balancing probabilistic accuracy and magnitude fidelity. During inference, the model ingests real-time Global Forecast System (GFS) initial conditions to generate 15-day forecasts autoregressively. Evaluation against GEFS using IMERG data demonstrates higher Critical Success Index (CSI) scores at precipitation thresholds of 0.1 mm, 1 mm, 10 mm, and 20 mm, highlighting improved performance for moderate to heavy rainfall.
中文标题/摘要
标题:CSU-PCAST:一种用于中期集合降水预报的双分支变压器框架
准确的中期降水预报对于水文气象风险管理和灾害减缓至关重要,但当前的数值天气预报(NWP)系统仍面临挑战,尤其是在长时间尺度上维持对中等和强降雨的高技能水平。本研究开发了一种基于深度学习的集合框架,用于多步降水预测,通过联合建模全面的气象变量集。该模型在0.25°空间分辨率的ERA5再分析数据上进行训练,使用来自NASA全球降水测量(GPM)星座(IMERG)的降水标签,包含57个输入变量,包括高空和地表预测因子。该架构采用基于补丁的Swin Transformer骨干网络,结合周期卷积处理经度连续性,并通过条件层归一化整合时间和噪声嵌入。双分支解码器预测总降水量和其他变量,通过冻结编码器-解码器路径进行专门训练。训练通过结合连续排名概率评分(CRPS)和加权对数1+均方误差(log1pMSE)的混合损失来最小化,平衡概率准确性和幅度保真度。在推理过程中,模型使用实时全球预报系统(GFS)初始条件自回归生成15天预报。使用IMERG数据与GEFS进行评估,显示出在0.1毫米、1毫米、10毫米和20毫米降水阈值下的更高成功率指数(CSI)分数,突显了对中等到强降雨的改进性能。
Summary / 总结
This study addresses the challenge of accurate medium-range precipitation forecasting by developing a deep learning-based ensemble framework called CSU-PCAST. The model uses a dual-branch transformer architecture with a Swin Transformer backbone and periodic convolutions, trained on ERA5 reanalysis data and NASA's IMERG precipitation labels. Key findings include higher Critical Success Index scores for moderate to heavy rainfall compared to the Global Ensemble Forecast System, indicating improved performance in forecasting extended lead times.
该研究通过开发基于深度学习的集合框架CSU-PCAST来解决中程降水预报的挑战。该模型采用具有Swin Transformer骨干和周期卷积的双分支变压器架构,使用ERA5再分析数据和NASA的IMERG降水标签进行训练。主要发现包括在中等至强降水阈值上比全球集合预报系统(GEFS)具有更高的关键成功指数(CSI)分数,表明在延长预报时效上的性能有所提升。
DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Authors: Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal
First: 2025-10-23T17:42:14+00:00 · Latest: 2025-10-23T17:42:14+00:00
Abstract
Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.
中文标题/摘要
标题:DyPE:动态位置外推在超高清扩散中的应用
扩散变换器模型能够生成具有非凡保真度和细节的图像,但由于自注意力机制与图像标记数量的平方级扩展,训练它们在超高清分辨率上仍然极其昂贵。在本文中,我们引入了一种名为动态位置外推(DyPE)的新型、无需训练的方法,该方法使预训练的扩散变换器能够在远超其训练数据的分辨率下合成图像,且无需额外的采样成本。DyPE 利用了扩散过程固有的频谱进展,其中低频结构早期收敛,而高频结构需要更多步骤才能解决。具体而言,DyPE 在每次扩散步骤中动态调整模型的位置编码,使其频谱与生成过程的当前阶段相匹配。这种方法使我们能够在远超训练分辨率的分辨率下生成图像,例如,使用 FLUX 生成 1600 万像素的图像。在多个基准测试中,DyPE 一致地提高了性能,并在超高清图像生成中达到了最先进的保真度,尤其是在更高分辨率下,性能提升更为显著。项目页面可在 https://noamissachar.github.io/DyPE/ 获取。
Summary / 总结
DyPE is a training-free method that enables pre-trained diffusion transformers to generate images at ultra-high resolutions by dynamically adjusting positional encodings during the diffusion process. This method leverages the spectral progression of the diffusion process to match the frequency spectrum of the model's positional encoding with the current stage of generation. As a result, DyPE can generate images at resolutions far beyond the training data, achieving state-of-the-art fidelity, especially at higher resolutions, with no additional sampling cost. Project page: https://noamissachar.github.io/DyPE/.
DyPE 是一种无需训练的方法,通过在扩散过程中动态调整位置编码,使预训练的扩散变换器能够生成超高清图像。该方法利用扩散过程中的频谱进展,使模型的位置编码频谱与生成过程的当前阶段相匹配。因此,DyPE 可以生成远超训练数据分辨率的图像,尤其是在更高分辨率下实现最佳的保真度,且无需额外的采样成本。项目页面:https://noamissachar.github.io/DyPE/
MEIcoder: Decoding Visual Stimuli from Neural Activity by Leveraging Most Exciting Inputs
Authors: Jan Sobotka, Luca Baroni, Ján Antolík
Venue: NeurIPS 2025
First: 2025-10-23T17:35:34+00:00 · Latest: 2025-10-23T17:35:34+00:00
Comments: Accepted to NeurIPS 2025
Abstract
Decoding visual stimuli from neural population activity is crucial for understanding the brain and for applications in brain-machine interfaces. However, such biological data is often scarce, particularly in primates or humans, where high-throughput recording techniques, such as two-photon imaging, remain challenging or impossible to apply. This, in turn, poses a challenge for deep learning decoding techniques. To overcome this, we introduce MEIcoder, a biologically informed decoding method that leverages neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training. MEIcoder achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in primary visual cortex (V1), especially excelling on small datasets with fewer recorded neurons. Using ablation studies, we demonstrate that MEIs are the main drivers of the performance, and in scaling experiments, we show that MEIcoder can reconstruct high-fidelity natural-looking images from as few as 1,000-2,500 neurons and less than 1,000 training data points. We also propose a unified benchmark with over 160,000 samples to foster future research. Our results demonstrate the feasibility of reliable decoding in early visual system and provide practical insights for neuroscience and neuroengineering applications.
中文标题/摘要
标题:MEIcoder:通过利用最具兴奋性输入解码神经活动中的视觉刺激
从神经群体活动解码视觉刺激对于理解大脑和脑机接口的应用至关重要。然而,此类生物数据在灵长类或人类中往往稀缺,特别是由于高通量记录技术,如双光子成像,在这些物种中难以应用或不可能实施。这反过来又对深度学习解码技术提出了挑战。为克服这一问题,我们引入了MEIcoder,这是一种基于生物学的解码方法,利用了神经元特异的最具兴奋性输入(MEIs)、结构相似性指数度量损失和对抗训练。MEIcoder 在从初级视觉皮层(V1)单细胞活动重建视觉刺激方面达到了最先进的性能,特别是在小数据集和较少记录神经元的情况下表现尤为出色。通过消融研究,我们证明了MEIs是性能的主要驱动因素;在扩展实验中,我们展示了MEIcoder 可以从仅1,000-2,500个神经元和不到1,000个训练数据点中重建高保真度的自然图像。我们还提出了一种包含超过160,000个样本的统一基准,以促进未来的研究。我们的结果证明了在早期视觉系统中可靠解码的可行性,并为神经科学和神经工程应用提供了实用见解。
Summary / 总结
MEIcoder is a biologically informed decoding method that uses neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training to decode visual stimuli from neural activity. It achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in the primary visual cortex, especially on small datasets. Ablation studies show that MEIs are the main drivers of performance, and scaling experiments demonstrate that MEIcoder can reconstruct high-fidelity images from as few as 1,000-2,500 neurons and less than 1,000 training data points.
MEIcoder 是一种生物启发的解码方法,利用神经元特定的最兴奋输入(MEIs)、结构相似性指数度量损失和对抗训练来解码神经活动中的视觉刺激。它在从初级视觉皮层单细胞活动重建视觉刺激方面达到了最先进的性能,尤其在小数据集上表现优异。消融研究显示,MEIs 是性能的主要驱动因素,而扩展实验表明,MEIcoder 可以从多达 1,000-2,500 个神经元和不到 1,000 个训练数据点中重建高保真度的自然图像。
Learning Modular Exponentiation with Transformers
Authors: David Demitri Africa, Sara M. Kapoor, Theo Simon Sorg, Challenger Mishra
Venue: NeurIPS
First: 2025-06-30T10:00:44+00:00 · Latest: 2025-10-23T17:33:42+00:00
Comments: Accepted at the 5th MATH-AI Workshop, NeurIPS'25
Abstract
Modular exponentiation is crucial to number theory and cryptography, yet remains largely unexplored from a mechanistic interpretability standpoint. We train a 4-layer encoder-decoder Transformer model to perform this operation and investigate the emergence of numerical reasoning during training. Utilizing principled sampling strategies, PCA-based embedding analysis, and activation patching, we examine how number-theoretic properties are encoded within the model. We find that reciprocal operand training leads to strong performance gains, with sudden generalization across related moduli. These synchronized accuracy surges reflect grokking-like dynamics, suggesting the model internalizes shared arithmetic structure. We also find a subgraph consisting entirely of attention heads in the final layer sufficient to achieve full performance on the task of regular exponentiation. These results suggest that transformer models learn modular arithmetic through specialized computational circuits, paving the way for more interpretable and efficient neural approaches to modular exponentiation.
中文标题/摘要
标题:基于变换器的模幂运算学习
模幂运算在数论和密码学中至关重要,但在机械可解释性方面仍鲜有研究。我们训练了一个4层编码器-解码器变换器模型来执行此操作,并研究了训练过程中数值推理的出现。利用原则性的采样策略、PCA基嵌入分析和激活插补,我们检查了模型中数论性质的编码方式。我们发现,倒数操作数训练导致了显著的性能提升,且在相关模数上出现了突然的一般化。这些同步的准确率突增反映了类似“顿悟”的动态,表明模型内化了共享的算术结构。我们还发现,在最终层中有一个由完全由注意力头组成的子图,足以在常规幂运算任务上实现全性能。这些结果表明,变换器模型通过专门的计算电路学习模算术,为更可解释和高效的神经方法铺平了道路。
Summary / 总结
This study explores modular exponentiation using a 4-layer Transformer model, focusing on the emergence of numerical reasoning during training. By employing reciprocal operand training and analyzing attention heads, the model demonstrates strong performance gains and sudden generalization across related moduli, indicating the internalization of shared arithmetic structure. The findings suggest that transformers learn modular arithmetic through specialized computational circuits, enhancing interpretability and efficiency in neural approaches to modular exponentiation.
该研究使用4层Transformer模型探索模幂运算,关注训练过程中数值推理的涌现。通过采用倒数操作数训练和分析注意力机制,研究发现模型能够内化共享的算术结构,导致在相关模数上突然出现性能提升。此外,最终层中的一个由注意力头组成的子图足以实现全性能,表明Transformer通过专门的计算电路学习模幂运算。
ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology
Authors: Nima Torbati, Anastasia Meshcheryakova, Ramona Woitek, Diana Mechtcheriakova, Amirreza Mahbod
First: 2025-10-23T17:21:06+00:00 · Latest: 2025-10-23T17:21:06+00:00
Comments: 5 pages
Abstract
Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved {\mu}IoU/{\mu}Dice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS-SegNet
中文标题/摘要
标题:ACS-SegNet:一种基于注意力机制的CNN-SegFormer组织分割网络
自动化组织病理图像分析在各种疾病的计算机辅助诊断中起着重要作用。在开发的算法中,基于深度学习的方法在多个任务中表现出色,包括组织病理图像中的语义组织分割。在本研究中,我们提出了一种基于注意力驱动的特征融合方法,结合了卷积神经网络(CNNs)和视觉变换器(ViTs)在统一的双编码器模型中,以提高语义分割性能。在两个公开可用的数据集上的评估表明,我们的模型在GCPS数据集上实现了76.79%/86.87%的μIoU/μDice分数,在PUMA数据集上实现了64.93%/76.60%的μIoU/μDice分数,优于最先进的基准和基线方法。我们的方法的实现已公开在GitHub仓库中:https://github.com/NimaTorbati/ACS-SegNet
Summary / 总结
The study aims to enhance semantic tissue segmentation in histopathological images through a novel approach combining attention-driven feature fusion of CNNs and ViTs within a dual-encoder model. The proposed ACS-SegNet model achieved superior performance, with μIoU/μDice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, surpassing state-of-the-art benchmarks.
研究旨在通过结合基于注意力机制的CNN和ViT特征融合的双编码器模型来提升病理图像中的语义组织分割性能。该模型在两个数据集上的评估结果显示,其{μ}IoU/{μ}Dice分数分别为GCPS上的76.79%/86.87%和PUMA上的64.93%/76.60%,优于现有基准。
Reinforcement Learning and Consumption-Savings Behavior
Authors: Brandon Kaplowitz
First: 2025-10-23T17:14:49+00:00 · Latest: 2025-10-23T17:14:49+00:00
Comments: 41 pages, 10 figures
Abstract
This paper demonstrates how reinforcement learning can explain two puzzling empirical patterns in household consumption behavior during economic downturns. I develop a model where agents use Q-learning with neural network approximation to make consumption-savings decisions under income uncertainty, departing from standard rational expectations assumptions. The model replicates two key findings from recent literature: (1) unemployed households with previously low liquid assets exhibit substantially higher marginal propensities to consume (MPCs) out of stimulus transfers compared to high-asset households (0.50 vs 0.34), even when neither group faces borrowing constraints, consistent with Ganong et al. (2024); and (2) households with more past unemployment experiences maintain persistently lower consumption levels after controlling for current economic conditions, a "scarring" effect documented by Malmendier and Shen (2024). Unlike existing explanations based on belief updating about income risk or ex-ante heterogeneity, the reinforcement learning mechanism generates both higher MPCs and lower consumption levels simultaneously through value function approximation errors that evolve with experience. Simulation results closely match the empirical estimates, suggesting that adaptive learning through reinforcement learning provides a unifying framework for understanding how past experiences shape current consumption behavior beyond what current economic conditions would predict.
中文标题/摘要
标题:强化学习与消费储蓄行为
本文展示了强化学习如何解释经济衰退期间家庭消费行为中的两个令人困惑的经验模式。我开发了一个模型,其中代理使用神经网络近似进行Q学习,以在收入不确定性下做出消费储蓄决策,从而偏离标准理性预期假设。该模型重现了近期文献中的两个关键发现:(1)失业家庭在之前拥有较低流动资产的情况下,相对于高资产家庭,其在刺激转移中的边际消费倾向(MPCs)高出许多(0.50 vs 0.34),即使两组均未面临借贷限制,这与Ganong等人(2024)的研究一致;(2)具有更多失业经历的家庭在控制当前经济状况后,持续保持较低的消费水平,这种“疤痕效应”由Malmendier和Shen(2024)记录。与基于收入风险信念更新或事前异质性的现有解释不同,强化学习机制通过随经验变化的价值函数近似误差同时产生较高的MPCs和较低的消费水平。模拟结果与经验估计高度吻合,表明通过强化学习进行的自适应学习提供了一个统一框架,以理解过去经历如何塑造当前的消费行为,超越当前经济状况的预测。
Summary / 总结
This paper uses reinforcement learning to explain two empirical patterns in household consumption behavior during economic downturns. The model, which employs Q-learning with neural network approximation, shows that unemployed households with low liquid assets have higher marginal propensities to consume compared to high-asset households, and that past unemployment experiences lead to persistently lower consumption levels. These findings align with recent literature and suggest that reinforcement learning can generate both higher marginal propensities to consume and lower consumption levels through value function approximation errors that evolve with experience, offering a unified framework for understanding consumption behavior beyond current economic conditions.
本文使用强化学习来解释经济衰退期间家庭消费行为中的两个实证模式。该模型采用基于神经网络逼近的Q学习,表明失业且之前拥有低流动资产的家庭在刺激转移中的边际消费倾向高于高资产家庭,并且过去的失业经历会导致消费水平在控制当前经济状况后持续降低。这些发现与近期文献相符,表明通过随经验变化的价值函数近似误差,强化学习可以同时生成更高的边际消费倾向和更低的消费水平,提供了一个统一的框架来理解过去经历如何影响当前的消费行为。
GenLit: Reformulating Single-Image Relighting as Video Generation
Authors: Shrisha Bharadwaj, Haiwen Feng, Giorgio Becherini, Victoria Fernandez Abrevaya, Michael J. Black
First: 2024-12-15T15:40:40+00:00 · Latest: 2025-10-23T17:11:15+00:00
Abstract
Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing. . Project page: https://genlit.is.tue.mpg.de/.
中文标题/摘要
标题:GenLit: 将单张图像重新光照问题重新表述为视频生成问题
在单张图像中调整3D场景的光照代表了计算机视觉和图形学中的一个基本挑战。这一问题传统上通过逆渲染技术解决,涉及显式的3D资产重建和昂贵的光线追踪模拟。同时,视觉基础模型的最新进展表明,一种新的范式可能即将出现——用大量图像和视频数据训练的网络取代显式的物理模型。在本文中,我们利用视频扩散模型的隐式场景理解,特别是稳定视频扩散,来重新光照单张图像。我们引入了GenLit框架,将图形引擎执行光照操作的能力提炼为视频生成模型,使用户可以直接在给定图像中的3D世界中插入和操作点光源,并直接生成视频序列。我们发现,仅在一个小型合成数据集上微调的模型可以泛化到真实场景,实现具有合理且令人信服的阴影和互反射的单张图像重新光照。我们的结果突显了视频基础模型捕捉光照、材质和形状丰富信息的能力,我们的发现表明,这些模型在少量训练下可以用于重新光照,而无需显式的资产重建或光线追踪。
Summary / 总结
GenLit reformulates single-image relighting as a video generation task using a video diffusion model, particularly Stable Video Diffusion. This approach enables users to manipulate a point light in a 3D scene within a single image and generate plausible relighting effects as a video sequence. The model, fine-tuned on a small synthetic dataset, generalizes well to real-world scenes, producing convincing shadows and inter-reflections without requiring explicit 3D asset reconstruction or ray-tracing simulations.
本文提出GenLit框架,利用视频扩散模型来处理单张图像中的光照变化问题。方法是通过在小型合成数据集上微调模型,使其能够泛化到真实场景,用户可以在给定图像中插入和操纵点光源,并生成视频序列结果。主要发现包括模型能够生成逼真的阴影和相互反射,表明视频基础模型能够捕捉丰富的光照、材质和形状信息,而无需进行显式的3D资产重建或光线追踪。
Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex
Authors: Azadeh Beiranvand, Seyed Mehdi Vahidipour
First: 2025-04-16T20:25:11+00:00 · Latest: 2025-10-23T17:06:25+00:00
Comments: 26 pages, 4 figures
Abstract
Text-attributed graphs (TAGs) present unique challenges in representation learning by requiring models to capture both the semantic richness of node-associated texts and the structural dependencies of the graph. While graph neural networks (GNNs) excel at modeling topological information, they lack the capacity to process unstructured text. Conversely, large language models (LLMs) are proficient in text understanding but are typically unaware of graph structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit allows for mutual attention between textual and structural representations, enabling information to flow in both directions, text influencing structure and structure guiding textual interpretation. The proposed architecture is trained using parameter-efficient fine-tuning (LoRA), keeping the LLM frozen while adapting to task-specific signals. Extensive experiments on five benchmark datasets demonstrate that BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction. An ablation study further highlights the importance of soft prompting and bi-directional attention in the model's success.
中文标题/摘要
标题:在BiGTex中结合文本属性图的结构和语义信号
文本属性图(TAGs)在表示学习中提出了独特挑战,需要模型同时捕捉节点关联文本的语义丰富性和图的结构依赖性。虽然图神经网络(GNNs)擅长建模拓扑信息,但缺乏处理非结构化文本的能力。相反,大型语言模型(LLMs)擅长文本理解,但通常不了解图结构。在本工作中,我们提出了一种名为BiGTex(双向图文本)的新型架构,通过堆叠图-文本融合单元,将GNNs和LLMs紧密结合。每个单元允许文本和结构表示之间的相互注意,使信息可以双向流动,文本影响结构,结构指导文本解释。该提出的架构通过参数高效微调(LoRA)进行训练,保持LLM冻结,以适应特定任务信号。在五个基准数据集上的广泛实验表明,BiGTex在节点分类中达到了最先进的性能,并且在链接预测中表现出有效的泛化能力。进一步的消融研究还强调了软提示和双向注意在模型成功中的重要性。
Summary / 总结
This work addresses the challenge of integrating structural and semantic signals in text-attributed graphs by proposing BiGTex, a novel architecture that combines GNNs and LLMs through stacked Graph-Text Fusion Units. Each unit facilitates mutual attention between textual and structural representations, allowing information to flow bidirectionally. The model is trained using parameter-efficient fine-tuning (LoRA) and achieves state-of-the-art performance in node classification and link prediction on five benchmark datasets.
研究通过提出BiGTex架构,结合图神经网络(GNNs)和大型语言模型(LLMs),解决了文本属性图(TAGs)中结构和语义信号的整合问题。BiGTex使用堆叠的Graph-Text融合单元,使文本和结构表示之间能够相互注意,信息可以双向流动。该模型使用参数高效微调(LoRA)进行训练,并在五个基准数据集上实现了节点分类和链接预测的最先进性能。
Learning to Triage Taint Flows Reported by Dynamic Program Analysis in Node.js Packages
Authors: Ronghao Ni, Aidan Z. H. Yang, Min-Chien Hsu, Nuno Sabino, Limin Jia, Ruben Martins, Darion Cassel, Kevin Cheang
First: 2025-10-23T16:58:02+00:00 · Latest: 2025-10-23T16:58:02+00:00
Abstract
Program analysis tools often produce large volumes of candidate vulnerability reports that require costly manual review, creating a practical challenge: how can security analysts prioritize the reports most likely to be true vulnerabilities? This paper investigates whether machine learning can be applied to prioritizing vulnerabilities reported by program analysis tools. We focus on Node.js packages and collect a benchmark of 1,883 Node.js packages, each containing one reported ACE or ACI vulnerability. We evaluate a variety of machine learning approaches, including classical models, graph neural networks (GNNs), large language models (LLMs), and hybrid models that combine GNN and LLMs, trained on data based on a dynamic program analysis tool's output. The top LLM achieves $F_{1} {=} 0.915$, while the best GNN and classical ML models reaching $F_{1} {=} 0.904$. At a less than 7% false-negative rate, the leading model eliminates 66.9% of benign packages from manual review, taking around 60 ms per package. If the best model is tuned to operate at a precision level of 0.8 (i.e., allowing 20% false positives amongst all warnings), our approach can detect 99.2% of exploitable taint flows while missing only 0.8%, demonstrating strong potential for real-world vulnerability triage.
中文标题/摘要
标题:学习处理动态程序分析在Node.js包中报告的污染流
程序分析工具通常会产生大量候选漏洞报告,需要昂贵的手动审查,这提出了一个实际挑战:安全分析师如何优先处理最有可能是真实漏洞的报告? 本文探讨了是否可以应用机器学习来优先处理程序分析工具报告的漏洞。我们专注于Node.js包,并收集了1,883个Node.js包的基准数据,每个包包含一个报告的ACE或ACI漏洞。我们评估了多种机器学习方法,包括经典模型、图神经网络(GNN)、大型语言模型(LLM)以及结合GNN和LLM的混合模型,这些模型基于动态程序分析工具的输出进行训练。顶级LLM的$F_{1} {=} 0.915$,而最佳GNN和经典机器学习模型的$F_{1} {=} 0.904$。在不到7%的假阴性率下,领先模型从手动审查中排除了66.9%的良性包,每包耗时约60毫秒。如果将最佳模型调整为0.8的精确度水平(即在所有警告中允许20%的假阳性),我们的方法可以检测到99.2%的可利用污染流,而仅错过0.8%,这表明在实际漏洞处理中具有很强的潜力。
Summary / 总结
This paper explores the application of machine learning to prioritize vulnerability reports from program analysis tools, focusing on Node.js packages. It evaluates various models including classical, graph neural networks, large language models, and hybrid models, achieving an F1 score of 0.915 with the top LLM. The best model reduces false negatives to less than 7% and eliminates 66.9% of benign packages for manual review, processing each in about 60 ms, while maintaining strong detection rates even at higher precision levels.
本文研究了使用机器学习来优先处理程序分析工具从Node.js包中生成的漏洞报告。评估了包括经典模型、图神经网络(GNN)和大型语言模型(LLM)在内的多种模型,其中最佳LLM的F1得分为0.915。在7%的假阴性率下,最佳模型将需要人工审查的良性包数量减少了66.9%,每处理一个包大约需要60毫秒,并且在允许20%的误报率时,可以检测到99.2%的可利用漏洞流。
Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process
Authors: Tsai Hor Chan, Feng Wu, Yihang Chen, Guosheng Yin, Lequan Yu
First: 2025-10-23T16:53:24+00:00 · Latest: 2025-10-23T16:53:24+00:00
Comments: Accepted by NeruIPS 2025
Abstract
Developing effective multimodal fusion approaches has become increasingly essential in many real-world scenarios, such as health care and finance. The key challenge is how to preserve the feature expressiveness in each modality while learning cross-modal interactions. Previous approaches primarily focus on the cross-modal alignment, while over-emphasis on the alignment of marginal distributions of modalities may impose excess regularization and obstruct meaningful representations within each modality. The Dirichlet process (DP) mixture model is a powerful Bayesian non-parametric method that can amplify the most prominent features by its richer-gets-richer property, which allocates increasing weights to them. Inspired by this unique characteristic of DP, we propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment. Specifically, we assume that each modality follows a mixture of multivariate Gaussian distributions and further adopt DP to calculate the mixture weights for all the components. This paradigm allows DP to dynamically allocate the contributions of features and select the most prominent ones, leveraging its richer-gets-richer property, thus facilitating multimodal feature fusion. Extensive experiments on several multimodal datasets demonstrate the superior performance of our model over other competitors. Ablation analysis further validates the effectiveness of DP in aligning modality distributions and its robustness to changes in key hyperparameters. Code is anonymously available at https://github.com/HKU-MedAI/DPMM.git
中文标题/摘要
标题:通过变分狄利克雷过程放大多模态学习中的显著表示
在许多实际场景中,如医疗保健和金融领域,开发有效的多模态融合方法变得越来越重要。关键挑战是如何在学习跨模态交互的同时保持每个模态的特征表达能力。以往的方法主要集中在跨模态对齐上,而过度强调模态边缘分布的对齐可能会施加过多的正则化,阻碍每个模态内有意义的表示。狄利克雷过程(DP)混合模型是一种强大的贝叶斯非参数方法,通过其“富者愈富”的特性可以放大最显著的特征,为它们分配更多的权重。受DP这一独特特性的启发,我们提出了一种新的DP驱动的多模态学习框架,该框架能够自动在显著的模内表示学习和跨模态对齐之间实现最优平衡。具体而言,我们假设每个模态遵循多元高斯分布的混合,并进一步采用DP计算所有成分的混合权重。这种范式允许DP动态分配特征的贡献并选择最显著的特征,利用其“富者愈富”的特性,从而促进多模态特征融合。在多个多模态数据集上的广泛实验表明,我们的模型在性能上优于其他竞争对手。消融分析进一步验证了DP在对齐模态分布方面的有效性及其对关键超参数变化的鲁棒性。代码匿名发布在https://github.com/HKU-MedAI/DPMM.git
Summary / 总结
This paper addresses the challenge of preserving feature expressiveness in each modality while learning cross-modal interactions by proposing a DP-driven multimodal learning framework. The method uses a Dirichlet process mixture model to dynamically allocate weights to the most prominent features, enhancing their representation. Experiments on multiple multimodal datasets show that this approach outperforms existing methods, and ablation studies confirm the effectiveness of the Dirichlet process in aligning modality distributions and its robustness to hyperparameters changes.
该论文解决了在学习跨模态交互的同时保持每个模态特征表达性的挑战。它提出了一种基于Dirichlet过程的多模态学习框架,使用Dirichlet过程放大显著特征,并在内模表示学习和跨模态对齐之间实现最优平衡。实验表明,所提出模型在多个多模态数据集上的性能优于其他竞争对手,且消融分析证实了Dirichlet过程在对齐模态分布和对关键超参数变化的鲁棒性方面的有效性。
Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs
Authors: Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai
First: 2025-06-24T18:01:52+00:00 · Latest: 2025-10-23T16:48:37+00:00
Comments: 36 pages, 3 figures
Abstract
We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems. Our code is publicly available at: https://github.com/kAIto47802/Prover-Agent.
中文标题/摘要
标题:证明代理:一种基于代理的正式数学证明框架
我们提出了证明代理,这是一种将大型语言模型(LLMs)与形式证明助手Lean结合的新型AI代理,用于自动定理证明。证明代理协调了一个非正式推理LLM、一个形式证明模型以及来自Lean的反馈,同时生成辅助引理。这些辅助引理不仅限于形式证明中的子目标,还可以包括从假设中推导出的特殊情况或潜在有用的事实,这些事实有助于发现可行的证明策略。它在MiniF2F基准测试中达到了88.1%的成功率,建立了使用小型语言模型(SLMs)的新最先进的方法,且样本预算远低于先前的方法。我们还提供了理论分析和案例研究,以说明这些生成的引理如何有助于解决具有挑战性的问题。我们的代码可在以下地址公开获取:https://github.com/kAIto47802/Prover-Agent.
Summary / 总结
Prover Agent is an AI framework that combines large language models with Lean, a formal proof assistant. It uses an informal reasoning model, a formal prover, and feedback from Lean to generate auxiliary lemmas, which help in discovering proof strategies. Prover Agent achieves an 88.1% success rate on the MiniF2F benchmark, surpassing previous methods using small language models with fewer samples.
Prover Agent 是一个结合了大型语言模型和形式证明助手 Lean 的 AI 框架。它使用非正式推理模型、形式证明模型和 Lean 的反馈来生成辅助引理,这些引理有助于发现证明策略。Prover Agent 在 MiniF2F 基准测试中达到了 88.1% 的成功率,超过了使用小型语言模型的先前方法,并且使用了更少的样本量。
Thought Communication in Multiagent Collaboration
Authors: Yujia Zheng, Zhuokai Zhao, Zijian Li, Yaqi Xie, Mingze Gao, Lizhu Zhang, Kun Zhang
Venue: NeurIPS 2025 Spotlight
First: 2025-10-23T16:48:02+00:00 · Latest: 2025-10-23T16:48:02+00:00
Comments: NeurIPS 2025 Spotlight
Abstract
Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.
中文标题/摘要
标题:多智能体协作中的思维通信
自然语言长期以来使人类合作成为可能,但其损失性、模糊性和间接性限制了集体智能的潜力。虽然机器不受这些限制,但大多数基于LLM的多智能体系统仍然仅依赖自然语言,交换标记或其嵌入。为了超越语言,我们引入了一种新的范式——思维通信,使智能体能够直接心心相通,类似于心灵感应。为了以一种原则性的方式揭示这些潜在的想法,我们将这一过程形式化为一个通用的潜在变量模型,其中智能体状态由潜在想法的未知函数生成。我们证明,在没有辅助信息的非参数设置中,任何一对智能体之间的共享和私有潜在想法都可以被识别。此外,思维共享的全局结构,包括哪些智能体共享哪些想法以及这些关系如何结构化,也可以在理论保证下被恢复。在已建立的理论指导下,我们开发了一个框架,在通信前从所有智能体中提取潜在想法,并为每个智能体分配相关的想法及其共享模式。这一范式自然地超越了LLMs,适用于所有模态,因为大多数观测数据源自隐藏的生成过程。在合成和真实世界基准上的实验验证了理论,并展示了思维通信的协作优势。我们希望这项工作能揭示利用隐藏世界的潜力,因为许多挑战仅通过表面观察是无法解决的,无论计算量或数据规模如何。
Summary / 总结
The paper aims to enhance multi-agent collaboration by introducing thought communication, a direct mind-to-mind interaction method, to overcome the limitations of natural language. The authors formalize this process using a latent variable model, proving that shared and private thoughts between agents can be identified and the global structure of thought sharing can be recovered. Experiments show that thought communication improves collaborative performance on both synthetic and real-world benchmarks, validating the theoretical framework and highlighting the potential of leveraging hidden generative processes in multi-agent systems.
研究旨在通过引入直接的心灵感应式交互方法——思想通信,来提升多智能体的合作能力,克服自然语言的局限性。方法将思想通信形式化为潜在变量模型,能够识别智能体之间的共享和私有思想。实验表明,这种方法可以恢复思想共享的全局结构,并在合成和真实世界基准上提高协作性能。
Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing
Authors: Xizhi Wu, Madeline S. Kreider, Philip E. Empey, Chenyu Li, Yanshan Wang
First: 2025-10-23T16:44:39+00:00 · Latest: 2025-10-23T16:44:39+00:00
Abstract
Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities extraction.LR and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance.
中文标题/摘要
标题:使用自然语言处理从临床笔记中自动化提取氟尿嘧啶治疗及治疗相关毒性
目的:氟尿嘧啶广泛用于治疗结直肠癌和乳腺癌,但与手足综合征和心脏毒性等毒性反应相关。由于毒性信息通常嵌入在临床笔记中,我们旨在开发和评估自然语言处理(NLP)方法以提取治疗和毒性信息。 材料与方法:我们构建了一个包含204,165名成人肿瘤患者236份临床笔记的黄金标准数据集。领域专家对治疗方案和毒性相关类别进行了注释。我们开发了基于规则、机器学习(随机森林、支持向量机[支持向量机]、逻辑回归[逻辑回归])、深度学习(BERT、临床BERT)和大型语言模型(LLM)的NLP方法(零样本和错误分析提示)。模型使用80:20的训练测试分割。 结果:有足够的数据来训练和评估5个注释类别。错误分析提示在治疗和毒性提取方面实现了最佳的精确度、召回率和F1分数(F1=1.000),而零样本提示在治疗方面达到了F1=1.000,在毒性方面达到了F1=0.876。逻辑回归和支持向量机在毒性提取方面排名第二(F1=0.937)。深度学习表现不佳,BERT(治疗F1=0.873;毒性F1=0.839)和临床BERT(治疗F1=0.873;毒性F1=0.886)。基于规则的方法作为基线,治疗的F1得分为0.857,毒性为0.858。 讨论:基于LLM的方法优于其他所有方法,其次是机器学习方法。机器学习和深度学习方法受限于小的训练数据,且在提取稀有类别时表现出有限的泛化能力。 结论:基于LLM的NLP最有效地从临床笔记中提取了氟尿嘧啶治疗和毒性信息,并且在支持肿瘤学研究和药物警戒方面具有强大的潜力。
Summary / 总结
The study aimed to develop natural language processing (NLP) methods to extract fluoropyrimidine treatment and related toxicities from clinical notes. A gold-standard dataset of 236 clinical notes was created, and various NLP approaches including rule-based, machine learning-based (Random Forest, SVM, LR), deep learning-based (BERT, ClinicalBERT), and large language models (LLM) were evaluated. Error-analysis prompting achieved optimal precision, recall, and F1 scores for treatment and toxicities extraction, while zero-shot prompting performed well for treatment but less so for toxicities. LLM-based approaches outperformed others, indicating their strong potential for oncology research and pharmacovigilance.
研究旨在开发自然语言处理(NLP)方法,从临床笔记中提取氟尿嘧啶治疗和毒性信息。创建了一个包含236份临床笔记的黄金标准数据集,并评估了包括基于规则、机器学习、深度学习和大型语言模型在内的多种NLP方法。错误分析提示在治疗和毒性提取方面达到了最优的精确度、召回率和F1分数,而零样本提示在治疗方面表现良好,但在毒性方面表现较差。大型语言模型表现最佳,表明其在肿瘤学研究和药物警戒方面具有强大的潜力。
AutoScape: Geometry-Consistent Long-Horizon Scene Generation
Authors: Jiacheng Chen, Ziyu Jiang, Mingfu Liang, Bingbing Zhuang, Jong-Chyi Su, Sparsh Garg, Ying Wu, Manmohan Chandraker
Venue: ICCV 2025
First: 2025-10-23T16:44:34+00:00 · Latest: 2025-10-23T16:44:34+00:00
Comments: ICCV 2025. Project page: https://auto-scape.github.io
Abstract
This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.
中文标题/摘要
标题:AutoScape:几何一致的长时序场景生成
本文提出AutoScape,一种长时序驾驶场景生成框架。其核心是一种新颖的RGB-D扩散模型,该模型迭代生成稀疏、几何一致的关键帧,作为场景外观和几何结构的可靠锚点。为了保持长距离几何一致性,该模型1) 在共享的潜在空间中同时处理图像和深度,2) 显式地基于先前生成的关键帧的渲染点云条件化现有的场景几何结构,3) 用变形一致的指导来引导采样过程。给定高质量的RGB-D关键帧,视频扩散模型在它们之间进行插值以生成密集且连贯的视频帧。AutoScape生成了超过20秒的逼真且几何一致的驾驶视频,分别将长时序FID和FVD得分提高了48.6%和43.0%。
Summary / 总结
AutoScape is a framework for generating long-horizon driving scenes using a novel RGB-D diffusion model that iteratively generates geometrically consistent keyframes. The model maintains long-range geometric consistency by handling image and depth in a shared latent space, conditioning on existing scene geometry, and steering the sampling process with warp-consistent guidance. This approach enables the generation of realistic and geometrically consistent driving videos over 20 seconds, significantly improving FID and FVD scores compared to previous methods.
AutoScape 是一种使用新颖的 RGB-D 扩散模型生成长时驾驶场景的框架,该模型通过迭代生成几何上一致的关键帧来工作。模型通过在共享的潜在空间中处理图像和深度、基于现有场景几何进行条件控制以及使用变形一致的指导来保持长距离的几何一致性。这种方法生成了超过 20 秒的逼真且几何上一致的驾驶视频,显著提高了 FID 和 FVD 分数,超过了先前的方法。
No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes
Authors: Jasmine Bayrooti, Sattar Vakili, Amanda Prorok, Carl Henrik Ek
Venue: NeurIPS
First: 2025-10-23T16:44:31+00:00 · Latest: 2025-10-23T16:44:31+00:00
Comments: Appearing in NeurIPS, 2025
Abstract
Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of $\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)})$ over $K$ episodes of horizon $H$, where $\Gamma(\cdot)$ captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.
中文标题/摘要
标题:有限时间马尔可夫决策过程的无遗憾汤普森采样方法与高斯过程
汤普森采样(TS)是一种强大的广泛使用的顺序决策策略,应用于从贝叶斯优化到强化学习(RL)的多个领域。尽管TS取得了成功,但其理论基础仍然有限,特别是在具有复杂时间结构的RL环境中。我们通过使用具有高斯边缘分布的模型来填补这一空白,建立了TS的无遗憾保证。具体而言,我们考虑具有奖励和转换联合高斯过程(GP)先验的RL episodic问题。我们证明了在长度为$H$的$K$个episode中,遗憾界为$\mathcal{\tilde{O}}(\sqrt{KH\Gamma(KH)})$,其中$\Gamma(\cdot)$捕获了GP模型的复杂性。我们的分析解决了包括价值函数的非高斯性质和贝尔曼更新的递归结构在内的多个挑战,并将经典的椭圆势引理扩展到多输出设置。这项工作推进了对TS在RL中的理解,并突显了结构假设和模型不确定性如何影响其在有限时间马尔可夫决策过程中的性能。
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Authors: Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Venue: NeurIPS 2025
First: 2025-10-13T15:25:52+00:00 · Latest: 2025-10-23T16:40:49+00:00
Comments: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk
Abstract
Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.
中文标题/摘要
标题:mmWalk:迈向多模态多视角行走辅助
在极端或复杂环境中提供行走辅助仍然是盲人或低视力(BLV)人群的一大挑战,主要原因是缺乏对整体场景的理解。受BLV社区实际需求的启发,我们构建了mmWalk,这是一个模拟的多模态数据集,集成了多视角传感器和无障碍导向特征,用于户外安全导航。该数据集包含120条手动控制、场景分类的行走轨迹,共有62000帧同步图像。它包含了超过559000张全景图像,涵盖RGB、深度和语义模态。此外,为了强调现实相关性,每条轨迹都涉及户外的边缘情况和专为BLV用户设计的无障碍地标。此外,我们还生成了mmWalkVQA,这是一个包含超过69000个视觉问题-答案三元组的VQA基准,分为9个类别,旨在提供安全和知情的行走辅助。我们使用零样本和少样本设置评估了最先进的视觉-语言模型(VLMs),发现它们在我们的风险评估和导航任务中表现不佳。我们还在真实世界数据集上验证了mmWalk微调模型,并展示了该数据集在推进多模态行走辅助方面的有效性。
Summary / 总结
The research aims to address the challenges of walking assistance in extreme environments for people with blindness or low vision by developing a comprehensive multi-modal dataset called mmWalk. This dataset includes 120 walking trajectories with 62k synchronized frames and over 559k panoramic images across RGB, depth, and semantic modalities. The study evaluates state-of-the-art Vision-Language Models and finds that they struggle with risk assessment and navigational tasks, highlighting the need for further development. The mmWalk-finetuned model is validated on real-world datasets, demonstrating its effectiveness in advancing multi-modal walking assistance.
研究旨在通过开发mmWalk多模态数据集来解决盲人或低视力人士在极端环境下的行走辅助问题。该数据集包含120条行走轨迹,62k同步帧和超过559k的全景图像,涵盖RGB、深度和语义模态。关键发现表明,最先进的视觉-语言模型在风险评估和导航任务上表现不佳,而mmWalk微调模型在真实世界数据集上显示出有效性。
User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios
Authors: Xiaoyuan Wu, Roshni Kaushik, Wenkai Li, Lujo Bauer, Koichi Onoue
First: 2025-10-23T16:38:26+00:00 · Latest: 2025-10-23T16:38:26+00:00
Abstract
Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs' ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users' perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users' evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs' ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users' perceived privacy and utility.
中文标题/摘要
标题:用户对大型语言模型在隐私敏感场景中隐私保护和帮助性感知的研究
大型语言模型(LLMs)在撰写邮件、总结会议和回答健康问题等任务中得到了快速采用。在这种使用中,用户可能需要分享私人信息(如健康记录、联系方式)。为了评估LLMs识别和屏蔽此类私人信息的能力,先前的工作开发了基准测试(如ConfAIde、PrivacyLens)并使用了现实生活中的场景。使用这些基准测试,研究人员发现,当LLMs应对复杂任务时,有时会泄露私人信息(如在会议总结中泄露员工薪资)。然而,这些评估依赖于LLMs(代理LLMs)来衡量其是否遵守隐私规范,忽略了真实用户的感知。此外,先前的工作主要关注LLMs响应的隐私保护质量,而没有调查其在帮助性方面的细微差异。为了了解用户如何感知LLMs在隐私敏感场景中的隐私保护质量和帮助性,我们使用来自PrivacyLens的90个场景对94名参与者进行了用户研究。我们发现,在评估同一场景下的相同响应时,用户在隐私保护质量和帮助性方面的一致性较低。进一步,我们发现五个代理LLMs之间的一致性很高,而每个单独的LLM与用户评估的相关性较低。这些结果表明,LLMs响应的隐私性和帮助性往往因人而异,代理LLMs不能很好地估计真实用户在隐私敏感场景中对这些响应的感知。我们的结果表明,需要进行以用户为中心的研究来衡量LLMs帮助用户同时保护隐私的能力。此外,未来的研究可以探索提高代理LLMs与用户之间对齐的方法,以更好地估计用户感知到的隐私和实用性。
Summary / 总结
This study evaluates user perceptions of privacy and helpfulness in LLM responses to privacy-sensitive scenarios. Using 94 participants and 90 scenarios from PrivacyLens, the research found low agreement among users on the privacy-preservation quality and helpfulness of LLM responses, while proxy LLMs showed high agreement. This suggests that the privacy and helpfulness of LLM responses are subjective and that proxy LLMs may not accurately represent user perceptions. The study highlights the need for user-centered evaluations to measure LLMs' ability to preserve privacy and be helpful.
本研究评估了用户对隐私敏感场景中LLM响应的隐私保护质量和帮助性感知。使用94名参与者和来自PrivacyLens的90个场景,研究发现用户在隐私保护质量和帮助性上的评价存在低一致性,而代理LLM之间则表现出高一致性。这表明隐私保护质量和帮助性是主观的,代理LLM可能无法准确反映用户的感知。研究强调了进行用户中心的评估以衡量LLM在保护隐私和提供帮助方面的能力的必要性。
Optimizing Clinical Fall Risk Prediction: A Data-Driven Integration of EHR Variables with the Johns Hopkins Fall Risk Assessment Tool
Authors: Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Kimia Ghobadi
First: 2025-10-23T16:31:09+00:00 · Latest: 2025-10-23T16:31:09+00:00
Comments: 19 pages, 7 figures, 4 tables
Abstract
In this study we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models on JHFRAT assessment data and additional electronic health record (EHR) variables. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labelling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.
中文标题/摘要
标题:基于数据驱动整合EHR变量与约翰霍普金斯跌倒风险评估工具的临床跌倒风险优化预测
本研究旨在通过数据驱动建模方法更好地使约翰霍普金斯跌倒风险评估工具(JHFRAT)的跌倒风险预测与临床有意义的指标相一致。我们对2022年3月至2023年10月期间约翰霍普金斯健康系统三家医院的54,209例住院病例进行了回顾性分析。共有20,208例住院病例被纳入高跌倒风险事件,13,941例被纳入低跌倒风险事件。为了结合临床知识并保持可解释性,我们在JHFRAT评估数据和额外的电子健康记录(EHR)变量上应用了约束评分优化(CSO)模型。该模型在预测性能上显著优于当前的JHFRAT(CSO AUC-ROC=0.91,JHFRAT AUC-ROC=0.86)。约束评分优化模型在有无EHR变量的情况下表现相似。尽管基准黑盒模型(XGBoost)在知识驱动的约束逻辑回归基础上提高了性能指标(AUC-ROC=0.94),但CSO在风险标签变化时表现出更强的鲁棒性。基于证据的方法为医疗机构提供了一个坚实的基础,以系统地增强住院跌倒预防协议和患者安全,通过数据驱动的优化技术改进风险评估和资源分配。
Summary / 总结
This study aimed to enhance the predictive accuracy of fall risk assessment by integrating EHR variables with the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) using constrained score optimization (CSO) models. A retrospective analysis of 54,209 inpatient admissions showed that CSO models improved predictive performance (AUC-ROC=0.91) compared to the original JHFRAT (AUC-ROC=0.86). The model's robustness was maintained even without EHR variables, and it demonstrated better consistency in risk labeling compared to a benchmark black-box model (XGBoost, AUC-ROC=0.94).
本研究旨在通过将电子健康记录(EHR)变量与约翰霍普金斯跌倒风险评估工具(JHFRAT)结合,使用约束分数优化(CSO)模型来提高跌倒风险预测的准确性。对54,209名住院患者的回顾性分析表明,CSO模型的预测性能得到了显著提升(AUC-ROC=0.91),优于原始JHFRAT(AUC-ROC=0.86)。即使不使用EHR变量,CSO模型也保持了稳健性,并且在可解释性和稳定性方面优于基准的黑盒模型(XGBoost)。
CLEVER: A Curated Benchmark for Formally Verified Code Generation
Authors: Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri
First: 2025-05-20T05:15:47+00:00 · Latest: 2025-10-23T16:29:07+00:00
Abstract
We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).
Summary / 总结
CLEVER is a benchmark of 161 problems for verified code generation in Lean, designed to evaluate end-to-end formal verification. Each problem includes generating a specification and a Lean implementation that satisfies the specification. Unlike previous benchmarks, CLEVER avoids test-case supervision and specifications that reveal implementation details. All outputs are verified using Lean's type checker. Evaluations of several state-of-the-art language models show that achieving full verification is challenging, making CLEVER a rigorous benchmark for program synthesis and formal reasoning.
论文介绍了CLEVER基准,包含161个问题,用于在Lean中进行端到端的验证代码生成。每个问题包括生成一个规范和一个Lean实现,该实现能够证明满足该规范。与之前的基准不同,CLEVER避免了测试案例的监督和LLM生成的注释,确保所有输出都是可机器验证的。该基准评估了几种基于最新语言模型的少量示例和代理方法,这些方法在实现完整验证方面都遇到困难,突显了其在程序合成和形式推理方面的挑战性。
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
Authors: Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, Fei-Yue Wang
Venue: NeurIPS 2025
First: 2025-04-21T17:59:02+00:00 · Latest: 2025-10-23T16:28:10+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.
中文标题/摘要
标题:停止求和:最小形式的信用分配是过程奖励模型进行推理所需的一切
过程奖励模型(PRMs)在大型语言模型(LLMs)在具有挑战性的推理任务上的测试时扩展方面已被证明是有效的。然而,PRMs中的奖励作弊问题限制了它们在强化微调中的成功应用。在本文中,我们确定了PRM引起的奖励作弊的主要原因:强化学习(RL)中的经典求和形式的信用分配,它将价值定义为未来折扣奖励的累积和,容易导致LLMs作弊以获取高奖励。为了解决这个问题,我们提出了PURE:过程监督强化学习。PURE的关键创新是一种最小形式的信用分配,将价值函数定义为未来奖励的最小值。这种方法通过限制价值函数的范围和更合理地分配优势,显著缓解了奖励作弊。通过在3个基础模型上进行广泛的实验,我们展示了基于PRM的方法实现最小形式信用分配的推理性能与可验证奖励方法相当,仅需30%的步骤。相比之下,经典的求和形式的信用分配在训练的开始阶段就崩溃了!此外,当我们用10%的可验证奖励补充基于PRM的微调时,我们进一步缓解了奖励作弊,并在我们的实验中基于Qwen2.5-Math-7B生成了最佳微调模型,实现了AMC23 82.5%的准确率和5个基准的53.3%的平均准确率。此外,我们总结了观察到的奖励作弊案例,并分析了训练崩溃的原因。我们已在https://github.com/CJReinforce/PURE/发布了我们的代码和模型权重。
Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
Authors: Wenyi Xiao, Leilei Gan
First: 2025-04-25T16:11:23+00:00 · Latest: 2025-10-23T16:25:28+00:00
Abstract
When applying reinforcement learning--typically through GRPO--to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
Summary / 总结
FAST-GRPO is a variant of GRPO designed to improve the scalability of large vision-language model reasoning by dynamically adjusting reasoning depth based on question characteristics. It introduces two metrics to estimate question difficulty and incorporates adaptive rewards and difficulty-aware KL divergence. Experiments show that FAST outperforms the base model by over 10% in accuracy and reduces token usage by 32.7-67.3% compared to previous methods, effectively balancing reasoning length and accuracy.
FAST-GRPO 是一种改进的 GRPO 变体,通过根据问题特性动态调整推理深度来提高大型视觉-语言模型推理的可扩展性。它引入了快慢思考指标和自适应奖励来引导模型的推理过程。实验表明,FAST-GRPO 在七个基准测试中实现了更高的准确率,最高可达 10% 的改进,并且与之前的慢思考方法相比,减少了 32.7-67.3% 的标记使用量,有效平衡了推理长度和准确率。
History
20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553