arXiv 论文速递

2025-12-19 03:27
Snapshot: 20251219_0327
Spatia: Video Generation with Updatable Spatial Memory
Authors: Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, Yan Lu
First: 2025-12-17T18:59:59+00:00 · Latest: 2025-12-17T18:59:59+00:00
Comments: Project page: https://zhaojingjing713.github.io/Spatia/
Abstract
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
中文标题/摘要
标题:Spatia:具有可更新空间记忆的视频生成
现有的视频生成模型由于视频信号的密集和高维特性,在保持长时间的空间和时间一致性方面存在困难。为克服这一限制,我们提出了一种名为Spatia的空间记忆感知视频生成框架,该框架明确地保留了一个3D场景点云作为持久的空间记忆。Spatia根据这种空间记忆迭代生成视频片段,并通过视觉SLAM不断更新它。这种动态-静态解耦设计在整个生成过程中增强了空间一致性,同时保持了模型生成逼真动态实体的能力。此外,Spatia还支持明确的摄像机控制和3D感知交互编辑,提供了一个几何上可靠的框架,用于可扩展、基于记忆的视频生成。
Summary / 总结
Spatia is a video generation framework that addresses the challenge of maintaining long-term spatial and temporal consistency by using a 3D scene point cloud as persistent spatial memory. It iteratively generates video clips based on this spatial memory and updates it through visual SLAM, enhancing spatial consistency while allowing for realistic dynamic entities. Key findings include improved spatial consistency and support for applications like explicit camera control and 3D-aware interactive editing.
Spatia 是一种通过使用 3D 场景点云作为持久空间记忆来解决长时间空间和时间一致性问题的视频生成框架。它基于这种空间记忆迭代生成视频片段,并通过视觉 SLAM 不断更新它,从而提高空间一致性并支持如显式摄像机控制和 3D 意识交互编辑等应用。主要发现包括增强的空间一致性以及支持的应用范围。
In Pursuit of Pixel Supervision for Visual Pre-training
Authors: Lihe Yang, Shang-Wen Li, Yang Li, Xinjie Lei, Dong Wang, Abdelrahman Mohamed, Hengshuang Zhao, Hu Xu
First: 2025-12-17T18:59:58+00:00 · Latest: 2025-12-17T18:59:58+00:00
Comments: Project page: https://github.com/facebookresearch/pixio
Abstract
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.
中文标题/摘要
标题:追求像素监督以实现视觉预训练
最基础的层面,像素是我们感知世界视觉信息的来源。像素包含从低级属性到高级概念的所有层级的信息。自编码器是一种经典的长期存在的从像素或其他原始输入中学习表示的范式。在本工作中,我们证明了基于自编码器的自监督学习在今天仍然具有竞争力,并能为下游任务生成强大的表示,同时保持简单、稳定和高效。我们的模型代号为“Pixio”,是一种增强的掩码自编码器(MAE),具有更具挑战性的预训练任务和更强大的架构。该模型在20亿张网页抓取图像上进行训练,采用一种最小化人工筛选的自我筛选策略。Pixio在包括单目深度估计(例如,Depth Anything)、前馈3D重建(即,MapAnything)、语义分割和机器人学习在内的多种野外下游任务中表现出色,超越或匹配在相似规模下训练的DINOv3。我们的结果表明,像素空间的自监督学习可以作为一种有前景的替代方案和补充,与潜在空间方法相比。
Summary / 总结
This work explores the use of pixel supervision for visual pre-training, leveraging autoencoders to learn representations from raw pixel data. The model, named Pixio, is an enhanced masked autoencoder with more challenging pre-training tasks and improved architectures, trained on 2 billion web-crawled images with minimal human curation. Pixio demonstrates strong performance across various downstream tasks, including monocular depth estimation, 3D reconstruction, semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales.
该研究探索了使用像素监督进行视觉预训练的方法,利用自编码器学习强大的表示。模型名为Pixio,是一种增强的掩码自编码器,训练数据来自2亿张网络抓取的图像,并进行了少量的人工筛选。Pixio在多种下游任务中表现出色,包括单目深度估计、3D重建、语义分割和机器人学习,性能优于或与DINOv3在相似规模下训练的结果相当。
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
Authors: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang
First: 2025-12-17T18:59:55+00:00 · Latest: 2025-12-17T18:59:55+00:00
Comments: 11 pages, 5 figures, conference or other essential info
Abstract
In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
中文标题/摘要
标题:DiffusionVL:将任何自回归模型转化为扩散视觉语言模型
在最近的多模态研究中,扩散范式因其独特的解码优势,已成为自回归范式(AR)的有前途的替代方案。然而,由于基础扩散语言模型能力的限制,扩散视觉语言模型(dVLM)的性能仍然远远落后于主流模型。这引发了一个简单而基本的问题:是否可以基于现有的强大自回归模型构建dVLM?为此,我们提出了DiffusionVL,这是一个可以从任何强大自回归模型转换而来的dVLM家族。通过简单的微调,我们成功地将自回归预训练模型适应到扩散范式中。这种方法产生了两个关键观察结果:(1)从基于自回归的多模态模型到扩散的范式转变非常有效。(2)直接将自回归语言模型转换为dVLM也是可行的,性能与LaLaVA风格的视觉指令调优相当。此外,我们引入了一种块解码设计到dVLM中,支持任意长度的生成和KV缓存重用,实现了显著的推理速度提升。我们进行了大量的实验。尽管使用了比先前方法少于5%的数据进行训练,DiffusionVL在MMMU-Pro(视觉)基准上的综合性能提高了34.4%,在MME(认知)基准上的性能提高了37.5%,同时实现了2倍的推理速度提升。模型和代码发布在https://github.com/hustvl/DiffusionVL。
Summary / 总结
DiffusionVL translates existing powerful autoregressive models into diffusion vision language models (dVLMs) through simple fine-tuning, demonstrating that the shift from autoregressive to diffusion paradigms is highly effective. It also shows that direct conversion of AR language models to dVLMs can achieve performance comparable to LLaVA-style visual-instruction-tuning. The model introduces a block-decoding design that supports arbitrary-length generation and KV cache reuse, leading to a 2x inference speedup. Despite using less than 5% of the data, DiffusionVL outperforms previous methods with a 34.4% gain on the MMMU-Pro (vision) benchmark and a 37.5% gain on the MME (Cog.) benchmark.
DiffusionVL 是一种可以从现有的强大自回归(AR)模型中转换而来的扩散视觉语言模型(dVLM)家族,通过简单的微调实现性能显著提升,在 MMMU-Pro(视觉)基准上获得 34.4% 的改进,在 MME(认知)基准上获得 37.5% 的改进,同时将推理速度提高一倍。引入了一种块解码设计,以支持任意长度的生成和 KV 缓存重用,进一步提高效率。
Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering
Authors: Divam Gupta, Anuj Pahuja, Nemanja Bartolovic, Tomas Simon, Forrest Iandola, Giljoo Nam
First: 2025-12-17T18:58:50+00:00 · Latest: 2025-12-17T18:58:50+00:00
Comments: Tech report
Abstract
We present Gaussian Pixel Codec Avatars (GPiCA), photorealistic head avatars that can be generated from multi-view images and efficiently rendered on mobile devices. GPiCA utilizes a unique hybrid representation that combines a triangle mesh and anisotropic 3D Gaussians. This combination maximizes memory and rendering efficiency while maintaining a photorealistic appearance. The triangle mesh is highly efficient in representing surface areas like facial skin, while the 3D Gaussians effectively handle non-surface areas such as hair and beard. To this end, we develop a unified differentiable rendering pipeline that treats the mesh as a semi-transparent layer within the volumetric rendering paradigm of 3D Gaussian Splatting. We train neural networks to decode a facial expression code into three components: a 3D face mesh, an RGBA texture, and a set of 3D Gaussians. These components are rendered simultaneously in a unified rendering engine. The networks are trained using multi-view image supervision. Our results demonstrate that GPiCA achieves the realism of purely Gaussian-based avatars while matching the rendering performance of mesh-based avatars.
中文标题/摘要
标题:高斯像素编码头像:一种高效的混合表示
我们提出了高斯像素编码头像(GPiCA),这是一种可以从多视角图像生成的逼真头像,并且可以在移动设备上高效渲染。GPiCA 利用了一种独特的混合表示,结合了三角网格和各向异性 3D 高斯分布。这种组合最大限度地提高了内存和渲染效率,同时保持了逼真的外观。三角网格在表示面部皮肤等表面区域方面非常高效,而 3D 高斯分布则有效地处理了非表面区域,如头发和胡须。为此,我们开发了一个统一的可微渲染管道,将网格视为体积渲染范式(3D 高斯点绘制)中的半透明层。我们训练神经网络将面部表情代码解码为三个组件:一个 3D 面网格、一个 RGBA 纹理和一组 3D 高斯分布。这些组件在统一的渲染引擎中同时进行渲染。网络使用多视角图像监督进行训练。我们的结果表明,GPiCA 在实现基于高斯的头像的真实感方面与基于网格的头像的渲染性能相当。
Summary / 总结
Gaussian Pixel Codec Avatars (GPiCA) are photorealistic head avatars generated from multi-view images and efficiently rendered on mobile devices. GPiCA uses a hybrid representation combining a triangle mesh and anisotropic 3D Gaussians to balance memory and rendering efficiency while maintaining photorealism. The triangle mesh represents facial skin, and 3D Gaussians handle non-surface areas like hair and beard. A unified differentiable rendering pipeline is developed, and neural networks are trained to decode facial expressions into a 3D face mesh, RGBA texture, and 3D Gaussians, which are rendered simultaneously. The results show that GPiCA matches the realism of Gaussian-based avatars and the rendering performance of mesh-based avatars.
Gaussian Pixel Codec Avatars (GPiCA) 是从多视角图像生成的高保真头像,可在移动设备上高效渲染。GPiCA 使用结合三角网格和各向异性 3D 高斯分布的独特混合表示,以优化内存和渲染效率同时保持高保真度。系统通过神经网络生成 3D 面网格、RGBA 图像和 3D 高斯分布,并在统一渲染引擎中同时渲染。实验结果表明,GPiCA 的保真度与基于高斯的头像相当,渲染性能与基于网格的头像相当。
Artism: AI-Driven Dual-Engine System for Art Generation and Critique
Authors: Shuai Liu, Yiqing Tian, Yang Chen, Mar Canet Sola
First: 2025-12-17T18:58:42+00:00 · Latest: 2025-12-17T18:58:42+00:00
Comments: 7 pages, 3 figures, 36 references, appendix with support material and 1 introduction video
Abstract
This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.
中文标题/摘要
标题:Artism:由AI驱动的双引擎系统,用于艺术创作与批评
本文提出了一种双引擎AI架构方法,旨在解决艺术进化中潜在轨迹探索的复杂问题。我们介绍了两个相互连接的组件:AIDA(人工艺术家社交网络)和Ismism Machine(批评分析系统)。核心创新在于利用深度学习和多智能体协作,实现对艺术历史发展和概念创新模式的多维度模拟。框架探索了从传统单向批评向智能、互动反思实践模式的转变。目前,我们正在将此方法应用于当代艺术概念的实验研究。本研究基于AI驱动的批评循环,提出了一种新的计算艺术分析方法。
Summary / 总结
This paper introduces a dual-engine AI system called Artism, which includes AIDA (an artificial artist social network) and the Ismism Machine for critical analysis. The system uses deep learning and multi-agent collaboration to simulate art historical developments and conceptual innovation patterns, moving towards an interactive critique mode. Initial experimental studies on contemporary art concepts demonstrate the potential of this AI-driven methodology for computational analysis of art.
该论文提出了一种名为Artism的双引擎AI系统,包括AIDA(一个人工智能艺术家社交网络)和Ismism Machine(一种批判分析系统)。该系统利用深度学习和多智能体协作来模拟艺术历史发展和概念创新模式,朝着一种互动的批判模式迈进。初步的当代艺术概念实验研究展示了这种方法在艺术计算分析中的潜力。
Multi-View Foundation Models
Authors: Leo Segre, Or Hirschorn, Shai Avidan
First: 2025-12-17T18:58:03+00:00 · Latest: 2025-12-17T18:58:03+00:00
Abstract
Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that is useful for various applications. However, in case we have multiple views of the same 3D scene, they operate on each image independently and do not always produce consistent features for the same 3D point. We propose a way to convert a Foundation Model into a Multi-View Foundation Model. Such a model takes as input a set of images and outputs a feature map for each image such that the features of corresponding points are as consistent as possible. This approach bypasses the need to build a consistent 3D model of the features and allows direct manipulation in the image space. Specifically, we show how to augment Transformers-based foundation models (i.e., DINO, SAM, CLIP) with intermediate 3D-aware attention layers that help match features across different views. As leading examples, we show surface normal estimation and multi-view segmentation tasks. Quantitative experiments show that our method improves feature matching considerably compared to current foundation models.
中文标题/摘要
标题:多视图基础模型
基础模型是各种计算机视觉应用中的重要工具。它们以单个RGB图像为输入,输出可用于各种应用的深层特征表示。然而,如果有同一3D场景的多个视图,它们会独立处理每张图像,并不一定能为相同的3D点生成一致的特征。我们提出了一种将基础模型转换为多视图基础模型的方法。这种模型以一组图像为输入,为每张图像输出一个特征图,使得对应点的特征尽可能一致。这种方法绕过了构建一致的3D特征模型的需要,允许直接在图像空间中进行操作。具体而言,我们展示了如何通过中间的3D感知注意力层来增强基于Transformer的基础模型(例如,DINO、SAM、CLIP),以帮助在不同视图之间匹配特征。作为主要示例,我们展示了表面法线估计和多视图分割任务。定量实验表明,与当前的基础模型相比,我们的方法显著提高了特征匹配的效果。
Summary / 总结
The research aims to enhance foundation models in computer vision by making them multi-view capable. The method involves augmenting existing foundation models like DINO, SAM, and CLIP with 3D-aware attention layers to ensure consistent feature representation across multiple views of the same 3D scene. The key experimental finding is that this approach significantly improves feature matching in tasks such as surface normal estimation and multi-view segmentation compared to standard foundation models.
研究旨在通过将基础模型转换为多视图基础模型,提高同一3D场景不同视角下的特征表示一致性。方法是通过在基于Transformer的基础模型中加入中间的3D感知注意力层,来匹配不同视角下的特征。实验表明,与传统基础模型相比,在表面法线估计和多视图分割等任务中,该方法显著提高了特征匹配的效果。
GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
Authors: Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall
Venue: WACV 2026
First: 2025-12-17T18:56:52+00:00 · Latest: 2025-12-17T18:56:52+00:00
Comments: accepted by WACV 2026
Abstract
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
中文标题/摘要
标题:GateFusion:分层门控跨模态融合在主动说话人检测中的应用
主动说话人检测(ASD)旨在识别视频每一帧中谁在说话。大多数最先进的方法依赖于晚期融合来结合视觉和音频特征,但晚期融合往往无法捕捉到细粒度的跨模态交互,这对于在不受约束的场景中实现稳健性能至关重要。本文介绍了一种名为GateFusion的新架构,该架构结合了强大的预训练单模态编码器和分层门控融合解码器(HiGate)。HiGate通过在Transformer骨干网的多个层中适配地注入一种模态的上下文特征到另一种模态中,实现了逐层、多深度的融合,由可学习的、双模态条件下的门控单元引导。为了进一步加强多模态学习,我们提出了两个辅助目标:掩码对齐损失(MAL)以使单模态输出与多模态预测对齐,以及过度正向惩罚(OPP)以抑制虚假的视频激活。GateFusion在多个具有挑战性的ASD基准测试中建立了新的最先进结果,分别在Ego4D-ASD、UniTalk和WASD基准测试中实现了77.8%(+9.4%)、86.1%(+2.9%)和96.1%(+0.5%)的mAP,并在AVA-ActiveSpeaker上实现了竞争力的性能。域外实验展示了我们模型的泛化能力,而全面的消融实验表明了每个组件的互补优势。
Summary / 总结
GateFusion is a novel architecture for Active Speaker Detection that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate) to capture fine-grained cross-modal interactions. It achieves state-of-the-art results on several benchmarks, improving mAP scores by 9.4%, 2.9%, and 0.5% on Ego4D-ASD, UniTalk, and WASD, respectively. Additional auxiliary objectives enhance multimodal learning and out-of-domain experiments show its generalization capability.
GateFusion 是一种用于活动说话人检测的新架构,结合了强大的预训练单模编码器和层次门控融合解码器(HiGate),以捕捉细粒度的跨模态交互。该模型通过可学习的、双模态条件下的门控在多个层面上逐步融合视觉和音频特征,并通过两个辅助目标:掩码对齐损失和过正性惩罚,实现了多个基准上的最新成果,包括 Ego4D-ASD 的 77.8% mAP、UniTalk 的 86.1% mAP 和 WASD 的 96.1% mAP。模型在跨域实验中展示了强大的泛化能力,并且全面的消融实验表明了每个组件的互补优势。
Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data
Authors: Kayode Olumoyin, Lamees El Naqa, Katarzyna Rejniak
Venue: NeurIPS 2025
First: 2025-12-17T18:55:49+00:00 · Latest: 2025-12-17T18:55:49+00:00
Comments: NeurIPS 2025 Workshop on Learning from Time Series for Health
Abstract
In a mathematical model of interacting biological organisms, where external interventions may alter behavior over time, traditional models that assume fixed parameters usually do not capture the evolving dynamics. In oncology, this is further exacerbated by the fact that experimental data are often sparse and sometimes are composed of a few time points of tumor volume. In this paper, we propose to learn time-varying interactions between cells, such as those of bladder cancer tumors and immune cells, and their response to a combination of anticancer treatments in a limited data scenario. We employ the physics-informed neural network (PINN) approach to predict possible subpopulation trajectories at time points where no observed data are available. We demonstrate that our approach is consistent with the biological explanation of subpopulation trajectories. Our method provides a framework for learning evolving interactions among biological organisms when external interventions are applied to their environment.
中文标题/摘要
标题:从稀疏生物数据中学习膀胱癌联合治疗的模型参数动力学
在描述相互作用生物有机体的数学模型中,外部干预可能会随时间改变其行为,而传统假设参数固定的模型通常无法捕捉到这种演变的动力学。在肿瘤学中,由于实验数据往往稀疏,有时仅由几个时间点的肿瘤体积组成,这一问题更为严重。在本文中,我们提出了一种在数据有限的情况下学习细胞之间随时间变化的相互作用,例如膀胱癌肿瘤和免疫细胞之间的相互作用,以及它们对联合抗癌治疗的响应。我们采用物理信息神经网络(PINN)方法预测在没有观测数据的时间点可能的亚群轨迹。我们证明了我们的方法与亚群轨迹的生物解释一致。我们的方法为在外部干预作用于其环境时学习生物有机体之间演变的相互作用提供了一个框架。
Summary / 总结
This paper addresses the challenge of modeling the dynamics of bladder cancer tumor cells and immune cells in response to anticancer treatments using sparse biological data. The authors propose using a physics-informed neural network (PINN) to predict the time-varying interactions between these cell populations. The method successfully captures the evolving dynamics and provides a framework for understanding the impact of external interventions on biological systems.
本文旨在利用稀疏生物数据,通过物理感知神经网络(PINN)预测膀胱癌细胞与免疫细胞之间的动态相互作用及其亚群轨迹。该方法与生物学解释一致,并为在数据有限的情况下理解相互作用提供了框架。
Dynamic Rebatching for Efficient Early-Exit Inference with DREX
Authors: Xuting Liu, Daniel Alexander, Siva Kesava Reddy Kakarla, Behnaz Arzani, Vincent Liu
First: 2025-12-17T18:55:45+00:00 · Latest: 2025-12-17T18:55:45+00:00
Abstract
Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE LLMs, as not all requests in a batch may be ready to exit at the same time. Existing solutions either force a uniform decision on the batch, which overlooks EE opportunities, or degrade output quality by forcing premature exits. We propose Dynamic Rebatching, a solution where we dynamically reorganize the batch at each early-exit point. Requests that meet the exit criteria are immediately processed, while those that continue are held in a buffer, re-grouped into a new batch, and forwarded to deeper layers. We introduce DREX, an early-exit inference system that implements Dynamic Rebatching with two key optimizations: 1) a copy-free rebatching buffer that avoids physical data movement, and 2) an EE and SLA-aware scheduler that analytically predicts whether a given rebatching operation will be profitable. DREX also efficiently handles the missing KV cache from skipped layers using memory-efficient state-copying. Our evaluation shows that DREX improves throughput by 2-12% compared to baseline approaches while maintaining output quality. Crucially, DREX completely eliminates involuntary exits, providing a key guarantee for preserving the output quality intended by the EE model.
中文标题/摘要
标题:动态重新分批以提高DREX早期退出推理效率
早期退出(EE)是一种大型语言模型(LLM)架构,通过允许使用模型的部分层生成更容易的令牌来加速推理。然而,传统的批处理框架不适合EE LLM,因为批次中的请求可能不会在同一时间准备好退出。现有解决方案要么对批次做出统一决策,忽视了EE机会,要么通过强制提前退出降低输出质量。我们提出了一种动态重新分批的解决方案,在每个早期退出点动态重新组织批次。满足退出条件的请求立即处理,而继续的请求被保留在缓冲区中,重新分组为新的批次,并转发到更深的层。我们引入了DREX,这是一种实现动态重新分批的早期退出推理系统,具有两个关键优化:1)无复制的重新分批缓冲区,避免物理数据移动;2)一种考虑EE和SLA的调度器,通过分析预测给定的重新分批操作是否有利可图。DREX还通过高效处理跳过的层缺失的KV缓存来使用内存高效的状态复制。我们的评估表明,与基线方法相比,DREX将吞吐量提高了2-12%,同时保持了输出质量。最关键的是,DREX完全消除了非自愿退出,为保持EE模型预期的输出质量提供了关键保证。
Summary / 总结
The research addresses the inefficiency of traditional batching frameworks for Early-Exit (EE) Large Language Models (LLMs), which can lead to either missed EE opportunities or degraded output quality. The proposed Dynamic Rebatching solution dynamically reorganizes batches at each early-exit point, allowing requests meeting the exit criteria to be processed immediately while others are buffered and re-grouped. DREX, the implemented system, includes optimizations such as a copy-free rebatching buffer and an EE and SLA-aware scheduler. Evaluation shows DREX improves throughput by 2-12% while maintaining output quality and completely eliminating involuntary exits.
研究针对大型语言模型早期退出推理中的效率问题,提出了动态重新分组方法,该方法在每个早期退出点动态重新组织批次。该方法允许满足退出条件的请求立即处理,而其他请求则被缓冲并重新分组。系统DREX包括无拷贝重新分组缓冲区和面向早期退出和SLA的调度器等优化。评估显示,DREX可以将吞吐量提高2-12%,同时保持输出质量,并完全消除不必要的退出。
End-to-End Training for Autoregressive Video Diffusion via Self-Resampling
Authors: Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin
First: 2025-12-17T18:53:29+00:00 · Latest: 2025-12-17T18:53:29+00:00
Comments: Project Page: https://guoyww.github.io/projects/resampling-forcing/
Abstract
Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.
中文标题/摘要
标题:基于自重采样的端到端训练以实现自回归视频扩散
自回归视频扩散模型在世界模拟方面具有潜力,但容易受到训练-测试不匹配导致的曝光偏差的影响。尽管最近的工作通过后训练解决了这一问题,但它们通常依赖双向教师模型或在线判别器。为了实现端到端的解决方案,我们引入了重采样强迫,这是一种无需教师模型的框架,能够从头开始并大规模训练自回归视频模型。我们方法的核心是自重采样方案,在训练过程中模拟推理时的模型错误。基于这些退化的历史帧,稀疏因果掩码强制时间因果性,同时允许在帧级扩散损失下并行训练。为了促进高效的长时序生成,我们还引入了历史路由机制,这是一种无需参数的机制,能够动态检索每个查询的最相关的前k个历史帧。实验表明,我们的方法在性能上与基于蒸馏的基线相当,但在更长的视频上表现出更优的时间一致性,这是由于原生长度的训练。
Summary / 总结
The research aims to address the exposure bias in autoregressive video diffusion models by introducing Resampling Forcing, a teacher-free framework. This method uses a self-resampling scheme to simulate inference-time errors during training and a sparse causal mask to enforce temporal causality. The approach also includes history routing to facilitate efficient long-horizon generation. Experiments show that this method achieves performance comparable to distillation-based baselines and exhibits better temporal consistency on longer videos due to native-length training.
研究旨在通过引入无教师的Resampling Forcing框架解决自回归视频扩散模型中的曝光偏差问题。该方法使用自采样方案在训练过程中模拟推理时的错误,并使用稀疏因果掩码来强制执行时间因果性。实验表明,这种方法在长视频上具有更好的时间一致性,性能与基于蒸馏的方法相当,但得益于原生长度的训练。
VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Authors: Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang
First: 2025-12-17T18:52:55+00:00 · Latest: 2025-12-17T18:52:55+00:00
Comments: 14 pages, 8 figures
Abstract
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic
中文标题/摘要
标题:VLIC:视觉-语言模型作为人类对齐的图像压缩感知裁判
包含人类偏好的图像压缩性能评估通常发现,诸如均方误差(MSE)之类的简单失真函数不足以与人类感知对齐。为了使压缩模型与人类感知对齐,先前的工作使用了基于大规模人类心理视觉判断数据集校准的可微分感知损失,由神经网络组成。我们展示了令人惊讶的是,最先进的视觉-语言模型(VLMs)可以在被要求对两幅图像之间的差异进行推理时,零样本地复制二选一强迫选择(2AFC)的人类判断。受利用VLMs强大的零样本视觉推理能力的启发,我们提出了视觉-语言模型图像压缩系统(VLIC),这是一种基于扩散的图像压缩系统,设计为后训练与二元VLM判断相结合。VLIC 利用现有的扩散模型后训练技术,而不是将VLM判断提炼为一个单独的感知损失网络。我们展示了在VLM判断上校准该系统在感知度量和大规模用户研究中产生了竞争力或最先进的性能,取决于数据集。我们还进行了VLM为基础的奖励设计和训练过程的广泛分析,并分享了重要的见解。更多视觉内容可在 https://kylesargent.github.io/vlic 获取
Summary / 总结
The research aims to improve image compression by aligning it with human perception, which traditional distortion metrics fail to do effectively. The method involves using state-of-the-art vision-language models (VLMs) to replicate human judgments on image pairs, and a new system, VLIC, is proposed to leverage these models for post-training in image compression. The key finding is that VLIC, calibrated on VLM judgments, achieves competitive or state-of-the-art performance in human-aligned visual compression, as evaluated by perceptual metrics and user studies.
该论文旨在通过利用视觉语言模型(VLMs)的零样本视觉推理能力,改进图像压缩模型与人类感知的对齐。提出的Vision-Language Models for Image Compression (VLIC) 系统是一种基于扩散的压缩方法,通过二元VLM判断进行后训练。实验结果表明,将VLIC与VLM判断校准后,其在人类对齐的视觉压缩方面的性能根据感知指标和用户研究达到了竞争性或最先进的水平。
FrontierCS: Evolving Challenges for Evolving Intelligence
Authors: Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, Youran Sun, Wesley Zheng, Meiyuwang Zhang, Ruyi Ji, Xuechang Tu, Zihan Zheng, Zexing Chen, Kangyang Zhou, Zhaozi Wang, Jingbang Chen, Aleksandra Korolova, Peter Henderson, Pramod Viswanath, Vijay Ganesh, Saining Xie, Zhuang Liu, Dawn Song, Sewon Min, Ion Stoica, Joseph E. Gonzalez, Jingbo Shang, Alvin Cheung
First: 2025-12-17T18:52:45+00:00 · Latest: 2025-12-17T18:52:45+00:00
Comments: Code with instruction: https://github.com/FrontierCS/Frontier-CS
Abstract
We introduce FrontierCS, a benchmark of 156 open-ended problems across diverse areas of computer science, designed and reviewed by experts, including CS PhDs and top-tier competitive programming participants and problem setters. Unlike existing benchmarks that focus on tasks with known optimal solutions, FrontierCS targets problems where the optimal solution is unknown, but the quality of a solution can be objectively evaluated. Models solve these tasks by implementing executable programs rather than outputting a direct answer. FrontierCS includes algorithmic problems, which are often NP-hard variants of competitive programming problems with objective partial scoring, and research problems with the same property. For each problem we provide an expert reference solution and an automatic evaluator. Combining open-ended design, measurable progress, and expert curation, FrontierCS provides a benchmark at the frontier of computer-science difficulty. Empirically, we find that frontier reasoning models still lag far behind human experts on both the algorithmic and research tracks, that increasing reasoning budgets alone does not close this gap, and that models often over-optimize for generating merely workable code instead of discovering high-quality algorithms and system designs.
中文标题/摘要
标题:FrontierCS:不断演化的智能挑战
我们介绍了FrontierCS,这是一个涵盖计算机科学多个领域的156个开放性问题的基准测试,由专家设计和审核,包括CS博士和顶级编程竞赛参与者及出题者。与专注于具有已知最优解任务的现有基准不同,FrontierCS针对的是最优解未知但解决方案质量可以客观评估的问题。模型通过实现可执行程序而不是直接输出答案来解决这些任务。FrontierCS包括算法问题,通常是具有客观部分评分的编程竞赛问题的NP难变体,以及具有相同属性的研究问题。对于每个问题,我们提供了专家参考解决方案和自动评估器。结合开放性设计、可衡量的进步和专家审核,FrontierCS提供了一个计算机科学难度前沿的基准测试。实证研究发现,前沿推理模型在算法和研究轨道上仍然远远落后于人类专家,单纯增加推理预算并不能缩小这一差距,而且模型往往过度优化生成可工作的代码,而不是发现高质量的算法和系统设计。
Summary / 总结
FrontierCS is a benchmark of 156 open-ended problems in computer science, curated by experts. Unlike existing benchmarks, FrontierCS targets problems without known optimal solutions but with objectively evaluable quality. Models solve these tasks by implementing executable programs. The benchmark includes algorithmic and research problems, with expert reference solutions and automatic evaluators. Empirical results show that current reasoning models perform poorly compared to human experts, and increasing reasoning budgets does not significantly improve performance. Models often prioritize generating workable code over discovering high-quality algorithms and designs.
FrontierCS 是一个由专家设计的包含 156 个开放性问题的基准,旨在评估模型解决没有最优解的问题的能力。模型需要实现可执行程序来解决这些问题。基准包括算法和研究问题,附有专家参考解决方案和自动评估器。实验证明,模型的表现远不及人类专家,增加推理预算也无法显著改善性能。模型往往生成的是勉强可用的代码,而不是高质量的算法和系统设计。
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Authors: Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu
First: 2025-12-17T18:48:26+00:00 · Latest: 2025-12-17T18:48:26+00:00
Comments: Project Page: https://github.com/JoeLeelyf/Skyra
Abstract
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
中文标题/摘要
标题:Skyra:基于接地 artifacts 推理的AI生成视频检测
AI驱动的视频生成技术的滥用引发了严重的社会关注,突显了可靠AI生成视频检测器的迫切需求。然而,大多数现有方法仅限于二元分类,缺乏供人类解释的必要说明。在本文中,我们提出了Skyra,这是一种专门的多模态大型语言模型(MLLM),用于识别AI生成视频中的人类可感知的视觉artifacts,并利用它们作为检测和解释的接地证据。为了支持这一目标,我们构建了ViF-CoT-4K用于监督微调(SFT),这是第一个具有精细粒度人类注释的大规模AI生成视频artifacts数据集。然后,我们开发了一种两阶段训练策略,系统地增强了模型的空间-时间artifacts感知、解释能力和检测准确性。为了全面评估Skyra,我们引入了ViF-Bench基准,包含由十多个最先进的视频生成器生成的3000个高质量样本。广泛的实验表明,Skyra在多个基准上超越了现有方法,而我们的评估为推进可解释的AI生成视频检测提供了宝贵的见解。
Summary / 总结
The paper addresses the need for reliable detectors for AI-generated videos, which is motivated by the misuse of such technologies. It introduces Skyra, a multimodal large language model that detects visual artifacts in AI-generated videos and provides explanations. Skyra is trained on a new dataset, ViF-CoT-4K, and uses a two-stage training strategy to improve its artifact perception and detection accuracy. Experiments show that Skyra outperforms existing methods on multiple benchmarks and provides valuable insights for explainable AI-generated video detection.
论文介绍了Skyra,这是一种多模态大型语言模型,用于检测AI生成视频中的视觉 artifacts 并提供解释。它使用大规模数据集 ViF-CoT-4K 进行监督微调,并采用两阶段训练策略以提高 artifacts 检测和解释能力。实验表明,Skyra 在多个基准测试中优于现有方法,提供了可解释的AI生成视频检测方面的有价值见解。
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Authors: Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, Elvis Nava
First: 2025-12-17T18:47:31+00:00 · Latest: 2025-12-17T18:47:31+00:00
Abstract
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce \model, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
中文标题/摘要
标题:mimic-video: 超越VLAs的通用机器人控制的视频动作模型
当前用于机器人操作的视觉-语言-动作模型(VLAs)是基于大规模但不连贯的静态网页数据预训练的视觉-语言骨干构建的。因此,尽管在语义泛化方面有所提高,策略仍需从机器人轨迹中隐式推断复杂的物理动力学和时间依赖性。这种依赖性导致了不可持续的数据负担,需要持续、大规模地收集专家数据来弥补缺乏的物理理解。我们认为,虽然视觉-语言预训练有效地捕捉了语义先验,但它对物理因果关系仍然视而不见。一种更有效的范式是利用视频在预训练期间同时捕捉语义和视觉动力学,从而将剩余的任务隔离为低级控制。为此,我们引入了\model,这是一种新颖的视频动作模型(VAM),它将一个大规模互联网视频模型与基于其潜在表示的动作解码器配对,该解码器基于流动匹配进行条件化。解码器作为逆动力学模型(IDM),从视频空间动作计划的潜在表示生成低级机器人动作。我们的广泛评估表明,我们的方法在模拟和真实世界机器人操作任务上达到了最先进的性能,与传统的VLA架构相比,样本效率提高了10倍,收敛速度提高了2倍。
Summary / 总结
The research aims to address the limitations of Vision-Language-Action Models (VLAs) for robotic manipulation by leveraging video to capture both semantics and visual dynamics during pretraining. The method introduces a Video-Action Model (VAM) that pairs a pretrained video model with an action decoder conditioned on latent representations. The decoder acts as an Inverse Dynamics Model (IDM), generating low-level robot actions. Experimental results demonstrate that this approach outperforms traditional VLA architectures, achieving state-of-the-art performance and improving sample efficiency by 10x and convergence speed by 2x.
研究针对视觉-语言-动作模型(VLAs)在机器人操作中的局限性,这些模型依赖于缺乏物理理解的预训练视觉-语言模型。为克服这一问题,研究引入了一种视频-动作模型(VAM),该模型结合了预训练的视频模型和一个基于视频空间动作计划的潜在表示的动作解码器。实验表明,VAM在模拟和真实世界机器人操作任务中表现优于传统VLAs架构,实现了10倍更好的样本效率和2倍更快的收敛速度。
Can LLMs Guide Their Own Exploration? Gradient-Guided Reinforcement Learning for LLM Reasoning
Authors: Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, Dong Yu
First: 2025-12-17T18:44:45+00:00 · Latest: 2025-12-17T18:44:45+00:00
Abstract
Reinforcement learning has become essential for strengthening the reasoning abilities of large language models, yet current exploration mechanisms remain fundamentally misaligned with how these models actually learn. Entropy bonuses and external semantic comparators encourage surface level variation but offer no guarantee that sampled trajectories differ in the update directions that shape optimization. We propose G2RL, a gradient guided reinforcement learning framework in which exploration is driven not by external heuristics but by the model own first order update geometry. For each response, G2RL constructs a sequence level feature from the model final layer sensitivity, obtainable at negligible cost from a standard forward pass, and measures how each trajectory would reshape the policy by comparing these features within a sampled group. Trajectories that introduce novel gradient directions receive a bounded multiplicative reward scaler, while redundant or off manifold updates are deemphasized, yielding a self referential exploration signal that is naturally aligned with PPO style stability and KL control. Across math and general reasoning benchmarks (MATH500, AMC, AIME24, AIME25, GPQA, MMLUpro) on Qwen3 base 1.7B and 4B models, G2RL consistently improves pass@1, maj@16, and pass@k over entropy based GRPO and external embedding methods. Analyzing the induced geometry, we find that G2RL expands exploration into substantially more orthogonal and often opposing gradient directions while maintaining semantic coherence, revealing that a policy own update space provides a far more faithful and effective basis for guiding exploration in large language model reinforcement learning.
中文标题/摘要
标题:大型语言模型能否引导自身的探索?基于梯度引导的强化学习框架
强化学习已成为增强大型语言模型推理能力的关键工具,但当前的探索机制与这些模型实际学习的方式存在根本上的不一致。熵奖励和外部语义比较器鼓励表面层次的变化,但并不能保证采样的轨迹在影响优化的方向上有所不同。我们提出了一种名为G2RL的梯度引导强化学习框架,在这种框架中,探索不是由外部启发式方法驱动,而是由模型自身的梯度更新几何驱动。对于每个响应,G2RL从模型最后一层的敏感性中构建一个序列级特征,这种特征可以从标准前向传播中以几乎不增加成本的方式获得,并通过比较这些特征来衡量每个轨迹如何重塑策略。引入新颖梯度方向的轨迹会获得一个有界乘法奖励调节器,而冗余或偏离流形的更新则会被淡化,从而产生一种自我参照的探索信号,这种信号自然与PPO风格的稳定性和KL控制相一致。在数学和一般推理基准测试(MATH500、AMC、AIME24、AIME25、GPQA、MMLUpro)上,G2RL在Qwen3基础1.7B和4B模型上的一致性改进了pass@1、maj@16和pass@k,优于基于熵的GRPO和外部嵌入方法。通过对诱导几何的分析,我们发现G2RL将探索扩展到了更多正交且往往对立的梯度方向,同时保持了语义连贯性,表明策略自身的更新空间为大型语言模型强化学习中的探索引导提供了更为忠实和有效的基础。
Summary / 总结
The paper introduces G2RL, a gradient-guided reinforcement learning framework that drives exploration based on the model's own first-order update geometry, rather than external heuristics. This method constructs a sequence-level feature from the model's final layer sensitivity and measures how each trajectory would reshape the policy. Across various reasoning benchmarks, G2RL improves performance metrics such as pass@1, maj@16, and pass@k compared to entropy-based methods and external embedding techniques. The induced geometry shows that G2RL explores more orthogonal and opposing gradient directions while maintaining semantic coherence.
论文提出了G2RL,一种基于模型自身一阶更新几何的梯度引导强化学习框架,而不是外部启发式方法。该方法从模型最终层的敏感性构建序列级特征,并衡量每个轨迹如何重塑策略。G2RL在数学和一般推理基准测试中提高了pass@1、maj@16和pass@k指标,优于基于熵的方法和外部嵌入方法。它将探索扩展到更多的正交和对立的梯度方向,同时保持语义连贯性。
MMGR: Multi-Modal Generative Reasoning
Authors: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu
First: 2025-12-16T18:58:04+00:00 · Latest: 2025-12-17T18:42:37+00:00
Comments: work in progress
Abstract
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
中文标题/摘要
标题:MMGR:多模态生成推理
视频基础模型生成视觉上逼真且时间上连贯的内容,但它们作为世界模拟器的可靠性取决于是否捕捉了物理、逻辑和空间约束。现有指标如弗雷切视频距离(FVD)强调感知质量,而忽视了推理失败,包括因果关系、物理规律和全局一致性方面的违反。我们引入了MMGR(多模态生成推理评估与基准),一个基于五种推理能力的原理性评估框架:物理、逻辑、三维空间、二维空间和时间。MMGR在抽象推理(ARC-AGI、数独)、体感导航(现实世界3D导航和定位)和物理常识(体育和组合交互)三个领域评估生成推理。MMGR应用细粒度指标,要求视频和图像生成的整体正确性。我们对领先视频模型(Veo-3、Sora-2、Wan-2.2)和图像模型(Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image)进行了基准测试,揭示了不同领域的性能差距。模型在物理常识任务上表现出适度的成功,但在抽象推理(ARC-AGI准确率低于10%)和体感设置中的长期空间规划方面表现不佳。我们的分析突显了当前模型的关键局限性,包括过度依赖感知数据、全局状态一致性较弱以及奖励视觉合理性而非因果正确性的目标。MMGR提供了一个统一的诊断基准,并为推理感知生成世界模型指明了方向。
Summary / 总结
MMGR is a new evaluation framework for video and image generation models, focusing on their reasoning abilities in physical, logical, and spatial domains. It evaluates models across abstract reasoning, embodied navigation, and physical commonsense tasks using fine-grained metrics. Key findings show significant performance gaps among leading models, with only moderate success on physical tasks and poor performance on abstract reasoning and long-horizon spatial planning. This highlights the need for models to better capture causal and global consistency in their reasoning processes.
MMGR 是一个新的评估框架,用于评估视频和图像生成模型在物理、逻辑和空间领域的推理能力。它通过细粒度的指标评估模型在抽象推理、体感导航和物理常识任务上的表现,要求整体正确性。研究发现,领先模型在物理常识任务上表现出色,但在抽象推理和长时空间规划方面表现不佳,这表明模型需要在推理和因果正确性方面有所改进,而不仅仅是视觉可信度。
A Multivariate Statistical Framework for Detection, Classification and Pre-localization of Anomalies in Water Distribution Networks
Authors: Oleg Melnikov, Yurii Dorofieiev, Yurii Shakhnovskiy, Huy Truong, Victoria Degeler
First: 2025-12-17T18:38:37+00:00 · Latest: 2025-12-17T18:38:37+00:00
Comments: 48 pages, 18 figures, 3 tables
Abstract
This paper presents a unified framework, for the detection, classification, and preliminary localization of anomalies in water distribution networks using multivariate statistical analysis. The approach, termed SICAMS (Statistical Identification and Classification of Anomalies in Mahalanobis Space), processes heterogeneous pressure and flow sensor data through a whitening transformation to eliminate spatial correlations among measurements. Based on the transformed data, the Hotelling's $T^2$ statistic is constructed, enabling the formulation of anomaly detection as a statistical hypothesis test of network conformity to normal operating conditions. It is shown that Hotelling's $T^2$ statistic can serve as an integral indicator of the overall "health" of the system, exhibiting correlation with total leakage volume, and thereby enabling approximate estimation of water losses via a regression model. A heuristic algorithm is developed to analyze the $T^2$ time series and classify detected anomalies into abrupt leaks, incipient leaks, and sensor malfunctions. Furthermore, a coarse leak localization method is proposed, which ranks sensors according to their statistical contribution and employs Laplacian interpolation to approximate the affected region within the network. Application of the proposed framework to the BattLeDIM L-Town benchmark dataset demonstrates high sensitivity and reliability in leak detection, maintaining robust performance even under multiple leaks. These capabilities make the method applicable to real-world operational environments without the need for a calibrated hydraulic model.
中文标题/摘要
标题:水分配网络中异常检测、分类和预定位的多元统计框架
本文提出了一种统一框架,利用多元统计分析来检测、分类和初步定位水分配网络中的异常。该方法称为SICAMS(Mahalanobis空间中异常的统计识别和分类),通过白化变换处理异质的压力和流量传感器数据,以消除测量间的空间相关性。基于变换后的数据,构建了Hotelling's $T^2$统计量,使异常检测可以转化为网络是否符合正常运行条件的统计假设检验。研究表明,Hotelling's $T^2$统计量可以作为系统整体“健康状况”的综合指标,与总泄漏量相关,从而可以通过回归模型近似估计水损失。开发了一种启发式算法来分析$T^2$时间序列,并将检测到的异常分类为突然泄漏、早期泄漏和传感器故障。此外,提出了一种粗略的泄漏定位方法,根据统计贡献对传感器进行排序,并使用拉普拉斯插值来近似网络中受影响的区域。将所提出框架应用于BattLeDIM L-Town基准数据集,显示出在多泄漏情况下对泄漏检测的高度敏感性和可靠性。这些功能使该方法适用于无需校准水力模型的实际运行环境。
Summary / 总结
The paper introduces SICAMS, a framework for detecting, classifying, and pre-localizing anomalies in water distribution networks using multivariate statistical analysis. By whitening pressure and flow sensor data, SICAMS constructs the Hotelling's $T^2$ statistic to identify deviations from normal conditions, which correlates with water leakage volume. The method classifies anomalies into leaks and sensor malfunctions and proposes a localization technique using Laplacian interpolation. Experiments on the BattLeDIM L-Town dataset show high sensitivity and reliability in leak detection, even under multiple leaks, making it suitable for real-world applications.
论文提出了一种名为SICAMS的多元统计框架,用于检测、分类和初步定位水分布网络中的异常。该方法通过白化变换和Hotelling's $T^2$ 统计量识别偏离正常条件的情况,并将这些情况与水损失相关联。一个启发式算法将异常分类为泄漏和传感器故障,而定位方法则根据统计贡献对传感器进行排名,以近似受影响的区域。该方法在存在多个泄漏的情况下仍能表现出高度的灵敏性和可靠性,适用于无需校准水力模型的实际运营环境。
Stylized Synthetic Augmentation further improves Corruption Robustness
Authors: Georg Siedel, Rojan Regmi, Abhirami Anand, Weijia Shao, Silvia Vock, Andrey Morozov
First: 2025-12-17T18:28:04+00:00 · Latest: 2025-12-17T18:28:04+00:00
Comments: Accepted at VISAPP 2026 conference
Abstract
This paper proposes a training data augmentation pipeline that combines synthetic image data with neural style transfer in order to address the vulnerability of deep vision models to common corruptions. We show that although applying style transfer on synthetic images degrades their quality with respect to the common FID metric, these images are surprisingly beneficial for model training. We conduct a systematic empirical analysis of the effects of both augmentations and their key hyperparameters on the performance of image classifiers. Our results demonstrate that stylization and synthetic data complement each other well and can be combined with popular rule-based data augmentation techniques such as TrivialAugment, while not working with others. Our method achieves state-of-the-art corruption robustness on several small-scale image classification benchmarks, reaching 93.54%, 74.9% and 50.86% robust accuracy on CIFAR-10-C, CIFAR-100-C and TinyImageNet-C, respectively
中文标题/摘要
标题:风格化合成增强进一步提高抗退化能力
本文提出了一种结合合成图像数据和神经风格迁移的训练数据增强管道,以解决深度视觉模型对常见退化易受攻击的问题。我们表明,尽管在合成图像上应用风格迁移会降低其在通用FID指标下的质量,但这些图像对模型训练却出乎意料地有益。我们系统地分析了两种增强方法及其关键超参数对图像分类器性能的影响。我们的结果表明,风格化和合成数据相辅相成,可以与流行的基于规则的数据增强技术(如TrivialAugment)结合使用,而不适用于其他技术。我们的方法在几个小型图像分类基准测试中达到了最先进的抗退化性能,分别在CIFAR-10-C、CIFAR-100-C和TinyImageNet-C上达到了93.54%、74.9%和50.86%的鲁棒准确率
Summary / 总结
This paper introduces a training data augmentation method that combines synthetic images with neural style transfer to enhance the robustness of deep vision models against common corruptions. The authors find that although stylized synthetic images have lower quality according to the FID metric, they significantly improve model performance. Systematic experiments show that stylization and synthetic data work well together and can be effectively combined with TrivialAugment. The proposed method achieves state-of-the-art results on several benchmarks, with robust accuracies of 93.54%, 74.9%, and 50.86% on CIFAR-10-C, CIFAR-100-C, and TinyImageNet-C respectively.
该研究提出了一种结合合成图像和神经风格迁移的数据增强方法,以增强深度视觉模型对常见干扰的鲁棒性。研究发现,尽管经过风格迁移的合成图像在FID指标上质量较低,但它们显著提升了模型性能。研究发现,风格迁移和合成数据可以很好地结合,并且可以与TrivialAugment等规则增强技术有效结合,从而在多个基准测试中达到了最先进的抗干扰性能。
Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Authors: Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan LU
First: 2025-12-17T18:15:17+00:00 · Latest: 2025-12-17T18:15:17+00:00
Comments: Under Review
Abstract
Human beings solve complex problems through critical thinking, where reasoning and evaluation are intertwined to converge toward correct solutions. However, most existing large language models (LLMs) decouple reasoning from verification: they either generate reasoning without explicit self-checking or rely on external verifiers to detect errors post hoc. The former lacks immediate feedback, while the latter increases system complexity and hinders synchronized learning. Motivated by human critical thinking, we propose Stepwise Think-Critique (STC), a unified framework that interleaves reasoning and self-critique at each step within a single model. STC is trained with a hybrid reinforcement learning objective combining reasoning rewards and critique-consistency rewards to jointly optimize reasoning quality and self-evaluation. Experiments on mathematical reasoning benchmarks show that STC demonstrates strong critic-thinking capabilities and produces more interpretable reasoning traces, representing a step toward LLMs with built-in critical thinking.
中文标题/摘要
标题:逐步思考-批判:一种统一的鲁棒且可解释的大语言模型推理框架
人类通过批判性思维解决复杂问题,其中推理和评估交织在一起,逐步趋向正确答案。然而,大多数现有的大语言模型(LLMs)将推理与验证脱钩:它们要么生成推理而没有明确的自我检查,要么依赖外部验证者在事后检测错误。前者缺乏即时反馈,而后者增加了系统复杂性并妨碍了同步学习。受人类批判性思维的启发,我们提出了一种统一框架Stepwise Think-Critique (STC),该框架在单一模型中交替进行推理和自我批判。STC 通过结合推理奖励和批判一致性奖励的混合强化学习目标来联合优化推理质量和自我评估。在数学推理基准测试中的实验表明,STC 展示了强大的批判性思维能力,并生成了更可解释的推理轨迹,代表了向具有内置批判性思维的大语言模型迈进的一步。
Summary / 总结
The paper proposes Stepwise Think-Critique (STC), a unified framework that integrates reasoning and self-critique within a single model to enhance the robustness and interpretability of large language models (LLMs). STC is trained using a hybrid reinforcement learning objective that combines reasoning rewards with critique-consistency rewards. Experiments on mathematical reasoning benchmarks demonstrate that STC improves critic-thinking capabilities and generates more interpretable reasoning traces compared to models that decouple reasoning from verification.
论文提出了Stepwise Think-Critique (STC)框架,该框架在单一模型中结合了推理和自我批判,以增强大型语言模型(LLMs)的鲁棒性和可解释性。STC使用结合了推理奖励和批判一致性奖励的混合强化学习目标进行训练。实验表明,STC在数学推理基准测试中提高了批判性思考能力,并生成了更可解释的推理过程。
PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning
Authors: Xiaodi Li, Dingcheng Li, Rujun Gao, Mahmoud Zamani, Feng Mi, Latifur Khan
First: 2025-12-17T18:11:29+00:00 · Latest: 2025-12-17T18:11:29+00:00
Comments: 10 pages, 3 figures, 2025 IEEE International Conference on Big Data (BigData)
Abstract
Continual learning remains a fundamental challenge in machine learning, requiring models to learn from a stream of tasks without forgetting previously acquired knowledge. A major obstacle in this setting is catastrophic forgetting, where performance on earlier tasks degrades as new tasks are learned. In this paper, we introduce PPSEBM, a novel framework that integrates an Energy-Based Model (EBM) with Progressive Parameter Selection (PPS) to effectively address catastrophic forgetting in continual learning for natural language processing tasks. In PPSEBM, progressive parameter selection allocates distinct, task-specific parameters for each new task, while the EBM generates representative pseudo-samples from prior tasks. These generated samples actively inform and guide the parameter selection process, enhancing the model's ability to retain past knowledge while adapting to new tasks. Experimental results on diverse NLP benchmarks demonstrate that PPSEBM outperforms state-of-the-art continual learning methods, offering a promising and robust solution to mitigate catastrophic forgetting.
中文标题/摘要
标题:PPSEBM:一种基于能量的模型,结合渐进参数选择的持续学习方法
持续学习仍然是机器学习中的一个基本挑战,要求模型能够从任务流中学习而不忘记之前获得的知识。在这种设置中,灾难性遗忘是一个主要障碍,即随着新任务的学习,早期任务的性能会下降。在本文中,我们提出了一种名为PPSEBM的新框架,该框架将基于能量的模型(EBM)与渐进参数选择(PPS)相结合,以有效解决自然语言处理任务中持续学习中的灾难性遗忘问题。在PPSEBM中,渐进参数选择为每个新任务分配了特定的任务参数,而EBM则从先前的任务中生成代表性的伪样本。这些生成的样本主动地指导和影响参数选择过程,增强了模型保留过去知识并适应新任务的能力。在多种NLP基准上的实验结果表明,PPSEBM优于最先进的持续学习方法,提供了一种有希望且稳健的解决方案,以减轻灾难性遗忘。
From Trace to Line: LLM Agent for Real-World OSS Vulnerability Localization
Authors: Haoran Xi, Minghao Shao, Brendan Dolan-Gavitt, Muhammad Shafique, Ramesh Karri
First: 2025-09-30T22:27:18+00:00 · Latest: 2025-12-17T18:10:36+00:00
Abstract
Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function or file level detections which offers limited actionable guidance to engineers who need precise line-level localization and targeted patches in real-world software development. We present T2L-Agent (Trace-to-Line Agent), a project-level, end-to-end framework that plans its own analysis and progressively narrows scope from modules to exact vulnerable lines. T2L-Agent couples multi-round feedback with an Agentic Trace Analyzer (ATA) that fuses run-time evidence such as crash points, stack traces, and coverage deltas with AST-based code chunking, enabling iterative refinement beyond single pass predictions and translating symptoms into actionable, line-level diagnoses. To benchmark line-level vulnerability discovery, we introduce T2L-ARVO, a diverse, expert-verified 50-case benchmark spanning five crash families and real-world projects. T2L-ARVO is specifically designed to support both coarse-grained detection and fine-grained localization, enabling rigorous evaluation of systems that aim to move beyond file-level predictions. On T2L-ARVO, T2L-Agent achieves up to 58.0% detection and 54.8% line-level localization, substantially outperforming baselines. Together, the framework and benchmark push LLM-based vulnerability detection from coarse identification toward deployable, robust, precision diagnostics that reduce noise and accelerate patching in open-source software workflows.
中文标题/摘要
标题:从踪迹到线:面向现实世界开源软件漏洞定位的LLM代理
大型语言模型在漏洞发现方面显示出潜力,但现有方法孤立地检查代码,难以处理长上下文,并且主要集中在粗粒度的功能或文件级别的检测上,这为工程师提供了有限的精确行级定位和针对性修补的指导。我们提出了T2L-Agent(踪迹到线代理),这是一种项目级别的端到端框架,能够自主规划分析,并逐步从模块缩小到具体的漏洞行。T2L-Agent 结合多轮反馈与代理踪迹分析器(ATA),将运行时证据(如崩溃点、堆栈跟踪和覆盖率差异)与基于AST的代码切片相结合,实现迭代细化,超越单次预测,并将症状转化为可操作的行级诊断。为了评估行级漏洞发现,我们引入了T2L-ARVO,这是一个多样化的、专家验证的50个案例基准,涵盖了五个崩溃家族和实际项目。T2L-ARVO特别设计用于支持粗粒度检测和细粒度定位,使系统能够超越文件级别的预测进行严格的评估。在T2L-ARVO上,T2L-Agent 的检测率最高可达58.0%,行级定位率最高可达54.8%,显著优于基线。该框架和基准共同推动了基于LLM的漏洞检测从粗略识别向可部署、稳健、精确诊断的转变,减少了噪声并加速了开源软件工作流中的补丁修复。
Summary / 总结
T2L-Agent is a project-level framework that uses large language models to localize vulnerabilities at the line level, addressing the limitations of previous methods that focus on coarse function or file level detections. It employs an Agentic Trace Analyzer to iteratively refine predictions using multi-round feedback and fuses run-time evidence with AST-based code chunking. On the T2L-ARVO benchmark, T2L-Agent achieves up to 58.0% detection and 54.8% line-level localization, significantly outperforming existing methods.
T2L-Agent 是一个端到端的框架,利用大型语言模型在真实软件中逐步将范围从模块缩小到具体的代码行来定位漏洞。它结合多轮反馈和一个代理追踪分析器来细化预测,并将症状转化为可操作的行级诊断。在 T2L-ARVO 基准测试中,T2L-Agent 的检测率最高可达 58.0%,行级定位率最高可达 54.8%,显著优于现有方法。
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Authors: Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang
First: 2025-12-17T17:58:35+00:00 · Latest: 2025-12-17T17:58:35+00:00
Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
中文标题/摘要
标题:VTCBench:视觉语言模型能否通过视觉文本压缩理解长上下文?
LLM的上下文窗口扩展相关的计算和内存开销严重限制了其可扩展性。值得注意的解决方案是视觉文本压缩(VTC),如DeepSeek-OCR和Glyph等框架,将长文本转换为密集的二维视觉表示,从而实现3倍至20倍的标记压缩比。然而,这种高信息密度对视觉语言模型(VLM)的核心长上下文能力的影响尚未得到充分研究。为填补这一空白,我们首次引入了VTC基准,并系统评估了VLM在三种长上下文理解设置中的性能:VTC-Retrieval,评估模型检索和聚合信息的能力;VTC-Reasoning,要求模型通过最小的词汇重叠来推断潜在关联以定位事实;VTC-Memory,衡量模型在长期对话记忆中进行综合问答的能力。此外,我们建立了VTCBench-Wild以模拟多样化的输入场景。我们在基准上全面评估了领先开源和专有模型。结果表明,尽管大多数VLM能够很好地解码文本信息(如OCR),但在使用VTC压缩信息时,它们在长上下文理解方面表现出令人惊讶的差强人意的能力,无法捕捉上下文中的长期关联或依赖关系。本研究为理解VTC提供了深入的见解,并为设计更高效和可扩展的VLM奠定了基础。
Summary / 总结
The paper introduces VTCBench, a benchmark to evaluate the long-context understanding capabilities of vision-language models (VLMs) using vision-text compression (VTC). It assesses models in three settings: VTC-Retrieval, VTC-Reasoning, and VTC-Memory, and finds that most VLMs struggle to understand long contexts compressed by VTC, failing to capture long-term associations and dependencies. The study highlights the need for improving VLMs' ability to handle high-density visual representations of text.
研究引入了VTCBench,以评估使用视觉文本压缩(VTC)的视觉语言模型(VLM)在长上下文理解方面的能力。它在三个场景下评估模型:VTC-Retrieval、VTC-Reasoning 和 VTC-Memory。结果显示,大多数VLM在处理压缩后的高密度视觉信息时难以理解长上下文,表明需要更好地处理这些高密度的视觉表示。
Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift
Authors: Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang shen
First: 2025-12-17T17:54:20+00:00 · Latest: 2025-12-17T17:54:20+00:00
Comments: Code at: https://github.com/Jiacheng8/HALD
Abstract
Soft labels generated by teacher models have become a dominant paradigm for knowledge transfer and recent large-scale dataset distillation such as SRe2L, RDED, LPLD, offering richer supervision than conventional hard labels. However, we observe that when only a limited number of crops per image are used, soft labels are prone to local semantic drift: a crop may visually resemble another class, causing its soft embedding to deviate from the ground-truth semantics of the original image. This mismatch between local visual content and global semantic meaning introduces systematic errors and distribution misalignment between training and testing. In this work, we revisit the overlooked role of hard labels and show that, when appropriately integrated, they provide a powerful content-agnostic anchor to calibrate semantic drift. We theoretically characterize the emergence of drift under few soft-label supervision and demonstrate that hybridizing soft and hard labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which leverages hard labels as intermediate corrective signals while retaining the fine-grained advantages of soft labels. Extensive experiments on dataset distillation and large-scale conventional classification benchmarks validate our approach, showing consistent improvements in generalization. On ImageNet-1K, we achieve 42.7% with only 285M storage for soft labels, outperforming prior state-of-the-art LPLD by 9.0%. Our findings re-establish the importance of hard labels as a complementary tool, and call for a rethinking of their role in soft-label-dominated training.
中文标题/摘要
标题:硬标签入局!重新思考硬标签在减轻局部语义漂移中的作用
由教师模型生成的软标签已成为知识迁移和大规模数据集蒸馏(如SRe2L、RDED、LPLD)中的主导范式,提供了比传统硬标签更丰富的监督。然而,我们观察到,当每张图像仅使用少量裁剪时,软标签容易出现局部语义漂移:一个裁剪可能在视觉上类似于另一个类别,导致其软嵌入偏离原始图像的真实语义。这种局部视觉内容与全局语义意义之间的不匹配引入了训练和测试之间的系统性错误和分布不一致。在本文中,我们重新审视了被忽视的硬标签作用,并展示了当适当集成时,它们提供了一种强大的内容无关锚点来校准语义漂移。我们从理论上描述了在少量软标签监督下漂移的出现,并证明了混合软标签和硬标签可以恢复视觉内容和语义监督之间的对齐。基于这一见解,我们提出了一种新的训练范式——减轻局部语义漂移的硬标签(HALD),利用硬标签作为中间纠正信号,同时保留软标签的细粒度优势。在大规模数据集蒸馏和传统分类基准上的广泛实验验证了我们的方法,展示了一致的泛化改进。在ImageNet-1K上,我们仅使用2.85亿存储的软标签实现了42.7%,超越了先前的SPLD最佳结果9.0%。我们的研究重新确立了硬标签作为补充工具的重要性,并呼吁重新思考它们在软标签主导训练中的作用。
Summary / 总结
This work addresses the issue of local semantic drift in soft labels by integrating hard labels. It shows that hard labels, when properly used, can provide a content-agnostic anchor to mitigate this drift. The proposed Hard Label for Alleviating Local Semantic Drift (HALD) method combines the benefits of both soft and hard labels, leading to improved generalization. Experiments on ImageNet-1K demonstrate that HALD outperforms prior methods, achieving 42.7% accuracy with only 285M storage for soft labels.
该论文解决了由教师模型生成的软标签中出现的局部语义漂移问题,这会导致训练中的系统性错误。它提出了一种新的训练范式HALD,通过整合硬标签提供一个内容无关的锚点,从而减轻语义漂移。实验结果显示,该方法在ImageNet-1K上的性能提高了9.0%,优于之前的最佳方法。
IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning
Authors: Yuanhang Li, Yiren Song, Junzhe Bai, Xinran Liang, Hu Yang, Libiao Jin, Qi Mao
First: 2025-12-17T17:47:18+00:00 · Latest: 2025-12-17T17:47:18+00:00
Abstract
We propose \textbf{IC-Effect}, an instruction-guided, DiT-based framework for few-shot video VFX editing that synthesizes complex effects (\eg flames, particles and cartoon characters) while strictly preserving spatial and temporal consistency. Video VFX editing is highly challenging because injected effects must blend seamlessly with the background, the background must remain entirely unchanged, and effect patterns must be learned efficiently from limited paired data. However, existing video editing models fail to satisfy these requirements. IC-Effect leverages the source video as clean contextual conditions, exploiting the contextual learning capability of DiT models to achieve precise background preservation and natural effect injection. A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning via Effect-LoRA, ensures strong instruction following and robust effect modeling. To further improve efficiency, we introduce spatiotemporal sparse tokenization, enabling high fidelity with substantially reduced computation. We also release a paired VFX editing dataset spanning $15$ high-quality visual styles. Extensive experiments show that IC-Effect delivers high-quality, controllable, and temporally consistent VFX editing, opening new possibilities for video creation.
中文标题/摘要
标题:IC-Effect:基于上下文学习的精确高效视频特效编辑
我们提出了一种名为\textbf{IC-Effect}的指令引导、基于DiT的框架,用于少量样本的视频VFX编辑,能够合成复杂的特效(例如火焰、粒子和卡通人物)同时严格保持空间和时间一致性。视频VFX编辑极具挑战性,因为注入的特效必须与背景无缝融合,背景必须完全不变,且必须从有限的配对数据中高效学习特效模式。然而,现有的视频编辑模型无法满足这些要求。IC-Effect 利用源视频作为清洁的上下文条件,利用DiT模型的上下文学习能力实现精确的背景保留和自然的特效注入。通过两阶段训练策略,包括通用编辑适应和通过Effect-LoRA进行特效特定学习,确保了强烈的指令跟随和稳健的特效建模。为了进一步提高效率,我们引入了时空稀疏标记化,使高保真度的计算显著减少。我们还发布了跨越15种高质量视觉风格的配对VFX编辑数据集。大量实验表明,IC-Effect 提供高质量、可控且时间一致的VFX编辑,为视频创作开辟了新的可能性。
Summary / 总结
IC-Effect is a framework for video VFX editing that uses instruction guidance and DiT models to synthesize complex effects while preserving the background. It employs a two-stage training strategy and spatiotemporal sparse tokenization to ensure precise background preservation and natural effect injection, achieving high-quality, controllable, and temporally consistent VFX editing. The framework demonstrates strong instruction following and robust effect modeling, opening new possibilities for video creation.
IC-Effect 是一个基于 DiT 模型的指令引导框架,用于少量样本的视频 VFX 编辑,能够合成复杂效果同时保持空间和时间一致性。它采用两阶段训练策略和时空稀疏分词化,确保精确的背景保留和自然的效果注入,实现高质量且时间一致的 VFX 编辑,并减少计算量。该框架展示了强大的指令遵循能力和稳健的效果建模能力,为视频创作开辟了新可能。
Learning without training: The implicit dynamics of in-context learning
Authors: Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo
First: 2025-07-21T18:44:35+00:00 · Latest: 2025-12-17T17:34:33+00:00
Abstract
One of the most striking features of Large Language Models (LLMs) is their ability to learn in-context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in-context and not only during training. Specifically, we show how a transformer block implicitly transforms a context into a low-rank weight-update of its MLP layer.
中文标题/摘要
标题:无需训练的学习:上下文学习的隐式动态
大型语言模型(LLMs)最引人注目的特征之一是其在上下文中的学习能力。即在推理时,LLM能够在提示中以示例形式呈现的新模式出现时,无需任何额外的权重更新就能学习新的模式,即使这些模式在训练过程中未见过。这些机制仍然很大程度上未知。在本文中,我们展示了将自注意力层与MLP堆叠起来,使变压器块能够根据上下文隐式修改MLP层的权重。我们通过理论和实验表明,这种简单的机制可能是LLM能够在上下文而非仅在训练期间学习的原因。具体而言,我们展示了变压器块如何隐式地将上下文转换为MLP层的低秩权重更新。
Summary / 总结
The research explores the mechanism behind Large Language Models (LLMs) learning new patterns at inference time without additional training. By stacking a self-attention layer with an MLP, the transformer block implicitly modifies the MLP weights based on the context. Experiments demonstrate that this simple mechanism enables LLMs to learn in-context, suggesting it as a key factor for their ability to learn new patterns during inference rather than only during training.
研究探讨了大型语言模型(LLMs)在推理时学习新模式而不进行额外训练的机制。通过将自注意力层与MLP堆叠,变压器块会根据上下文隐式修改MLP的权重。实验表明,这种简单机制使LLMs能够在推理时学习新模式,而不是仅在训练期间学习,这表明它是其能够在推理时学习新模式的关键因素。
Deterministic Global Optimization of the Acquisition Function in Bayesian Optimization: To Do or Not To Do?
Authors: Anastasia Georgiou, Daniel Jungen, Luise Kaven, Verena Hunstig, Constantine Frangakis, Ioannis Kevrekidis, Alexander Mitsos
First: 2025-03-05T16:05:26+00:00 · Latest: 2025-12-17T17:32:28+00:00
Comments: 39 pages, 8 figures, 11 tables
Abstract
Bayesian Optimization (BO) with Gaussian Processes relies on optimizing an acquisition function to determine sampling. We investigate the advantages and disadvantages of using a deterministic global solver (MAiNGO) compared to conventional local and stochastic global solvers (L-BFGS-B and multi-start, respectively) for the optimization of the acquisition function. For CPU efficiency, we set a time limit for MAiNGO, taking the best point as optimal. We perform repeated numerical experiments, initially using the Muller-Brown potential as a benchmark function, utilizing the lower confidence bound acquisition function; we further validate our findings with three alternative benchmark functions. Statistical analysis reveals that when the acquisition function is more exploitative (as opposed to exploratory), BO with MAiNGO converges in fewer iterations than with the local solvers. However, when the dataset lacks diversity, or when the acquisition function is overly exploitative, BO with MAiNGO, compared to the local solvers, is more likely to converge to a local rather than a global ly near-optimal solution of the black-box function. L-BFGS-B and multi-start mitigate this risk in BO by introducing stochasticity in the selection of the next sampling point, which enhances the exploration of uncharted regions in the search space and reduces dependence on acquisition function hyperparameters. Ultimately, suboptimal optimization of poorly chosen acquisition functions may be preferable to their optimal solution. When the acquisition function is more exploratory, BO with MAiNGO, multi-start, and L-BFGS-B achieve comparable probabilities of convergence to a globally near-optimal solution (although BO with MAiNGO may require more iterations to converge under these conditions).
中文标题/摘要
标题:贝叶斯优化中获取函数的确定性全局优化:做还是不做?
贝叶斯优化(BO)使用高斯过程依赖于优化获取函数来确定采样。我们研究了使用确定性全局求解器(MAiNGO)与传统局部和随机全局求解器(L-BFGS-B和多重启,分别)优化获取函数的优缺点。为了提高CPU效率,我们为MAiNGO设置了时间限制,将最佳点视为最优。我们进行了重复的数值实验,最初使用Muller-Brown势能作为基准函数,使用下置信边界获取函数;我们进一步使用三个替代基准函数验证了我们的发现。统计分析表明,当获取函数更具探索性(而非开发性)时,MAiNGO的BO收敛迭代次数少于局部求解器。然而,当数据集缺乏多样性,或当获取函数过于开发性时,与局部求解器相比,MAiNGO的BO更有可能收敛到局部而非全局近似最优解。L-BFGS-B和多重启通过在选择下一个采样点时引入随机性,增强了搜索空间中未开发区域的探索,并减少了对获取函数超参数的依赖性。最终,选择不当的获取函数的次优优化可能比其最优解更可取。当获取函数更具探索性时,MAiNGO、多重启和L-BFGS-B的BO达到全局近似最优解的概率相当(尽管在这些条件下,MAiNGO的BO可能需要更多迭代才能收敛)。
Summary / 总结
This study investigates the use of a deterministic global solver (MAiNGO) versus local and stochastic global solvers (L-BFGS-B and multi-start) for optimizing the acquisition function in Bayesian Optimization (BO). Experiments using the Muller-Brown potential and three other benchmark functions show that MAiNGO can converge faster when the acquisition function is more exploitative. However, in datasets lacking diversity or when the acquisition function is overly exploitative, MAiNGO is more prone to converging to a local rather than a global near-optimal solution. Local solvers, through stochasticity, mitigate this risk and enhance exploration, making them more robust in such scenarios.
研究探讨了使用确定性全局求解器(MAiNGO)在贝叶斯优化(BO)中优化获取函数与局部和随机求解器(L-BFGS-B 和多启动)的比较。使用Muller-Brown势能和其他三个基准函数的实验表明,当获取函数更具探索性时,MAiNGO 和多启动、L-BFGS-B 在达到全局近似最优解的概率上表现相当,但在数据集缺乏多样性时,MAiNGO 更容易陷入局部最优。通过引入随机性,局部求解器增强了探索未知区域的能力,减少了对获取函数超参数的依赖,从而提高了鲁棒性。
OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence
Authors: Yu Zheng, Jie Hu, Kailun Yang, Jiaming Zhang
First: 2025-12-17T17:29:20+00:00 · Latest: 2025-12-17T17:29:20+00:00
Comments: 16 pages, 5 figures
Abstract
Autonomous driving requires a persistent understanding of 3D scenes that is robust to temporal disturbances and accounts for potential future actions. We introduce a new concept of 4D Occupancy Spatio-Temporal Persistence (OccSTeP), which aims to address two tasks: (1) reactive forecasting: ''what will happen next'' and (2) proactive forecasting: "what would happen given a specific future action". For the first time, we create a new OccSTeP benchmark with challenging scenarios (e.g., erroneous semantic labels and dropped frames). To address this task, we propose OccSTeP-WM, a tokenizer-free world model that maintains a dense voxel-based scene state and incrementally fuses spatio-temporal context over time. OccSTeP-WM leverages a linear-complexity attention backbone and a recurrent state-space module to capture long-range spatial dependencies while continually updating the scene memory with ego-motion compensation. This design enables online inference and robust performance even when historical sensor input is missing or noisy. Extensive experiments prove the effectiveness of the OccSTeP concept and our OccSTeP-WM, yielding an average semantic mIoU of 23.70% (+6.56% gain) and occupancy IoU of 35.89% (+9.26% gain). The data and code will be open source at https://github.com/FaterYU/OccSTeP.
中文标题/摘要
标题:OccSTeP:4D 占有时空持久性基准测试
自动驾驶需要一种对3D场景的持久理解,这种理解能够抵御时间上的干扰,并考虑到潜在的未来行动。我们引入了一个新的4D占有时空持久性(OccSTeP)的概念,旨在解决两个任务:(1)反应性预测:“接下来会发生什么”;(2)前瞻性预测:“给定特定未来行动会发生什么”。我们首次创建了一个具有挑战性场景(例如错误的语义标签和丢失的帧)的新OccSTeP基准测试。为了解决这一任务,我们提出了OccSTeP-WM,这是一种无需分词的世界模型,它维持了一个密集的体素场景状态,并随着时间的推移逐步融合时空上下文。OccSTeP-WM 利用线性复杂度的注意力骨干和递归状态空间模块来捕捉长距离的空间依赖性,同时通过自我运动补偿不断更新场景记忆。这种设计使得在线推理和即使在历史传感器输入缺失或噪声的情况下也能保持鲁棒性能。广泛的实验证明了OccSTeP概念和我们的OccSTeP-WM的有效性,平均语义mIoU为23.70%(+6.56%的增益)和占用IoU为35.89%(+9.26%的增益)。数据和代码将在https://github.com/FaterYU/OccSTeP开源。
Summary / 总结
The paper introduces OccSTeP, a new concept for persistent 4D understanding in autonomous driving, focusing on reactive and proactive forecasting. It proposes OccSTeP-WM, a world model that maintains a dense voxel-based scene state and fuses spatio-temporal context over time, achieving an average semantic mIoU of 23.70% and occupancy IoU of 35.89%. The model is robust to missing or noisy historical sensor data and performs well in challenging scenarios like erroneous semantic labels and dropped frames.
论文提出了OccSTeP,一种新的4D理解概念,专注于反应性和前瞻性预测。它提出了OccSTeP-WM,一种维护密集体素场景状态并融合时空上下文的模型,平均语义mIoU为23.70%,占用IoU为35.89%。该模型对缺失或噪声历史传感器数据具有鲁棒性,并在错误的语义标签和丢帧等挑战性场景中表现出色。
A Teacher-Student Perspective on the Dynamics of Learning Near the Optimal Point
Authors: Carlos Couto, José Mourão, Mário A. T. Figueiredo, Pedro Ribeiro
First: 2025-12-17T17:17:12+00:00 · Latest: 2025-12-17T17:17:12+00:00
Comments: 25 pages, 9 figures
Abstract
Near an optimal learning point of a neural network, the learning performance of gradient descent dynamics is dictated by the Hessian matrix of the loss function with respect to the network parameters. We characterize the Hessian eigenspectrum for some classes of teacher-student problems, when the teacher and student networks have matching weights, showing that the smaller eigenvalues of the Hessian determine long-time learning performance. For linear networks, we analytically establish that for large networks the spectrum asymptotically follows a convolution of a scaled chi-square distribution with a scaled Marchenko-Pastur distribution. We numerically analyse the Hessian spectrum for polynomial and other non-linear networks. Furthermore, we show that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions. For a generic non-linear activation function, such as the error function, we empirically observe that the Hessian matrix is always full rank.
中文标题/摘要
标题:教师与学生视角下的神经网络学习最优点附近动力学分析
在神经网络学习的最优点附近,梯度下降动力学的学习性能由损失函数相对于网络参数的海森矩阵决定。我们对一些教师-学生问题的海森矩阵特征谱进行了表征,当教师网络和学生网络具有匹配权重时,显示较小的海森矩阵特征值决定了长期学习性能。对于线性网络,我们通过分析证明,对于大型网络,谱渐近地遵循一个缩放的卡方分布与缩放的马尔琴科-帕斯图分布的卷积。我们还对多项式和其他非线性网络的海森矩阵谱进行了数值分析。此外,我们表明,对于使用多项式激活函数的网络,海森矩阵的秩可以视为有效参数数量。对于通用的非线性激活函数,如误差函数,我们通过实验观察到海森矩阵总是满秩。
Summary / 总结
This study investigates the dynamics of learning near the optimal point in neural networks, focusing on the role of the Hessian matrix. The research characterizes the Hessian eigenspectrum for teacher-student networks with matching weights, showing that smaller eigenvalues are crucial for long-term learning performance. For linear networks, the spectrum asymptotically follows a convolution of a scaled chi-square distribution and a scaled Marchenko-Pastur distribution. Numerical analysis of Hessian spectra for polynomial and non-linear networks is also provided, with findings that the rank of the Hessian matrix can be seen as an effective number of parameters for networks using polynomial activation functions, and that the Hessian matrix is always full rank for error function activation functions.
研究通过分析损失函数相对于网络参数的海森矩阵,探讨了神经网络在接近最优点时的学习动态。对于大型线性网络,研究表明海森矩阵的谱在大网络中渐近地遵循一个缩放的卡方分布与缩放的马尔琴科-帕斯特尔分布的卷积。还对多项式和非线性网络的海森矩阵谱进行了数值分析,发现对于具有多项式激活函数的网络,海森矩阵的秩可以视为有效参数数量,而对于像误差函数这样的非线性激活函数,海森矩阵总是满秩的。
Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction
Authors: Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet
First: 2025-12-17T17:14:26+00:00 · Latest: 2025-12-17T17:14:26+00:00
Abstract
Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.
中文标题/摘要
标题:自回归语言模型实际上是能量基模型:关于下一词预测前瞻能力的见解
自回归模型(ARMs)目前构成了大型语言模型(LLMs)的主要范式。能量基模型(EBMs)代表了另一类模型,尽管在LLM开发中历史上较少出现,但自然地描述了后训练对齐中的最优策略。在本文中,我们提供了这两个模型类的统一视角。以概率链规则为起点,我们建立了函数空间中ARMs和EBMs之间的显式双射关系,我们证明这对应于最大熵强化学习中软贝尔曼方程的特殊情形。基于这种双射关系,我们推导了ARMs和EBMs监督学习的等价性。此外,我们通过提供理论误差界分析了EBMs向ARMs的蒸馏过程。我们的结果为理解基于下一词预测范式的ARMs的前瞻能力提供了见解。
Summary / 总结
This paper explores the relationship between autoregressive models (ARMs) and energy-based models (EBMs) in the context of large language models (LLMs). By using the chain rule of probability, the authors establish a bijection between ARMs and EBMs, showing that ARMs can be viewed as a special case of EBMs. The study provides insights into the lookahead capabilities of ARMs, which are typically trained for next-token prediction, by deriving the equivalence between supervised learning of ARMs and EBMs and by providing theoretical error bounds for the distillation of EBMs into ARMs.
本文探讨了自回归模型(ARMs)和能量模型(EBMs)在大型语言模型(LLMs)中的关系。通过使用概率链规则,作者建立了ARMs和EBMs之间的对应关系,表明ARMs可以被视为EBMs的一种特殊情况。这一等价性进一步扩展到EBMs的监督学习和蒸馏到ARMs。关键发现是,尽管ARMs基于下一个词的预测范式,但它们具有类似于EBMs的前瞻能力,提供了对其规划能力的见解。
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
Authors: Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
First: 2025-12-17T17:09:52+00:00 · Latest: 2025-12-17T17:09:52+00:00
Comments: Project website: https://tobias-kirschstein.github.io/flexavatar/ , Video: https://youtu.be/g8wxqYBlRGY
Abstract
We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. A core challenge lies in the limited availability of multi-view data and the tendency of monocular training to yield incomplete 3D head reconstructions. We identify the root cause of this issue as the entanglement between driving signal and target viewpoint when learning from monocular videos. To address this, we propose a transformer-based 3D portrait animation model with learnable data source tokens, so-called bias sinks, which enables unified training across monocular and multi-view datasets. This design leverages the strengths of both data sources during inference: strong generalization from monocular data and full 3D completeness from multi-view supervision. Furthermore, our training procedure yields a smooth latent avatar space that facilitates identity interpolation and flexible fitting to an arbitrary number of input observations. In extensive evaluations on single-view, few-shot, and monocular avatar creation tasks, we verify the efficacy of FlexAvatar. Many existing methods struggle with view extrapolation while FlexAvatar generates complete 3D head avatars with realistic facial animations. Website: https://tobias-kirschstein.github.io/flexavatar/
中文标题/摘要
标题:FlexAvatar:在部分监督下学习完整的3D头部avatar
我们介绍了FlexAvatar,一种从单张图像创建高质量和完整3D头部avatar的方法。核心挑战在于多视角数据的稀缺性以及单目训练倾向于生成不完整的3D头部重建。我们发现这个问题的根本原因在于从单目视频学习时驱动信号与目标视角之间的纠缠。为了解决这个问题,我们提出了一种基于变换器的3D肖像动画模型,该模型具有可学习的数据源标记,称为偏差汇,这使得可以在单目和多视角数据集上统一训练。此设计在推理时利用了两种数据源的优势:单目数据的强大泛化能力和多视角监督的完整3D完整性。此外,我们的训练过程产生了一个平滑的潜在avatar空间,便于身份插值和灵活适应任意数量的输入观测。在广泛的单视角、少量样本和单目avatar创建任务评估中,我们验证了FlexAvatar的有效性。许多现有方法在视图外推方面存在困难,而FlexAvatar能够生成具有逼真面部动画的完整3D头部avatar。网站:https://tobias-kirschstein.github.io/flexavatar/
Summary / 总结
FlexAvatar addresses the challenge of creating complete 3D head avatars from a single image by proposing a transformer-based model with learnable data source tokens, or bias sinks, which unify monocular and multi-view training. This method improves generalization from monocular data and ensures full 3D completeness from multi-view supervision. Extensive evaluations show that FlexAvatar outperforms existing methods in generating complete 3D head avatars with realistic facial animations, especially in view extrapolation tasks.
FlexAvatar通过提出基于变换器的模型并使用可学习的数据源标记,称为偏置汇,统一了单视角和多视角训练,解决了从单张图像生成完整3D头像的挑战。该方法结合了单视角数据的强大泛化能力和多视角监督的完整3D特性,产生了逼真的面部动画。广泛的评估表明,FlexAvatar在生成具有逼真面部动画的完整3D头像方面优于许多现有方法,特别是在视角外推任务中。
Corrective Diffusion Language Models
Authors: Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, Grigorios G. Chrysos
First: 2025-12-17T17:04:38+00:00 · Latest: 2025-12-17T17:04:38+00:00
Comments: 18 pages
Abstract
Diffusion language models are structurally well-suited for iterative error correction, as their non-causal denoising dynamics allow arbitrary positions in a sequence to be revised. However, standard masked diffusion language model (MDLM) training fails to reliably induce this behavior, as models often cannot identify unreliable tokens in a complete input, rendering confidence-guided refinement ineffective. We study corrective behavior in diffusion language models, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a correction-oriented post-training principle that explicitly supervises visible incorrect tokens, enabling error-aware confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and controlled settings demonstrate that models trained with our approach substantially outperform standard MDLMs in correction scenarios, while also improving pure completion performance. Our code is publicly available at https://github.com/zhangshuibai/CDLM.
中文标题/摘要
标题:纠正性扩散语言模型
扩散语言模型在迭代错误修正方面结构上非常适合,因为它们非因果去噪的动力学允许序列中的任意位置被修正。然而,标准的掩码扩散语言模型(MDLM)训练无法可靠地诱导这种行为,因为模型往往无法识别输入中的不可靠标记,使得基于信心的精炼无效。我们研究了扩散语言模型的纠正行为,定义为能够对错误标记分配较低的信心并在迭代中精炼它们以保留正确内容的能力。我们表明,这种能力不是由传统的掩码扩散目标诱导的,并提出了一种纠正导向的后训练原则,该原则明确监督可见的错误标记,从而实现错误感知的信心和目标精炼。为了评估纠正行为,我们引入了代码修订基准(CRB),这是一个可控且可执行的基准,用于评估错误定位和就地修正。在代码修订任务和受控设置上的实验表明,使用我们方法训练的模型在纠正场景中的表现显著优于标准MDLM,同时也在纯完成性能上有所提升。我们的代码已公开发布在https://github.com/zhangshuibai/CDLM。
Summary / 总结
The study addresses the challenge of iterative error correction in diffusion language models by proposing a correction-oriented post-training principle. This method explicitly supervises visible incorrect tokens, enhancing models' ability to assign lower confidence to errors and iteratively refine them. Experiments show that models trained with this approach outperform standard masked diffusion language models in correction tasks and improve pure completion performance.
论文通过提出一种纠错导向的后训练原则来解决扩散语言模型中的迭代纠错问题。该方法明确监督可见的错误标记,以实现错误感知的信心和目标化修正,而这些能力不是由传统的遮蔽扩散目标诱导的。作者引入了代码修订基准(CRB)来评估纠错行为,并证明使用其方法训练的模型在纠错场景中比标准的遮蔽扩散语言模型表现更好,同时也在纯完成性能上有所提升。
Geopolitics, Geoeconomics and Risk: A Machine Learning Approach
Authors: Alvaro Ortiz, Tomasa Rodrigo, Pablo Saborido
First: 2025-10-14T11:51:36+00:00 · Latest: 2025-12-17T16:40:57+00:00
Comments: This new version has an important contribution by Pablo Saborido who is now a co author of the paper
Abstract
We introduce a novel high-frequency daily panel dataset of both markets and news-based indicators -- including Geopolitical Risk, Economic Policy Uncertainty, Trade Policy Uncertainty, and Political Sentiment -- for 42 countries across both emerging and developed markets. Using this dataset, we study how sentiment dynamics shape sovereign risk, measured by Credit Default Swap (CDS) spreads, and evaluate their forecasting value relative to traditional drivers such as global monetary policy and market volatility. Our horse-race analysis of forecasting models demonstrates that incorporating news-based indicators significantly enhances predictive accuracy and enriches the analysis, with non-linear machine learning methods -- particularly Random Forests -- delivering the largest gains. Our analysis reveals that while global financial variables remain the dominant drivers of sovereign risk, geopolitical risk and economic policy uncertainty also play a meaningful role. Crucially, their effects are amplified through non-linear interactions with global financial conditions. Finally, we document pronounced regional heterogeneity, as certain asset classes and emerging markets exhibit heightened sensitivity to shocks in policy rates, global financial volatility, and geopolitical risk.
中文标题/摘要
标题:地缘政治、地缘经济与风险:机器学习方法
我们引入了一个新的高频每日面板数据集,包括市场和基于新闻的指标——包括地缘政治风险、经济政策不确定性、贸易政策不确定性以及政治情绪——覆盖了42个国家,包括新兴市场和发展中市场。利用该数据集,我们研究了情绪动态如何影响主权风险,通过信用违约互换(CDS)利差来衡量,并评估其相对于传统驱动因素(如全球货币政策和市场波动性)的预测价值。我们的预测模型竞速分析表明,纳入基于新闻的指标显著提高了预测准确性并丰富了分析,特别是随机森林等非线性机器学习方法带来了最大的收益。我们的分析表明,尽管全球金融变量仍然是主权风险的主要驱动因素,但地缘政治风险和经济政策不确定性也发挥着重要作用。关键的是,这些因素通过与全球金融条件的非线性相互作用而被放大。最后,我们记录了明显的区域异质性,某些资产类别和新兴市场对政策利率、全球金融波动性和地缘政治风险的冲击表现出更高的敏感性。
Summary / 总结
This study introduces a high-frequency daily panel dataset of markets and news-based indicators for 42 countries, focusing on geopolitical risk, economic policy uncertainty, trade policy uncertainty, and political sentiment. Using this dataset, the research evaluates the impact of sentiment dynamics on sovereign risk, measured by CDS spreads, and finds that incorporating news-based indicators, especially through non-linear machine learning methods like Random Forests, significantly improves predictive accuracy. The analysis shows that while global financial variables are dominant, geopolitical risk and economic policy uncertainty also play a substantial role, particularly through non-linear interactions with global financial conditions, and highlights regional heterogeneity in risk sensitivity.
该研究引入了一个涵盖42个国家的高频每日面板数据集,包括市场指标和基于新闻的指标,重点关注地缘政治风险、经济政策不确定性、贸易政策不确定性以及政治情绪。利用该数据集,研究评估了情绪动态对主权风险(通过CDS展期衡量)的影响,并发现通过非线性机器学习方法(特别是随机森林)纳入新闻指标显著提高了预测准确性。分析表明,尽管全球金融变量是主要驱动因素,但地缘政治风险和经济政策不确定性也起到了重要作用,特别是在与全球金融条件的非线性交互中,研究还强调了不同地区在风险敏感性方面的差异性。
IMKD: Intensity-Aware Multi-Level Knowledge Distillation for Camera-Radar Fusion
Authors: Shashank Mishra, Karan Patil, Didier Stricker, Jason Rambach
Venue: WACV
First: 2025-12-17T16:40:52+00:00 · Latest: 2025-12-17T16:40:52+00:00
Comments: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026. 22 pages, 8 figures. Includes supplementary material
Abstract
High-performance Radar-Camera 3D object detection can be achieved by leveraging knowledge distillation without using LiDAR at inference time. However, existing distillation methods typically transfer modality-specific features directly to each sensor, which can distort their unique characteristics and degrade their individual strengths. To address this, we introduce IMKD, a radar-camera fusion framework based on multi-level knowledge distillation that preserves each sensor's intrinsic characteristics while amplifying their complementary strengths. IMKD applies a three-stage, intensity-aware distillation strategy to enrich the fused representation across the architecture: (1) LiDAR-to-Radar intensity-aware feature distillation to enhance radar representations with fine-grained structural cues, (2) LiDAR-to-Fused feature intensity-guided distillation to selectively highlight useful geometry and depth information at the fusion level, fostering complementarity between the modalities rather than forcing them to align, and (3) Camera-Radar intensity-guided fusion mechanism that facilitates effective feature alignment and calibration. Extensive experiments on the nuScenes benchmark show that IMKD reaches 67.0% NDS and 61.0% mAP, outperforming all prior distillation-based radar-camera fusion methods. Our code and models are available at https://github.com/dfki-av/IMKD/.
中文标题/摘要
标题:IMKD:感知强度多级知识蒸馏的相机-雷达融合
通过利用知识蒸馏,可以在推理时无需使用LiDAR的情况下实现高性能的雷达-相机3D目标检测。然而,现有的蒸馏方法通常直接将模态特定特征传输到每个传感器,这可能会扭曲它们的独特特性并削弱它们的个体优势。为了解决这个问题,我们引入了IMKD,这是一种基于多级知识蒸馏的雷达-相机融合框架,该框架保留了每个传感器的固有特性,同时放大了它们的互补优势。IMKD 应用了三个阶段的感知强度感知蒸馏策略,以在整个架构中丰富融合表示:(1) LiDAR到Radar强度感知特征蒸馏,以增强雷达表示的细粒度结构线索;(2) LiDAR到融合特征强度引导蒸馏,以在融合级别选择性地突出有用几何和深度信息,促进模态之间的互补性而不是强制它们对齐;(3) 相机-雷达强度引导融合机制,以促进有效的特征对齐和校准。在nuScenes基准上的广泛实验表明,IMKD 达到了67.0% NDS和61.0% mAP,优于所有先前的基于蒸馏的雷达-相机融合方法。我们的代码和模型可在https://github.com/dfki-av/IMKD/上获得。
Summary / 总结
IMKD is a radar-camera fusion framework that uses multi-level knowledge distillation to preserve the unique characteristics of each sensor while enhancing their complementary strengths. It employs a three-stage, intensity-aware distillation strategy to enrich the fused representation. IMKD outperforms previous methods, achieving 67.0% NDS and 61.0% mAP on the nuScenes benchmark.
IMKD 是一种利用三阶段、基于强度的知识蒸馏策略来保留每个传感器的独特特性并增强其互补优势的雷达-摄像头融合框架。该方法包括 LiDAR 到雷达特征蒸馏、LiDAR 到融合特征蒸馏和摄像头-雷达融合机制。实验在 nuScenes 基准上显示,IMKD 达到了 67.0% 的 NDS 和 61.0% 的 mAP,超过了之前的基于蒸馏的方法。
Human-computer interactions predict mental health
Authors: Veith Weilnhammer, Jefferson Ortega, David Whitney
First: 2025-11-25T11:00:39+00:00 · Latest: 2025-12-17T16:38:21+00:00
Abstract
Scalable assessments of mental illness remain a critical roadblock toward accessible and equitable care. Here, we show that everyday human-computer interactions encode mental health with state-of-the-art biomarker precision. We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA on 20,000 cursor and touchscreen recordings labelled with 1.3 million mental-health self-reports collected from 9,000 participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, identifies individuals living with mental illness, and achieves near-ceiling accuracy when predicting group-level mental health. By extracting non-verbal signatures of psychological function that have so far remained untapped, MAILA represents a key step toward scalable digital phenotyping and foundation models for mental health.
中文标题/摘要
标题:人机交互预测心理健康
可扩展的精神疾病评估仍然是实现可访问和公平护理的关键障碍。在此,我们展示了日常人机交互能够以最先进的生物标志物精度编码心理健康状态。我们引入了MAILA,这是一种从数字活动推断潜在心理状态的机器学习框架。我们使用来自9000名参与者、带有130万份心理健康自我报告的20000个鼠标和触屏记录训练了MAILA。数据集包括2000名纵向评估的个体,1500名被诊断为抑郁症的个体,以及500名被诊断为强迫症的个体。MAILA 沿着三个正交维度跟踪动态心理状态,识别患有精神疾病的人,并在预测群体心理健康时达到接近天花板的准确性。通过提取迄今为止未被利用的心理功能的非言语特征,MAILA 代表了实现可扩展的数字表型和心理健康基础模型的关键一步。
Summary / 总结
The research aims to develop scalable assessments of mental illness to improve accessible and equitable care. It introduces MAILA, a machine-learning framework that analyzes digital activity like cursor and touchscreen movements to predict mental health with high accuracy. The framework was trained on 20,000 recordings from 9,000 participants, including 2,000 longitudinally assessed individuals, 1,500 with depression, and 500 with obsessive-compulsive disorder. MAILA can track dynamic mental states and predict group-level mental health with near-perfect accuracy, offering a promising tool for digital phenotyping and mental health research.
研究旨在开发可扩展的评估精神疾病的方法,以提高精神卫生服务的可及性和公平性。研究引入了MAILA,一种机器学习框架,通过分析如鼠标和触摸屏操作等数字活动来预测精神健康状况,准确度极高。该框架基于9000名参与者共20,000次记录进行训练,其中包括2000名纵向评估的个体,1500名患有抑郁症的个体和500名患有强迫症的个体。MAILA能够追踪动态精神状态,并能以接近完美的准确度预测群体的精神健康状况,为数字表型学和精神健康研究提供了有前景的工具。
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
Authors: Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen
First: 2025-12-17T16:36:16+00:00 · Latest: 2025-12-17T16:36:16+00:00
Abstract
In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Code and models will be released.
中文标题/摘要
标题:MoonSeg3R:基于重建基础模型的单目在线零样本3D实例分割
在本文中,我们关注单目在线零样本3D实例分割,这是一个新颖的实际应用场景,现有方法无法在此场景下表现良好,因为它们依赖于摆好姿态的RGB-D序列。为克服这一限制,我们利用CUT3R,一种最近的重建基础模型(RFM),从单个RGB流中提供可靠的几何先验。我们提出了MoonSeg3R,该方法引入了三个关键组件:(1) 一个自监督查询精炼模块,通过空间语义蒸馏将视觉基础模型(VFMs)的分割掩码转换为具有区分性的3D查询;(2) 一个3D查询索引记忆,通过检索上下文查询提供时间一致性;(3) 一个来自CUT3R的状态分布标记,作为掩码身份描述符以增强跨帧融合。在ScanNet200和SceneNN上的实验表明,MoonSeg3R是首个能够实现单目在线3D分割的方法,并且其性能与基于RGB-D的最新系统相当。代码和模型将被发布。
Summary / 总结
The research aims to address the challenge of online zero-shot monocular 3D instance segmentation, which existing methods cannot handle due to their reliance on posed RGB-D sequences. To overcome this, the authors propose MoonSeg3R, which uses a Reconstructive Foundation Model (RFM) to provide geometric priors from a single RGB stream. Key components include a self-supervised query refinement module, a 3D query index memory, and a state-distribution token. Experiments demonstrate that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance comparable to state-of-the-art RGB-D-based systems.
MoonSeg3R 通过利用 CUT3R 这种重建基础模型来提供来自单个 RGB 流的几何先验,解决了在线零样本单目 3D 实例分割的问题。它引入了一个自监督查询精炼模块、一个 3D 查询索引记忆体和一个状态分布标记,分别增强 3D 查询转换、时间一致性以及跨帧融合。实验表明,MoonSeg3R 是第一个能够实现在线单目 3D 分割的方法,并且其性能与基于 RGB-D 的最新系统相当。
DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Authors: Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang, Yuwen Tang
First: 2025-11-14T06:09:37+00:00 · Latest: 2025-12-17T16:24:51+00:00
Comments: 36 pages
Abstract
The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.
中文标题/摘要
标题:DiscoX:专家领域话语层面翻译任务的基准测试
尽管话语层面翻译在专家领域中至关重要,对于知识传播和跨语言学术交流而言,其评估仍然不足。这些翻译需要话语层面的连贯性和严格的术语精确性,但当前的评估方法主要集中在段落级别的准确性和流畅性上。为解决这一局限,我们引入了DiscoX,这是一个新的话语层面和专家层面的中英翻译基准。它包含来自7个领域的200篇专业编纂的文本,平均每篇超过1700个词。为了评估DiscoX上的性能,我们还开发了Metric-S,这是一种无参考的系统,可以提供细粒度的自动评估,涵盖准确度、流畅性和适宜性。Metric-S与人工判断的一致性很强,显著优于现有指标。我们的实验揭示了一个显著的性能差距:即使最先进的LLM在这些任务上仍然落后于人类专家。这一发现验证了DiscoX的难度,并突显了实现专业级机器翻译所面临的挑战。提出的基准和评估系统为更严格的评估提供了坚实框架,促进了基于LLM的翻译的未来进步。
Summary / 总结
The paper addresses the inadequacy of current evaluation methods for discourse-level translation in expert domains, which are crucial for knowledge dissemination. It introduces DiscoX, a new benchmark for Chinese-English translation in expert domains, consisting of 200 professionally-curated texts. The authors also develop Metric-S, a reference-free system that evaluates accuracy, fluency, and appropriateness, showing strong consistency with human judgments. Experiments show a significant performance gap between advanced LLMs and human experts, highlighting the challenges in achieving professional-grade machine translation. This work provides a robust framework for future advancements in LLM-based translation.
论文针对专家领域中当前话语级翻译评估方法的不足,这些方法主要关注段落级别的准确性和流畅性,而非话语连贯性和术语精确性。作者引入了DiscoX,这是一个新的基准,用于评估7个领域中的200篇专业编写的中英翻译文本。作者还开发了Metric-S,这是一种无需参考的系统,用于评估翻译质量,显示了与人工判断的一致性,并优于现有指标。实验表明,最先进的LLM在这些任务上的表现仍然落后于人类专家,突显了实现专业级机器翻译的挑战。
Functional Percolation: Criticality of Form and Function
Authors: Galen J. Wilkerson
First: 2025-12-10T05:05:10+00:00 · Latest: 2025-12-17T16:20:34+00:00
Comments: 8 pages, 6 figures
Abstract
Understanding how network structure constrains and enables information processing is a central problem in the statistical mechanics of interacting systems. Here we study random networks across the structural percolation transition and analyze how connectivity governs realizable input-output transformations under cascade dynamics. Using Erdos-Renyi networks as a minimal ensemble, we examine structural, functional, and information-theoretic observables as functions of mean degree. We find that the emergence of the giant connected component coincides with a sharp transition in realizable information processing: complex input-output response functions become accessible, functional diversity increases rapidly, output entropy rises, and directed information flow, quantified by transfer entropy, extends beyond local neighborhoods. We term this coincidence of structural, functional, and informational transitions functional percolation, referring to a sharp expansion of the space of realizable input-output functions at the percolation threshold. Near criticality, networks exhibit a Pareto-optimal tradeoff between functional complexity and diversity, suggesting that percolation criticality may provide a general organizing principle of information processing capacity in systems with local interactions and propagating influences.
中文标题/摘要
标题:功能渗流:形式与功能的临界性
理解网络结构如何限制和促进信息处理是相互作用系统统计力学中的一个核心问题。在这里,我们研究结构渗流过渡过程中的随机网络,并分析连接性如何在级联动力学下控制可实现的输入-输出转换。使用Erdos-Renyi网络作为最小集合,我们研究了结构、功能和信息论观测值作为平均度函数的变化。我们发现,巨型连通分量的出现与可实现信息处理的急剧转变相吻合:复杂的输入-输出响应函数变得可实现,功能多样性迅速增加,输出熵上升,由转移熵量化的有向信息流延伸到局部邻域。我们将这种结构、功能和信息转变的巧合称为功能渗流,指的是在渗流阈值处可实现输入-输出函数空间的急剧扩展。在接近临界性时,网络表现出帕累托最优的权衡,即功能复杂性和多样性的权衡,表明渗流临界性可能为具有局部相互作用和传播影响的系统的信息处理能力提供一个普遍的组织原则。
Summary / 总结
This study investigates how network structure influences information processing across the percolation transition. Using Erdos-Renyi networks, the research examines structural, functional, and information-theoretic properties as a function of mean degree. Key findings include a sharp transition in realizable input-output functions, increased functional diversity, higher output entropy, and extended directed information flow beyond local neighborhoods at the percolation threshold, termed functional percolation. Near criticality, networks show a Pareto-optimal tradeoff between functional complexity and diversity, suggesting percolation criticality as a general principle for information processing capacity in systems with local interactions and propagating influences.
研究探讨了网络结构如何影响信息处理,特别是在结构相变过程中。使用Erdos-Renyi网络,研究了随着平均度的变化,网络的结构、功能和信息论属性。关键发现包括在相变阈值处出现复杂输入-输出函数的显著转变、功能多样性增加、输出熵提高以及信息流扩展到局部区域之外,这种现象被称为功能相变。接近临界点时,网络表现出功能复杂性和多样性的帕累托最优权衡,表明相变临界性可能是系统中局部交互和传播影响下信息处理能力的一般组织原则。
Evaluating Large Language Models in Scientific Discovery
Authors: Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang, Shunan Zheng, Saiyang Zhang, Sartaaj Takrim Khan, Mahyar Rajabi-Kochi, Samantha Paradi-Maropakis, Tony Baltoiu, Fengyu Xie, Tianyang Chen, Kexin Huang, Weiliang Luo, Meijing Fang, Xin Yang, Lixue Cheng, Jiajun He, Soha Hassoun, Xiangliang Zhang, Wei Wang, Chandan K. Reddy, Chao Zhang, Zhiling Zheng, Mengdi Wang, Le Cong, Carla P. Gomes, Chang-Yu Hsieh, Aditya Nandy, Philippe Schwaller, Heather J. Kulik, Haojun Jia, Huan Sun, Seyed Mohamad Moosavi, Chenru Duan
First: 2025-12-17T16:20:03+00:00 · Latest: 2025-12-17T16:20:03+00:00
Abstract
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.
中文标题/摘要
标题:评估大型语言模型在科学发现中的应用
大型语言模型(LLMs)在科学研究中的应用日益增多,然而现有的科学基准测试主要考察的是脱域知识,忽视了驱动科学发现的迭代推理、假设生成和观察解释。我们提出了一种基于场景的基准测试,该测试在生物学、化学、材料科学和物理学领域评估LLMs,由领域专家定义真正感兴趣的研究项目,并将其分解为模块化研究场景,从中抽取经过验证的问题。该框架在两个层面评估模型:(i) 场景关联问题的准确性,(ii) 项目层面的表现,其中模型必须提出可测试的假设,设计模拟或实验,并解释结果。将这一两阶段科学发现评估(SDE)框架应用于最先进的LLMs,揭示了相对于通用科学基准测试的一致性能差距,模型规模和推理能力的提升带来的回报递减,以及来自不同提供商的顶级模型共有的系统性弱点。研究场景中的巨大性能差异导致了在评估科学发现项目中表现最佳的模型的选择变化,表明当前所有LLMs距离通用科学“超级智能”还很遥远。然而,LLMs已经在各种科学发现项目中展示了潜力,包括那些构成场景得分较低的情况,突显了引导探索和偶然性在发现中的作用。该SDE框架提供了一个可重复的基准测试,用于评估LLMs在发现方面的相关性,并为推进其发展以实现科学发现指明了实际路径。
Summary / 总结
The study evaluates large language models (LLMs) in scientific discovery by introducing a scenario-grounded benchmark that assesses models in biology, chemistry, materials, and physics. The benchmark evaluates models at both question-level accuracy and project-level performance, where models must generate hypotheses, design experiments, and interpret results. Key findings include a consistent performance gap between LLMs and general science benchmarks, diminishing returns from increasing model size, and shared weaknesses across top-tier models. The framework highlights the potential of LLMs in scientific discovery despite current limitations, suggesting the need for guided exploration and serendipity in research. This SDE framework provides a reproducible benchmark for evaluating LLMs in scientific discovery projects.
研究通过引入一个基于场景的基准来评估大型语言模型(LLMs)在生物学、化学、材料科学和物理学中的科学发现能力。该基准从问题级准确性和项目级性能两个方面评估模型,要求模型提出假设、设计实验并解释结果。主要发现包括与通用科学基准相比的一致性能差距、随模型规模扩大而递减的性能提升以及顶级模型之间的共同弱点。研究指出,尽管当前LLMs存在局限性,但在各种科学发现项目中仍显示出潜力,突显了引导探索和偶然性在发现中的作用。该框架提供了一个可重复的基准,用于评估LLMs在科学发现中的表现,并指导未来的发展方向。
GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
Authors: Bozhou Li, Sihan Yang, Yushuo Guan, Ruichuan An, Xinlong Chen, Yang Shi, Pengfei Wan, Wentao Zhang, Yuanxing zhang
First: 2025-12-17T16:09:43+00:00 · Latest: 2025-12-17T16:09:43+00:00
Abstract
The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.
中文标题/摘要
标题:GRAN-TED:生成稳健、对齐和细腻的文本嵌入以用于扩散模型
文本编码器是文本到图像和文本到视频扩散模型的关键组件,从根本上决定了生成内容的语义保真度。然而,其开发受到了两个主要挑战的阻碍:缺乏一个高效且可靠的评估框架来预测下游生成性能,以及难以有效适应预训练语言模型以进行视觉合成。为了解决这些问题,我们提出了GRAN-TED,这是一种生成扩散模型中稳健、对齐和细腻文本嵌入的范式。我们的贡献有两个方面。首先,我们提出了TED-6K,这是一种新的纯文本基准,可以有效地评估编码器的表示质量,而无需进行昂贵的端到端模型训练。我们证明,通过轻量级、统一的适配器标准化后的TED-6K上的性能,强烈地与编码器在下游生成任务中的有效性相关。其次,在验证框架的指导下,我们开发了一种更优秀的文本编码器,采用了一种新颖的两阶段训练范式。该过程包括一个初始的在多模态大型语言模型上的微调阶段,以获得更好的视觉表示,随后是逐层加权方法以提取更细腻和有力的文本特征。我们的实验表明,GRAN-TED编码器不仅在TED-6K上达到了最先进的性能,还在文本到图像和文本到视频生成中取得了可验证的性能提升。我们的代码可在以下链接获取:https://anonymous.4open.science/r/GRAN-TED-4FCC/
Summary / 总结
The paper introduces GRAN-TED, a method to generate robust, aligned, and nuanced text embeddings for diffusion models. It addresses the challenges of evaluating text encoders and adapting pretrained language models for visual synthesis by proposing TED-6K, a novel text-only benchmark. The method involves a two-stage training process: initial fine-tuning on a multimodal large language model and layer-wise weighting to extract more nuanced features. Experiments show that GRAN-TED outperforms existing methods on TED-6K and improves performance in text-to-image and text-to-video generation tasks.
GRAN-TED通过引入TED-6K,一个仅文本基准,解决了评估文本编码器性能的效率和鲁棒性问题。它还提出了一种两阶段训练范式,首先对多模态大型语言模型进行微调以获得更好的视觉表示,然后应用逐层加权方法提取更细腻和有力的文本特征。最终生成的GRAN-TED编码器在TED-6K上表现出色,并在文本到图像和文本到视频生成任务中提高了生成质量。
Photonics-Enhanced Graph Convolutional Networks
Authors: Yuan Wang, Oleksandr Kyriienko
First: 2025-12-17T15:55:45+00:00 · Latest: 2025-12-17T15:55:45+00:00
Comments: 12 pages, 6 figures
Abstract
Photonics can offer a hardware-native route for machine learning (ML). However, efficient deployment of photonics-enhanced ML requires hybrid workflows that integrate optical processing with conventional CPU/GPU based neural network architectures. Here, we propose such a workflow that combines photonic positional embeddings (PEs) with advanced graph ML models. We introduce a photonics-based method that augments graph convolutional networks (GCNs) with PEs derived from light propagation on synthetic frequency lattices whose couplings match the input graph. We simulate propagation and readout to obtain internode intensity correlation matrices, which are used as PEs in GCNs to provide global structural information. Evaluated on Long Range Graph Benchmark molecular datasets, the method outperforms baseline GCNs with Laplacian based PEs, achieving $6.3\%$ lower mean absolute error for regression and $2.3\%$ higher average precision for classification tasks using a two-layer GCN as a baseline. When implemented in high repetition rate photonic hardware, correlation measurements can enable fast feature generation by bypassing digital simulation of PEs. Our results show that photonic PEs improve GCN performance and support optical acceleration of graph ML.
中文标题/摘要
标题:光子增强图卷积网络
光子学可以为机器学习(ML)提供硬件原生的途径。然而,高效部署光子增强的ML需要将光学处理与传统的基于CPU/GPU的神经网络架构相结合的混合工作流。在这里,我们提出了一种结合光子位置嵌入(PEs)与先进的图ML模型的工作流。我们介绍了一种基于光子的方法,将PEs与光在合成频率晶格上的传播耦合匹配的输入图相结合,以增强图卷积网络(GCNs)。我们模拟传播和读出以获得节点间强度相关矩阵,将其用作GCNs中的PEs,提供全局结构信息。在Long Range Graph Benchmark分子数据集上评估,该方法优于基于拉普拉斯的PEs的基线GCNs,使用两层GCN作为基线时,回归任务的均方绝对误差降低了6.3%,分类任务的平均精度提高了2.3%。当在高重复率的光子硬件中实现时,相关性测量可以绕过PEs的数字模拟,实现快速特征生成。我们的结果表明,光子PEs可以提高GCN性能,并支持图ML的光学加速。
Summary / 总结
The research aims to enhance graph convolutional networks (GCNs) using photonic positional embeddings (PEs) to improve machine learning performance on graph-structured data. The method integrates photonic processing with conventional neural network architectures by deriving PEs from light propagation on synthetic frequency lattices that match the input graph. Experiments on molecular datasets demonstrate that this approach outperforms traditional GCNs with Laplacian-based PEs, achieving a 6.3% lower mean absolute error for regression and a 2.3% higher average precision for classification tasks.
该论文提出了一种将光子位置嵌入与图卷积网络结合的混合工作流,以增强图结构数据上的机器学习。该方法利用合成频率格上的光波传播生成位置嵌入,然后在GCN中使用这些嵌入提供全局结构信息。在分子数据集上的评估表明,该方法在回归任务中实现了更低的平均绝对误差,在分类任务中实现了更高的平均精度,优于传统的基于拉普拉斯的位置嵌入的GCN。
History
20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553