arXiv 论文速递

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Authors: Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, Tao Xiang, Ziwei Liu, Juan-Manuel Perez-Rua

First: 2025-12-24T18:59:58+00:00 · Latest: 2025-12-24T18:59:58+00:00

Comments: Project Page: http://haonanqiu.com/projects/HiStream.html

Abs · PDF · Code1 · Code2 · Project1

Abstract

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.

中文标题/摘要

标题：HiStream：通过消除冗余的流式传输高效生成高分辨率视频

高分辨率视频生成对于数字媒体和电影至关重要，但由于扩散模型的二次复杂性导致计算瓶颈，使得实际推理变得不可行。为了解决这一问题，我们引入了HiStream，这是一种高效的自回归框架，系统地在三个维度上减少冗余：i) 空间压缩：在低分辨率处去噪，然后使用缓存特征在高分辨率处细化；ii) 时间压缩：采用分块策略，固定大小的锚点缓存确保稳定的推理速度；iii) 时间步压缩：对后续的缓存条件分块应用较少的去噪步骤。在1080p基准测试中，我们的主要HiStream模型（i+ii）实现了最先进的视觉质量，同时与Wan2.1基线相比，去噪速度提高了76.2倍，且几乎无质量损失。我们的更快变体HiStream+应用了所有三种优化（i+ii+iii），相对于基线实现了107.5倍的加速，提供了速度和质量之间的权衡，从而使得高分辨率视频生成既实用又可扩展。

Summary / 总结

HiStream is an efficient autoregressive framework designed to reduce the computational complexity of high-resolution video generation. It employs three strategies: spatial compression by denoising at low resolution before high-resolution refinement, temporal compression using a chunk-by-chunk approach with a fixed-size anchor cache, and timestep compression by applying fewer denoising steps to subsequent chunks. The primary HiStream model achieves state-of-the-art visual quality with up to 76.2x faster denoising compared to the Wan2.1 baseline, while HiStream+ further accelerates the process by 107.5x with a slight quality trade-off.

HiStream 是一种高效的自回归框架，旨在降低高分辨率视频生成的计算复杂性。它采用了三种策略：通过在低分辨率下先去噪再在高分辨率下进行细化的空间压缩，使用固定大小的锚点缓存进行分块处理的时间压缩，以及对后续块应用更少去噪步骤的时间步压缩。主要的 HiStream 模型在与 Wan2.1 基线相比快 76.2 倍的同时保持了最先进的视觉质量，而 HiStream+ 进一步加速了 107.5 倍，尽管在质量上有所妥协。

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu

First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00

Comments: Project page: https://sytwu.github.io/BeyondMemo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

中文标题/摘要

标题：超越记忆：多模态序数回归基准以揭示视觉语言模型中的流行度偏差

我们揭示了最先进的视觉语言模型（VLMs）中存在显著的流行度偏差，这些模型在著名建筑上的准确率比普通建筑高出34%，表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题，我们引入了该任务上最大的开放基准数据集：YearGuessr数据集，包含来自157个国家的55,546张建筑图像及其多模态属性，附有其建设年份的连续序数标签（1001-2024）、GPS数据和页面浏览量作为流行度的代理。使用该数据集，我们将建设年份预测任务框定为序数回归，并引入了流行度感知的区间准确度指标来量化这种偏差。我们基准测试的30多种模型，包括我们的YearCLIP模型，证实了VLMs在流行、记忆化的项目上表现出色，但在未识别的主题上却面临重大挑战，揭示了它们推理能力中的关键缺陷。项目页面：https://sytwu.github.io/BeyondMemo/

Summary / 总结

The paper addresses the significant popularity bias in state-of-the-art vision-language models (VLMs), which perform better on famous buildings than ordinary ones. To investigate this, the authors introduce the YearGuessr dataset, comprising 55,546 building images with multi-modal attributes and continuous ordinal labels of construction years, GPS data, and page-view counts. Using this dataset, they frame the task as ordinal regression and introduce new metrics to quantify the bias. The benchmark of 30+ models, including YearCLIP, confirms that VLMs excel on popular items but struggle with unrecognized subjects, highlighting a critical flaw in their reasoning capabilities.

论文探讨了最先进的视觉-语言模型（VLMs）中存在的显著流行度偏差，即模型在著名建筑上的表现优于普通建筑。为了系统地研究这一问题，作者引入了包含55,546张建筑图像的YearGuessr数据集，这些图像具有多模态属性和连续的按年份排序标签。通过将任务建模为序数回归并引入新的度量标准，基准测试揭示了VLMs在流行、记忆中的项目上表现出色，但在未识别的主题上却面临重大挑战，突显了其推理能力的关键缺陷。

Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

Authors: Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin

First: 2025-12-24T18:59:51+00:00 · Latest: 2025-12-24T18:59:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.

中文标题/摘要

标题：通过量化不确定性优化掩码扩散模型的解码路径

掩码扩散模型（MDMs）提供了灵活的非自回归生成，但这种自由度引入了一个挑战：最终输出质量高度依赖于解码顺序。我们首次正式化了这一问题，将输出质量的差异归因于生成路径中累积的预测不确定性。为了量化这种不确定性，我们引入了去噪熵，这是一种可计算的度量标准，作为评估生成过程的内部信号。利用这一度量标准，我们提出了两种优化解码路径的算法：一种事后选择方法和一种实时指导策略。实验表明，我们的熵导向方法显著提高了生成质量，在具有挑战性的推理、规划和代码基准测试中持续提升了准确性。我们的工作确立了去噪熵作为理解并控制生成过程的原理性工具，有效地将MDMs中的不确定性从一种负担转变为发现高质量解决方案的关键优势。

Summary / 总结

The research aims to address the variability in output quality of Masked Diffusion Models (MDMs) due to their flexible decoding order. To tackle this, the authors introduce Denoising Entropy, a metric to quantify predictive uncertainty along generative paths. They propose two methods: a post-hoc selection method and a real-time guidance strategy, both leveraging Denoising Entropy. Experiments show that these entropy-guided methods enhance generation quality, particularly on complex benchmarks involving reasoning, planning, and code generation.

本研究解决了Masked Diffusion Models (MDMs)由于解码顺序敏感性而导致输出质量波动的问题。通过将此问题形式化，并将其归因于生成路径上的累积预测不确定性，作者引入了Denoising Entropy作为评估生成过程的度量标准。他们提出了两种算法：一种是事后选择方法，另一种是实时指导策略，这两种方法都利用Denoising Entropy来优化解码路径。实验结果表明，这些基于熵的方法显著提高了生成质量，特别是在涉及推理、规划和代码生成的复杂基准测试中表现尤为突出。

Streaming Video Instruction Tuning

Authors: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou

First: 2025-12-24T18:59:36+00:00 · Latest: 2025-12-24T18:59:36+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

中文标题/摘要

标题：流式视频指令调优

我们提出了Streamo，一种实时流式视频LLM，作为通用交互式助手。与现有的专注于问答或字幕的在线视频模型不同，Streamo执行广泛的流式视频任务，包括实时解说、动作理解、事件字幕、时间事件定位和时间敏感的问答。为了开发这种多功能性，我们构建了Streamo-Instruct-465K，一个针对流式视频理解的大规模指令遵循数据集。该数据集涵盖了多种时间上下文和多任务监督，使Streamo能够在异构流式任务中统一训练。通过简化的工作流程在指令遵循数据集上端到端训练后，Streamo展示了强大的时间推理、响应式交互和在各种流式基准测试中的广泛泛化能力。广泛的实验表明，Streamo填补了离线视频感知模型与实时多模态助手之间的差距，朝着统一、智能的视频理解在连续视频流中的目标迈出了一步。

Summary / 总结

Streamo is a real-time streaming video LLM designed as a general-purpose interactive assistant. It excels in a wide range of tasks such as real-time narration, action understanding, and event captioning. To achieve this versatility, the researchers created Streamo-Instruct-465K, a large instruction-following dataset for streaming video understanding. After training, Streamo demonstrates strong temporal reasoning and broad generalization across various streaming benchmarks, bridging the gap between offline video models and real-time multimodal assistants.

研究动机是开发一个名为Streamo的实时流媒体视频助手，能够执行多种任务，如实时叙述和事件字幕。为此，作者创建了一个名为Streamo-Instruct-465K的大规模指令跟随数据集，用于流媒体视频理解。经过训练后，Streamo展示了强大的时间推理能力和在各种流媒体基准测试中的泛化能力，填补了离线视频模型与实时多模态助手之间的差距。

Fast SAM2 with Text-Driven Token Pruning

Authors: Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen

First: 2025-12-24T18:59:05+00:00 · Latest: 2025-12-24T18:59:05+00:00

Comments: 28 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.

中文标题/摘要

标题：快速SAM2：基于文本驱动的标记剪枝

Segment Anything Model 2 (SAM2) 是一种视觉基础模型，在基于提示的视频对象分割方面取得了显著进展，但其实际部署受限于处理密集视觉标记的时间高计算和高内存成本。SAM2 管道通常会将图像编码器生成的所有视觉标记通过下游的时间推理模块进行传递，而不考虑这些标记与目标对象的相关性，导致由于基于内存的二次注意力开销而降低了可扩展性。本文介绍了一种基于文本的标记剪枝框架，通过在时间传播之前选择性地减少标记密度来提高推理效率，而不修改基础分割架构。该方法在视觉编码之后、基于内存的传播之前运行，使用轻量级的路由机制对标记进行排名，该机制结合了局部视觉上下文、从以对象为中心的文本描述（用户提供的或自动生成的）中推导出的语义相关性以及有助于保留模糊或边界关键区域的不确定性提示。通过仅保留用于下游处理的最相关信息标记，所提出的方法减少了冗余计算，同时保持了分割精度。在多个具有挑战性的视频分割基准测试中的广泛实验表明，编码器后的标记剪枝提供了一条实用且有效的途径，以实现基于提示的视频分割的高效性，与未剪枝的基线SAM2相比，推理速度提高了42.50%，GPU内存使用量降低了37.41%，同时保持了竞争力的J和F性能。这些结果突显了早期标记选择对提高基于变压器的视频分割系统实时性和资源受限应用可扩展性的潜力。

Summary / 总结

This work introduces a text-guided token pruning framework to enhance the efficiency of the Segment Anything Model 2 (SAM2) for video object segmentation. By selectively reducing token density before temporal propagation, the method improves inference speed and reduces GPU memory usage without altering the segmentation architecture. Experiments show up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline, while maintaining competitive segmentation performance. This approach highlights the potential for early token selection to improve the scalability of transformer-based video segmentation systems for real-time applications.

这项工作提出了一种文本引导的token剪枝框架，用于增强Segment Anything Model 2 (SAM2)在视频对象分割中的推理效率。通过在时间传播前选择性地减少token密度，并使用一个轻量级机制来考虑局部视觉上下文、语义相关性和不确定性提示来对token进行排序。实验表明，这种方法可以将推理时间减少最多42.50%，GPU内存使用量减少37.41%，同时保持竞争力的分割性能。这突显了早期token选择对提高基于变压器的视频分割系统可扩展性的潜力。

TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Authors: Varun Belagali, Saarthak Kapse, Pierre Marza, Srijan Das, Zilinghan Li, Sofiène Boutaj, Pushpak Pati, Srikar Yellapragada, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Prateek Prasanna, Stergios Christodoulidis Maria Vakalopoulou, Dimitris Samaras

First: 2025-12-24T18:58:16+00:00 · Latest: 2025-12-24T18:58:16+00:00

Abs · PDF · Code1 · Code2

Abstract

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

中文标题/摘要

标题：TICON：一种用于组织病理学表示学习的幻灯片级切片上下文化器

在大型全切片图像（WSI）中，对小切片的解释通常需要更大的图像上下文。我们引入了TICON，一种基于变换器的切片表示上下文化器，能够为任何计算病理学应用生成丰富的上下文化嵌入。标准基于切片编码器的管道从切片中剥离其上下文来提取嵌入，无法建模对于局部和全局任务都至关重要的丰富幻灯片级信息。此外，不同的切片编码器在不同的下游任务中表现出色。因此，需要一个统一的模型来上下文化来自任何切片级基础模型的嵌入。TICON 通过一个共享的编码器来满足这一需求，该编码器使用掩码建模目标进行预训练，以同时统一和上下文化来自多种切片级病理基础模型的表示。我们的实验表明，TICON 上下文化的嵌入在许多不同任务中显著提高了性能，建立了切片级基准（如HEST-Bench、THUNDER、CATCH）和幻灯片级基准（如Patho-Bench）的新最佳结果。最后，我们使用仅11K张WSI对TICON 进行预训练形成幻灯片级基础模型，超越了使用多达350K张WSI预训练的当前最佳幻灯片级基础模型。

Summary / 总结

TICON is a transformer-based model designed to provide rich, contextualized embeddings for tiles in large whole slide images, addressing the limitations of standard tile encoder-based pipelines. It uses a single, shared encoder pretrained with a masked modeling objective to unify and contextualize representations from various tile-level pathology foundation models. Experiments show that TICON significantly improves performance across multiple tasks, setting new state-of-the-art results on both tile-level and slide-level benchmarks.

TICON 是一种基于变换器的模型，旨在为全切片图像中的小块提供丰富的上下文嵌入，解决小块编码器管道无法捕捉切片级信息的问题。通过使用共享编码器进行预训练，并采用掩码建模目标，TICON 统一并上下文化了来自不同小块级病理基础模型的嵌入。实验表明，TICON 在多个任务上显著提高了性能，建立了在切片级和小块级基准上的新最佳结果。此外，TICON 仅使用 11,000 个全切片图像就能构建出切片级基础模型，超越了使用多达 350,000 个全切片图像预训练的最先进的模型。

Parallel Token Prediction for Language Models

Authors: Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh, Stephan Mandt

First: 2025-12-24T18:46:55+00:00 · Latest: 2025-12-24T18:46:55+00:00

Comments: Preprint. Under review

Abs · PDF · Code1 · Code2

Abstract

We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.

中文标题/摘要

标题：语言模型中的并行令牌预测

我们提出了并行令牌预测（PTP），这是一种用于语言模型并行序列生成的通用框架。PTP 在单个变压器调用中通过将采样过程纳入模型中，同时预测多个依赖令牌，从而减少了自回归解码的延迟瓶颈，并避免了现有方法中常见的独立性假设限制。我们证明PTP 可以表示任意自回归序列分布。PTP 可以通过蒸馏现有模型或通过逆自回归训练进行训练，无需教师。实验上，我们在 Spec-Bench 上通过每步接受超过四个令牌，实现了 Vicuna-7B 的最佳推测解码性能。我们框架的通用性表明，在不损失建模能力的情况下，长序列的并行生成是可行的。

Summary / 总结

The research proposes Parallel Token Prediction (PTP), a framework that jointly predicts multiple dependent tokens in a single transformer call, reducing the latency of autoregressive decoding and avoiding restrictive independence assumptions. PTP can represent arbitrary autoregressive sequence distributions and is trained either by distilling an existing model or through inverse autoregressive training. Experiments show that PTP achieves state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench, indicating the feasibility of parallel generation of long sequences without loss of modeling power.

研究旨在通过提出并行令牌预测（PTP）框架解决自回归解码的延迟问题，该框架在一个变压器调用中预测多个依赖令牌，并将采样过程整合到模型中，减少了对现有限制性独立假设的依赖。实验表明，PTP 在 Vicuna-7B 上实现了最先进的推测解码性能，每步接受超过四个令牌，在 Spec-Bench 上，表明其在不牺牲建模能力的情况下可以实现长序列的并行生成。

Variationally correct operator learning: Reduced basis neural operator with a posteriori error estimation

Authors: Yuan Qiu, Wolfgang Dahmen, Peng Chen

First: 2025-12-24T18:37:59+00:00 · Latest: 2025-12-24T18:37:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Minimizing PDE-residual losses is a common strategy to promote physical consistency in neural operators. However, standard formulations often lack variational correctness, meaning that small residuals do not guarantee small solution errors due to the use of non-compliant norms or ad hoc penalty terms for boundary conditions. This work develops a variationally correct operator learning framework by constructing first-order system least-squares (FOSLS) objectives whose values are provably equivalent to the solution error in PDE-induced norms. We demonstrate this framework on stationary diffusion and linear elasticity, incorporating mixed Dirichlet-Neumann boundary conditions via variational lifts to preserve norm equivalence without inconsistent penalties. To ensure the function space conformity required by the FOSLS loss, we propose a Reduced Basis Neural Operator (RBNO). The RBNO predicts coefficients for a pre-computed, conforming reduced basis, thereby ensuring variational stability by design while enabling efficient training. We provide a rigorous convergence analysis that bounds the total error by the sum of finite element discretization bias, reduced basis truncation error, neural network approximation error, and statistical estimation errors arising from finite sampling and optimization. Numerical benchmarks validate these theoretical bounds and demonstrate that the proposed approach achieves superior accuracy in PDE-compliant norms compared to standard baselines, while the residual loss serves as a reliable, computable a posteriori error estimator.

中文标题/摘要

标题：变分正确的算子学习：基于后验误差估计的降阶基神经算子

最小化PDE残差损失是促进神经算子物理一致性的常见策略。然而，标准形式通常缺乏变分正确性，这意味着小的残差并不保证小的解误差，因为使用了不合规的范数或针对边界条件的任意罚项。本文通过构建值可证明等同于PDE诱导范数下解误差的一阶系统最小二乘（FOSLS）目标，发展了一种变分正确的算子学习框架。我们通过变分提升混合Dirichlet-Neumann边界条件，确保范数等价性而不引入不一致的罚项。为了满足FOSLS损失所需的函数空间一致性，我们提出了一种降阶基神经算子（RBNO）。RBNO预测预计算的、一致的降阶基的系数，从而通过设计确保变分稳定性，同时实现高效的训练。我们提供了一种严格的收敛性分析，将总误差界定了有限元离散偏差、降阶基截断误差、神经网络逼近误差以及有限采样和优化引起的统计估计误差之和。数值基准验证了这些理论界，并表明所提出的方法在PDE一致范数下实现了优于标准基线的更高精度，而残差损失作为可靠的、可计算的后验误差估计器。

Summary / 总结

This work addresses the issue of variational correctness in neural operators by developing a variationally correct framework using first-order system least-squares (FOSLS) objectives. The method incorporates mixed boundary conditions and ensures norm equivalence without inconsistent penalties. A Reduced Basis Neural Operator (RBNO) is proposed to predict coefficients for a pre-computed reduced basis, ensuring variational stability and efficient training. Theoretical analysis and numerical benchmarks show that the proposed approach achieves higher accuracy in PDE-compliant norms and provides a reliable a posteriori error estimator.

该研究通过使用一阶系统最小二乘（FOSLS）目标来解决神经算子的变分正确性问题，开发了一种变分正确的框架。该框架确保小残差对应小解误差，并通过变分提升来处理混合边界条件。为了实现函数空间的符合性，作者提出了一个基于预计算的缩减基的神经算子（RBNO），该算子预测缩减基的系数，从而确保变分稳定性并实现高效的训练。理论分析和数值基准表明，所提出的方法在PDE一致范数中优于标准方法，并提供了一个可靠的后验误差估计器。

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

Authors: Roy Turgeman, Tom Tirer

First: 2025-12-24T18:21:01+00:00 · Latest: 2025-12-24T18:21:01+00:00

Abs · PDF · Code1 · Code2

Abstract

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

中文标题/摘要

标题：数据处理不等式反映实践吗？低级任务的有效性探究

数据处理不等式是信息论原理，表明通过处理观测值无法增加信号的信息量。特别是，它表明在解决分类问题之前增强信号或编码信号是没有益处的。这一断言可以证明在最优贝叶斯分类器的情况下是正确的。然而，在实践中，尽管现代深度神经网络具有强大的能力，但在“高级”的下游任务之前通常会执行“低级”任务。在本文中，我们旨在理解何时以及为什么低级处理对分类有益。我们对二分类设置进行了全面的理论研究，考虑了一个与最优贝叶斯分类器紧密相连的分类器，并随着训练样本数量的增加而收敛于最优贝叶斯分类器。我们证明了对于任何有限数量的训练样本，都存在一种预分类处理可以提高分类准确性。我们还探讨了类别分离、训练集大小和类别平衡对这种处理相对增益的影响。我们通过理论设置的实证研究支持了我们的理论。最后，我们进行了一项实证研究，调查了去噪和编码对基准数据集上实用深度分类器性能的影响。具体来说，我们改变了训练集的大小和类别分布以及噪声水平，并展示了与理论结果一致的趋势。

Summary / 总结

This paper investigates the utility of low-level tasks in classification, challenging the data processing inequality. It proves that for a binary classification setup, pre-classification processing can improve accuracy even with finite training samples. The study explores factors like class separation, training set size, and class balance, and supports theoretical findings with empirical investigations on benchmark datasets, showing consistent trends with the theoretical results.

本文探讨了低级任务在分类中的实用性，挑战了数据处理不等式。研究展示了即使在有限的训练样本数量下，预分类处理也能提高准确性。研究还探讨了类别分离、训练集大小和类别平衡如何影响这种处理的好处。实证研究支持这些发现，在基准数据集上的实际深度分类器中，展示了与理论预测一致的趋势。

Learning to Solve PDEs on Neural Shape Representations

Authors: Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra

First: 2025-12-24T18:14:02+00:00 · Latest: 2025-12-24T18:14:02+00:00

Comments: Article webpage link: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.

中文标题/摘要

标题：在神经形状表示中学习求解偏微分方程

在形状上求解偏微分方程（PDEs）是许多形状分析和工程任务的基础；然而，现有的PDE求解器通常基于多边形/三角形网格，而现代3D资产越来越多地以神经表示形式存在。这种不匹配使得没有合适的方法可以直接在神经域内求解曲面PDEs，迫使进行显式的网格提取或逐实例残差训练，阻碍了端到端的工作流程。我们提出了一种全新的无网格公式，该公式学习一个基于神经（局部）形状属性的局部更新算子，使得可以在数据所在的曲面上直接求解PDEs。该算子自然地与常见的神经曲面表示相结合，只需在一个代表性形状上进行一次训练，即可在形状和拓扑变化下泛化，从而在无需显式网格化或逐实例优化的情况下实现准确且快速的推理，同时保持可微性。在分析基准（球体上的热方程和泊松求解）和不同表示的真实神经资产中，我们的方法在某些方面略优于CPM，同时保持与FEM相当的性能，并且据我们所知，首次提供了在神经和经典曲面表示上求解曲面PDEs的端到端管道。代码将在接受后发布。

Summary / 总结

The research addresses the challenge of solving partial differential equations (PDEs) on shapes represented by neural networks, which is crucial for shape analysis and engineering tasks. The method introduces a mesh-free formulation that learns a local update operator conditioned on neural shape attributes, allowing PDEs to be solved directly within the neural domain. The approach integrates with existing neural surface representations, requires training only once on a representative shape, and generalizes across different shapes and topologies, enabling accurate and fast inference without explicit meshing or per-instance optimization. Experiments show that the method performs slightly better than the closest competitor (CPM) and is comparable to finite element methods (FEM), marking the first end-to-end pipeline for solving surface PDEs on both neural and classical surface representations.

论文解决了在神经形状表示上求解偏微分方程（PDEs）的问题，这些表示在3D资产中越来越常用。它提出了一种无网格公式，该公式基于神经形状属性学习局部更新算子，允许在神经域内直接求解PDEs。该方法与流行的神经表面表示集成，只需单个形状训练，并且能够跨形状和拓扑变化进行泛化，从而实现准确且快速的推理，无需显式网格化或逐实例优化。实验表明，该方法在性能上略优于CPM，并且接近FEM，提供了第一个同时适用于神经和经典表面表示的求解表面PDEs的端到端管道。

Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning

Authors: Ke Sun, Yingnan Zhao, Enze Shi, Yafei Wang, Xiaodong Yan, Bei Jiang, Linglong Kong

Venue: NeurIPS 2025

First: 2021-10-07T03:14:46+00:00 · Latest: 2025-12-24T17:53:45+00:00

Comments: NeurIPS 2025; Previous Version in ICML Workshop: Exploration in AI Today (EXAIT) 2025

Abs · PDF · Code1 · Code2

Abstract

The remarkable empirical performance of distributional reinforcement learning (RL) has garnered increasing attention to understanding its theoretical advantages over classical RL. By decomposing the categorical distributional loss commonly employed in distributional RL, we find that the potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization. This less-studied entropy regularization aims to capture additional knowledge of return distribution beyond only its expectation, contributing to an augmented reward signal in policy optimization. In contrast to the vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration by promoting diverse actions, the novel entropy regularization derived from categorical distributional loss implicitly updates policies to align the learned policy with (estimated) environmental uncertainty. Finally, extensive experiments verify the significance of this uncertainty-aware regularization from distributional RL on the empirical benefits over classical RL. Our study offers an innovative exploration perspective to explain the intrinsic benefits of distributional learning in RL.

中文标题/摘要

标题：分类分布损失的内在优势：分布感知正则化探索在强化学习中的应用

分布式强化学习（RL）的卓越实证性能引起了对其与经典RL理论优势的广泛关注。通过分解在分布式RL中常用的分类分布损失，我们发现分布式RL潜在优势可归因于一种衍生的分布匹配熵正则化。这种较少研究的熵正则化旨在捕捉回报分布的额外知识，而不仅仅是其期望值，从而为策略优化提供增强的奖励信号。与MaxEnt RL中的标准熵正则化相比，后者通过促进多样化的动作显式地鼓励探索，而从分类分布损失中推导出的新型熵正则化则隐式地更新策略，使其与（估计的）环境不确定性相一致。最后，广泛的实验验证了这种分布感知正则化在实证上对经典RL的优越性。我们的研究提供了一种创新的探索视角，以解释分布式学习在RL中的内在优势。

Summary / 总结

This paper investigates the theoretical advantages of distributional reinforcement learning (RL) by decomposing the categorical distributional loss. It finds that the distribution-matching entropy regularization derived from this loss can capture additional knowledge about the return distribution, enhancing the reward signal in policy optimization. Unlike the explicit exploration encouraged by vanilla entropy regularization in MaxEnt RL, the new regularization implicitly aligns the learned policy with environmental uncertainty. Experiments confirm the significance of this uncertainty-aware regularization in improving the empirical performance of distributional RL over classical RL.

该论文通过分解分类分布损失，研究分布性强化学习（RL）的理论优势。它发现了一种熵正则化，能够捕捉回报分布超出其期望的部分，从而改善探索。不同于MaxEnt RL中的标准熵正则化显式鼓励多样动作，这种新正则化隐式地使策略与环境不确定性对齐。实验验证了这种不确定性意识正则化在分布性RL中的优势，优于经典RL。

AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Authors: Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng

First: 2025-12-24T17:40:42+00:00 · Latest: 2025-12-24T17:40:42+00:00

Comments: 23 pages, 13 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.

中文标题/摘要

标题：AndroidLens：嵌套子目标下的长延迟评估方法用于Android GUI代理

图形用户界面（GUI）代理可以通过自动化移动设备上频繁执行的长延迟任务来显著提高生产力。然而，现有的评估基准仍然局限于有限的应用程序、简单的任务和粗粒度的指标。为了解决这一问题，我们引入了AndroidLens，这是一个针对移动GUI代理的具有挑战性的评估框架，包含571个长延迟任务，涵盖中文和英文环境，每个任务平均需要超过26步才能完成。该框架的特点包括：(1) 来自38个领域的真实世界用户场景的任务，涵盖复杂的类型如多约束、多目标和领域特定任务；(2) 静态评估保留了真实世界的异常情况，并允许多条有效路径以减少偏差；(3) 动态评估采用基于里程碑的方案，通过平均任务进度（ATP）进行细粒度的进度测量。我们的评估表明，即使是最优秀的模型也只能达到12.7%的任务成功率和50.47%的ATP。我们还强调了真实环境中的一些关键挑战，包括环境异常、自适应探索和长期记忆保留。

Summary / 总结

AndroidLens is a new evaluation framework for mobile GUI agents, addressing limitations of existing benchmarks by including 571 long-latency tasks in both Chinese and English environments. The framework features tasks from real-world scenarios, static and dynamic evaluation methods, and measures success rate and progress via ATP. Key findings show that even top models achieve only 12.7% task success and 50.47% ATP, highlighting challenges in real-world environments such as environmental anomalies and long-term memory retention.

研究引入了AndroidLens框架，包含571个跨38个领域的长延迟任务，每个任务平均需要超过26步。框架采用静态和动态评估来衡量任务成功率和进度。主要发现是，即使最佳模型也只能达到12.7%的任务成功率和50.47%的平均任务进度，突显了环境异常和长期记忆保留等挑战。

Transcriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering

Authors: Abdullah G. Elafifi, Basma Mamdouh, Mariam Hanafy, Muhammed Alaa Eldin, Yosef Khaled, Nesma Mohamed El-Gelany, Tarek H. M. Abou-El-Enien

First: 2025-12-24T17:39:37+00:00 · Latest: 2025-12-24T17:39:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Acute Myeloid Leukemia (AML) remains a clinical challenge due to its extreme molecular heterogeneity and high relapse rates. While precision medicine has introduced mutation-specific therapies, many patients still lack effective, personalized options. This paper presents a novel, end-to-end computational framework that bridges the gap between patient-specific transcriptomics and de novo drug discovery. By analyzing bulk RNA sequencing data from the TCGA-LAML cohort, the study utilized Weighted Gene Co-expression Network Analysis (WGCNA) to prioritize 20 high-value biomarkers, including metabolic transporters like HK3 and immune-modulatory receptors such as SIGLEC9. The physical structures of these targets were modeled using AlphaFold3, and druggable hotspots were quantitatively mapped via the DOGSiteScorer engine. Then developed a novel, reaction-first evolutionary metaheuristic algorithm as well as multi-objective optimization programming that assembles novel ligands from fragment libraries, guided by spatial alignment to these identified hotspots. The generative model produced structurally unique chemical entities with a strong bias toward drug-like space, as evidenced by QED scores peaking between 0.5 and 0.7. Validation through ADMET profiling and SwissDock molecular docking identified high-confidence candidates, such as Ligand L1, which achieved a binding free energy of -6.571 kcal/mol against the A08A96 biomarker. These results demonstrate that integrating systems biology with metaheuristic molecular assembly can produce pharmacologically viable, patient tailored leads, offering a scalable blueprint for precision oncology in AML and beyond

中文标题/摘要

标题：基于转录组的个性化从头药物生成用于AML：使用元启发式组装和靶向筛选

急性髓系白血病（AML）由于其极端的分子异质性和高复发率，仍然是临床挑战。尽管精准医疗引入了针对突变的治疗方法，但许多患者仍然缺乏有效的个性化选择。本文提出了一种全新的端到端计算框架，将患者特异性转录组学与从头药物发现联系起来。通过分析TCGA-LAML队列的大规模RNA测序数据，研究利用加权基因共表达网络分析（WGCNA）优先筛选出20个高价值生物标志物，包括代谢转运蛋白如HK3和免疫调节受体如SIGLEC9。这些靶点的物理结构使用AlphaFold3建模，并通过DOGSiteScorer引擎定量映射可成药热点。开发了一种新颖的反应优先进化元启发式算法以及多目标优化编程，从片段库中组装新型配体，由这些识别的热点的空间对齐引导。生成模型产生了结构上独特的化学实体，药效团空间得分峰值在0.5到0.7之间。通过ADMET表型分析和SwissDock分子对接验证，识别出高置信度候选物，如配体L1，其与A08A96生物标志物的结合自由能为-6.571 kcal/mol。这些结果表明，将系统生物学与元启发式分子组装相结合可以产生药理学上可行的、患者特异性的先导化合物，为AML和其他癌症的精准肿瘤学提供可扩展的蓝图

Summary / 总结

This study addresses the challenge of personalized drug discovery for Acute Myeloid Leukemia (AML) by integrating patient-specific transcriptomics with de novo drug generation. Using WGCNA to identify key biomarkers and AlphaFold3 for structural modeling, the framework employs a metaheuristic algorithm to assemble novel ligands. Key findings include the generation of drug-like chemical entities with high ADMET scores and a binding free energy of -6.571 kcal/mol for Ligand L1 against the A08A96 biomarker.

该研究通过将患者特异性转录组学与从头药物生成相结合，解决了急性髓系白血病（AML）的个性化药物发现挑战。利用WGCNA识别关键生物标志物和AlphaFold3进行结构建模，框架通过元启发式算法组装新型配体。关键发现包括生成具有高ADMET评分的药物样分子，配体Ligand L1与A08A96生物标志物的结合自由能为-6.571 kcal/mol。

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

Authors: Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane

Venue: NeurIPS 2025

First: 2025-06-06T19:29:13+00:00 · Latest: 2025-12-24T17:26:35+00:00

Comments: 40 pages, 8 figures, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

中文标题/摘要

标题：交替梯度流：两层神经网络特征学习的理论

神经网络学习哪些特征以及如何学习仍然是一个开放的问题。本文引入了交替梯度流（AGF）算法框架，描述了从小型初始化训练的两层网络中特征学习的动力学。先前的研究表明，在这种情况下，梯度流表现出阶梯状的损失曲线，交替在神经元缓慢对齐到有用方向的平台期和神经元迅速增长的急剧下降期。AGF 将这种行为近似为交替的两步过程：在休眠神经元上最大化一个效用函数，在活跃神经元上最小化一个成本函数。AGF 从所有神经元都休眠开始。在每次迭代中，一个休眠的神经元激活，触发特征的获取和损失的下降。AGF 定量描述了这些下降的顺序、时间和幅度，与多个常用架构的实验结果相符。我们证明，AGF 统一并扩展了全连接线性网络和仅注意力线性变换器中已有的鞍点到鞍点分析，其中学习的特征分别是奇异模式和主成分。在对角线线性网络中，我们证明 AGF 在初始化消失的极限下收敛到梯度流。将 AGF 应用于训练以执行模块加法的二次网络，我们首次完整地描述了训练动力学，揭示了网络按系数大小递减的顺序学习傅里叶特征。总体而言，AGF 为理解神经网络中的特征学习提供了一个有希望的步骤。

Summary / 总结

The paper introduces Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer neural networks trained from small initialization. AGF approximates the alternating behavior of gradient flow as two steps: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. The key experimental findings show that AGF matches the order, timing, and magnitude of loss drops across various architectures and unifies existing saddle-to-saddle analyses in linear networks and transformers. AGF also characterizes the training dynamics of quadratic networks performing modular addition, revealing that networks learn Fourier features in decreasing order of coefficient magnitude.

本文提出了交替梯度流（AGF）算法框架，用于描述从小初始化训练的两层神经网络中的特征学习动态。AGF 将梯度流的交替行为近似为两个步骤：最大化潜伏神经元的效用函数和最小化活跃神经元的成本函数。关键实验发现表明，AGF 能够匹配各种架构中特征学习下降的顺序、时间和幅度，统一并扩展了现有的鞍点到鞍点分析，并证明在对角线线性网络中，AGF 在初始化趋于零时收敛到梯度流。AGF 还对用于执行模加的二次网络的训练动力学进行了完整表征，揭示了网络按系数大小递减顺序学习傅里叶特征。

Model Merging via Multi-Teacher Knowledge Distillation

Authors: Seyed Arshan Dalili, Mehrdad Mahdavi

First: 2025-12-24T17:10:44+00:00 · Latest: 2025-12-24T17:10:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.

中文标题/摘要

标题：多教师知识蒸馏下的模型合并

模型合并已成为联合多任务学习（MTL）的轻量级替代方案，但合并模型的泛化特性仍鲜有研究。建立此类理论保证并不容易，因为合并过程通常禁止访问原始训练数据，并涉及结合在根本上异质数据分布下训练的微调模型。在缺乏这些动态的原理性理解时，当前方法往往依赖于启发式方法来近似参数的最佳组合。这种方法在系数缩放中最为关键，即调节每个微调模型对共享参数贡献大小的权重因子。然而，由于缺乏指导其选择的原理性目标，这些方法会导致脆弱的性能，并且高度依赖于缩放初始化。我们通过（i）建立一种新的基于平坦度的PAC-Bayes泛化界，专门适用于模型合并场景。此分析引入了一个“跨任务异质性”项，正式捕捉了多种微调模型先验与目标多任务分布之间的不匹配。受此理论洞察的指导，（ii）我们将模型合并框架化为在稀缺未标记数据上的多教师知识蒸馏。我们正式证明，最小化学生-教师Kullback-Leibler散度直接收紧了合并模型超额风险的上界。受基于平坦度的界推导的指导，（iii）我们通过SAMerging方法实现这一目标，该方法使用尖锐度感知最小化（SAM）来寻找平坦的极小值。实验中，SAMerging在视觉和NLP基准测试中建立了新的最佳状态，实现了卓越的性能。代码可在https://github.com/arshandalili/SAMerging/获得。

Summary / 总结

This paper addresses the challenge of model merging by establishing a theoretical framework and proposing a novel method. The motivation is to provide theoretical guarantees for the generalization properties of merged models, which are typically trained on heterogeneous data. The authors introduce a flatness-aware PAC-Bayes generalization bound and frame model merging as multi-teacher knowledge distillation. They operationalize this with SAMerging, which uses Sharpness-Aware Minimization to find flat minima. Experiments show that SAMerging outperforms existing methods across vision and NLP benchmarks, demonstrating its effectiveness.

该论文通过建立理论泛化界并将其问题框架化为多教师知识蒸馏来解决模型合并的挑战。作者引入了一个基于平坦度的PAC-Bayes界，以捕捉细调模型与多任务分布之间的不匹配，并提出SAMerging方法，该方法使用尖锐度感知最小化来寻找平坦的极小值。实验表明，SAMerging在视觉和自然语言处理基准测试中超越了现有方法，展示了稳健且有效的模型合并。代码可在https://github.com/arshandalili/SAMerging获得。

Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction

Authors: Suren Bandara

First: 2025-12-24T17:10:37+00:00 · Latest: 2025-12-24T17:10:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.

中文标题/摘要

标题：基于掩膜后处理的表格分割结构坐标提取

从表格中提取结构化数据在扫描文档和数字档案的文档图像分析中起着关键作用。尽管已经提出了许多方法来检测表格结构并提取单元格内容，但在低分辨率或噪声图像中准确识别表格段边界（行和列）仍然具有挑战性。在许多实际场景中，表格数据不完整或退化，限制了基于变换器的方法对噪声输入的适应性。基于掩膜的边缘检测技术在这些条件下表现出更大的鲁棒性，因为它们的灵敏度可以通过阈值调整进行调整；然而，现有方法通常直接将掩膜应用于图像，导致噪声敏感性、分辨率损失或高计算成本。本文提出了一种新的多尺度信号处理方法，用于从表格掩膜中检测表格边缘。行和列转换被建模为一维信号，并使用逐渐增加方差的高斯卷积进行处理，然后通过统计阈值处理来抑制噪声并保留稳定的结构边缘。检测到的信号峰值被映射回图像坐标以获得准确的段边界。实验结果表明，将所提出的方法应用于列边缘检测，可以将基于布局的度量标准PubLayNet-1M基准上的Cell-Aware Segmentation Accuracy (CASA)从67%提高到76%，该度量标准评估文本正确性和正确的单元格放置。该方法通过零填充和缩放策略对分辨率变化具有鲁棒性，并生成优化的结构化表格输出，适合下游分析。

Summary / 总结

This paper addresses the challenge of accurately detecting table segment boundaries in low-resolution or noisy images, proposing a multi-scale signal-processing method for detecting table edges from table masks. By modeling row and column transitions as one-dimensional signals and applying Gaussian convolution with progressively increasing variances, the method suppresses noise while preserving stable structural edges. The approach improves Cell-Aware Segmentation Accuracy (CASA) from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR, demonstrating robustness to resolution variations and suitability for downstream analysis.

本文解决了低分辨率或噪声图像中准确检测表格边界的问题，这对于从表格中提取结构化数据至关重要。提出了一种多尺度信号处理方法，将行和列的过渡视为一维信号，并使用具有递增方差的高斯卷积和统计阈值处理来检测稳定的结构边缘。该方法在使用TableNet和PyTesseract OCR时，将Cell-Aware Segmentation Accuracy (CASA) 从67%提高到76%，在PubLayNet-1M基准上展示了其对分辨率变化的鲁棒性和适用于下游分析的特性。

Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Authors: Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan

First: 2025-12-24T17:05:09+00:00 · Latest: 2025-12-24T17:05:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

中文标题/摘要

标题：使用尖峰驱动视频变换器的手术场景分割及其实时潜力

现代手术系统越来越多地依赖智能场景理解以提供及时的情境感知，从而增强术中安全性。在此管道中，手术场景分割在准确感知手术事件方面发挥着核心作用。尽管最近的深度学习模型，尤其是大规模基础模型，实现了显著的分割准确性，但它们巨大的计算需求和高能耗阻碍了在资源受限的手术环境中进行实时部署。为解决这一限制，我们探索了新兴的SNN作为高效手术智能的有前途范式。然而，其性能仍受到手术标注数据稀缺和手术视频表示固有的稀疏性限制。为此，我们提出了SpikeSurgSeg，这是首个针对手术场景分割的尖峰驱动视频变换器框架，具有在非GPU平台上实现实时潜力的潜力。为解决手术标注数据有限的问题，我们引入了一种针对SNN的手术场景掩蔽自编码预训练策略，通过逐层管状掩蔽实现稳健的空间-时间表示学习。在此预训练骨干的基础上，我们进一步采用一种轻量级的尖峰驱动分割头，该头能够产生时间一致的预测，同时保持SNN的低延迟特性。在EndoVis18和我们内部的SurgBleed数据集上的广泛实验表明，SpikeSurgSeg在推断延迟方面至少减少了8倍，同时其mIoU与最先进的基于ANN的模型相当。值得注意的是，它相对于大多数基础模型基线的加速比超过20倍，突显了其在时间关键型手术场景分割中的潜力。

Summary / 总结

The research aims to develop a real-time surgical scene segmentation model for enhanced intra-operative safety, addressing the limitations of existing deep learning models in terms of computational demands and power consumption. The proposed SpikeSurgSeg framework uses a spike-driven video Transformer with a surgical-scene masked autoencoding pretraining strategy to learn robust spatiotemporal representations. It also includes a lightweight segmentation head to ensure low latency. Experiments show that SpikeSurgSeg achieves comparable mean intersection over union (mIoU) to state-of-the-art models while reducing inference latency by at least 8 times and offering over 20 times acceleration compared to foundation-model baselines.

研究旨在开发一种实时的手术场景分割模型，以提高术中安全性，解决当前深度学习模型在计算需求和能耗方面的限制。提出的SpikeSurgSeg框架使用尖峰驱动的视频Transformer和手术场景掩码自编码预训练策略来学习稳健的时空表示。实验结果表明，SpikeSurgSeg在平均交并比(mIoU)上与最先进的基于ANN的模型相当，同时将推理延迟减少至少8倍，并且相对于大多数基础模型基线提供了超过20倍的加速。

SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

Authors: Divij Dudeja, Mayukha Pal

First: 2025-12-24T16:59:04+00:00 · Latest: 2025-12-24T16:59:04+00:00

Abs · PDF · Code1 · Code2

Abstract

The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.

中文标题/摘要

标题：SMART SLM：结构化记忆与推理变换器，一种用于准确文档辅助的小型语言模型

工程手册（EM）的用户发现阅读EMs很困难，因为它们很长，格式密集，包含书面文档、逐步程序和工程设备的标准参数列表。现成的变换器，尤其是紧凑型的，将这些材料视为一个扁平的令牌流。这种方法导致了自信但错误的数字答案，并迫使模型以低效的方式记忆单独的事实。SMART（结构化记忆与推理变换器）为上述问题提供了一种不同的且实用的解决方案。SMART通过使用分层方法来结构化其处理过程，并基于三个主要工作类别：（1）语法意识事实提取（语法学家）树LSTM，从EM句子中提取作为主语关系宾语关系的事实；（2）紧凑索引记忆MANN（记忆增强神经网络），将这些理性主语关系宾语对象索引为384维向量，与信息来源相关联；（3）6层变换器，学习将之前检索到的事实融合到其生成的响应中。整个SMART模型使用45.51M参数，比GPT-2（124M）少64%，比BERT（133M）少69%，并且其准确率比GPT-2高21.3%，表明SMART以最少的处理要求更好地拟合数据。SMART采用双模式推理，已知文档的索引快速路径（亚秒级答案时间）和新上传文件的索引动态路径（借助RAGs的FAISS前20结果，记忆限制在64个槽位）。在实际部署中，该框架比可比的小型变换器模型产生更支持的结果，减少了幻觉。

Summary / 总结

The paper addresses the challenge of accurately processing Engineering Manuals (EMs) using transformers, which often treat EMs as flat token streams, leading to incorrect answers. SMART (Structured Memory and Reasoning Transformer) proposes a hierarchical approach, including a syntax-aware fact extractor, a compact indexed memory, and a transformer to integrate retrieved facts. SMART uses 45.51M parameters, outperforming GPT-2 and BERT with 21.3% higher accuracy, demonstrating better fit with less processing. It supports dual inference modes for known and new documents, enhancing efficiency and accuracy in real-world applications.

论文旨在解决使用语言模型准确处理工程手册（EM）的挑战。它提出了SMART（结构化记忆和推理变换器），采用分层方法，包括语法感知的事实提取器、紧凑的索引记忆和变换器，以提高准确性。SMART使用45.51M参数，比GPT-2高出21.3%的准确性，同时需要较少的处理。它提供了两种推理模式，适用于已知和新上传的文档，从而提供更可靠的结果，减少幻觉现象，优于其他小型变换器模型。

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Authors: Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller

First: 2025-12-24T16:46:04+00:00 · Latest: 2025-12-24T16:46:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.

中文标题/摘要

标题：GriDiT: 基于因子化网格的扩散方法用于高效生成长图像序列

现代深度学习方法通常将图像序列视为按顺序堆叠帧的大张量。然而，鉴于当前的最先进水平（SoTA），这种简单的表示是否理想？在本文中，我们从生成模型的角度回答了这个问题，并旨在提出一种更有效的图像序列数据建模方法。观察当前SoTA图像序列生成方法中的低效性和瓶颈，我们展示了与其处理大张量，通过先在低分辨率下生成粗略的序列，再在高分辨率下细化各个帧，可以改进生成过程。我们仅使用包含下采样帧的网格图像训练生成模型。然而，我们学习使用扩散变换器（DiT）的强自我注意机制来捕捉帧之间的相关性，从而生成图像序列。实际上，我们的建模方式将二维图像生成器扩展为低分辨率的三维图像序列生成器，而无需进行任何架构修改。随后，我们逐帧超分辨率以添加与序列无关的高分辨率细节。这种方法具有多种优势，并可以克服该领域SoTA方法的关键限制。与现有的图像序列生成模型相比，我们的方法在合成质量上表现出色，并且在序列间具有更好的连贯性。它还能够生成任意长度的高保真图像序列，并在推理时间和训练数据使用方面提高效率。此外，我们简洁的建模方式使我们的方法能够在多种数据领域中有效泛化，这通常需要额外的先验知识和监督才能在生成上下文中建模。我们的方法在数据集上始终在质量和推理速度（至少快两倍）方面优于SoTA。

Summary / 总结

The paper proposes GriDiT, a method that factorizes the generation of long image sequences into two steps: first generating a low-resolution coarse sequence and then refining individual frames at high resolution. This approach uses a Diffusion Transformer to capture frame correlations and super-resolve frames to achieve high-fidelity sequences. The method outperforms existing models in synthesis quality, coherence, and efficiency, and generalizes well across different data domains.

GriDiT通过将生成过程分解为低分辨率序列生成和高分辨率帧细化，解决了将图像序列视为大型张量的低效问题。它使用扩散变换器生成网格图像，然后逐帧超分辨率。该方法提高了合成质量、连贯性，并在质量和推理速度上超越了现有模型。

Learning to Refocus with Video Diffusion Models

Authors: SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

Venue: SIGGRAPH Asia 2025

First: 2025-12-22T19:29:57+00:00 · Latest: 2025-12-24T16:32:32+00:00

Comments: Code and data are available at https://learn2refocus.github.io . SIGGRAPH Asia 2025, Dec. 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at https://learn2refocus.github.io

中文标题/摘要

标题：学习使用视频扩散模型重新聚焦

对焦是摄影的基础，但自动对焦系统往往无法捕捉到预期的主体，用户经常希望在拍摄后调整对焦。我们提出了一种使用视频扩散模型进行现实后对焦的新方法。从单张失焦图像出发，我们的方法生成了一组感知上准确的焦距堆栈，表示为视频序列，支持交互式重新对焦并解锁一系列下游应用。我们发布了一个大规模的焦距堆栈数据集，以支持这项工作和未来的研究。我们的方法在感知质量和在具有挑战性的场景中的鲁棒性方面均优于现有方法，为日常摄影中的更高级对焦编辑能力铺平了道路。代码和数据可在https://learn2refocus.github.io 获取

Summary / 总结

The paper addresses the issue of inaccurate autofocus in photography and the desire to adjust focus after capturing an image. It proposes a method using video diffusion models to generate a perceptually accurate focal stack from a single defocused image, allowing for interactive refocusing. The method outperforms existing approaches in both perceptual quality and robustness, and a large-scale dataset is provided to support this work and future research.

该研究提出了一种使用视频扩散模型的新型后捕获对焦方法，可以从单张失焦图像生成感知上准确的焦距堆栈。该方法允许交互式对焦，并在摄影中有多种应用。该方法在感知质量和鲁棒性方面均优于现有技术，特别是在复杂场景中。还提供了一个大规模的焦距堆栈数据集以支持这项工作和未来的研究。

ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang

First: 2025-12-24T16:24:18+00:00 · Latest: 2025-12-24T16:24:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.

中文标题/摘要

标题：ACD：通过注意力监督实现视频扩散模型的直接条件控制

在视频合成中，可控性是一个基本要求，准确对齐条件信号至关重要。现有无分类器自由引导方法通常通过建模数据和条件的联合分布间接实现条件化，这往往导致对指定条件的有限可控性。基于分类器的引导通过外部分类器强制执行条件，但模型可能会利用这种机制提高分类器得分而不真正满足预期条件，从而产生对抗性伪影并限制有效的可控性。在本文中，我们提出了一种新的框架——注意力条件扩散（ACD），通过注意力监督实现视频扩散模型的直接条件控制。通过使模型的注意力图与外部控制信号对齐，ACD 达到了更好的可控性。为此，我们引入了一种稀疏的3D感知对象布局作为高效的条件信号，以及一个专用的布局控制网和自动注释流水线，以实现可扩展的布局集成。在基准视频生成数据集上的大量实验表明，ACD 在保持时间连贯性和视觉保真度的同时，实现了与条件输入的更优对齐，建立了条件视频合成的有效范式。

Summary / 总结

The paper addresses the need for better controllability in video synthesis by proposing Attention-Conditional Diffusion (ACD), which directly aligns the model's attention maps with external control signals. ACD uses a sparse 3D-aware object layout as an efficient conditioning signal and includes a Layout ControlNet and automated annotation pipeline. Experiments show that ACD provides better alignment with conditioning inputs while maintaining temporal coherence and visual fidelity.

研究旨在通过直接使模型的注意力与外部控制信号对齐来增强视频合成中的可控性。提出的注意力条件扩散（ACD）框架使用稀疏的3D感知对象布局和布局ControlNet进行高效的条件控制。实验表明，ACD在保持时间连贯性和视觉保真度的同时，能够更好地与条件输入对齐，优于现有方法在可控性方面的表现。

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Authors: Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu

First: 2025-12-24T16:00:15+00:00 · Latest: 2025-12-24T16:00:15+00:00

Comments: Project Page: https://dreamontage.github.io/DreaMontage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

中文标题/摘要

标题：DreaMontage：任意帧引导的一次性视频生成

“一次性”技术代表了电影制作中独特而复杂的美学。然而，其实现往往受到高昂成本和复杂现实约束的阻碍。尽管新兴的视频生成模型提供了虚拟替代方案，但现有方法通常依赖于简单的片段拼接，这往往无法保持视觉连贯性和时间一致性。在本文中，我们介绍了DreaMontage，这是一种全面的框架，用于任意帧引导的生成，能够从多种用户提供的输入中合成无缝、富有表现力且长时间的一次性视频。为了实现这一目标，我们从三个主要维度应对挑战。(i) 我们将一个轻量级的中间条件机制集成到DiT架构中。通过采用一种有效的基训练数据调优策略，我们解锁了强大的任意帧控制能力。(ii) 为了提高视觉保真度和电影表现力，我们精心制作了一个高质量的数据集，并实施了一个视觉表达SFT阶段。通过解决诸如主体运动合理性和平滑过渡等关键问题，我们应用了一种定制的DPO方案，显著提高了生成内容的成功率和可用性。(iii) 为了促进长序列的生成，我们设计了一种分段自回归(SAR)推理策略，该策略在内存高效的情况下运行。广泛的实验表明，我们的方法实现了视觉上引人注目且无缝连贯的一次性效果，同时保持了计算效率，使用户能够将零散的视觉材料转化为生动、连贯的一次性电影体验。

Summary / 总结

DreaMontage is a framework for generating seamless one-shot videos from arbitrary frames. It integrates a lightweight intermediate-conditioning mechanism and a Visual Expression SFT stage to enhance visual fidelity and expressiveness. The approach also includes a Tailored DPO scheme and a Segment-wise Auto-Regressive inference strategy to ensure smooth transitions and rational subject motion. Experiments show that DreaMontage produces visually striking and temporally coherent one-shot videos efficiently.

DreaMontage 是一个从任意帧生成无缝一镜头视频的框架，它整合了轻量级的中间条件机制和视觉表达 SFT 阶段以提升视觉保真度和表现力。该方法使用定制的 DPO 方案确保主体运动的合理性和平滑过渡，并采用分段自回归 (SAR) 推断策略以高效合成长序列视频。实验表明，DreaMontage 能够生成视觉上引人注目且时间上连贯的一镜头效果，同时保持计算效率，使用户能够从碎片化的视觉材料中创建生动、连贯的电影体验。

LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov

First: 2025-12-24T15:36:21+00:00 · Latest: 2025-12-24T15:36:21+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .

中文标题/摘要

标题：LookPlanGraph：基于VLM图增强的体感指令跟随方法

使用大型语言模型（LLM）作为规划器的方法在体感指令跟随任务中变得普遍。为了成功完成任务，LLM 必须在机器人操作的环境中进行接地。一种解决方案是使用包含所有必要信息的场景图。现代方法依赖于预先构建的场景图，并假设在规划开始时所有任务相关信息都已可用。然而，这些方法没有考虑到在图构建和任务执行之间环境可能发生的变化。我们提出了 LookPlanGraph 方法，该方法利用由静态资产和对象先验组成的场景图。在计划执行过程中，LookPlanGraph 不断通过验证现有先验或发现新实体来更新图。这通过使用视觉语言模型处理代理的主观摄像机视图来实现。我们在具有更改对象位置的 VirtualHome 和 OmniGibson 模拟环境中进行了实验，证明 LookPlanGraph 在基于预定义静态场景图的方法中表现出色。为了展示我们方法的实际适用性，我们还在现实世界中进行了实验。此外，我们引入了 GraSIF（用于指令跟随的图场景）数据集及其自动验证框架，包含来自 SayPlan Office、BEHAVIOR-1K 和 VirtualHome RobotHow 的 514 个任务。项目页面可在 https://lookplangraph.github.io 查看。

Summary / 总结

The research aims to improve embodied instruction following by addressing the limitations of static scene graphs that do not account for environmental changes. LookPlanGraph uses a scene graph with static assets and object priors, updating it during plan execution by processing the agent's egocentric view with a Vision Language Model. Experiments in simulated and real-world environments show that LookPlanGraph outperforms methods relying on predefined static scene graphs, and the GraSIF dataset with an automated validation framework is introduced to support this approach.

该论文提出了LookPlanGraph方法，通过结合视觉语言模型来增强基于场景图的指令跟随能力。它通过在任务执行过程中不断更新场景图来解决静态场景图的局限性，从而应对环境变化。实验结果表明，LookPlanGraph在模拟和真实环境中均优于依赖预定义静态场景图的方法，特别是在物体位置发生变化的情况下表现更佳。

GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer

Authors: Corey Adams, Rishikesh Ranade, Ram Cherukuri, Sanjay Choudhry

First: 2025-12-23T14:40:08+00:00 · Latest: 2025-12-24T15:28:58+00:00

Abs · PDF · Code1 · Code2

Abstract

We present GeoTransolver, a Multiscale Geometry-Aware Physics Attention Transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO) and reused in every block. Implemented and released in NVIDIA PhysicsNeMo, GeoTransolver persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes. We benchmark GeoTransolver on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, comparing against Domino, Transolver (as released in PhysicsNeMo), and literature-reported AB-UPT, and evaluate drag/lift R2 and Relative L1 errors for field variables. GeoTransolver delivers better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency; we include ablations on DrivAerML and qualitative results such as contour plots and design trends for the best GeoTransolver models. By unifying multiscale geometry-aware context with physics-based attention in a scalable transformer, GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes.

中文标题/摘要

标题：GeoTransolver：使用多尺度几何感知物理注意力变换器在不规则域中学习物理

我们提出了GeoTransolver，这是一种多尺度几何感知物理注意力变换器，用于CAE，它用GALE替代了标准注意力，将物理感知的自我注意力耦合到从多尺度球查询中计算出的共享几何/全局/边界条件上下文上（灵感来自DoMINO），并在每个块中重用。在NVIDIA PhysicsNeMo中实现并发布，GeoTransolver持续将几何、全局和边界条件参数投影到物理状态空间中，将潜在计算锚定到域结构和操作模式上。我们在DrivAerML、Luminary SHIFT-SUV和Luminary SHIFT-Wing上对GeoTransolver进行了基准测试，与Domino、Transolver（在PhysicsNeMo中发布）和文献中报告的AB-UPT进行比较，并评估了场变量的拖曳/升力R2和相对L1误差。GeoTransolver提供了更好的准确性、对几何/模式变化的改进鲁棒性以及有利的数据效率；我们包括了DrivAerML上的消融分析和诸如等值线图和最佳GeoTransolver模型的设计趋势等定性结果。通过在可扩展的变换器中统一多尺度几何感知上下文与基于物理的注意力，GeoTransolver促进了复杂、不规则域和非线性物理模式下的操作学习，以实现高保真代理建模。

Summary / 总结

GeoTransolver is a multiscale geometry-aware physics attention transformer designed to improve physics-based modeling on irregular domains. It uses GALE for self-attention on learned state slices and cross-attention to a shared geometry/global/boundary-condition context, which is computed from multi-scale ball queries. GeoTransolver was benchmarked on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, showing better accuracy, improved robustness to geometry and regime shifts, and favorable data efficiency compared to other methods like Domino and Transolver.

GeoTransolver 是一种多尺度几何感知物理注意变换器，旨在提高不规则域上计算流体动力学（CFD）模型的准确性和鲁棒性。它使用了 GALE，这是一种物理感知的自注意力机制，并结合了对来自多尺度球查询的共享几何/全局/边界条件上下文的交叉注意力。GeoTransolver 在 DrivAerML、Luminary SHIFT-SUV 和 Luminary SHIFT-Wing 上进行了基准测试，显示了比 Domino 和 Transolver 等其他方法更好的准确性和对几何和运行模式变化的鲁棒性，以及更好的数据效率。

SegMo: Segment-aligned Text to 3D Human Motion Generation

Authors: Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen

First: 2025-12-24T15:26:11+00:00 · Latest: 2025-12-24T15:26:11+00:00

Comments: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026

Abs · PDF · Code1 · Code2

Abstract

Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.

中文标题/摘要

标题：SegMo: 与片段对齐的文本到3D人体动作生成

从文本描述生成3D人体动作是一个重要的研究问题，在视频游戏、虚拟现实和增强现实等领域有着广泛的应用。最近的方法在序列级别对齐文本描述和人体动作，忽略了模态的内部语义结构。然而，动作描述和动作序列可以自然地分解为更小且语义上更连贯的片段，这些片段可以作为原子对齐单元以实现更精细的对应。受此启发，我们提出了一种新颖的SegMo框架，以实现细粒度的文本-动作对齐。我们的框架由三个模块组成：(1) 文本片段提取，将复杂的文本描述分解为按时间顺序排列的短语，每个短语代表一个简单的原子动作；(2) 动作片段提取，将完整的动作序列分割为相应的动作片段；(3) 细粒度文本-动作对齐，通过对比学习对齐文本和动作片段。广泛的实验表明，SegMo在两个广泛使用的数据集上改进了强大的基线，HumanML3D测试集上的TOP 1得分为0.553。此外，由于学习到的文本和动作片段共享嵌入空间，SegMo还可以应用于检索任务，如动作定位和动作到文本检索。

Summary / 总结

SegMo is a novel framework for generating 3D human motions from textual descriptions, addressing the limitation of previous methods by aligning text and motion at the segment level. It consists of three modules: Text Segment Extraction, Motion Segment Extraction, and Fine-grained Text-Motion Alignment. SegMo significantly improves the alignment accuracy, achieving a TOP 1 score of 0.553 on the HumanML3D test set and demonstrating its effectiveness in retrieval tasks such as motion grounding and motion-to-text retrieval.

SegMo 是一种新颖的框架，用于从文本生成 3D 人体动作，通过在片段级别对齐文本和动作来解决先前方法的局限性。它包含三个模块：文本片段提取、动作片段提取和细粒度文本-动作对齐。SegMo 在 HumanML3D 测试集上将 TOP 1 分数提高到 0.553，并且在动作定位和动作到文本检索等检索任务中也表现出有效性。

MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models

Authors: Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller

First: 2025-12-24T15:15:18+00:00 · Latest: 2025-12-24T15:15:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers -- a property we term 'latent solvability'. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.

中文标题/摘要

标题：MiST：理解中期科学训练在开发化学推理模型中的作用

大型语言模型可以通过基于规则的奖励进行在线微调来发展推理能力。然而，最近的研究揭示了一个关键限制：强化学习仅在基础模型对正确答案已分配非可忽略概率时才能成功——我们称这一特性为“潜在可解性”。本研究探讨了化学推理能力的出现及其先决条件对化学领域意味着什么。我们确定了基于强化学习的化学推理的两个必要条件：1）符号能力，2）潜在的化学知识。我们提出了中期科学训练（MiST）：一系列中期训练技术以满足这些条件，包括数据混合、SMILES/CIF意识预处理、继续预训练29亿个标记以及监督微调1亿个标记。这些步骤将3B和7B模型的潜在可解性得分提高了1.8倍，并使强化学习在有机反应命名中的顶级准确率从10.9%提升到63.9%，在无机材料生成中的顶级准确率从40.6%提升到67.4%。对于其他具有挑战性的化学任务，也观察到了类似的结果，同时生成了可解释的推理痕迹。我们的研究结果定义了化学推理训练的明确先决条件，并突显了中期训练在解锁推理能力中的更广泛作用。

Summary / 总结

This study investigates the role of mid-stage scientific training (MiST) in developing chemical reasoning capabilities in large language models. It identifies two prerequisites: symbolic competence and latent chemical knowledge. The proposed MiST techniques, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training, and supervised fine-tuning, significantly enhance the models' latent solvability, leading to improved performance in organic reaction naming and inorganic material generation tasks, with top-1 accuracy increasing from 10.9% to 63.9% and from 40.6% to 67.4%, respectively.

本研究探讨了中期科学训练（MiST）在开发化学推理能力中的作用。它确定了两个先决条件：符号能力与潜在的化学知识。所提出的MiST技术，包括数据混合、SMILES/CIF意识预处理、持续预训练和监督微调，显著提高了模型的潜在可解性，从而在有机反应命名和无机材料生成任务中的top-1准确率分别从10.9%提高到63.9%和从40.6%提高到67.4%。

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

Authors: Xiao-Qi Han, Ze-Feng Gao, Peng-Jie Guo, Zhong-Yi Lu

First: 2025-12-24T15:07:36+00:00 · Latest: 2025-12-24T15:07:36+00:00

Comments: 19 pages, 6 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen--the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at https://github.com/xqh19970407/PhononBench

中文标题/摘要

标题：PhononBench：一种基于声子的大规模基准测试，用于晶体生成中的动力学稳定性

在本工作中，我们介绍了PhononBench，这是首个用于AI生成晶体动力学稳定性的大规模基准测试。利用最近开发的MatterSim原子间势，该势能在超过10,000种材料中实现了从头算水平的声子预测精度，PhononBench能够高效地进行大规模声子计算和动力学稳定性分析，针对六种领先的晶体生成模型生成的108,843种晶体结构。PhononBench揭示了当前生成模型在确保动力学稳定性方面的普遍局限性：所有生成结构的动力学稳定性平均率为25.83%，最佳模型MatterGen也仅达到41.0%。进一步的案例研究显示，在目标性质生成中——以MatterGen的带隙调节为例——即使在最佳带隙条件0.5 eV下，动力学稳定性率仍低至23.5%。在空间群控制生成中，高对称晶体表现出更好的稳定性（例如，立方系统达到49.2%的稳定性率），但所有控制生成的平均稳定性仍仅为34.4%。这项研究的重要附加成果是识别了28,119种在整个布里渊区都稳定的晶体结构，为未来的材料探索提供了大量可靠的候选者。通过建立首个大规模动力学稳定性基准测试，本工作系统地突显了当前晶体生成模型的局限性，并提供了未来开发设计和发现物理上可行材料所需的重要评估标准和指导。所有模型生成的晶体结构、声子计算结果以及PhononBench开发的高通量评估工作流程将在https://github.com/xqh19970407/PhononBench公开发布

Summary / 总结

PhononBench is the first large-scale benchmark for dynamical stability in AI-generated crystals, utilizing the MatterSim interatomic potential to perform efficient phonon calculations on 108,843 crystal structures from six leading crystal generation models. The study reveals that the average dynamical-stability rate is only 25.83%, with the best model, MatterGen, achieving 41.0%. Even in targeted property generation, the stability rate remains low, at 23.5% for optimal band-gap conditions and 34.4% on average for space-group-controlled generations. The benchmark also identifies 28,119 phonon-stable crystal structures, providing a reliable pool for materials exploration. This work highlights the current limitations of crystal generation models and offers essential evaluation criteria for their development.

PhononBench 是首个用于评估 AI 生成晶体动力学稳定性的大规模基准，利用 MatterSim 交互原子势能对六种领先晶体生成模型产生的 108,843 种晶体结构进行高效声子计算。研究显示，平均动力学稳定性仅为 25.83%，最佳模型 MatterGen 达到 41.0%。即使在目标属性生成中，稳定性率也较低，最优带隙条件下为 23.5%，空间群控制生成的平均稳定性率为 34.4%。基准还识别出 28,119 种在整个布里渊区稳定的晶体结构，为材料探索提供了可靠的数据池。这项工作揭示了当前晶体生成模型的局限性，并为它们的发展提供了必要的评估标准和指导。

Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Authors: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen

Venue: MM

First: 2025-12-24T15:02:33+00:00 · Latest: 2025-12-24T15:02:33+00:00

Comments: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

中文标题/摘要

标题：利用轻量级实体提取实现可扩展的基于事件的图像检索

从自然语言描述中检索图像是一项核心任务，位于计算机视觉和自然语言处理的交叉点，广泛应用于搜索引擎、媒体归档和数字内容管理中。然而，由于模糊或依赖上下文的查询、语言的多样性以及需要可扩展的解决方案，现实世界中的图像-文本检索仍然具有挑战性。在本文中，我们提出了一种轻量级的两阶段检索管道，利用事件为中心的实体提取来结合现实世界标题中的时间与上下文信号。第一阶段使用基于显著实体的BM25高效候选过滤，第二阶段应用BEiT-3模型捕捉深层多模态语义并重新排序结果。在OpenEvents v1基准上评估，我们的方法达到了0.559的平均精度，显著优于先前的基线。这些结果突显了结合事件引导过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。我们的代码可在https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval 获取。

Summary / 总结

This study addresses the challenge of retrieving images from natural language descriptions by proposing a lightweight two-stage retrieval pipeline. The first stage uses BM25 based on salient entities for efficient candidate filtering, while the second stage employs BEiT-3 models to capture deep multimodal semantics and rerank the results. The method achieves a mean average precision of 0.559 on the OpenEvents v1 benchmark, outperforming previous approaches, demonstrating the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex scenarios.

本文提出了一种轻量级的两阶段检索管道，以解决从自然语言描述中检索图像的挑战。第一阶段使用基于显著实体的BM25进行高效的候选过滤，第二阶段则使用BEiT-3模型捕获深度多模态语义并重新排序结果。该方法在OpenEvents v1基准测试上实现了0.559的平均精度，优于先前的方法，展示了结合事件导向过滤与长文本视觉语言建模在复杂现实场景中实现准确高效图像检索的有效性。

RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

First: 2025-12-24T15:01:26+00:00 · Latest: 2025-12-24T15:01:26+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

中文标题/摘要

标题：RoboSafe：通过可执行安全逻辑保护具身代理

由视觉-语言模型（VLMs）驱动的具身代理越来越能够执行复杂的现实世界任务，但它们仍然容易受到可能导致不安全行为的危险指令的影响。运行时安全护栏可以在任务执行过程中拦截危险行为，提供了一种有前景的解决方案，因为它们具有灵活性。然而，现有的防御措施往往依赖于静态规则过滤或提示级控制，难以应对动态、时间依赖性和上下文丰富的环境中出现的隐含风险。为了解决这个问题，我们提出了一种名为RoboSafe的混合推理运行时保护，通过可执行谓词基础的安全逻辑为具身代理提供保护。RoboSafe结合了两种互补的推理过程，即混合长短期安全记忆。我们首先提出了一种后向反思推理模块，该模块不断回顾短期记忆中的最近轨迹，以推断时间安全谓词，并在检测到违规行为时主动触发重新规划。然后，我们提出了一种前瞻预测推理模块，该模块通过生成基于长期安全记忆和代理的多模态观察的安全谓词来预测即将出现的风险。这些组件共同形成了一个适应性强、可验证的安全逻辑，既可解释又可作为代码执行。在多个代理的广泛实验中，RoboSafe与领先基准相比显著减少了危险行为（风险发生率降低36.8%），同时保持了接近原始的任务性能。在物理机器人手臂上的实际评估进一步证实了其实用性。代码将在接受后发布。

Summary / 总结

RoboSafe is a hybrid reasoning runtime safeguard for embodied agents using executable predicate-based safety logic. It integrates Backward Reflective Reasoning and Forward Predictive Reasoning to continuously monitor and predict potential safety risks. Experiments show that RoboSafe significantly reduces hazardous actions by 36.8% compared to leading baselines while maintaining near-original task performance. Real-world evaluations on robotic arms validate its practicality.

RoboSafe 通过使用可执行的安全逻辑来保护实体代理，采用混合推理方法，包括回顾性推理模块，用于回顾近期轨迹以检测安全违规，以及预测性推理模块，基于长期记忆和多模态观察来预测风险。实验表明，RoboSafe 可以将危险行为减少 36.8%，同时保持任务性能，并且在实际机器人手臂上的评估进一步证实了其实用性。

Latent Implicit Visual Reasoning

Authors: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

First: 2025-12-24T14:59:49+00:00 · Latest: 2025-12-24T14:59:49+00:00

Abs · PDF · Code1 · Code2

Abstract

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

中文标题/摘要

标题：潜在隐式视觉推理

虽然大型多模态模型（LMMs）取得了显著进展，但它们仍然主要以文本为中心，依赖语言作为核心推理模态。因此，它们在处理以视觉为主的推理任务方面能力有限。最近的方法通过使用辅助图像、深度图或图像裁剪来监督中间的视觉步骤，试图解决这一问题。然而，这些策略对“有用的”视觉抽象施加了限制性的先验，增加了注释成本，并且难以在不同任务之间泛化。为了解决这一关键限制，我们提出了一种任务无关的机制，该机制训练LMMs发现和使用视觉推理标记，而无需显式的监督。这些标记全局注意并以任务自适应的方式重新编码图像，使模型能够提取相关视觉信息，而无需手工设计的监督。我们的方法在各种视觉中心任务上优于直接微调，并且在包括那些难以指定中间抽象的任务中也达到了最先进的结果，同时还能泛化到多任务指令调优。

Summary / 总结

The research aims to enhance large multimodal models' ability to handle predominantly visual reasoning tasks by proposing a task-agnostic mechanism. This method trains models to discover and use visual reasoning tokens without explicit supervision, allowing them to re-encode images in a task-adaptive way. The approach outperforms direct fine-tuning and achieves state-of-the-art results on various vision-centric tasks, including those with hard-to-specify intermediate abstractions, and generalizes well to multi-task instruction tuning.

研究旨在通过提出一种任务无关的机制来增强大型多模态模型处理以视觉为主的推理任务的能力。该方法训练模型在没有显式监督的情况下发现和使用视觉推理令牌，使其能够以任务适应性方式重新编码图像。该方法在各种视觉中心任务上优于直接微调，实现了最先进的结果，包括那些难以指定中间抽象的任务，并且在多任务指令调优方面具有良好的泛化能力。

A study of EHVI vs fixed scalarization for molecule design

Authors: Anabel Yong, Austin Tripp, Layla Hosseini-Gerami, Brooks Paige

Venue: NeurIPS

First: 2025-07-18T07:12:19+00:00 · Latest: 2025-12-24T14:56:07+00:00

Comments: Accepted to NeurIPS AI4Science Workshop 2025

Abs · PDF · Code1 · Code2

Abstract

Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy - Expected Hypervolume Improvement (EHVI) - against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants - including random or adaptive schemes - our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.

中文标题/摘要

标题：分子设计中EHVI与固定加权标量化比较研究

多目标贝叶斯优化（MOBO）为分子设计中的权衡提供了一个原则性的框架。然而，它与标量化替代方法的实证优势尚未得到充分探索。我们使用期望改进（EI）作为基准，对比了简单基于帕累托的MOBO策略——期望hypervolume改进（EHVI）——和一个简单的固定权重标量化基线。在严格控制的设置下，使用相同的高斯过程代理和分子表示，EHVI在三个分子优化任务中，在帕累托前沿覆盖、收敛速度和化学多样性方面始终优于标量化EI。虽然标量化包括灵活的变体——包括随机或自适应方案——我们的结果表明，即使在数据量有限的情况下，强确定性实例也可能表现不佳。这些发现为在有限评估预算和非平凡权衡时新分子优化中帕累托意识获取的实际优势提供了实证证据。

Summary / 总结

This study investigates the performance of Expected Hypervolume Improvement (EHVI) compared to a fixed scalarization method (Expected Improvement, EI) in molecular design using multi-objective Bayesian optimization (MOBO). Across three molecular optimization tasks, EHVI outperformed EI in terms of Pareto front coverage, convergence speed, and chemical diversity. The results suggest that Pareto-aware acquisition methods like EHVI are advantageous, particularly in low-data regimes where trade-offs are complex.

研究比较了在分子设计中使用多目标贝叶斯优化时，Expected Hypervolume Improvement (EHVI) 和固定加权标量化（如 Expected Improvement, EI）的有效性。结果显示，EHVI 在帕累托前沿覆盖、收敛速度和化学多样性方面优于标量化方法，尤其是在数据有限且需要处理复杂权衡时。研究还表明，即使是强大的确定性标量化方案，在低数据环境下也可能不如 EHVI 表现。这些发现支持了在分子设计中使用帕累托感知的获取策略的实际优势。

ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering

Authors: Paritosh Parmar, Eric Peh, Basura Fernando

First: 2025-08-28T17:10:53+00:00 · Latest: 2025-12-24T14:52:45+00:00

Comments: Project page: https://paritoshparmar.github.io/chainreaction/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

中文标题/摘要

标题：ChainReaction：因果链引导推理在模块化和可解释的因果为什么视频问答中的应用

现有的因果为什么视频问答（VideoQA）模型往往难以进行高层次推理，依赖于不透明的单一管道，将视频理解、因果推理和答案生成紧密结合在一起。这些黑盒方法缺乏可解释性，往往依赖于浅层启发式方法。我们提出了一种新的模块化范式，明确地将因果推理与答案生成分离，引入自然语言因果链作为可解释的中间表示。受人类认知模型的启发，这些结构化的因果序列将低级视频内容与高级因果推理联系起来，使推理变得透明且逻辑连贯。我们的两阶段架构包括因果链提取器（CCE），从视频-问题对中生成因果链，以及因果链驱动的答案生成器（CCDA），基于这些链生成答案。为了解决缺乏标注推理痕迹的问题，我们提出了一种生成现有数据集中准确因果链的可扩展方法。我们为46000个样本构建了经过人工验证的因果链。我们还提出了CauCo，一种新的因果导向字幕评估指标。在三个大规模基准上的实验表明，我们的方法不仅优于最先进的模型，还在可解释性、用户信任和泛化方面取得了显著提升——将CCE定位为跨不同领域的可重用因果推理引擎。项目页面：https://paritoshparmar.github.io/chainreaction/

Summary / 总结

The paper proposes ChainReaction, a modular approach for causal-why video question answering that decouples causal reasoning from answer generation using natural language causal chains as interpretable intermediates. The two-stage architecture includes a Causal Chain Extractor (CCE) and a Causal Chain-Driven Answerer (CCDA). Experiments show that ChainReaction outperforms existing models in terms of explainability, user trust, and generalization, and introduces a new evaluation metric, CauCo, for causality-oriented captioning.

研究旨在通过将因果推理与答案生成分离来提高视频问答模型的可解释性和推理能力。提出了一种两阶段架构：因果链提取器（CCE）从视频-问题对中生成因果链，因果链驱动答案器（CCDA）利用这些链生成答案。该方法在三个大规模基准测试中优于现有模型，显示出在可解释性、用户信任和泛化方面的改进。还提出了一种新的评估指标CauCo用于因果导向的字幕。方法还包括一种从现有数据集中生成准确因果链的可扩展方法，由人类验证了46K样本。

Causal-driven attribution (CDA): Estimating channel influence without user-level data

Authors: Georgios Filippou, Boi Mai Quach, Diana Lenghel, Arthur White, Ashish Kumar Jha

First: 2025-12-24T14:51:12+00:00 · Latest: 2025-12-24T14:51:12+00:00

Comments: 42 pages, 8 figures, submitted initially to the journal of the academy of marketing science on 24th Dec 2025

Abs · PDF · Code1 · Code2

Abstract

Attribution modelling lies at the heart of marketing effectiveness, yet most existing approaches depend on user-level path data, which are increasingly inaccessible due to privacy regulations and platform restrictions. This paper introduces a Causal-Driven Attribution (CDA) framework that infers channel influence using only aggregated impression-level data, avoiding any reliance on user identifiers or click-path tracking. CDA integrates temporal causal discovery (using PCMCI) with causal effect estimation via a Structural Causal Model to recover directional channel relationships and quantify their contributions to conversions. Using large-scale synthetic data designed to replicate real marketing dynamics, we show that CDA achieves an average relative RMSE of 9.50% when given the true causal graph, and 24.23% when using the predicted graph, demonstrating strong accuracy under correct structure and meaningful signal recovery even under structural uncertainty. CDA captures cross-channel interdependencies while providing interpretable, privacy-preserving attribution insights, offering a scalable and future-proof alternative to traditional path-based models.

中文标题/摘要

标题：因果驱动归因（CDA）：无需用户级数据估计渠道影响

归因建模是衡量营销效果的核心，但大多数现有方法依赖于用户级路径数据，这些数据因隐私法规和平台限制而变得越来越难以获取。本文介绍了一种因果驱动归因（CDA）框架，该框架仅使用聚合的印象级数据推断渠道影响，避免依赖用户标识符或点击路径跟踪。CDA 结合使用 PCMCI 的时间因果发现与结构因果模型中的因果效应估计，以恢复渠道关系的方向并量化其对转化的贡献。使用设计用于复制真实营销动态的大规模合成数据，我们展示了当给定真实因果图时，CDA 的平均相对 RMSE 为 9.50%，使用预测图时为 24.23%，证明在正确结构下具有很强的准确性，并且即使在结构不确定性下也能恢复有意义的信号。CDA 捕捉跨渠道的相互依赖性，同时提供可解释的、保护隐私的归因洞察，提供了一种可扩展且面向未来的替代传统路径模型的选择。

Summary / 总结

CDA is a framework that infers channel influence using aggregated impression-level data, avoiding user-level data. It integrates temporal causal discovery and causal effect estimation to recover channel relationships and quantify their contributions. Experiments on synthetic data show CDA achieves 9.50% average relative RMSE with the true causal graph and 24.23% with the predicted graph, indicating strong accuracy and meaningful signal recovery even under structural uncertainty.

本文提出了一种因果驱动归因（CDA）框架，该框架利用聚合的曝光级数据来推断渠道影响，而不依赖于用户级信息。CDA 结合了时间因果发现和因果效应估计，以恢复渠道关系并量化其对转化的贡献。实验结果显示，CDA 在提供真实因果图时的平均相对 RMSE 为 9.50%，使用预测图时为 24.23%，即使在结构不确定性下也能实现有意义的信号恢复。

Human Motion Estimation with Everyday Wearables

Authors: Siqi Zhu, Yixuan Li, Junfu Li, Qi Wu, Zan Wang, Haozhe Ma, Wei Liang

First: 2025-12-24T14:44:51+00:00 · Latest: 2025-12-24T14:44:51+00:00

Abs · PDF · Code1 · Code2

Abstract

While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.

中文标题/摘要

标题：基于日常穿戴设备的人体运动估计

基于穿戴设备的人体运动估计对于XR交互等应用至关重要，但现有方法往往存在穿戴不便、硬件昂贵和繁琐校准的问题，这阻碍了它们在日常生活中的应用。为解决这些挑战，我们提出了EveryWear，一种完全基于日常穿戴设备的轻量级和实用的人体运动捕捉方法：一部智能手机、智能手表、耳塞和配备一个前置摄像头和两个向下摄像头的智能眼镜，无需在使用前进行显式校准。我们引入了Ego-Elec，一个包含56种日常活动的9小时真实世界数据集，覆盖17种不同的室内和室外环境，并由运动捕捉（MoCap）提供地面真实三维注释，以促进该领域的稳健研究和基准测试。我们的方法采用多模态教师-学生框架，结合了第一人称摄像头的视觉线索和消费级设备的惯性信号。通过直接在真实世界数据上训练而不是合成数据，我们的模型有效地消除了制约先前工作的模拟到现实的差距。实验表明，我们的方法优于基线模型，验证了其在实际全身运动估计中的有效性。

Summary / 总结

The research aims to improve human motion estimation for applications like XR interaction by addressing issues such as poor wearability and expensive hardware. EveryWear, a lightweight approach using everyday wearables like a smartphone, smartwatch, earbuds, and smart glasses, is introduced. The method employs a multimodal teacher-student framework integrating visual cues from egocentric cameras with inertial signals from consumer devices, trained on real-world data to reduce the sim-to-real gap. Experiments show that this approach outperforms baseline models for practical full-body motion estimation.

研究旨在通过解决佩戴不便和硬件昂贵等问题，改进用于XR交互的人体动作估计。提出了一个轻量级的方法EveryWear，使用日常穿戴设备如智能手机、智能手表、耳塞和智能眼镜。该方法采用了一种多模态教师-学生框架，结合了第一人称摄像头的视觉线索和消费级设备的惯性信号，并直接在真实世界数据上进行训练。实验表明，该方法优于基线模型，证明了其在实际全身动作估计中的有效性。

Analytic and Variational Stability of Deep Learning Systems

Authors: Ronald Katende

First: 2025-12-24T14:43:59+00:00 · Latest: 2025-12-24T14:43:59+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a unified analytic and variational framework for studying stability in deep learning systems viewed as coupled representation-parameter dynamics. The central object is the Learning Stability Profile, which tracks the infinitesimal response of representations, parameters, and update mechanisms to perturbations along the learning trajectory. We prove a Fundamental Analytic Stability Theorem showing that uniform boundedness of these stability signatures is equivalent, up to norm equivalence, to the existence of a Lyapunov-type energy that dissipates along the learning flow. In smooth regimes, the framework yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractivity of the learning dynamics. Classical spectral stability results for feedforward networks, a discrete CFL-type condition for residual architectures, and parametric and temporal stability laws for stochastic gradient methods arise as direct consequences. The theory extends to non-smooth learning systems, including ReLU networks, proximal and projected updates, and stochastic subgradient flows, by replacing classical derivatives with Clarke generalized derivatives and smooth energies with variational Lyapunov functionals. The resulting framework provides a unified dynamical description of stability across architectures and optimization methods, clarifying how architectural and algorithmic choices jointly govern robustness and sensitivity to perturbations. It also provides a foundation for further extensions to continuous-time limits and geometric formulations of learning dynamics.

中文标题/摘要

标题：深度学习系统的分析与变分稳定性

我们提出了一种统一的分析和变分框架，用于研究深度学习系统作为耦合的表示-参数动力学的稳定性。中心对象是学习稳定性概貌，它跟踪表示、参数和更新机制在学习轨迹上受到扰动时的微小响应。我们证明了一个基本的分析稳定性定理，表明这些稳定性特征的统一有界性，等价于存在一种类似李雅普诺夫的能量，该能量在学习流中耗散。在光滑区域，该框架给出了将谱范数、激活正则性、步长和学习率与学习动力学的收缩性联系起来的显式稳定性指数。对于前馈网络的经典谱稳定性结果、残差架构的离散CFL型条件以及随机梯度方法的参数和时间稳定性法则均作为直接推论出现。该理论扩展到非光滑学习系统，包括ReLU网络、近端和投影更新以及随机梯度流，通过用Clarke广义导数替换经典导数，并用变分李雅普诺夫泛函替换光滑能量。由此产生的框架提供了一种统一的动力学描述，涵盖了各种架构和优化方法的稳定性，阐明了架构和算法选择如何共同影响鲁棒性和对扰动的敏感性。它还为连续时间极限和学习动力学的几何形式提供了进一步扩展的基础。

Summary / 总结

This paper introduces a unified framework for analyzing the stability of deep learning systems by examining the dynamics of representations and parameters. The central concept is the Learning Stability Profile, which measures the infinitesimal response to perturbations. The authors prove a Fundamental Analytic Stability Theorem showing that bounded stability signatures are equivalent to the existence of a Lyapunov energy that dissipates along the learning trajectory. The framework provides explicit stability exponents for smooth regimes and extends to non-smooth systems, offering a unified description of stability across different architectures and optimization methods.

本文提出了一种统一框架，通过研究表示和参数的动力学来分析深度学习系统的稳定性。中心概念是学习稳定性概貌，衡量微小扰动的响应。作者证明了一个基本的分析稳定性定理，表明有界稳定性特征等价于沿学习轨迹存在一个耗散的Lyapunov能量。该框架为光滑和非光滑系统提供了显式的稳定性指数，并统一描述了不同架构和优化方法下的稳定性。

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Authors: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

First: 2025-12-18T10:21:14+00:00 · Latest: 2025-12-24T14:39:27+00:00

Comments: Project available at https://github.com/sarapapi/hearing2translate

Abs · PDF · Code1 · Code2 · Code3

Abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

中文标题/摘要

标题：听译：将语音模态整合到LLM中的有效性

随着大型语言模型（LLMs）超越文本，将语音整合为原生模态产生了语音LLMs，旨在直接翻译口语，从而绕过传统的转录管道。然而，这种整合是否能比现有的级联架构提高语音到文本的翻译质量仍是一个开放的问题。我们提出了听译，这是第一个全面的测试套件，严格基准测试了5个最先进的语音LLMs与16个强大的级联系统，这些系统结合了领先的语音基础模型（SFM）和多语言LLMs。我们的分析涵盖了16个基准、13种语言对和9种具有挑战性的条件，包括不连贯、嘈杂和长篇语音。在广泛的评估中，我们发现级联系统仍然是最可靠的，当前的语音LLMs仅在某些设置中与级联系统相当，而SFM则落后于两者，这表明在模型内部或管道中整合一个LLM对于高质量的语音翻译是必不可少的。

Summary / 总结

The study investigates the effectiveness of integrating speech as a native modality into Large Language Models (LLMs), known as SpeechLLMs, for direct speech-to-text translation. It benchmarks 5 state-of-the-art SpeechLLMs against 16 direct and cascade systems across 16 benchmarks, 13 language pairs, and 9 challenging conditions. The results show that cascaded systems remain more reliable overall, while current SpeechLLMs only match cascades in certain settings, indicating that integrating an LLM is crucial for high-quality speech translation.

研究探讨了将语音作为自然模态集成到大型语言模型（LLMs）中，以实现直接的语音到文本翻译的有效性。它在16个基准、13种语言对和9种挑战条件下，对5个最先进的SpeechLLMs与16个直接和级联系统进行了基准测试。结果显示，级联系统在整体上更为可靠，而当前的SpeechLLMs仅在某些设置中与级联系统相当，表明集成一个LLM对于高质量的语音翻译至关重要。

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

Authors: Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu

First: 2025-12-24T14:28:17+00:00 · Latest: 2025-12-24T14:28:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Zero-shot object navigation (ZSON) requires a robot to locate a target object in a previously unseen environment without relying on pre-built maps or task-specific training. However, existing ZSON methods often struggle in realistic and cluttered environments, particularly when the scene contains heavy occlusions, unknown risks, or dynamically moving target objects. To address these challenges, we propose \textbf{Schrödinger's Navigator}, a navigation framework inspired by Schrödinger's thought experiment on uncertainty. The framework treats unobserved space as a set of plausible future worlds and reasons over them before acting. Conditioned on egocentric visual inputs and three candidate trajectories, a trajectory-conditioned 3D world model imagines future observations along each path. This enables the agent to see beyond occlusions and anticipate risks in unseen regions without requiring extra detours or dense global mapping. The imagined 3D observations are fused into the navigation map and used to update a value map. These updates guide the policy toward trajectories that avoid occlusions, reduce exposure to uncertain space, and better track moving targets. Experiments on a Go2 quadruped robot across three challenging scenarios, including severe static occlusions, unknown risks, and dynamically moving targets, show that Schrödinger's Navigator consistently outperforms strong ZSON baselines in self-localization, object localization, and overall Success Rate in occlusion-heavy environments. These results demonstrate the effectiveness of trajectory-conditioned 3D imagination in enabling robust zero-shot object navigation.

中文标题/摘要

标题：薛定谔的导航器：零样本物体导航的未来图景想象

零样本物体导航（ZSON）要求机器人在未见过的环境中定位目标物体，无需依赖预先构建的地图或特定任务的训练。然而，现有的ZSON方法在现实且杂乱的环境中往往难以应对，特别是在场景包含严重遮挡、未知风险或动态移动目标的情况下。为了解决这些挑战，我们提出了**薛定谔的导航器**，这是一种受薛定谔不确定性思想实验启发的导航框架。该框架将未观察到的空间视为一组可能的未来世界，并在行动前对其进行推理。基于第一人称视觉输入和三条候选轨迹，一条轨迹条件下的3D世界模型沿着每条路径想象未来的观察结果。这使代理能够超越遮挡物，预见未见区域的风险，而无需额外绕路或密集的全局映射。想象出的3D观察结果被融合到导航图中，并用于更新价值图。这些更新引导策略避开遮挡物，减少对不确定空间的暴露，并更好地追踪移动目标。在具有严重静态遮挡、未知风险和动态移动目标的三个具有挑战性的场景中，使用四足机器人Go2进行的实验表明，薛定谔的导航器在自我定位、物体定位和整体成功率方面始终优于强大的ZSON基线。这些结果证明了轨迹条件下的3D想象在实现稳健的零样本物体导航方面的有效性。

Summary / 总结

The paper addresses the challenge of zero-shot object navigation (ZSON) in complex and cluttered environments where the robot must locate a target object without pre-built maps or specific training. It introduces Schrödinger's Navigator, which uses a trajectory-conditioned 3D world model to imagine future observations and navigate through occluded areas and unknown risks. Experiments show that Schrödinger's Navigator outperforms existing methods in self-localization, object localization, and success rate in environments with heavy occlusions and moving targets.

论文提出了Schrödinger的导航器框架，将未观察到的空间视为一组可能的未来世界，使用轨迹条件下的3D世界模型来想象未来的观察，并将这些观察融合到导航图中，引导机器人避开遮挡和不确定的空间。实验表明，Schrödinger的导航器在自我定位、目标定位和整体成功率方面优于现有方法，特别是在遮挡严重的环境中。

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Authors: Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

First: 2025-12-24T14:18:38+00:00 · Latest: 2025-12-24T14:18:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.

中文标题/摘要

标题：VisRes 基准：关于评估 VLM 视觉推理能力的研究

视觉-语言模型（VLMs）在视觉问答和图像描述等任务上取得了显著进展。然而，这些模型在视觉推理方面的表现与其依赖语言先验的程度之间的关系尚不明确。为了解决这一问题，我们引入了 VisRes 基准，该基准旨在在无需上下文语言监督的自然环境中研究视觉推理。通过对三个复杂度级别的模型行为进行分析，我们发现了感知和关系视觉推理能力的明显局限性。VisRes 在其级别上隔离了不同的推理能力。第一级测试在模糊、纹理变化、遮挡和旋转等干扰下的感知完成和全局图像匹配；第二级测试单一属性（如颜色、数量、方向）的基于规则的推理；第三级则针对需要整合多个视觉属性的组合推理。在超过 19,000 张受控任务图像中，我们发现最先进的 VLMs 在微妙的感知干扰下表现接近随机，揭示了其有限的抽象能力，仅限于模式识别。最后，我们讨论了 VisRes 如何为多模态研究中推进抽象视觉推理提供统一框架。

Summary / 总结

The paper introduces VisRes Bench, a benchmark to evaluate the visual reasoning capabilities of Vision-Language Models (VLMs) without contextual language supervision. By analyzing model behavior across three levels of complexity, the study reveals that state-of-the-art VLMs struggle with subtle perceptual perturbations and show limited abstraction beyond pattern recognition. VisRes Bench isolates distinct reasoning abilities and provides a unified framework for advancing abstract visual reasoning in multimodal research.

论文介绍了VisRes Bench，这是一个用于评估Vision-Language模型（VLM）在无需依赖上下文语言监督的情况下视觉推理能力的基准。该基准分为三个复杂度级别：感知完成和全局图像匹配（Level 1）、单一属性的规则推理（Level 2）和多视觉属性的组合推理（Level 3）。研究发现，最先进的VLM在细微的感知扰动下表现不佳，表明它们的抽象能力仅限于模式识别。这项工作提供了一个统一的框架，以促进多模态研究中的抽象视觉推理发展。

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Authors: Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan

First: 2025-12-24T14:08:38+00:00 · Latest: 2025-12-24T14:08:38+00:00

Comments: 14 pages, 10 figures, Technical Report,

Abs · PDF · Code1 · Code2

Abstract

In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.

中文标题/摘要

标题：UltraShape 1.0：通过可扩展的几何细化生成高保真3D形状

在本报告中，我们介绍了UltraShape 1.0，这是一种可扩展的3D扩散框架，用于高保真3D几何生成。所提出的方法采用两阶段生成管道：首先合成粗略的整体结构，然后细化以生成详细的高质量几何结构。为了支持可靠的3D生成，我们开发了一个全面的数据处理管道，包括一种新颖的封闭式处理方法和高质量数据过滤。该管道通过去除低质量样本、填补孔洞和增厚细长结构，提高了公共可用3D数据集的几何质量，同时保留了精细的几何细节。为了实现精细的几何细化，我们在扩散过程中将空间定位与几何细节合成解耦。我们通过在固定的空间位置进行体素细化来实现这一点，其中从粗略几何结构派生的体素查询提供了通过RoPE编码的显式位置锚点，使扩散模型能够专注于在减少的结构解决方案空间内合成局部几何细节。我们的模型仅在公共可用的3D数据集上进行训练，尽管训练资源有限，但仍能实现强大的几何质量。广泛的评估表明，UltraShape 1.0在数据处理质量和几何生成方面与现有的开源方法竞争。所有代码和训练模型将被发布以支持未来的研究。

Summary / 总结

UltraShape 1.0 is a scalable 3D diffusion framework for generating high-fidelity 3D geometry through a two-stage process: coarse global structure synthesis followed by detailed refinement. It includes a comprehensive data processing pipeline that enhances geometric quality and preserves fine details. The method decouples spatial localization from geometric detail synthesis, using voxel-based refinement with RoPE encoding to focus on local details. Despite limited training resources, UltraShape 1.0 performs competitively with existing methods in both data processing and geometry generation, with all code and models to be released for future research.

UltraShape 1.0 是一个可扩展的 3D 扩散框架，用于生成高保真 3D 形状。它使用两阶段管道来合成粗略的全局结构，然后对其进行细化以生成详细的几何形状。该框架包括一个数据处理管道，通过移除低质量样本和填充孔洞来提高 3D 数据集的质量。扩散过程中空间定位和几何细节合成被解耦，允许集中细化。尽管训练资源有限，UltraShape 1.0 在数据处理质量和几何生成方面仍表现出色，与现有方法竞争。

Towards Arbitrary Motion Completing via Hierarchical Continuous Representation

Authors: Chenghao Xu, Guangtao Lyu, Qi Liu, Jiexi Yan, Muli Yang, Cheng Deng

First: 2025-12-24T14:07:04+00:00 · Latest: 2025-12-24T14:07:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.

中文标题/摘要

标题：通过分层连续表示实现任意运动补全

物理运动本质上是连续的，更高的相机帧率通常有助于提高平滑度和时间连贯性。首次探索了人类运动序列的连续表示，能够对任意输入运动序列进行任意帧率的插值、过渡甚至外推。为此，我们提出了一种基于隐式神经表示（INRs）的名为NAME的新型参数激活诱导分层隐式表示框架。我们的方法引入了分层时间编码机制，从运动序列的多个时间尺度中提取特征，有效捕捉复杂的时序模式。此外，我们还将基于傅里叶变换的自定义参数激活函数集成到基于MLP的解码器中，以增强连续表示的表达能力。这种参数化表示显著增强了模型对复杂运动行为的高精度表示能力。在多个基准数据集上的广泛评估表明，我们提出的方法具有有效性和鲁棒性。

Summary / 总结

The research aims to develop a method for arbitrary motion completion by exploring continuous representations of human motion sequences. The proposed method, named NAME, uses a hierarchical implicit representation framework based on Implicit Neural Representations (INRs) to capture intricate temporal patterns at multiple scales. It also integrates a parametric activation function using Fourier transformations to enhance the model's expressiveness. Experimental results show that the approach effectively interpolates, inbetween, and extrapolates motion sequences at arbitrary frame rates, demonstrating its effectiveness and robustness across various benchmark datasets.

本文旨在通过探索人类运动序列的连续表示来开发任意运动完成的方法。作者提出了一种名为NAME的新型参数激活诱导分层隐式表示框架，该框架基于隐式神经表示（INRs）和分层时间编码机制，能够在多个时间尺度上捕捉复杂的时空模式。该方法还结合了一个基于傅里叶变换的自定义参数激活函数，以增强连续表示的表达能力。实验结果表明，所提出的方法在各种基准数据集上具有有效性和鲁棒性，能够以高精度在任意帧率下插值、过渡和外推运动序列。