arXiv 论文速递

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Authors: Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, Tao Xiang, Ziwei Liu, Juan-Manuel Perez-Rua

First: 2025-12-24T18:59:58+00:00 · Latest: 2025-12-24T18:59:58+00:00

Comments: Project Page: http://haonanqiu.com/projects/HiStream.html

Abs · PDF · Code1 · Code2 · Project1

Abstract

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.

中文标题/摘要

标题：HiStream：通过消除冗余的流式传输高效生成高分辨率视频

高分辨率视频生成对于数字媒体和电影至关重要，但由于扩散模型的二次复杂性，计算瓶颈使得实际推理不可行。为了解决这一问题，我们引入了HiStream，这是一种高效的自回归框架，系统地在三个维度上减少冗余：i) 空间压缩：在低分辨率去噪后再在高分辨率处使用缓存特征进行细化；ii) 时间压缩：分块处理策略，带有固定大小的锚点缓存，确保稳定的推理速度；iii) 时间步压缩：对后续的、基于缓存条件的块应用较少的去噪步骤。在1080p基准测试中，我们的主要HiStream模型（i+ii）实现了最先进的视觉质量，同时与Wan2.1基线相比，去噪速度提高了76.2倍，且几乎无质量损失。我们的更快变体HiStream+应用了所有三种优化（i+ii+iii），相对于基线实现了107.5倍的加速，提供了速度和质量之间的权衡，从而使得高分辨率视频生成既实用又可扩展。

Summary / 总结

HiStream is an efficient autoregressive framework designed to reduce the computational complexity of high-resolution video generation. It achieves this by spatially compressing denoising at low resolution, temporally compressing with a fixed-size anchor cache, and reducing the number of denoising steps. On 1080p benchmarks, HiStream demonstrates up to 76.2x faster denoising compared to the Wan2.1 baseline with negligible quality loss, and HiStream+ further accelerates this to 107.5x with a slight trade-off in quality, making high-resolution video generation both practical and scalable.

HiStream 是一种高效的自回归框架，旨在减少高分辨率视频生成的计算复杂性。它通过在低分辨率下压缩去噪、使用固定大小的锚点缓存进行时间压缩以及减少去噪步骤来实现这一目标。在 1080p 基准测试中，HiStream 的去噪速度比 Wan2.1 基线快 76.2 倍且几乎无质量损失，而 HiStream+ 进一步加速到 107.5 倍，尽管在质量上略有妥协，但使得高分辨率视频生成既实用又可扩展。

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu

First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00

Comments: Project page: https://sytwu.github.io/BeyondMemo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

中文标题/摘要

标题：超越记忆：多模态序数回归基准以揭示视觉语言模型中的流行度偏差

我们揭示了最先进的视觉语言模型（VLMs）中存在显著的流行度偏差，这些模型在著名建筑上的准确率比普通建筑高出34%，表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题，我们引入了该任务上最大的开放基准数据集：YearGuessr数据集，包含来自157个国家的55,546张建筑图像，具有多模态属性，并附有其建设年份的连续序数标签（1001-2024）、GPS数据和页面浏览量作为流行度的代理。使用该数据集，我们将建筑年份预测任务框架化为序数回归，并引入了流行度感知的区间准确度指标来量化这种偏差。我们构建的包含30多个模型的基准，包括我们的YearCLIP模型，证实了VLMs在流行、记忆化的项目上表现出色，但在未识别的主题上却面临重大挑战，揭示了它们推理能力中的关键缺陷。项目页面：https://sytwu.github.io/BeyondMemo/

Summary / 总结

The paper addresses the significant popularity bias in state-of-the-art vision-language models (VLMs), which perform better on famous buildings than ordinary ones. To systematically investigate this, the authors introduce the YearGuessr dataset, comprising 55,546 building images with multi-modal attributes and continuous ordinal labels of construction years. Using this dataset, they frame the task as ordinal regression and introduce new metrics to quantify the bias. The benchmark of 30+ models, including YearCLIP, confirms that VLMs excel on popular items but struggle with unrecognized subjects, highlighting a critical flaw in their reasoning capabilities.

研究揭示了最先进的视觉-语言模型存在显著的流行度偏差，对著名建筑的性能优于普通建筑。为解决这一问题，研究人员创建了包含55,546张建筑图片的YearGuessr数据集，这些图片具有多模态属性和连续的按年份排序标签。使用该数据集评估了30多种模型，包括YearCLIP，发现模型在识别流行建筑方面表现出色，但在不知名的建筑上却表现不佳，这揭示了其推理能力的一个关键缺陷。

Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

Authors: Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin

First: 2025-12-24T18:59:51+00:00 · Latest: 2025-12-24T18:59:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.

中文标题/摘要

标题：通过量化不确定性优化掩码扩散模型的解码路径

掩码扩散模型（MDMs）提供了灵活的非自回归生成，但这种自由度引入了一个挑战：最终输出质量高度依赖于解码顺序。我们首次正式化了这一问题，将输出质量的差异归因于生成路径上累积的预测不确定性。为了量化这种不确定性，我们引入了去噪熵，这是一种可计算的度量标准，可以作为评估生成过程的内部信号。利用这一度量标准，我们提出了两种旨在优化解码路径的算法：一种事后选择方法和一种实时指导策略。实验表明，我们的熵指导方法显著提高了生成质量，在具有挑战性的推理、规划和代码基准测试中持续提升了准确性。我们的工作确立了去噪熵作为理解并控制生成过程的原理性工具，有效地将MDMs中的不确定性从一种负担转变为发现高质量解决方案的关键优势。

Summary / 总结

This paper addresses the challenge of output quality variability in Masked Diffusion Models (MDMs) due to the sensitivity to decoding order. It introduces Denoising Entropy as a metric to quantify predictive uncertainty along generative paths and proposes two algorithms: a post-hoc selection method and a real-time guidance strategy. Experiments show that these entropy-guided methods enhance generation quality, particularly on complex benchmarks involving reasoning, planning, and code generation.

研究解决了Masked Diffusion Models (MDMs)因解码顺序不同而导致输出质量变化的问题。通过引入Denoising Entropy来量化预测不确定性，作者提出了两种算法来优化解码路径。实验表明，这些基于熵的方法可以提高生成质量，特别是在复杂基准上的表现。这项工作将不确定性转化为发现高质量解决方案的优势。

Streaming Video Instruction Tuning

Authors: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou

First: 2025-12-24T18:59:36+00:00 · Latest: 2025-12-24T18:59:36+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

中文标题/摘要

标题：流式视频指令调优

我们提出了Streamo，一个实时流式视频LLM，作为通用交互式助手。与现有的专注于问答或字幕的在线视频模型不同，Streamo执行广泛的流式视频任务，包括实时解说、动作理解、事件字幕、时间事件定位和时间敏感的问答。为了开发这种多功能性，我们构建了Streamo-Instruct-465K，一个针对流式视频理解的大规模指令遵循数据集。该数据集涵盖了多种时间上下文和多任务监督，使Streamo能够在异构流式任务中统一训练。通过简化的工作流在指令遵循数据集上端到端训练后，Streamo展示了强大的时间推理、响应式交互和在各种流式基准测试中的广泛泛化。广泛的实验表明，Streamo填补了离线视频感知模型与实时多模态助手之间的差距，朝着统一、智能的视频理解在连续视频流中迈出了一步。

Summary / 总结

Streamo is a real-time streaming video language model designed as a general-purpose interactive assistant. It excels in a wide range of streaming video tasks, including real-time narration, action understanding, and event captioning. To achieve this versatility, the researchers created Streamo-Instruct-465K, a large instruction-following dataset for streaming video understanding. After training, Streamo demonstrates strong temporal reasoning and broad generalization across various streaming benchmarks, bridging the gap between offline video models and real-time multimodal assistants.

Streamo 是一种实时流媒体视频语言模型，作为通用交互式助手。它在实时叙述、动作理解等多种流媒体任务上表现出色。为了实现这一多功能性，研究人员创建了 Streamo-Instruct-465K 数据集，专门用于流媒体视频理解。经过训练后，Streamo 展现了强大的时间推理能力和在各种流媒体基准测试中的广泛泛化能力，填补了离线视频模型与实时多模态助手之间的差距。

Fast SAM2 with Text-Driven Token Pruning

Authors: Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen

First: 2025-12-24T18:59:05+00:00 · Latest: 2025-12-24T18:59:05+00:00

Comments: 28 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.

中文标题/摘要

标题：快速SAM2：基于文本驱动的标记剪枝

Segment Anything Model 2 (SAM2) 是一种视觉基础模型，在基于提示的视频对象分割方面取得了显著进展，但其实际部署受限于处理时间密集视觉标记的高计算和内存成本。SAM2 管道通常会将图像编码器生成的所有视觉标记通过下游的时间推理模块进行传递，而不考虑这些标记与目标对象的相关性，导致由于基于内存的注意力开销呈二次增长而降低了可扩展性。本文提出了一种基于文本的标记剪枝框架，通过在时间传播之前选择性地减少标记密度来提高推理效率，而不修改底层分割架构。该方法在视觉编码之后、基于内存的传播之前运行，使用一种轻量级的路由机制对标记进行排名，该机制结合了局部视觉上下文、从以对象为中心的文本描述（用户提供的或自动生成的）中推导出的语义相关性以及有助于保留模糊或边界关键区域的不确定性提示。通过仅保留对下游处理最有用的标记，所提出的方法减少了冗余计算，同时保持了分割精度。在多个具有挑战性的视频分割基准测试中的广泛实验表明，编码器后的标记剪枝提供了一条实用且有效的途径，以实现基于提示的视频分割的高效性，与未剪枝的基线SAM2相比，其推理速度提高了42.50%，GPU内存使用量降低了37.41%，同时保持了竞争力的J和F性能。这些结果突显了早期标记选择对提高基于变压器的视频分割系统实时性和资源受限应用可扩展性的潜力。

Summary / 总结

This work introduces a text-guided token pruning framework for Segment Anything Model 2 (SAM2) to enhance inference efficiency in video object segmentation. By selectively reducing token density before temporal propagation, the method improves scalability without altering the segmentation architecture. Experiments show a 42.50% faster inference and 37.41% lower GPU memory usage compared to the unpruned baseline, while maintaining competitive segmentation performance. This approach leverages local visual context, semantic relevance, and uncertainty cues to retain only the most informative tokens for downstream processing.

本文提出了一种文本引导的token剪枝框架，用于改进Segment Anything Model 2 (SAM2)在视频对象分割中的推理效率。通过在时间传播前选择性地减少token密度，并使用一个轻量级的路由机制来考虑局部视觉上下文、语义相关性和不确定性提示来对token进行排序。实验结果显示，与未剪枝的基线相比，该方法可实现42.50％的更快推理速度和37.41％的更低GPU内存使用，同时保持了竞争力的分割性能。这种方法增强了基于Transformer的视频分割系统的可扩展性，适用于实时应用。

TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Authors: Varun Belagali, Saarthak Kapse, Pierre Marza, Srijan Das, Zilinghan Li, Sofiène Boutaj, Pushpak Pati, Srikar Yellapragada, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Prateek Prasanna, Stergios Christodoulidis Maria Vakalopoulou, Dimitris Samaras

First: 2025-12-24T18:58:16+00:00 · Latest: 2025-12-24T18:58:16+00:00

Abs · PDF · Code1 · Code2

Abstract

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

中文标题/摘要

标题：TICON：一种用于组织病理学表示学习的幻灯片级切片上下文化器

在大型全切片图像（WSI）中，对小切片的解释通常需要更大的图像上下文。我们引入了TICON，一种基于变换器的切片表示上下文化器，能够为任何计算病理学应用生成丰富的上下文化嵌入。标准基于切片编码器的管道从切片中剥离其上下文来提取嵌入，无法建模对于局部和全局任务都至关重要的丰富幻灯片级信息。此外，不同的切片编码器在不同的下游任务中表现出色。因此，需要一个统一的模型来上下文化来自任何切片级基础模型的嵌入。TICON 通过一个共享的编码器来满足这一需求，该编码器使用掩码建模目标进行预训练，以同时统一和上下文化来自多种切片级病理基础模型的表示。我们的实验表明，TICON 上下文化的嵌入在许多不同任务中显著提高了性能，建立了切片级基准（例如，HEST-Bench、THUNDER、CATCH）和幻灯片级基准（例如，Patho-Bench）的新最先进的结果。最后，我们使用仅11K张WSI对TICON 进行预训练以形成幻灯片级基础模型，超越了使用多达350K张WSI预训练的最先进的幻灯片级基础模型。

Summary / 总结

TICON is a transformer-based model that provides rich, contextualized embeddings for tiles in whole slide images, addressing the limitations of standard tile encoder-based pipelines. It unifies and contextualizes representations from various tile-level pathology foundation models using a single, shared encoder pretrained with a masked modeling objective. Experiments show that TICON significantly improves performance on both tile-level and slide-level benchmarks, setting new state-of-the-art results and outperforming existing models with fewer training images.

TICON 是一种基于变换器的模型，旨在为整个切片图像中的小块提供丰富的上下文嵌入，解决了缺乏上下文的切片编码器管道的局限性。它使用一个共享的预训练编码器，通过掩码建模目标同时统一和上下文化来自各种切片级病理基础模型的嵌入。实验表明，TICON 在多个任务上显著提高了性能，建立了在切片级和切片级基准上的新最佳结果。此外，TICON 能够仅使用 11K 个切片图像构建切片级基础模型，优于使用多达 350K 个切片图像预训练的最先进的模型。

Parallel Token Prediction for Language Models

Authors: Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh, Stephan Mandt

First: 2025-12-24T18:46:55+00:00 · Latest: 2025-12-24T18:46:55+00:00

Comments: Preprint. Under review

Abs · PDF · Code1 · Code2

Abstract

We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.

中文标题/摘要

标题：语言模型中的并行令牌预测

我们提出了并行令牌预测（PTP），这是一种用于语言模型并行序列生成的通用框架。PTP 在单个变压器调用中通过将采样过程纳入模型中，同时预测多个依赖令牌，从而减少了自回归解码的延迟瓶颈，并避免了现有多种令牌预测方法中常见的独立性假设限制。我们证明PTP 可以表示任意自回归序列分布。PTP 可以通过蒸馏现有模型或通过逆自回归训练进行训练，无需教师。实验上，我们在 Spec-Bench 上通过每步接受超过四个令牌，实现了 Vicuna-7B 的最佳推测解码性能。我们框架的通用性表明，在不损失建模能力的情况下，长序列的并行生成是可行的。

Summary / 总结

The research proposes Parallel Token Prediction (PTP), a framework that jointly predicts multiple dependent tokens in a single transformer call, reducing the latency of autoregressive decoding and avoiding restrictive independence assumptions. PTP is trained either by distilling an existing model or through inverse autoregressive training. Experiments show that PTP achieves state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench, indicating its potential for parallel generation of long sequences without loss of modeling power.

研究提出了并行令牌预测（PTP），这是一种用于语言模型并行序列生成的框架，可以在单个变压器调用中联合预测多个依赖令牌。这种方法减少了自回归解码的延迟，并避免了现有的多令牌预测方法中的限制性独立假设。实验表明，PTP 在 Vicuna-7B 上实现了最先进的推测性解码性能，每步接受超过四个令牌，在 Spec-Bench 上，证明了其在并行生成长序列方面的有效性和通用性，同时不牺牲建模能力。

Variationally correct operator learning: Reduced basis neural operator with a posteriori error estimation

Authors: Yuan Qiu, Wolfgang Dahmen, Peng Chen

First: 2025-12-24T18:37:59+00:00 · Latest: 2025-12-24T18:37:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Minimizing PDE-residual losses is a common strategy to promote physical consistency in neural operators. However, standard formulations often lack variational correctness, meaning that small residuals do not guarantee small solution errors due to the use of non-compliant norms or ad hoc penalty terms for boundary conditions. This work develops a variationally correct operator learning framework by constructing first-order system least-squares (FOSLS) objectives whose values are provably equivalent to the solution error in PDE-induced norms. We demonstrate this framework on stationary diffusion and linear elasticity, incorporating mixed Dirichlet-Neumann boundary conditions via variational lifts to preserve norm equivalence without inconsistent penalties. To ensure the function space conformity required by the FOSLS loss, we propose a Reduced Basis Neural Operator (RBNO). The RBNO predicts coefficients for a pre-computed, conforming reduced basis, thereby ensuring variational stability by design while enabling efficient training. We provide a rigorous convergence analysis that bounds the total error by the sum of finite element discretization bias, reduced basis truncation error, neural network approximation error, and statistical estimation errors arising from finite sampling and optimization. Numerical benchmarks validate these theoretical bounds and demonstrate that the proposed approach achieves superior accuracy in PDE-compliant norms compared to standard baselines, while the residual loss serves as a reliable, computable a posteriori error estimator.

中文标题/摘要

标题：变分正确的算子学习：基于后验误差估计的降维神经算子

最小化PDE残差损失是促进神经算子物理一致性的常见策略。然而，标准形式通常缺乏变分正确性，这意味着小的残差并不保证小的解误差，因为使用了不合规的范数或人为的边界条件惩罚项。本文通过构建值可证明等同于PDE诱导范数下解误差的一阶系统最小二乘（FOSLS）目标，发展了一种变分正确的算子学习框架。我们通过变分提升混合Dirichlet-Neumann边界条件，确保范数等价性而不引入不一致的惩罚项。为了满足FOSLS损失所需的函数空间一致性，我们提出了一种降维神经算子（RBNO）。RBNO预测预计算的、一致的降维基的系数，从而通过设计确保变分稳定性，同时实现高效的训练。我们提供了一种严格的收敛性分析，将总误差界定了有限元离散偏差、降维基截断误差、神经网络逼近误差以及有限采样和优化引起的统计估计误差之和。数值基准验证了这些理论界，并表明所提出的方法在PDE一致范数下实现了优于标准基线的更高精度，而残差损失作为可靠的、可计算的后验误差估计器。

Summary / 总结

This work addresses the issue of variational correctness in neural operators by developing a variationally correct framework using first-order system least-squares (FOSLS) objectives. The framework ensures that small residuals correspond to small solution errors. It incorporates mixed boundary conditions and uses a Reduced Basis Neural Operator (RBNO) to predict coefficients for a pre-computed reduced basis, ensuring variational stability and efficient training. Theoretical analysis and numerical benchmarks show that the approach achieves higher accuracy in PDE-compliant norms and provides a reliable a posteriori error estimator.

该研究通过使用一阶系统最小二乘（FOSLS）目标来解决神经算子的变分正确性问题。提出的Reduced Basis Neural Operator（RBNO）确保了函数空间的一致性和变分稳定性。该方法提供了严格的收敛性分析，并展示了在PDE一致范数下的精度优于标准基线，同时残差损失作为可靠的后验误差估计器。

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

Authors: Roy Turgeman, Tom Tirer

First: 2025-12-24T18:21:01+00:00 · Latest: 2025-12-24T18:21:01+00:00

Abs · PDF · Code1 · Code2

Abstract

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

中文标题/摘要

标题：数据处理不等式反映实践吗？低级任务的有效性探究

数据处理不等式是信息论原理，表明通过处理观测值无法增加信号的信息量。特别是，它表明在解决分类问题之前增强信号或对其进行编码是没有益处的。这一断言可以证明在最优贝叶斯分类器的情况下是正确的。然而，在实践中，尽管现代深度神经网络具有强大的能力，但在“高级”的下游任务之前通常会执行“低级”任务。在本文中，我们旨在理解何时以及为什么低级处理可以对分类有益。我们对二分类设置进行了全面的理论研究，考虑了一个与最优贝叶斯分类器紧密相连的分类器，并随着训练样本数量的增加而收敛于最优贝叶斯分类器。我们证明了对于任何有限数量的训练样本，都存在一种预分类处理可以提高分类准确性。我们还探讨了类别分离、训练集大小和类别平衡对这种处理相对增益的影响。我们通过理论设置的经验研究支持了我们的理论。最后，我们进行了一项经验研究，调查了去噪和编码对基准数据集上实用深度分类器性能的影响。具体来说，我们改变了训练集的大小和类别分布以及噪声水平，并展示了与理论结果一致的趋势。

Summary / 总结

This paper investigates the utility of low-level tasks in classification, challenging the data processing inequality. It presents a theoretical study showing that pre-classification processing can improve accuracy even with a finite number of training samples. The study also explores how class separation, training set size, and class balance affect these gains. Empirical evidence from both a theoretical setup and benchmark datasets supports these findings, demonstrating that low-level processing can enhance performance in practical scenarios.

本文探讨了低级任务在分类中的实用性，挑战了数据处理不等式。研究证明，对于任何有限数量的训练样本，预分类处理可以提高准确性。研究还探讨了类别分离、训练集大小和类别平衡如何影响这种处理的好处。基准数据集上的实验证据支持这些发现，显示的趋势与理论分析一致。

Learning to Solve PDEs on Neural Shape Representations

Authors: Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra

First: 2025-12-24T18:14:02+00:00 · Latest: 2025-12-24T18:14:02+00:00

Comments: Article webpage link: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.

中文标题/摘要

标题：在神经形状表示中学习求解偏微分方程

在形状上求解偏微分方程（PDEs）是许多形状分析和工程任务的基础；然而，现有的PDE求解器通常基于多边形/三角形网格，而现代3D资产越来越多地以神经表示形式存在。这种不匹配使得没有合适的方法可以直接在神经域内求解曲面PDEs，迫使进行显式的网格提取或逐实例残差训练，阻碍了端到端的工作流程。我们提出了一种全新的无网格公式，该公式学习一个基于神经（局部）形状属性的局部更新算子，使得可以在数据所在的曲面上直接求解PDEs。该算子自然地与常见的神经曲面表示相结合，只需在一个代表性形状上进行一次训练，即可在形状和拓扑变化下泛化，从而在无需显式网格化或逐实例优化的情况下实现准确且快速的推理，同时保持可微性。在分析基准（球体上的热方程和泊松求解）和不同表示的真实神经资产中，我们的方法在某些方面略优于CPM，同时保持与FEM相当的性能，并且据我们所知，首次提供了在神经和经典曲面表示上求解曲面PDEs的端到端管道。代码将在接受后发布。

Summary / 总结

The research addresses the challenge of solving partial differential equations (PDEs) on shapes represented by neural networks, which is crucial for shape analysis and engineering tasks. The method introduces a mesh-free formulation that learns a local update operator based on local shape attributes, allowing PDEs to be solved directly on neural representations. Experiments show that the method performs slightly better than the closest competitor (CPM) and is comparable to finite element methods (FEM), while enabling end-to-end workflows without the need for explicit meshing or per-instance optimization.

研究解决了在神经形状表示上求解偏微分方程（PDEs）的问题，这些表示在3D资产中越来越普遍但缺乏直接求解的方法。方法提出了一种无网格公式，通过条件化神经形状属性来学习局部更新操作符，使PDEs可以直接在神经领域内求解。实验表明，该方法在性能上略优于一致PDE方法（CPM），且接近有限元方法（FEM），同时提供了一个从神经和经典表面表示求解表面PDEs的端到端管道。

Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning

Authors: Ke Sun, Yingnan Zhao, Enze Shi, Yafei Wang, Xiaodong Yan, Bei Jiang, Linglong Kong

Venue: NeurIPS 2025

First: 2021-10-07T03:14:46+00:00 · Latest: 2025-12-24T17:53:45+00:00

Comments: NeurIPS 2025; Previous Version in ICML Workshop: Exploration in AI Today (EXAIT) 2025

Abs · PDF · Code1 · Code2

Abstract

The remarkable empirical performance of distributional reinforcement learning (RL) has garnered increasing attention to understanding its theoretical advantages over classical RL. By decomposing the categorical distributional loss commonly employed in distributional RL, we find that the potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization. This less-studied entropy regularization aims to capture additional knowledge of return distribution beyond only its expectation, contributing to an augmented reward signal in policy optimization. In contrast to the vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration by promoting diverse actions, the novel entropy regularization derived from categorical distributional loss implicitly updates policies to align the learned policy with (estimated) environmental uncertainty. Finally, extensive experiments verify the significance of this uncertainty-aware regularization from distributional RL on the empirical benefits over classical RL. Our study offers an innovative exploration perspective to explain the intrinsic benefits of distributional learning in RL.

中文标题/摘要

标题：分类分布损失的内在优势：分布感知正则化探索在强化学习中的应用

分布式强化学习（RL）的卓越实证性能引起了对其与经典RL理论优势的越来越多关注。通过分解在分布式RL中常用的分类分布损失，我们发现分布式RL潜在优势可归因于一种衍生的分布匹配熵正则化。这种较少研究的熵正则化旨在捕捉回报分布的额外知识，而不仅仅是其期望值，从而为策略优化提供增强的奖励信号。与MaxEnt RL中的基本熵正则化相比，后者通过促进多样化的动作显式地鼓励探索，而从分类分布损失中推导出的新型熵正则化则隐式地更新策略，使其与（估计的）环境不确定性相一致。最后，广泛的实验验证了这种分布感知正则化在实证上对经典RL的优越性。我们的研究为解释分布式学习在RL中的内在优势提供了创新的探索视角。

Summary / 总结

This paper explores the theoretical advantages of distributional reinforcement learning (RL) by decomposing the categorical distributional loss. It identifies an entropy regularization that captures the return distribution beyond its expectation, contributing to better reward signals. Unlike traditional entropy regularization in MaxEnt RL, this new regularization implicitly aligns policies with environmental uncertainty. Experiments confirm the benefits of this uncertainty-aware regularization in distributional RL over classical RL methods.

该论文通过分解分类分布损失，探索了分布式强化学习（RL）的理论优势。它识别出一种分布匹配熵正则化，能够捕捉回报分布超出其期望的额外知识，从而增强奖励信号。不同于传统MaxEnt RL中的熵正则化明确鼓励探索，这种新正则化隐式地使策略与环境不确定性对齐。实验确认了这种不确定性意识正则化在分布式RL中的重要性，提供了其相对于经典RL内在优势的新视角。

AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Authors: Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng

First: 2025-12-24T17:40:42+00:00 · Latest: 2025-12-24T17:40:42+00:00

Comments: 23 pages, 13 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.

中文标题/摘要

标题：AndroidLens：嵌套子目标下的长延迟评估方法用于Android GUI代理

图形用户界面（GUI）代理可以通过自动化移动设备上频繁执行的长延迟任务来显著提高生产力。然而，现有的评估基准仍然局限于有限的应用程序、简单的任务和粗粒度的指标。为了解决这个问题，我们引入了AndroidLens，这是一个针对移动GUI代理的具有挑战性的评估框架，包含571个长延迟任务，涵盖中文和英文环境，每个任务平均需要超过26步才能完成。该框架的特点是：(1) 来自38个领域的真实世界用户场景的任务，涵盖多种复杂类型，如多约束、多目标和领域特定任务；(2) 静态评估保留了真实世界的异常情况，并允许多条有效路径以减少偏差；(3) 动态评估采用基于里程碑的方案，通过平均任务进度（ATP）进行细粒度的进度测量。我们的评估表明，即使是最优秀的模型也只能达到12.7%的任务成功率和50.47%的ATP。我们还强调了真实世界环境中的关键挑战，包括环境异常、自适应探索和长期记忆保留。

Summary / 总结

The research introduces AndroidLens, a challenging evaluation framework for mobile GUI agents, which includes 571 long-latency tasks in both Chinese and English environments, each requiring an average of over 26 steps. The framework features real-world user scenarios, static and dynamic evaluations, and a milestone-based scheme for progress measurement. The evaluation shows that even the best models achieve only 12.7% task success rate and 50.47% Average Task Progress (ATP).

AndroidLens 是一个针对移动 GUI 代理的挑战性评估框架，包含571个跨中英文环境的长延迟任务，每个任务平均需要超过26步。这些任务覆盖了38个领域的实际场景，并包括多约束、多目标等复杂类型的任务。评估框架使用静态和动态方法来衡量任务成功率和平均任务进度（ATP），结果显示即使最好的模型也只能达到12.7%的任务成功率和50.47%的ATP。研究还指出了实际环境中的挑战，如环境异常、自适应探索和长期记忆保持等。

Transcriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering

Authors: Abdullah G. Elafifi, Basma Mamdouh, Mariam Hanafy, Muhammed Alaa Eldin, Yosef Khaled, Nesma Mohamed El-Gelany, Tarek H. M. Abou-El-Enien

First: 2025-12-24T17:39:37+00:00 · Latest: 2025-12-24T17:39:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Acute Myeloid Leukemia (AML) remains a clinical challenge due to its extreme molecular heterogeneity and high relapse rates. While precision medicine has introduced mutation-specific therapies, many patients still lack effective, personalized options. This paper presents a novel, end-to-end computational framework that bridges the gap between patient-specific transcriptomics and de novo drug discovery. By analyzing bulk RNA sequencing data from the TCGA-LAML cohort, the study utilized Weighted Gene Co-expression Network Analysis (WGCNA) to prioritize 20 high-value biomarkers, including metabolic transporters like HK3 and immune-modulatory receptors such as SIGLEC9. The physical structures of these targets were modeled using AlphaFold3, and druggable hotspots were quantitatively mapped via the DOGSiteScorer engine. Then developed a novel, reaction-first evolutionary metaheuristic algorithm as well as multi-objective optimization programming that assembles novel ligands from fragment libraries, guided by spatial alignment to these identified hotspots. The generative model produced structurally unique chemical entities with a strong bias toward drug-like space, as evidenced by QED scores peaking between 0.5 and 0.7. Validation through ADMET profiling and SwissDock molecular docking identified high-confidence candidates, such as Ligand L1, which achieved a binding free energy of -6.571 kcal/mol against the A08A96 biomarker. These results demonstrate that integrating systems biology with metaheuristic molecular assembly can produce pharmacologically viable, patient tailored leads, offering a scalable blueprint for precision oncology in AML and beyond

中文标题/摘要

标题：基于转录组的个性化从头药物生成用于AML：使用元启发式组装和靶向筛选

急性髓系白血病（AML）由于其极端的分子异质性和高复发率，仍然是临床挑战。尽管精准医疗引入了针对突变的治疗方法，但许多患者仍然缺乏有效的个性化选择。本文提出了一种新颖的端到端计算框架，将患者特异性转录组学与从头药物发现联系起来。通过分析TCGA-LAML队列的批量RNA测序数据，研究利用加权基因共表达网络分析（WGCNA）优先筛选出20个高价值生物标志物，包括代谢转运蛋白如HK3和免疫调节受体如SIGLEC9。这些靶点的物理结构使用AlphaFold3建模，并通过DOGSiteScorer引擎定量映射可成药热点。开发了一种新颖的反应优先进化元启发式算法以及多目标优化编程，从片段库中组装新型配体，由这些识别的热点的空间对齐引导。生成模型产生了结构上独特的化学实体，具有强烈的药物样空间偏向，QED评分峰值在0.5到0.7之间。通过ADMET表型分析和SwissDock分子对接验证，识别出高置信度候选物，如配体L1，其与A08A96生物标志物的结合自由能为-6.571 kcal/mol。这些结果表明，将系统生物学与元启发式分子组装相结合可以产生药理学上可行的、患者特制的先导化合物，为AML和其他癌症的精准肿瘤学提供可扩展的蓝图

Summary / 总结

This study addresses the challenge of acute myeloid leukemia (AML) by developing an end-to-end computational framework that integrates patient-specific transcriptomics with de novo drug discovery. Using WGCNA to prioritize biomarkers and AlphaFold3 for structural modeling, the framework employs a novel metaheuristic algorithm to generate structurally unique chemical entities. Key findings include the production of drug-like molecules with ADMET validation and a high-confidence candidate achieving a binding free energy of -6.571 kcal/mol.

该研究通过开发一个将患者特异性转录组学与新药发现相结合的端到端计算框架，来应对急性髓系白血病（AML）的临床挑战。使用WGCNA识别20个高价值生物标志物，然后使用AlphaFold3建模其结构，并使用DOGSiteScorer绘制可成药热点。一种新颖的元启发式算法和多目标优化程序从片段库中组装新型配体，通过空间对齐指导这些热点。生成的化学实体显示出强烈的药物样特性，QED分数在0.5到0.7之间。通过ADMET表型分析和分子对接验证，识别出高置信度候选物，如配体L1，与A08A96生物标志物的结合自由能为-6.571 kcal/mol，展示了个性化药物生成在AML中的潜力。

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

Authors: Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane

Venue: NeurIPS 2025

First: 2025-06-06T19:29:13+00:00 · Latest: 2025-12-24T17:26:35+00:00

Comments: 40 pages, 8 figures, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

中文标题/摘要

标题：交替梯度流：两层神经网络特征学习的一种理论

神经网络学习哪些特征以及如何学习仍然是一个开放的问题。本文引入了交替梯度流（AGF），这是一种算法框架，描述了从小初始化训练的两层网络中特征学习的动力学。先前的研究表明，在这种情况下，梯度流表现出阶梯状的损失曲线，交替出现神经元缓慢对齐到有用方向的平台期和神经元迅速增长的急剧下降期。AGF 将这种行为近似为交替的两步过程：在休眠神经元上最大化一个效用函数，在活跃神经元上最小化一个成本函数。AGF 从所有神经元都处于休眠状态开始。在每次迭代中，一个休眠的神经元激活，触发特征的获取和损失的下降。AGF 量化了这些下降的顺序、时间和幅度，与多个常用架构的实验结果相符。我们证明了 AGF 统一并扩展了全连接线性网络和仅注意力线性变压器中已有的鞍点到鞍点分析，其中学习的特征分别是奇异模式和主成分。在对角线线性网络中，我们证明 AGF 在初始化趋于零的极限下收敛到梯度流。将 AGF 应用于训练以执行模块加法的二次网络，我们首次完整地描述了训练动力学，揭示了网络按系数大小递减的顺序学习傅里叶特征。总体而言，AGF 为理解神经网络中的特征学习提供了一个有希望的步骤。

Summary / 总结

The paper introduces Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer neural networks trained from small initialization. AGF approximates the alternating behavior of gradient flow as two steps: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. The key findings include matching the order, timing, and magnitude of loss drops with experiments across various architectures, unifying and extending existing saddle-to-saddle analyses, and providing a complete characterization of training dynamics in quadratic networks, revealing that networks learn Fourier features in decreasing order of coefficient magnitude.

本文提出了交替梯度流（AGF）算法框架，用于描述从小初始化训练的两层神经网络中的特征学习动态。AGF 将梯度流的行为近似为交替的两步过程：在静止神经元上最大化一个效用函数，在活跃神经元上最小化一个成本函数。关键实验发现表明，AGF 能够匹配各种架构中观察到的损失下降的顺序、时间和幅度，统一并扩展了线性网络和变压器中的鞍点到鞍点分析，并为执行模加操作的二次网络提供了完整的训练动力学特征，揭示了网络按系数大小递减顺序学习傅里叶特征。

Model Merging via Multi-Teacher Knowledge Distillation

Authors: Seyed Arshan Dalili, Mehrdad Mahdavi

First: 2025-12-24T17:10:44+00:00 · Latest: 2025-12-24T17:10:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.

中文标题/摘要

标题：多教师知识蒸馏下的模型合并

模型合并已成为联合多任务学习（MTL）的轻量级替代方案，但合并模型的泛化特性尚未得到充分探索。建立此类理论保证并不容易，因为合并过程通常禁止访问原始训练数据，并涉及结合在根本上异质数据分布下训练的微调模型。在缺乏这些动态的原理性理解时，当前方法往往依赖于启发式方法来近似参数的最佳组合。这种方法在系数缩放中最为关键，即调节每个微调模型对共享参数贡献大小的权重因子。然而，由于缺乏指导其选择的原理性目标，这些方法会导致脆弱的性能，并且高度依赖于缩放初始化。我们通过（i）建立一种新的适用于模型合并设置的平滑度感知PAC-Bayes泛化界来解决这一缺口。该分析引入了一个“跨任务异质性”项，正式捕捉了多种微调模型先验与目标多任务分布之间的不匹配。受这一理论洞察的指导，（ii）我们将模型合并框架化为在稀缺未标记数据上的多教师知识蒸馏。我们正式证明，最小化学生-教师Kullback-Leibler散度直接收紧了合并模型超额风险的上界。受所推导的平滑度感知界指导，（iii）我们通过SAMerging方法实现这一目标，该方法使用尖锐度感知最小化（SAM）来寻找平滑的极小值。实验中，SAMerging在视觉和自然语言处理基准测试中建立了新的最佳状态，实现了卓越的性能。代码可在https://github.com/arshandalili/SAMerging/ 获取。

Summary / 总结

This paper addresses the theoretical and practical challenges in model merging, a lightweight alternative to joint multi-task learning. It introduces a novel flatness-aware PAC-Bayes generalization bound for model merging, which captures the mismatch between fine-tuned models and multi-task distributions. Guided by this insight, the authors frame model merging as multi-teacher knowledge distillation and propose SAMerging, which uses Sharpness-Aware Minimization to find flat minima. Experiments show that SAMerging outperforms existing methods on vision and NLP benchmarks.

论文通过建立理论框架来解决模型合并问题，这是一种轻量级的多任务学习替代方案。它引入了一种针对模型合并的平滑度感知PAC-Bayes泛化界，并将任务框架化为多教师的知识蒸馏。提出的SAMerging方法使用尖锐度感知最小化来寻找平滑的最小值，从而在视觉和自然语言处理基准测试中取得了显著的性能提升。

Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction

Authors: Suren Bandara

First: 2025-12-24T17:10:37+00:00 · Latest: 2025-12-24T17:10:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.

中文标题/摘要

标题：基于掩膜后处理的表格分割结构坐标提取

从表格中提取结构化数据在扫描文档和数字档案的文档图像分析中起着关键作用。尽管已经提出了许多方法来检测表格结构并提取单元格内容，但在低分辨率或噪声图像中准确识别表格段边界（行和列）仍然具有挑战性。在许多实际场景中，表格数据不完整或退化，限制了基于变换器的方法对噪声输入的适应性。基于掩膜的边缘检测技术在这些条件下表现出更大的鲁棒性，因为它们的灵敏度可以通过阈值调整进行调整；然而，现有方法通常直接将掩膜应用于图像，导致噪声敏感性、分辨率损失或高计算成本。本文提出了一种新的多尺度信号处理方法，用于从表格掩膜中检测表格边缘。行和列转换被建模为一维信号，并使用具有逐渐增加方差的高斯卷积进行处理，然后通过统计阈值抑制噪声同时保留稳定的结构边缘。检测到的信号峰值被映射回图像坐标以获得准确的段边界。实验结果表明，将所提出的方法应用于列边缘检测，可以将基于布局的度量Cell-Aware Segmentation Accuracy (CASA)从PubLayNet-1M基准上的67%提高到76%，该度量评估文本正确性和正确的单元格放置。该方法通过零填充和缩放策略对分辨率变化具有鲁棒性，并生成优化的结构化表格输出，适合下游分析。

Summary / 总结

This paper addresses the challenge of accurately detecting table segment boundaries in low-resolution or noisy images, which is crucial for structured data extraction from tables in document image analysis. The authors propose a multi-scale signal-processing method that models row and column transitions as one-dimensional signals and processes them using Gaussian convolution with increasing variances, followed by statistical thresholding to detect stable structural edges. The method improves the Cell-Aware Segmentation Accuracy (CASA) from 67% to 76% on the PubLayNet-1M benchmark when used with TableNet and PyTesseract OCR, demonstrating robustness to resolution variations and high adaptability to noisy inputs.

本文针对低分辨率或噪声图像中准确识别表格边界的问题，提出了一种多尺度信号处理方法，将行和列的过渡视为一维信号，并使用高斯卷积和统计阈值处理。该方法在使用TableNet和PyTesseract OCR时，将Cell-Aware Segmentation Accuracy (CASA) 从67%提高到76%，在PubLayNet-1M基准上展示了其对分辨率变化的鲁棒性和适用于下游分析的特性。

Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Authors: Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan

First: 2025-12-24T17:05:09+00:00 · Latest: 2025-12-24T17:05:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

中文标题/摘要

标题：使用尖峰驱动视频变换器的手术场景分割及其实时潜力

现代手术系统越来越多地依赖智能场景理解以提供及时的态势感知，从而增强术中安全性。在此流程中，手术场景分割在准确感知手术事件方面发挥着核心作用。尽管最近的深度学习模型，尤其是大规模基础模型，实现了显著的分割准确性，但它们巨大的计算需求和高能耗阻碍了在资源受限的手术环境中进行实时部署。为解决这一限制，我们探索了新兴的SNN作为高效手术智能的有前途范式。然而，其性能仍受到手术标注数据稀缺和手术视频表示固有的稀疏性限制。为此，我们提出了SpikeSurgSeg，这是首个针对手术场景分割的尖峰驱动视频变换器框架，具有在非GPU平台上实现实时潜力的潜力。为解决手术标注数据有限的问题，我们引入了一种针对SNN的手术场景掩蔽自编码预训练策略，通过逐层管状掩蔽实现稳健的空间-时间表示学习。基于此预训练骨干，我们进一步采用一种轻量级的尖峰驱动分割头，产生时间一致的预测，同时保持SNN的低延迟特性。在EndoVis18和我们内部的SurgBleed数据集上的广泛实验表明，SpikeSurgSeg在推断延迟方面至少减少了8倍，同时其mIoU与最先进的基于ANN的模型相当。值得注意的是，它相对于大多数基础模型基线的加速比超过20倍，突显了其在时间关键手术场景分割中的潜力。

Summary / 总结

The research aims to develop a real-time surgical scene segmentation model for enhanced intra-operative safety. To address the computational demands of deep learning models, the authors propose SpikeSurgSeg, a spike-driven video Transformer framework. This model uses a surgical-scene masked autoencoding pretraining strategy and a lightweight segmentation head to achieve high segmentation accuracy while significantly reducing inference latency. Experiments show that SpikeSurgSeg outperforms most foundation-model baselines by over 20 times in terms of speed, with comparable accuracy to state-of-the-art models.

研究旨在通过利用尖峰驱动的视频Transformer（SpikeSurgSeg）来改善实时手术场景分割，以增强术中安全。该方法包括针对尖峰神经网络（SNNs）的手术场景掩蔽自编码预训练策略和一个轻量级的尖峰驱动分割头。实验结果表明，SpikeSurgSeg在平均交并比（mIoU）上与最先进的（SOTA）基于ANN的模型相当，同时将推理延迟减少了至少8倍，并且相对于大多数基础模型基线提供了超过20倍的加速。

SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

Authors: Divij Dudeja, Mayukha Pal

First: 2025-12-24T16:59:04+00:00 · Latest: 2025-12-24T16:59:04+00:00

Abs · PDF · Code1 · Code2

Abstract

The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.

中文标题/摘要

标题：SMART SLM：结构化记忆与推理变换器，一种用于准确文档辅助的小型语言模型

工程手册（EM）的用户发现阅读EM文档很困难，因为它们很长，格式密集，包含书面文档、逐步操作程序和工程设备的标准参数列表。现成的变换器，尤其是紧凑型的，将这些材料视为一个扁平的令牌流。这种方法导致模型自信但错误的数字答案，并迫使模型以低效的方式记忆单独的事实。SMART（结构化记忆与推理变换器）为上述问题提供了一种不同的且实用的解决方案。SMART通过使用分层方法来结构化其处理过程，并基于三个主要工作类别：（1）语法意识事实提取器（语法学家）树LSTM，从EM句子中提取主语关系对象关系的事实；（2）紧凑索引记忆MANN（记忆增强神经网络），将这些理性主语关系对象作为384维向量索引，与信息来源相关联；（3）6层变换器，学习将之前检索到的事实融合到其生成的响应中。整个SMART模型使用45.51M参数，比GPT-2（124M）少64%，比BERT（133M）少69%，并且准确率比GPT-2高21.3%，表明SMART以最少的处理要求更好地拟合数据。SMART采用双模式推理，已知文档的索引快速路径（次秒级答案时间）和新上传文件的索引动态路径（借助RAGs的FAISS Top 20结果，记忆容量为64槽）。在实际部署中，该框架比可比的小型变换器模型产生更支持的结果，减少了幻觉。

Summary / 总结

The paper addresses the challenge of accurately processing Engineering Manuals (EM) by proposing SMART (Structured Memory and Reasoning Transformer), which uses a hierarchical approach to extract facts and store them in a compact indexed memory. SMART consists of a syntax-aware fact extractor, a compact indexed memory, and a transformer that fuses retrieved facts. The model uses 45.51M parameters, fewer than GPT-2 and BERT, and achieves 21.3% higher accuracy. SMART offers dual inference modes for known and new documents, leading to more accurate and supported results with reduced hallucinations compared to other small transformers.

论文提出了一种紧凑型语言模型SMART（Structured Memory and Reasoning Transformer），以解决用户阅读工程手册（EM）的难题。SMART采用分层方法，包括语法感知的事实提取器、紧凑的索引记忆和变压器，以更有效地处理EM。该模型的准确率比GPT-2高21.3%，参数量更少，显示出更好的拟合度和更低的计算需求。SMART支持已知和新上传文档的双重推理模式，通过减少幻觉和快速提供支持性结果，增强了实际部署中的表现。

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Authors: Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller

First: 2025-12-24T16:46:04+00:00 · Latest: 2025-12-24T16:46:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.

中文标题/摘要

标题：GriDiT：基于网格的因子化扩散方法用于高效生成长图像序列

现代深度学习方法通常将图像序列视为按顺序堆叠帧的大张量。然而，鉴于当前的最先进水平（SoTA），这种简单的表示是否理想？在本文中，我们从生成模型的角度回答了这个问题，并旨在设计一种更有效的图像序列数据建模方法。观察当前SoTA图像序列生成方法的低效性和瓶颈，我们展示了与其处理大张量，通过首先在低分辨率下生成粗略的序列，然后在高分辨率下细化各个帧，可以改进生成过程。我们仅使用包含下采样帧的网格图像训练生成模型。然而，我们学习使用扩散变换器（DiT）的强自注意力机制来捕捉帧之间的相关性，从而生成图像序列。实际上，我们的建模方法将二维图像生成器扩展为低分辨率的三维图像序列生成器，而无需进行任何架构修改。随后，我们逐帧超分辨率以添加序列无关的高分辨率细节。这种方法具有多种优势，并可以克服该领域SoTA的关键限制。与现有的图像序列生成模型相比，我们的方法在合成质量和序列间的一致性方面表现更优。它还能够生成任意长度的高保真图像序列，并在推理时间和训练数据使用方面提高效率。此外，我们简洁的建模方法使我们的方法能够在各种数据领域中有效泛化，通常需要额外的先验知识和监督才能在生成上下文中建模。我们的方法在数据集上始终在质量和推理速度（至少快两倍）方面优于SoTA。

Summary / 总结

This paper proposes GriDiT, a factorized grid-based diffusion method for efficient long image sequence generation. It addresses the inefficiencies of treating image sequences as large tensors by first generating a low-resolution coarse sequence and then refining individual frames at high resolution. The method uses a Diffusion Transformer to capture frame correlations and super-resolves each frame to add high-resolution details. Experiments show that GriDiT achieves superior synthesis quality, improved sequence coherence, and higher efficiency in inference time and training data usage compared to existing models.

该研究提出了GriDiT方法，将长图像序列的生成过程分为两步：首先生成低分辨率序列，然后逐帧提升分辨率。该方法利用扩散变换器捕捉帧间关联，并逐帧超分辨率处理，从而获得更高的合成质量、更好的序列连贯性以及更高的推理和训练数据使用效率。与现有模型相比，GriDiT在质量上表现更优，推理速度至少快一倍。

Learning to Refocus with Video Diffusion Models

Authors: SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

Venue: SIGGRAPH Asia 2025

First: 2025-12-22T19:29:57+00:00 · Latest: 2025-12-24T16:32:32+00:00

Comments: Code and data are available at https://learn2refocus.github.io . SIGGRAPH Asia 2025, Dec. 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at https://learn2refocus.github.io

中文标题/摘要

标题：学习使用视频扩散模型重新聚焦

对焦是摄影的基础，但自动对焦系统经常无法捕捉到预期的主体，用户在拍摄后也常希望调整对焦。我们提出了一种使用视频扩散模型进行现实后对焦的新方法。从单张失焦图像出发，我们的方法生成了一组感知上准确的焦深序列，表示为视频序列，支持交互式重新对焦并解锁一系列下游应用。我们发布了一个大规模的焦深数据集，以支持这项工作和未来的研究。我们的方法在感知质量和鲁棒性方面均优于现有方法，在各种挑战性场景中表现出色，为日常摄影中的更高级对焦编辑能力铺平了道路。代码和数据可在https://learn2refocus.github.io 获取

Summary / 总结

This paper addresses the issue of inaccurate autofocus in photography and the desire for post-capture refocusing. It introduces a method using video diffusion models to generate a perceptually accurate focal stack from a single defocused image, allowing for interactive refocusing and various downstream applications. The method outperforms existing approaches in both perceptual quality and robustness, and a large-scale dataset is provided to support this work and future research.

该论文通过引入使用视频扩散模型的新方法，解决了摄影中后期对焦的问题。从单张失焦图像出发，该方法生成了感知上准确的焦距堆栈，支持交互式对焦。该方法在各种场景下在感知质量和鲁棒性方面均优于现有技术，展示了在日常摄影中高级对焦编辑的潜力。提供了一个大规模的数据集以支持该研究和未来的工作。

ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang

First: 2025-12-24T16:24:18+00:00 · Latest: 2025-12-24T16:24:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.

中文标题/摘要

标题：ACD：通过注意力监督实现视频扩散模型的直接条件控制

在视频合成中，可控性是一个基本要求，准确对齐条件信号至关重要。现有无分类器自由引导方法通常通过建模数据和条件的联合分布间接实现条件化，这往往导致对指定条件的有限可控性。基于分类器的引导通过外部分类器强制执行条件，但模型可能会利用这种机制提高分类器分数而不真正满足预期条件，从而产生对抗性伪影并限制有效的可控性。在本文中，我们提出了注意力条件扩散（ACD），这是一种通过注意力监督直接在视频扩散模型中实现条件控制的新框架。通过使模型的注意力图与外部控制信号对齐，ACD 达到了更好的可控性。为此，我们引入了一种稀疏的3D感知对象布局作为高效的条件信号，以及一个专用的布局控制网和自动注释流水线以实现可扩展的布局集成。在基准视频生成数据集上的大量实验表明，ACD 在保持时间连贯性和视觉保真度的同时，提供了与条件输入更好的对齐，建立了条件视频合成的有效范式。

Summary / 总结

The paper addresses the need for better controllability in video synthesis by proposing Attention-Conditional Diffusion (ACD), which uses attention supervision to directly align the model's attention maps with external control signals. ACD introduces a sparse 3D-aware object layout as an efficient conditioning signal and includes a Layout ControlNet and an automated annotation pipeline. Experiments show that ACD improves alignment with conditioning inputs while maintaining temporal coherence and visual fidelity, outperforming existing methods in conditional video synthesis.

论文旨在通过直接使模型的注意力图与外部控制信号对齐来增强视频合成中的可控性。它提出了注意力条件扩散（ACD）框架，使用稀疏的3D感知对象布局作为条件信号，并引入了布局控制网和自动注释流水线以实现高效的控制。实验表明，ACD在保持时间连贯性和视觉保真度的同时，能够更好地与条件输入对齐，优于现有方法在可控性和有效性方面的表现。

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Authors: Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu

First: 2025-12-24T16:00:15+00:00 · Latest: 2025-12-24T16:00:15+00:00

Comments: Project Page: https://dreamontage.github.io/DreaMontage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

中文标题/摘要

标题：DreaMontage：任意帧引导的一次性视频生成

“一次性”技术代表了电影制作中独特而复杂的美学。然而，其实现往往受到高昂成本和复杂现实约束的阻碍。尽管新兴的视频生成模型提供了虚拟替代方案，但现有方法通常依赖于简单的片段拼接，这通常无法保持视觉连贯性和时间一致性。在本文中，我们介绍了DreaMontage，这是一种全面的框架，用于任意帧引导的生成，能够从多种用户提供的输入中合成无缝、富有表现力且长时间的一次性视频。为了实现这一目标，我们从三个主要维度应对挑战。(i) 我们将一个轻量级的中间条件机制整合到DiT架构中。通过采用一种有效的基训练数据调优策略，我们解锁了强大的任意帧控制能力。(ii) 为了提高视觉保真度和电影表现力，我们精心制作了一个高质量的数据集，并实施了视觉表达SFT阶段。通过应用定制的DPO方案，我们解决了诸如主体运动合理性及过渡平滑性等关键问题，显著提高了生成内容的成功率和可用性。(iii) 为了促进长序列的生成，我们设计了一种分段自回归(SAR)推理策略，该策略在内存高效的情况下运行。广泛的实验表明，我们的方法实现了视觉上引人注目且无缝连贯的一次性效果，同时保持了计算效率，使用户能够将零散的视觉材料转化为生动、连贯的一次性电影体验。

Summary / 总结

DreaMontage is a framework for generating seamless one-shot videos from arbitrary frames. It integrates a lightweight intermediate-conditioning mechanism into the DiT architecture and uses an Adaptive Tuning strategy to achieve robust control over arbitrary frames. The framework also includes a Visual Expression SFT stage and a Tailored DPO scheme to enhance visual fidelity and smooth transitions. Additionally, it employs a Segment-wise Auto-Regressive (SAR) inference strategy for efficient production of extended sequences. Experimental results show that DreaMontage can produce visually striking and temporally coherent one-shot videos while maintaining computational efficiency.

DreaMontage 是一个从任意帧生成无缝一镜头视频的框架。它将轻量级的中间条件机制集成到 DiT 架构中，采用自适应调谐策略，并包含视觉表达 SFT 阶段以增强视觉保真度。此外，它还使用了定制的 DPO 方案和分段自回归（SAR）推理策略来提高运动合理性和平滑过渡，同时保持计算效率。实验表明，DreaMontage 可以生成视觉上引人注目且时间上连贯的一镜头视频。

LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov

First: 2025-12-24T15:36:21+00:00 · Latest: 2025-12-24T15:36:21+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .

中文标题/摘要

标题：LookPlanGraph：基于VLM图增强的体感指令跟随方法

使用大型语言模型（LLM）作为规划器的方法在体感指令跟随任务中变得普遍。为了成功完成任务，LLM 必须在机器人操作的环境中进行接地。一种解决方案是使用包含所有必要信息的场景图。现代方法依赖于预先构建的场景图，并假设在规划开始时所有任务相关信息都已可用。然而，这些方法没有考虑到在图构建和任务执行之间环境可能发生的改变。我们提出了 LookPlanGraph 方法，该方法利用由静态资产和对象先验组成的场景图。在计划执行过程中，LookPlanGraph 不断通过验证现有先验或发现新实体来更新图。这通过使用视觉语言模型处理代理的主观摄像机视图来实现。我们在具有改变对象位置的 VirtualHome 和 OmniGibson 模拟环境中进行了实验，证明了 LookPlanGraph 在基于预定义静态场景图的方法中表现出色。为了展示我们方法的实际适用性，我们还在现实世界中进行了实验。此外，我们引入了 GraSIF（用于指令跟随的图场景）数据集，其中包括自动验证框架，包含从 SayPlan Office、BEHAVIOR-1K 和 VirtualHome RobotHow 中抽取的 514 个任务。项目页面可在 https://lookplangraph.github.io 查看。

Summary / 总结

The paper addresses the challenge of updating scene graphs during task execution to account for environmental changes. It introduces LookPlanGraph, which uses a Vision Language Model to continuously update a scene graph with relevant objects. Experiments in simulated and real-world environments show that LookPlanGraph outperforms methods relying on predefined static scene graphs. The study also presents the GraSIF dataset for instruction following tasks with an automated validation framework.

论文解决了在任务执行过程中更新场景图以应对环境变化的挑战。提出了LookPlanGraph方法，利用视觉语言模型在执行过程中不断更新场景图中的相关物体。实验在模拟和真实环境中表明，LookPlanGraph在任务执行中优于依赖预定义静态场景图的方法。研究还介绍了包含514个任务的GraSIF数据集及其自动验证框架。

GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer

Authors: Corey Adams, Rishikesh Ranade, Ram Cherukuri, Sanjay Choudhry

First: 2025-12-23T14:40:08+00:00 · Latest: 2025-12-24T15:28:58+00:00

Abs · PDF · Code1 · Code2

Abstract

We present GeoTransolver, a Multiscale Geometry-Aware Physics Attention Transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO) and reused in every block. Implemented and released in NVIDIA PhysicsNeMo, GeoTransolver persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes. We benchmark GeoTransolver on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, comparing against Domino, Transolver (as released in PhysicsNeMo), and literature-reported AB-UPT, and evaluate drag/lift R2 and Relative L1 errors for field variables. GeoTransolver delivers better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency; we include ablations on DrivAerML and qualitative results such as contour plots and design trends for the best GeoTransolver models. By unifying multiscale geometry-aware context with physics-based attention in a scalable transformer, GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes.

中文标题/摘要

标题：GeoTransolver：使用多尺度几何感知物理注意力变换器在不规则域上学习物理

我们提出了GeoTransolver，这是一种用于CAE的多尺度几何感知物理注意力变换器，用GALE替代了标准注意力，将物理感知的自我注意力耦合到从多尺度球查询中计算出的共享几何/全局/边界条件上下文上（灵感来自DoMINO），并在每个块中重用。在NVIDIA PhysicsNeMo中实现并发布，GeoTransolver持续将几何、全局和边界条件参数投影到物理状态空间中，将潜在计算锚定到域结构和操作模式上。我们在DrivAerML、Luminary SHIFT-SUV和Luminary SHIFT-Wing上对GeoTransolver进行了基准测试，与Domino、Transolver（在PhysicsNeMo中发布）和文献报告的AB-UPT进行比较，并评估了场变量的拖曳/升力R2和相对L1误差。GeoTransolver提供了更好的准确性，对几何/模式变化的鲁棒性更强，并且具有有利的数据效率；我们包括了DrivAerML上的消融分析和诸如等值线图和最佳GeoTransolver模型的设计趋势等定性结果。通过在可扩展的变换器中统一多尺度几何感知上下文和基于物理的注意力，GeoTransolver促进了复杂、不规则域和非线性物理模式下的操作学习，以实现高保真代理建模。

Summary / 总结

GeoTransolver is a multiscale geometry-aware physics attention transformer designed to improve the accuracy and robustness of computational fluid dynamics (CFD) simulations on irregular domains. It uses GALE for physics-aware self-attention and cross-attention to a shared geometry/global/boundary-condition context. GeoTransolver was benchmarked on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, showing better accuracy, improved robustness to geometry and regime shifts, and favorable data efficiency compared to other methods.

GeoTransolver 是一种多尺度几何感知物理注意力变换器，旨在提高复杂不规则域中计算流体动力学 (CFD) 模型的准确性和鲁棒性。它使用 GALE（几何感知物理注意力）将物理感知的自我注意力与共享几何/全局/边界条件上下文的交叉注意力相结合，增强模型处理不同几何形状和运行条件的能力。GeoTransolver 在 DrivAerML、Luminary SHIFT-SUV 和 Luminary SHIFT-Wing 上与其他方法进行了基准测试，显示出更高的准确性和数据效率，并且在几何形状和运行条件变化时具有更好的鲁棒性。

SegMo: Segment-aligned Text to 3D Human Motion Generation

Authors: Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen

First: 2025-12-24T15:26:11+00:00 · Latest: 2025-12-24T15:26:11+00:00

Comments: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026

Abs · PDF · Code1 · Code2

Abstract

Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.

中文标题/摘要

标题：SegMo: 与片段对齐的文本到3D人体动作生成

从文本描述生成3D人体动作是一个重要的研究问题，在视频游戏、虚拟现实和增强现实等领域有着广泛的应用。最近的方法在序列级别对齐文本描述和人体动作，忽略了模态的内部语义结构。然而，动作描述和动作序列可以自然地分解为更小且语义上更连贯的片段，这些片段可以作为原子对齐单元以实现更精细的对应。受此启发，我们提出了一种新颖的SegMo框架，以实现细粒度的文本-动作对齐。我们的框架由三个模块组成：(1) 文本片段提取，将复杂的文本描述分解为按时间顺序排列的短语，每个短语代表一个简单的原子动作；(2) 动作片段提取，将完整的动作序列分割为相应的动作片段；(3) 细粒度文本-动作对齐，通过对比学习对齐文本和动作片段。广泛的实验表明，SegMo在两个广泛使用的数据集上改进了强大的基线，实现了在HumanML3D测试集上的TOP 1得分为0.553。此外，由于学习到的文本和动作片段共享嵌入空间，SegMo还可以应用于检索任务，如动作定位和动作到文本检索。

Summary / 总结

SegMo is a novel framework for generating 3D human motions from textual descriptions by aligning text and motion segments. It decomposes textual descriptions and motion sequences into smaller, semantically coherent segments and uses contrastive learning for fine-grained alignment. Experiments show that SegMo outperforms strong baselines, achieving a TOP 1 score of 0.553 on the HumanML3D test set and demonstrating effectiveness in retrieval tasks.

SegMo 是一种新颖的框架，用于从文本生成 3D 人体动作，通过在段落级别对齐文本和动作来解决先前方法的局限性。它包括三个模块：文本段落提取、动作段落提取和细粒度文本-动作对齐。SegMo 在 HumanML3D 测试集上将 TOP 1 分数提高到 0.553，并且在动作定位和动作到文本检索等检索任务中也表现出有效性。

MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models

Authors: Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller

First: 2025-12-24T15:15:18+00:00 · Latest: 2025-12-24T15:15:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers -- a property we term 'latent solvability'. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.

中文标题/摘要

标题：MiST：理解中期科学训练在开发化学推理模型中的作用

大型语言模型可以通过基于规则的奖励进行在线微调来发展推理能力。然而，最近的研究揭示了一个关键限制：强化学习仅在基础模型对正确答案已分配非可忽略概率时才能成功——我们称这一特性为“潜在可解性”。本研究探讨了化学推理能力的出现及其先决条件对化学领域意味着什么。我们确定了基于强化学习的化学推理的两个必要条件：1）符号能力，2）潜在的化学知识。我们提出了中期科学训练（MiST）：一系列中期训练技术以满足这些条件，包括数据混合、SMILES/CIF意识预处理、继续预训练29亿个标记以及监督微调1亿个标记。这些步骤将3B和7B模型的潜在可解性得分提高了1.8倍，并使强化学习在有机反应命名中的顶级准确率从10.9%提升到63.9%，在无机材料生成中的顶级准确率从40.6%提升到67.4%。对于其他具有挑战性的化学任务，也观察到了类似的结果，同时生成了可解释的推理痕迹。我们的研究结果定义了化学推理训练的明确先决条件，并突显了中期训练在解锁推理能力中的更广泛作用。

Summary / 总结

This study explores the role of mid-stage scientific training (MiST) in developing chemical reasoning capabilities in large language models. The research identifies two prerequisites: symbolic competence and latent chemical knowledge. MiST includes techniques like data-mixing with SMILES/CIF-aware pre-processing, continued pre-training, and supervised fine-tuning. These methods significantly improve the models' latent solvability, enabling reinforcement learning to achieve top-1 accuracy of 63.9% in organic reaction naming and 67.4% in inorganic material generation, compared to 10.9% and 40.6% respectively without MiST.

研究探讨了中期科学训练（MiST）在发展化学推理模型中的作用，解决了“潜在可解性”约束问题，即基础模型必须初始赋予正确答案非忽略的概率。研究引入了数据混合、SMILES/CIF感知预处理、持续预训练和监督微调等MiST技术，显著提高了潜在可解性分数，并使强化学习在有机反应命名和无机材料生成任务中的准确性大幅提升。这些结果强调了中期训练在解锁化学推理能力中的重要性。

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

Authors: Xiao-Qi Han, Ze-Feng Gao, Peng-Jie Guo, Zhong-Yi Lu

First: 2025-12-24T15:07:36+00:00 · Latest: 2025-12-24T15:07:36+00:00

Comments: 19 pages, 6 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen--the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at https://github.com/xqh19970407/PhononBench

中文标题/摘要

标题：PhononBench：一种基于声子的大规模基准测试，用于晶体生成中的动力学稳定性

在本工作中，我们介绍了PhononBench，这是首个用于AI生成晶体动力学稳定性的大规模基准测试。利用最近开发的MatterSim原子间势，该势能在超过10,000种材料中实现了从头算水平的声子预测精度，PhononBench能够高效地进行大规模声子计算和动力学稳定性分析，针对六种领先的晶体生成模型生成的108,843种晶体结构。PhononBench揭示了当前生成模型在确保动力学稳定性方面的普遍局限性：所有生成结构的动力学稳定性平均率为25.83%，最佳模型MatterGen也仅达到41.0%。进一步的案例研究表明，在目标性质生成中——以MatterGen的带隙调节为例——即使在最佳带隙条件0.5 eV下，动力学稳定性率仍低至23.5%。在空间群控制生成中，高对称晶体表现出更好的稳定性（例如，立方系统达到49.2%的稳定性率），但所有控制生成的平均稳定性仍仅为34.4%。这项研究的重要附加成果是识别了28,119种在整个布里渊区都稳定的晶体结构，为未来的材料探索提供了可靠的候选池。通过建立首个大规模动力学稳定性基准测试，本工作系统地突显了当前晶体生成模型的局限性，并为未来设计和发现物理上可行的材料提供了必要的评估标准和指导。所有模型生成的晶体结构、声子计算结果以及PhononBench开发的高通量评估工作流程将在https://github.com/xqh19970407/PhononBench公开发布。

Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Authors: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen

Venue: MM

First: 2025-12-24T15:02:33+00:00 · Latest: 2025-12-24T15:02:33+00:00

Comments: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

中文标题/摘要

标题：利用轻量级实体提取实现可扩展的基于事件的图像检索

从自然语言描述中检索图像是一项核心任务，位于计算机视觉和自然语言处理的交叉点，广泛应用于搜索引擎、媒体归档和数字内容管理中。然而，由于模糊或依赖上下文的查询、语言的多样性以及需要可扩展的解决方案，现实世界中的图像-文本检索仍然具有挑战性。在本文中，我们提出了一种轻量级的两阶段检索管道，利用事件中心的实体提取来结合真实场景描述中的时间与上下文信号。第一阶段使用BM25基于显著实体进行高效的候选过滤，而第二阶段则应用BEiT-3模型来捕捉深层的多模态语义并重新排序结果。在OpenEvents v1基准上评估，我们的方法达到了0.559的平均精度，显著优于先前的基线。这些结果突显了结合事件引导的过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。我们的代码可在https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval 获取。

RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

First: 2025-12-24T15:01:26+00:00 · Latest: 2025-12-24T15:01:26+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

中文标题/摘要

标题：RoboSafe：通过可执行的安全逻辑保护具身代理

由视觉-语言模型（VLMs）驱动的具身代理越来越能够执行复杂的现实世界任务，但它们仍然容易受到可能导致不安全行为的危险指令的影响。运行时安全护栏可以在任务执行过程中拦截危险行为，提供了一种有前景的解决方案，因为它们具有灵活性。然而，现有的防御措施往往依赖于静态规则过滤或提示级控制，难以应对动态、时间依赖性和上下文丰富的环境中隐含的风险。为了解决这个问题，我们提出了RoboSafe，这是一种通过可执行谓词基础安全逻辑为具身代理提供混合推理运行时保护的混合方法。RoboSafe结合了在混合长短期安全记忆上的两种互补推理过程。我们首先提出了一种后向反思推理模块，该模块不断回顾短期记忆中的最近轨迹，以推断时间安全谓词，并在检测到违规行为时主动触发重新规划。然后，我们提出了一种前瞻预测推理模块，该模块通过生成基于长期安全记忆和代理的多模态观察的安全谓词来预见即将出现的风险。这些组件共同形成了一个既可解释又可执行的适应性、验证性安全逻辑。在多个代理的广泛实验中，RoboSafe与领先基准相比显著减少了危险行为（风险发生率降低36.8%），同时保持了接近原始的任务性能。在物理机器人手臂上的实际评估进一步证实了其实用性。代码将在接受后发布。

Summary / 总结

RoboSafe is designed to protect embodied agents from hazardous instructions by using executable safety logic. It employs a hybrid reasoning approach with two modules: Backward Reflective Reasoning, which continuously monitors recent actions for safety violations, and Forward Predictive Reasoning, which predicts potential risks based on long-term memory and current observations. Experiments show that RoboSafe significantly reduces hazardous actions by 36.8% compared to existing methods, while preserving task performance. Real-world evaluations on robotic arms validate its practicality.

RoboSafe 通过使用可执行的安全逻辑来保护由视觉语言模型驱动的实体代理，结合后向反思推理和前瞻预测推理，实时监控和预测潜在风险。实验表明，RoboSafe 相比现有方法显著减少了 36.8% 的危险行为，同时保持了相近的任务性能。实际机器人手臂的评估进一步验证了其实用性。

Latent Implicit Visual Reasoning

Authors: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

First: 2025-12-24T14:59:49+00:00 · Latest: 2025-12-24T14:59:49+00:00

Abs · PDF · Code1 · Code2

Abstract

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

中文标题/摘要

标题：潜在隐式视觉推理

虽然大型多模态模型（LMMs）取得了显著进展，但它们仍然主要以文本为中心，依赖语言作为核心推理模态。因此，它们在处理以视觉为主的推理任务方面能力有限。最近的方法通过使用辅助图像、深度图或图像裁剪来监督中间的视觉步骤，试图解决这一问题。然而，这些策略对“有用的”视觉抽象施加了限制性的先验，增加了注释成本，并且难以在不同任务之间泛化。为了解决这一关键限制，我们提出了一种任务无关的机制，该机制训练LMMs发现和使用视觉推理标记，而无需显式的监督。这些标记全局注意并以任务自适应的方式重新编码图像，使模型能够提取相关视觉信息，而无需手工制作的监督。我们的方法在各种视觉中心任务上优于直接微调，并且在包括那些难以指定中间抽象的任务中也达到了最先进的结果，同时还能泛化到多任务指令调优。

Summary / 总结

The research aims to enhance Large Multimodal Models (LMMs) by addressing their reliance on text for reasoning, which limits their ability to handle visual tasks. The method involves training LMMs to discover and use visual reasoning tokens without explicit supervision, allowing the model to adapt to different tasks by re-encoding images globally. Key findings show that this approach outperforms direct fine-tuning and achieves state-of-the-art results on various vision-centric tasks, even when intermediate abstractions are difficult to specify, and it generalizes well to multi-task instruction tuning.

研究旨在通过开发一种任务无关的机制，使大型多模态模型（LMMs）能够发现并使用视觉推理令牌，而无需显式的监督。该方法使模型能够以任务自适应的方式重新编码图像，提取相关视觉信息。该方法在各种视觉中心任务上优于直接微调，实现了最先进的结果，包括那些难以指定中间抽象的任务，并且在多任务指令调优中表现出良好的泛化能力。

A study of EHVI vs fixed scalarization for molecule design

Authors: Anabel Yong, Austin Tripp, Layla Hosseini-Gerami, Brooks Paige

Venue: NeurIPS

First: 2025-07-18T07:12:19+00:00 · Latest: 2025-12-24T14:56:07+00:00

Comments: Accepted to NeurIPS AI4Science Workshop 2025

Abs · PDF · Code1 · Code2

Abstract

Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy - Expected Hypervolume Improvement (EHVI) - against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants - including random or adaptive schemes - our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.

中文标题/摘要

标题：关于分子设计中EHVI与固定加权化方法比较的研究

多目标贝叶斯优化（MOBO）为分子设计中的权衡提供了原则性的框架。然而，它与标量化替代方法的实证优势尚未得到充分探索。我们使用期望改进（EI）作为固定权重标量化基线，与基于帕累托的MOBO策略——期望hypervolume改进（EHVI）进行基准测试，使用了严格控制的设置，具有相同的高斯过程代理和分子表示。在三个分子优化任务中，EHVI在帕累托前沿覆盖、收敛速度和化学多样性方面始终优于标量化EI。虽然标量化包括灵活的变体——包括随机或自适应方案，但我们的结果表明，在数据量有限的情况下，即使是强确定性实例也可能表现不佳。这些发现为在有限评估预算和非平凡权衡时新分子优化中帕累托意识获取的实际优势提供了实证证据。

Summary / 总结

The study evaluates the performance of Expected Hypervolume Improvement (EHVI) against a fixed-weight scalarized baseline (Expected Improvement, EI) in multi-objective Bayesian optimization for molecular design. Across three tasks, EHVI outperforms EI in terms of Pareto front coverage, convergence speed, and chemical diversity. The results suggest that Pareto-aware acquisition methods like EHVI are advantageous, especially in low-data regimes where trade-offs are complex.

研究比较了在分子设计的多目标贝叶斯优化中，Expected Hypervolume Improvement (EHVI) 和固定权重标量化基线（Expected Improvement, EI）的表现。在三个任务中，EHVI 在帕累托前沿覆盖率、收敛速度和化学多样性方面均优于 EI，即使标量化包括自适应方案。这表明，在数据有限且权衡复杂的情况下，帕累托意识的获取方法具有优势。

ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering

Authors: Paritosh Parmar, Eric Peh, Basura Fernando

First: 2025-08-28T17:10:53+00:00 · Latest: 2025-12-24T14:52:45+00:00

Comments: Project page: https://paritoshparmar.github.io/chainreaction/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

中文标题/摘要

标题：ChainReaction：因果链引导的推理方法实现模块化和可解释的因果为什么视频问答

现有的因果为什么视频问答（VideoQA）模型往往难以进行高阶推理，依赖于不透明的单一管道，将视频理解、因果推理和答案生成紧密结合在一起。这些黑盒方法缺乏可解释性，通常依赖于浅层启发式方法。我们提出了一种新的模块化范式，明确地将因果推理与答案生成分离，引入自然语言因果链作为可解释的中间表示。受人类认知模型的启发，这些结构化的因果序列将低级视频内容与高级因果推理联系起来，使推理变得透明且逻辑连贯。我们的两阶段架构包括一个因果链提取器（CCE），从视频-问题对中生成因果链，以及一个因果链驱动的答案生成器（CCDA），基于这些链生成答案。为了解决缺乏标注推理痕迹的问题，我们提出了一种生成现有数据集中准确因果链的可扩展方法。我们为46000个样本构建了经过人类验证的因果链。我们还提出了CauCo，一种新的因果导向字幕评估指标。在三个大规模基准上的实验表明，我们的方法不仅优于最先进的模型，还在可解释性、用户信任和泛化方面取得了显著提升——将CCE定位为跨不同领域的可重用因果推理引擎。项目页面：https://paritoshparmar.github.io/chainreaction/

Summary / 总结

The paper addresses the limitations of existing VideoQA models in handling higher-order reasoning by proposing a modular approach that separates causal reasoning from answer generation. It introduces causal chains as interpretable intermediate representations, enabling transparent and logically coherent inference. The two-stage architecture, comprising a Causal Chain Extractor and a Causal Chain-Driven Answerer, demonstrates superior performance and explainability on three large-scale benchmarks compared to state-of-the-art models.

论文针对现有视频问答模型在处理高级推理时的局限性，提出了一种模块化的方法，将因果推理与答案生成分离。它引入了因果链作为可解释的中间表示，使推理过程透明且逻辑连贯。两阶段架构，包括因果链提取器和因果链驱动的答案生成器，在三个大规模基准测试中表现出色，优于最先进的模型，并且在可解释性、用户信任和泛化方面取得了显著提升。

Causal-driven attribution (CDA): Estimating channel influence without user-level data

Authors: Georgios Filippou, Boi Mai Quach, Diana Lenghel, Arthur White, Ashish Kumar Jha

First: 2025-12-24T14:51:12+00:00 · Latest: 2025-12-24T14:51:12+00:00

Comments: 42 pages, 8 figures, submitted initially to the journal of the academy of marketing science on 24th Dec 2025

Abs · PDF · Code1 · Code2

Abstract

Attribution modelling lies at the heart of marketing effectiveness, yet most existing approaches depend on user-level path data, which are increasingly inaccessible due to privacy regulations and platform restrictions. This paper introduces a Causal-Driven Attribution (CDA) framework that infers channel influence using only aggregated impression-level data, avoiding any reliance on user identifiers or click-path tracking. CDA integrates temporal causal discovery (using PCMCI) with causal effect estimation via a Structural Causal Model to recover directional channel relationships and quantify their contributions to conversions. Using large-scale synthetic data designed to replicate real marketing dynamics, we show that CDA achieves an average relative RMSE of 9.50% when given the true causal graph, and 24.23% when using the predicted graph, demonstrating strong accuracy under correct structure and meaningful signal recovery even under structural uncertainty. CDA captures cross-channel interdependencies while providing interpretable, privacy-preserving attribution insights, offering a scalable and future-proof alternative to traditional path-based models.

中文标题/摘要

标题：因果驱动归因（CDA）：无需用户级数据估计渠道影响

归因建模是衡量营销效果的核心，但大多数现有方法依赖于用户级路径数据，这些数据因隐私法规和平台限制而变得越来越难以获取。本文介绍了一种因果驱动归因（CDA）框架，该框架仅使用聚合的印象级数据来推断渠道影响，避免依赖用户标识符或点击路径跟踪。CDA 结合使用 PCMCI 的时间因果发现与结构因果模型中的因果效应估计，以恢复渠道关系的方向并量化其对转化的贡献。使用设计用于复制真实营销动态的大规模合成数据，我们展示了当给定真实因果图时，CDA 的平均相对 RMSE 为 9.50%，使用预测图时为 24.23%，证明在正确结构下具有很强的准确性，并且即使在结构不确定性下也能恢复有意义的信号。CDA 捕捉跨渠道的相互依赖性，同时提供可解释的、保护隐私的归因洞察，提供了一种可扩展且面向未来的替代传统路径模型的选择。

Summary / 总结

The paper introduces Causal-Driven Attribution (CDA), a framework that infers channel influence using aggregated impression-level data, avoiding user-level data. CDA combines temporal causal discovery with causal effect estimation to recover channel relationships and quantify their contributions. Experiments on synthetic data show CDA achieves an average relative RMSE of 9.50% with the true causal graph and 24.23% with a predicted graph, indicating strong accuracy and meaningful signal recovery even under structural uncertainty.

论文提出了因果驱动归因（CDA）框架，该框架利用聚合的曝光级数据来推断渠道影响，而不依赖于用户标识符或点击路径跟踪，解决了现有基于用户级数据方法的限制。CDA 结合了时间因果发现和因果效应估计，以恢复渠道关系并量化其贡献。实验表明，CDA 在使用真实因果图时的平均相对 RMSE 为 9.50%，在使用预测图时为 24.23%，即使在结构不确定性下也能实现较强的准确性和有意义的信号恢复。

Human Motion Estimation with Everyday Wearables

Authors: Siqi Zhu, Yixuan Li, Junfu Li, Qi Wu, Zan Wang, Haozhe Ma, Wei Liang

First: 2025-12-24T14:44:51+00:00 · Latest: 2025-12-24T14:44:51+00:00

Abs · PDF · Code1 · Code2

Abstract

While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.

中文标题/摘要

标题：基于日常穿戴设备的人体运动估计

基于穿戴设备的人体运动估计对于XR交互等应用至关重要，但现有方法往往存在穿戴不便、硬件昂贵和繁琐校准的问题，这阻碍了它们在日常生活中的应用。为解决这些挑战，我们提出了EveryWear，一种完全基于日常穿戴设备的轻量级和实用的人体动作捕捉方法：一部智能手机、智能手表、耳塞和配备一个前置摄像头和两个向下摄像头的智能眼镜，无需在使用前进行显式校准。我们引入了Ego-Elec，一个包含56种日常活动的9小时真实世界数据集，覆盖17种不同的室内和室外环境，并由动作捕捉（MoCap）提供地面真实三维注释，以促进该领域的稳健研究和基准测试。我们的方法采用多模态教师-学生框架，将第一人称摄像头的视觉线索与消费设备的惯性信号相结合。通过直接在真实世界数据上训练而不是合成数据，我们的模型有效地消除了限制先前工作的模拟到现实的差距。实验表明，我们的方法优于基线模型，验证了其在实际全身运动估计中的有效性。

Summary / 总结

The research aims to improve human motion estimation for applications like XR interaction by addressing issues such as poor wearability and expensive hardware. EveryWear, a lightweight approach using everyday wearables like a smartphone, smartwatch, earbuds, and smart glasses, is introduced. The method employs a multimodal teacher-student framework integrating visual cues from egocentric cameras with inertial signals from consumer devices, trained on real-world data. Experiments show that the approach outperforms baseline models, validating its effectiveness for practical full-body motion estimation.

研究旨在通过解决穿戴不便和硬件昂贵等问题，改进用于XR交互等人机运动估计的应用。提出了一个轻量级的方法EveryWear，利用日常穿戴设备如智能手机、智能手表、耳塞和智能眼镜。该方法采用多模态教师-学生框架，结合来自第一人称摄像头的视觉线索和消费级设备的惯性信号，并直接在真实世界数据上进行训练。实验表明，该方法优于基线模型，证明了其在实际全身运动估计中的有效性。

Analytic and Variational Stability of Deep Learning Systems

Authors: Ronald Katende

First: 2025-12-24T14:43:59+00:00 · Latest: 2025-12-24T14:43:59+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a unified analytic and variational framework for studying stability in deep learning systems viewed as coupled representation-parameter dynamics. The central object is the Learning Stability Profile, which tracks the infinitesimal response of representations, parameters, and update mechanisms to perturbations along the learning trajectory. We prove a Fundamental Analytic Stability Theorem showing that uniform boundedness of these stability signatures is equivalent, up to norm equivalence, to the existence of a Lyapunov-type energy that dissipates along the learning flow. In smooth regimes, the framework yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractivity of the learning dynamics. Classical spectral stability results for feedforward networks, a discrete CFL-type condition for residual architectures, and parametric and temporal stability laws for stochastic gradient methods arise as direct consequences. The theory extends to non-smooth learning systems, including ReLU networks, proximal and projected updates, and stochastic subgradient flows, by replacing classical derivatives with Clarke generalized derivatives and smooth energies with variational Lyapunov functionals. The resulting framework provides a unified dynamical description of stability across architectures and optimization methods, clarifying how architectural and algorithmic choices jointly govern robustness and sensitivity to perturbations. It also provides a foundation for further extensions to continuous-time limits and geometric formulations of learning dynamics.

中文标题/摘要

标题：深度学习系统的解析与变分稳定性

我们提出了一种统一的解析和变分框架，用于研究深度学习系统作为耦合的表示-参数动力学的稳定性。中心对象是学习稳定性轮廓，它跟踪表示、参数和更新机制在学习轨迹上受到扰动时的微小响应。我们证明了一个基本的解析稳定性定理，表明这些稳定性特征的统一有界性，等价于存在一种类似李雅普诺夫的能量，该能量在学习流中耗散。在光滑区域，该框架给出了将谱范数、激活正则性、步长和学习率与学习动力学的收缩性联系起来的显式稳定性指数。对于前馈网络的经典谱稳定性结果、残差架构的离散CFL条件以及随机梯度方法的参数和时间稳定性法则均作为直接推论出现。该理论扩展到非光滑学习系统，包括ReLU网络、近端和投影更新以及随机梯度流，通过用Clarke广义导数替换经典导数，并用变分李雅普诺夫泛函替换光滑能量。由此产生的框架提供了一种统一的动力学描述，涵盖了各种架构和优化方法的稳定性，阐明了架构和算法选择如何共同影响鲁棒性和对扰动的敏感性。它还为连续时间极限和学习动力学的几何形式提供了进一步扩展的基础。

Summary / 总结

This paper introduces a unified framework combining analytic and variational methods to study stability in deep learning systems. The central concept is the Learning Stability Profile, which measures the infinitesimal response of representations, parameters, and update mechanisms to perturbations during training. The authors prove a Fundamental Analytic Stability Theorem showing that uniform boundedness of these stability signatures is equivalent to the existence of a Lyapunov-type energy that dissipates along the learning flow. The framework provides explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to the contractivity of the learning dynamics, and extends to non-smooth learning systems like ReLU networks and stochastic subgradient flows.

论文提出了一种统一框架，通过分析表示、参数和更新机制对扰动的响应来研究深度学习系统的稳定性。证明了一个基本的分析稳定性定理，表明稳定性签名的均匀有界性等价于沿学习流存在一种Lyapunov型能量并耗散。该框架提供了将谱范数、激活正则性、步长和学习率与学习动力学的收缩性联系起来的显式稳定性指数，并扩展到如ReLU网络和随机梯度流等非光滑系统。

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Authors: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

First: 2025-12-18T10:21:14+00:00 · Latest: 2025-12-24T14:39:27+00:00

Comments: Project available at https://github.com/sarapapi/hearing2translate

Abs · PDF · Code1 · Code2 · Code3

Abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

中文标题/摘要

标题：听译：将语音模态整合到LLM中的有效性

随着大型语言模型（LLMs）超越文本，将语音作为原生模态进行整合产生了语音LLMs，旨在直接翻译口语，从而绕过传统的转录管道。然而，这种整合是否能比现有的级联架构提高语音到文本的翻译质量仍是一个开放的问题。我们提出了听译，这是第一个全面的测试套件，严格基准测试了5个最先进的语音LLMs与16个强大的级联系统，这些系统结合了领先的语音基础模型（SFM）和多语言LLMs。我们的分析涵盖了16个基准测试、13种语言对和9种具有挑战性的条件，包括不连贯、嘈杂和长篇语音。在广泛的评估中，我们发现级联系统在整体上是最可靠的，而当前的语音LLMs仅在某些设置中与级联系统相当，SFM则落后于两者，这表明在模型内部或管道中整合一个LLM对于高质量的语音翻译是必不可少的。

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

Authors: Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu

First: 2025-12-24T14:28:17+00:00 · Latest: 2025-12-24T14:28:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Zero-shot object navigation (ZSON) requires a robot to locate a target object in a previously unseen environment without relying on pre-built maps or task-specific training. However, existing ZSON methods often struggle in realistic and cluttered environments, particularly when the scene contains heavy occlusions, unknown risks, or dynamically moving target objects. To address these challenges, we propose \textbf{Schrödinger's Navigator}, a navigation framework inspired by Schrödinger's thought experiment on uncertainty. The framework treats unobserved space as a set of plausible future worlds and reasons over them before acting. Conditioned on egocentric visual inputs and three candidate trajectories, a trajectory-conditioned 3D world model imagines future observations along each path. This enables the agent to see beyond occlusions and anticipate risks in unseen regions without requiring extra detours or dense global mapping. The imagined 3D observations are fused into the navigation map and used to update a value map. These updates guide the policy toward trajectories that avoid occlusions, reduce exposure to uncertain space, and better track moving targets. Experiments on a Go2 quadruped robot across three challenging scenarios, including severe static occlusions, unknown risks, and dynamically moving targets, show that Schrödinger's Navigator consistently outperforms strong ZSON baselines in self-localization, object localization, and overall Success Rate in occlusion-heavy environments. These results demonstrate the effectiveness of trajectory-conditioned 3D imagination in enabling robust zero-shot object navigation.

中文标题/摘要

标题：薛定谔的导航器：零样本物体导航的未来图景想象

零样本物体导航（ZSON）要求机器人在未见过的环境中定位目标物体，无需依赖预先构建的地图或特定任务的训练。然而，现有的ZSON方法在现实且杂乱的环境中往往难以应对，尤其是在场景包含严重遮挡、未知风险或动态移动目标的情况下。为了解决这些挑战，我们提出了**薛定谔的导航器**，这是一种受薛定谔不确定性思想实验启发的导航框架。该框架将未观察到的空间视为一组可能的未来世界，并在行动前对其进行推理。基于第一人称视觉输入和三条候选轨迹，一条轨迹条件下的3D世界模型沿着每条路径想象未来的观察结果。这使代理能够超越遮挡物，预见未见过区域的风险，而无需额外绕路或密集的全局映射。想象出的3D观察结果被融合到导航地图中，并用于更新价值地图。这些更新引导策略避开遮挡物，减少对不确定空间的暴露，并更好地追踪移动目标。在具有严重静态遮挡、未知风险和动态移动目标的三个具有挑战性的场景中，使用四足机器人Go2进行的实验表明，薛定谔的导航器在自我定位、物体定位和整体成功率方面始终优于强大的ZSON基线。这些结果证明了轨迹条件下的3D想象在实现稳健的零样本物体导航方面的有效性。

Summary / 总结

Schrödinger's Navigator is a navigation framework designed for zero-shot object navigation in unseen environments with heavy occlusions and dynamic targets. It treats unobserved space as a set of plausible future worlds and reasons over them to avoid occlusions and reduce exposure to uncertain space. Experiments show that this approach outperforms existing methods in self-localization, object localization, and success rate in occlusion-heavy environments.

Schrödinger的导航器是一种零样本物体导航框架，通过将未观察到的空间视为一组可能的未来世界来应对杂乱环境中的挑战。它使用轨迹条件下的3D世界模型来想象未来的观察，并将它们融合到导航图中，引导机器人避开遮挡和不确定的空间。实验表明，这种方法在自定位、物体定位和整体成功率方面优于现有方法，在遮挡严重的场景中表现出色。

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Authors: Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

First: 2025-12-24T14:18:38+00:00 · Latest: 2025-12-24T14:18:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.

中文标题/摘要

标题：VisRes 基准：关于评估 VLM 视觉推理能力的研究

视觉-语言模型（VLMs）在视觉问答和图像描述等任务上取得了显著进展。然而，这些模型在视觉推理方面的能力与依赖于语言先验的程度仍然不清楚。为了解决这个问题，我们引入了 VisRes 基准，该基准旨在在无需上下文语言监督的自然环境中研究视觉推理。通过对三种复杂性级别的模型行为进行分析，我们发现了感知和关系视觉推理能力的明显局限性。VisRes 在其级别上隔离了不同的推理能力。第一级测试在模糊、纹理变化、遮挡和旋转等干扰下的感知完成和全局图像匹配；第二级测试基于规则的单属性推理（例如，颜色、数量、方向）；第三级则针对需要整合多个视觉属性的组合推理。在超过 19,000 张受控任务图像中，我们发现最先进的 VLMs 在微妙的感知干扰下表现接近随机，揭示了其有限的抽象能力，仅限于模式识别。最后，我们讨论了 VisRes 如何为多模态研究中的抽象视觉推理提供统一框架。

Summary / 总结

The paper introduces VisRes Bench, a benchmark to evaluate the visual reasoning capabilities of Vision-Language Models (VLMs) without relying on contextual language supervision. The benchmark assesses models across three levels of complexity: perceptual completion, rule-based inference, and compositional reasoning. Key findings show that state-of-the-art VLMs struggle with subtle perceptual perturbations, indicating limited abstraction beyond pattern recognition. This work highlights the need for improving VLMs' visual reasoning abilities.

论文介绍了VisRes Bench，这是一个用于评估Vision-Language模型（VLM）在没有上下文语言监督情况下的视觉推理能力的基准。该基准包括三个复杂度级别：感知完成和全局图像匹配（Level 1）、单一属性的规则推理（Level 2）和多属性综合推理（Level 3）。在超过19,000张图像上，最先进的VLMs在细微的感知扰动下表现不佳，表明它们的抽象能力有限，主要依赖于模式识别而非真正的视觉推理能力。

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Authors: Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan

First: 2025-12-24T14:08:38+00:00 · Latest: 2025-12-24T14:08:38+00:00

Comments: 14 pages, 10 figures, Technical Report,

Abs · PDF · Code1 · Code2

Abstract

In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.

中文标题/摘要

标题：UltraShape 1.0：通过可扩展的几何细化生成高保真3D形状

在本报告中，我们介绍了UltraShape 1.0，这是一种可扩展的3D扩散框架，用于高保真3D几何生成。所提出的方法采用两阶段生成管道：首先合成粗略的整体结构，然后细化以生成详细的高质量几何形状。为了支持可靠的3D生成，我们开发了一个全面的数据处理管道，包括一种新颖的封闭式处理方法和高质量数据过滤。该管道通过去除低质量样本、填补孔洞和加厚细长结构，提高了公共3D数据集的几何质量，同时保留了精细的几何细节。为了实现精细的几何细化，我们在扩散过程中将空间定位与几何细节合成解耦。我们通过在固定的空间位置进行体素细化来实现这一点，其中从粗略几何形状推导出的体素查询提供了通过RoPE编码的显式位置锚点，使扩散模型能够专注于在减少的结构解决方案空间内合成局部几何细节。我们的模型仅在公共3D数据集上进行训练，尽管训练资源有限，但仍能实现强大的几何质量。广泛的评估表明，UltraShape 1.0在数据处理质量和几何生成方面与现有的开源方法竞争。所有代码和训练模型将被发布以支持未来的研究。

Summary / 总结

UltraShape 1.0 is a scalable 3D diffusion framework for generating high-fidelity 3D shapes. It uses a two-stage pipeline to synthesize a coarse global structure and then refine it for detailed geometry. The method includes a novel watertight processing method and high-quality data filtering to improve geometric quality. By decoupling spatial localization from geometric detail synthesis, UltraShape 1.0 focuses on local geometric details, achieving strong geometric quality with limited training resources and competitive performance compared to existing methods.

UltraShape 1.0 是一个可扩展的 3D 扩散框架，通过两阶段过程生成高保真 3D 几何：粗略的全局结构合成后进行详细的细化。它包括一个全面的数据处理管道，可以提高几何质量，并且在扩散过程中将空间定位与几何细节合成解耦，以实现精细的细化。评估显示，UltraShape 1.0 在数据处理和几何生成方面与现有方法相比表现良好，尽管训练资源有限。

Towards Arbitrary Motion Completing via Hierarchical Continuous Representation

Authors: Chenghao Xu, Guangtao Lyu, Qi Liu, Jiexi Yan, Muli Yang, Cheng Deng

First: 2025-12-24T14:07:04+00:00 · Latest: 2025-12-24T14:07:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.

中文标题/摘要

标题：通过分层连续表示实现任意运动补全

物理运动本质上是连续的，更高的相机帧率通常有助于提高平滑度和时间连贯性。首次探索了人类运动序列的连续表示，能够对任意输入运动序列在任意帧率下进行插值、过渡甚至外推。为此，我们提出了一种基于隐式神经表示（INRs）的名为NAME的新型参数激活诱导分层隐式表示框架。我们的方法引入了分层时间编码机制，从运动序列的多个时间尺度中提取特征，有效捕捉复杂的时序模式。此外，我们还将基于傅里叶变换的自定义参数激活函数集成到基于MLP的解码器中，以增强连续表示的表达能力。这种参数化表示显著增强了模型对复杂运动行为的高精度表示能力。在多个基准数据集上的广泛评估表明，我们提出的方法具有有效性和鲁棒性。

Summary / 总结

This paper addresses the challenge of generating arbitrary motion sequences by proposing a novel hierarchical continuous representation framework called NAME. It uses Implicit Neural Representations (INRs) and introduces a hierarchical temporal encoding mechanism to capture intricate temporal patterns at multiple scales. The method also incorporates a parametric activation function based on Fourier transformations to enhance the expressiveness of the continuous representation. Experimental results show that the proposed approach effectively interpolates, inbetween, and extrapolates motion sequences at arbitrary frame rates, demonstrating its effectiveness and robustness across various benchmark datasets.

研究旨在通过探索人类运动序列的连续表示来实现任意运动完成。提出的名为NAME的方法使用基于隐式神经表示（INRs）的分层隐式表示框架，以在多个时间尺度上捕捉复杂的时空模式。该方法结合了一个基于傅里叶变换的自定义参数激活函数，以增强连续表示的表达能力。实验结果表明，该方法能够有效地在任意帧率下插值、过渡和外推运动序列，展示了其在多个基准数据集上的有效性和鲁棒性。