arXiv 论文速递

HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

Authors: Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, Tao Xiang, Ziwei Liu, Juan-Manuel Perez-Rua

First: 2025-12-24T18:59:58+00:00 · Latest: 2025-12-24T18:59:58+00:00

Comments: Project Page: http://haonanqiu.com/projects/HiStream.html

Abs · PDF · Code1 · Code2 · Project1

Abstract

High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.

中文标题/摘要

标题：HiStream：通过消除冗余的流式传输高效生成高分辨率视频

高分辨率视频生成对于数字媒体和电影至关重要，但由于扩散模型的二次复杂性导致计算瓶颈，使得实际推理变得不可行。为了解决这一问题，我们引入了HiStream，这是一种高效的自回归框架，系统地在三个维度上减少了冗余：i) 空间压缩：在低分辨率处去噪，然后使用缓存特征在高分辨率处细化；ii) 时间压缩：采用分块策略，固定大小的锚点缓存确保稳定的推理速度；iii) 时间步压缩：对后续的、基于缓存条件的块应用较少的去噪步骤。在1080p基准测试中，我们的主要HiStream模型（i+ii）实现了最先进的视觉质量，同时与Wan2.1基线相比，去噪速度提高了76.2倍，且几乎无质量损失。我们的更快变体HiStream+应用了所有三种优化（i+ii+iii），相对于基线实现了107.5倍的加速，提供了速度和质量之间的权衡，从而使得高分辨率视频生成既实用又可扩展。

Summary / 总结

HiStream is an efficient autoregressive framework designed to generate high-resolution videos by reducing redundancy in spatial, temporal, and timestep dimensions. It uses low-resolution denoising, a fixed-size anchor cache for temporal compression, and fewer denoising steps for subsequent chunks. On 1080p benchmarks, HiStream achieves state-of-the-art visual quality with up to 76.2x faster denoising compared to Wan2.1, while HiStream+ further accelerates the process by 107.5x with a slight quality trade-off.

HiStream 是一种高效的自回归框架，通过在空间、时间和时间步三个维度上减少冗余来生成高分辨率视频。它采用低分辨率去噪、固定大小的锚点缓存进行时间压缩，并对后续块应用更少的去噪步骤。在1080p基准测试中，HiStream 达到最先进的视觉质量，比 Wan2.1 快 76.2 倍，而 HiStream+ 进一步加速到 107.5 倍，尽管有轻微的质量折衷。

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu

First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00

Comments: Project page: https://sytwu.github.io/BeyondMemo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

中文标题/摘要

标题：超越记忆：多模态序数回归基准以揭示视觉语言模型中的流行度偏差

我们揭示了最先进的视觉语言模型（VLMs）中存在显著的流行度偏差，这些模型在著名建筑上的准确率比普通建筑高出34%，表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题，我们引入了该任务上最大的开放基准数据集：YearGuessr数据集，包含来自157个国家的55,546张建筑物图像，具有多模态属性，并附有其建设年份的连续序数标签（1001-2024）、GPS数据和页面浏览量作为流行度的代理。使用该数据集，我们将建设年份预测任务框架化为序数回归，并引入了流行度感知的区间准确度指标来量化这种偏差。我们基准测试的30多种模型，包括我们的YearCLIP模型，证实了VLMs在流行、记忆化的项目上表现出色，但在未识别的主题上却面临重大挑战，揭示了它们推理能力中的关键缺陷。项目页面：https://sytwu.github.io/BeyondMemo/

Summary / 总结

The paper addresses the significant popularity bias in state-of-the-art vision-language models (VLMs), showing they perform 34% better on famous buildings than ordinary ones. To systematically investigate this, the authors introduce the YearGuessr dataset, comprising 55,546 building images with multi-modal attributes, and propose ordinal regression for construction year prediction. They introduce new metrics to quantify this bias and confirm that VLMs excel on popular items but struggle with unrecognized subjects, highlighting a critical flaw in their reasoning capabilities.

论文揭示了最先进的视觉-语言模型（VLMs）存在显著的流行度偏差，它们在著名建筑上的表现比普通建筑高出34%。为了系统地研究这一问题，作者引入了包含55,546张建筑图像的YearGuessr数据集，这些图像具有多模态属性，并提出使用序数回归进行建筑年份预测。他们引入了新的指标来量化这种偏差，并证实VLMs在流行项目上表现出色，但在未识别的主题上却面临重大挑战，这揭示了它们推理能力的一个关键缺陷。

Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

Authors: Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin

First: 2025-12-24T18:59:51+00:00 · Latest: 2025-12-24T18:59:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.

中文标题/摘要

标题：通过量化不确定性优化掩码扩散模型的解码路径

掩码扩散模型（MDMs）提供了灵活的非自回归生成，但这种自由度引入了一个挑战：最终输出质量高度依赖于解码顺序。我们首次正式化了这一问题，将输出质量的差异归因于生成路径上累积的预测不确定性。为了量化这种不确定性，我们引入了去噪熵，这是一种可计算的度量标准，作为评估生成过程的内部信号。利用这一度量标准，我们提出了两种优化解码路径的算法：一种事后选择方法和一种实时指导策略。实验表明，我们的熵导向方法显著提高了生成质量，在具有挑战性的推理、规划和代码基准测试中持续提升了准确性。我们的工作确立了去噪熵作为理解并控制生成过程的原理性工具，有效地将MDMs中的不确定性从一种负担转变为发现高质量解决方案的关键优势。

Summary / 总结

This paper addresses the challenge of output quality variability in Masked Diffusion Models (MDMs) due to different decoding orders. It introduces Denoising Entropy as a metric to quantify predictive uncertainty along generative paths and proposes two algorithms to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments show that these entropy-guided methods significantly enhance generation quality on various benchmarks, turning uncertainty into a tool for discovering high-quality solutions.

该论文解决了Masked Diffusion Models (MDMs)由于解码顺序灵活而导致输出质量变化的问题。通过引入Denoising Entropy作为衡量预测不确定性的一种指标，作者提出了两种算法：一种是事后选择方法，另一种是实时指导策略。实验表明，这些基于熵的方法能够提高生成质量，特别是在涉及推理、规划和代码生成的复杂基准测试中表现更佳。

Streaming Video Instruction Tuning

Authors: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou

First: 2025-12-24T18:59:36+00:00 · Latest: 2025-12-24T18:59:36+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

中文标题/摘要

标题：流式视频指令调优

我们提出了Streamo，一种实时流式视频LLM，作为通用交互式助手。与现有的专注于问答或字幕的在线视频模型不同，Streamo执行广泛的流式视频任务，包括实时解说、动作理解、事件字幕、时间事件定位和时间敏感问答。为了开发这种多功能性，我们构建了Streamo-Instruct-465K，一个针对流式视频理解的大规模指令遵循数据集。该数据集涵盖了多种时间上下文和多任务监督，使Streamo能够在异构流式任务中统一训练。通过简化的工作流程在指令遵循数据集上端到端训练后，Streamo展示了强大的时间推理、响应式交互和在各种流式基准测试中的广泛泛化能力。广泛的实验表明，Streamo填补了离线视频感知模型与实时多模态助手之间的差距，朝着统一、智能的视频理解在连续视频流中迈出了一步。

Summary / 总结

The research motivation is to develop a real-time streaming video assistant named Streamo that can perform a wide range of tasks such as real-time narration and event captioning. To achieve this, a large instruction-following dataset called Streamo-Instruct-465K was created, which covers various temporal contexts and supports multi-task supervision. After training, Streamo demonstrates strong temporal reasoning and generalization across different streaming benchmarks, bridging the gap between offline video models and real-time multimodal assistants.

Streamo 是一个实时流媒体视频 LLM，旨在作为通用的交互式助手。它在实时叙述、动作理解等多种流媒体任务上表现出色。为了实现这一多功能性，研究人员创建了 Streamo-Instruct-465K 数据集，用于流媒体视频理解。经过训练后，Streamo 展示出强大的时间推理能力和在各种流媒体基准测试中的广泛泛化能力，填补了离线视频感知模型与实时多模态助手之间的差距。

Fast SAM2 with Text-Driven Token Pruning

Authors: Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen

First: 2025-12-24T18:59:05+00:00 · Latest: 2025-12-24T18:59:05+00:00

Comments: 28 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.

中文标题/摘要

标题：快速SAM2：基于文本驱动的标记剪枝

Segment Anything Model 2 (SAM2) 是一种视觉基础模型，在基于提示的视频对象分割方面取得了显著进展，但其实际部署受限于处理时间密集视觉标记的高计算和内存成本。SAM2 管道通常会将图像编码器生成的所有视觉标记通过下游的时间推理模块进行传递，而不考虑这些标记与目标对象的相关性，导致由于基于内存的注意力开销呈二次增长而降低了可扩展性。本文提出了一种基于文本的标记剪枝框架，通过在时间传播之前选择性地减少标记密度来提高推理效率，而不修改底层分割架构。该方法在视觉编码之后、基于内存的传播之前运行，使用一种轻量级的路由机制对标记进行排名，该机制结合了局部视觉上下文、从以对象为中心的文本描述（用户提供的或自动生成的）中推导出的语义相关性以及有助于保留模糊或边界关键区域的不确定性提示。通过仅保留对下游处理最有用的标记，所提出的方法减少了冗余计算，同时保持了分割精度。在多个具有挑战性的视频分割基准测试中的广泛实验表明，编码器后的标记剪枝提供了一条实用且有效的途径，以实现基于提示的视频分割的高效性，与未剪枝的基线SAM2相比，其推理速度提高了42.50%，GPU内存使用量降低了37.41%，同时保持了竞争力的J和F性能。这些结果突显了早期标记选择对提高基于变压器的视频分割系统实时性和资源受限应用可扩展性的潜力。

Summary / 总结

This work introduces a text-guided token pruning framework for Segment Anything Model 2 (SAM2) to enhance inference efficiency in video object segmentation. By selectively reducing token density before temporal propagation, the method ranks tokens using a lightweight routing mechanism that considers local visual context, semantic relevance, and uncertainty cues. This approach reduces redundant computation and memory usage by up to 37.41 percent, while maintaining competitive segmentation performance. Extensive experiments show that post-encoder token pruning can achieve up to 42.50 percent faster inference compared to the unpruned baseline SAM2.

该研究提出了一种文本引导的token剪枝框架，用于Segment Anything Model 2 (SAM2)，通过在时间传播前选择性地减少token密度来提升推理效率。方法使用一个轻量级的路由机制来考虑局部视觉上下文、语义相关性和不确定性提示来对token进行排序。实验结果显示，这种方法可以将推理时间减少最多42.50%，GPU内存使用减少37.41%，同时保持竞争力的分割性能。这表明早期token选择对提高基于Transformer的视频分割系统的可扩展性具有潜在价值。

TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

Authors: Varun Belagali, Saarthak Kapse, Pierre Marza, Srijan Das, Zilinghan Li, Sofiène Boutaj, Pushpak Pati, Srikar Yellapragada, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Prateek Prasanna, Stergios Christodoulidis Maria Vakalopoulou, Dimitris Samaras

First: 2025-12-24T18:58:16+00:00 · Latest: 2025-12-24T18:58:16+00:00

Abs · PDF · Code1 · Code2

Abstract

The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.

中文标题/摘要

标题：TICON：一种用于组织病理学表示学习的幻灯片级切片上下文化器

在大型全切片图像（WSI）中，对小切片的解释通常需要更大的图像上下文。我们引入了TICON，一种基于变换器的切片表示上下文化器，能够为“任何”计算病理学应用生成丰富的上下文化嵌入。标准基于切片编码器的管道从切片中剥离其上下文提取嵌入，无法建模对于局部和全局任务都至关重要的丰富幻灯片级信息。此外，不同的切片编码器在不同的下游任务中表现出色。因此，需要一个统一的模型来上下文化来自“任何”切片级基础模型的嵌入。TICON 通过一个共享的编码器来满足这一需求，该编码器使用掩蔽建模目标进行预训练，以同时统一和上下文化来自多种切片级病理基础模型的表示。我们的实验表明，TICON 上下文化的嵌入在许多不同任务中显著提高了性能，建立了切片级基准（如HEST-Bench、THUNDER、CATCH）和幻灯片级基准（如Patho-Bench）的新最先进的结果。最后，我们使用仅11K张WSI对TICON 进行预训练形成一个幻灯片级基础模型，超越了使用多达350K张WSI预训练的最先进的幻灯片级基础模型。

Summary / 总结

TICON is a transformer-based model that provides rich, contextualized embeddings for tiles in whole slide images, addressing the limitations of tile encoder-based pipelines in capturing slide-level information. It uses a single, shared encoder to contextualize embeddings from various tile-level pathology foundation models, improving performance across multiple tasks and setting new state-of-the-art results on both tile-level and slide-level benchmarks. Additionally, TICON enables the creation of a slide-level foundation model with fewer training images, outperforming existing models pretrained on larger datasets.

TICON 是一种基于变换器的模型，旨在为全切片图像（WSI）中的小块提供丰富的上下文嵌入，解决了标准小块编码器管道的局限性。通过使用共享编码器进行预训练，TICON 统一并上下文化了来自多种小块病理基础模型的表示。实验表明，TICON 在多个任务上显著提高了性能，建立了在小块和切片级别基准上的新最先进的结果。此外，TICON 仅使用少量切片图像即可构建切片级别基础模型，优于使用更大数据集训练的最先进的模型。

Parallel Token Prediction for Language Models

Authors: Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh, Stephan Mandt

First: 2025-12-24T18:46:55+00:00 · Latest: 2025-12-24T18:46:55+00:00

Comments: Preprint. Under review

Abs · PDF · Code1 · Code2

Abstract

We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.

中文标题/摘要

标题：语言模型中的并行令牌预测

我们提出了并行令牌预测（PTP），这是一种用于语言模型并行序列生成的通用框架。PTP 在单个变压器调用中通过将采样过程纳入模型中，同时预测多个依赖令牌，从而减少了自回归解码的延迟瓶颈，并避免了现有多种令牌预测方法中常见的独立性假设限制。我们证明PTP 可以表示任意自回归序列分布。PTP 可以通过蒸馏现有模型或通过逆自回归训练进行训练，无需教师。实验上，我们在 Spec-Bench 上通过每步接受超过四个令牌，实现了 Vicuna-7B 的最佳推测解码性能。我们框架的通用性表明，在不损失建模能力的情况下，长序列的并行生成是可行的。

Summary / 总结

The research proposes Parallel Token Prediction (PTP), a framework that jointly predicts multiple dependent tokens in a single transformer call, reducing the latency of autoregressive decoding and avoiding restrictive independence assumptions. PTP is trained either by distilling an existing model or through inverse autoregressive training. Experiments show that PTP achieves state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench, indicating its potential for parallel generation of long sequences without loss of modeling power.

论文提出了并行令牌预测（PTP）框架，该框架在一个变压器调用中联合预测多个依赖令牌，减少了自回归解码的延迟，并避免了现有方法中的限制性独立假设。PTP 可以通过蒸馏现有模型或逆自回归训练来训练。实验结果显示，PTP 在 Vicuna-7B 上实现了最先进的推测解码性能，每步接受超过四个令牌，在 Spec-Bench 上表明其在不损失建模能力的情况下可以实现长序列的并行生成。

Variationally correct operator learning: Reduced basis neural operator with a posteriori error estimation

Authors: Yuan Qiu, Wolfgang Dahmen, Peng Chen

First: 2025-12-24T18:37:59+00:00 · Latest: 2025-12-24T18:37:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Minimizing PDE-residual losses is a common strategy to promote physical consistency in neural operators. However, standard formulations often lack variational correctness, meaning that small residuals do not guarantee small solution errors due to the use of non-compliant norms or ad hoc penalty terms for boundary conditions. This work develops a variationally correct operator learning framework by constructing first-order system least-squares (FOSLS) objectives whose values are provably equivalent to the solution error in PDE-induced norms. We demonstrate this framework on stationary diffusion and linear elasticity, incorporating mixed Dirichlet-Neumann boundary conditions via variational lifts to preserve norm equivalence without inconsistent penalties. To ensure the function space conformity required by the FOSLS loss, we propose a Reduced Basis Neural Operator (RBNO). The RBNO predicts coefficients for a pre-computed, conforming reduced basis, thereby ensuring variational stability by design while enabling efficient training. We provide a rigorous convergence analysis that bounds the total error by the sum of finite element discretization bias, reduced basis truncation error, neural network approximation error, and statistical estimation errors arising from finite sampling and optimization. Numerical benchmarks validate these theoretical bounds and demonstrate that the proposed approach achieves superior accuracy in PDE-compliant norms compared to standard baselines, while the residual loss serves as a reliable, computable a posteriori error estimator.

中文标题/摘要

标题：变分正确的算子学习：基于后验误差估计的降维神经算子

最小化PDE残差损失是促进神经算子物理一致性的常用策略。然而，标准形式通常缺乏变分正确性，这意味着小的残差不一定能保证小的解误差，因为使用了不合规的范数或针对边界条件的任意罚项。本文通过构建首阶系统最小二乘（FOSLS）目标来发展一个变分正确的算子学习框架，这些目标的值在PDE诱导的范数中可证明等同于解误差。我们通过变分提升将混合Dirichlet-Neumann边界条件纳入该框架中，以保持范数等价性而不引入不一致的罚项。为了确保FOSLS损失所需的函数空间一致性，我们提出了一种降维神经算子（RBNO）。RBNO预测预计算的、一致的降维基的系数，从而通过设计确保变分稳定性，同时实现高效的训练。我们提供了一种严格的收敛性分析，将总误差限制为有限元离散偏差、降维基截断误差、神经网络逼近误差以及由于有限采样和优化产生的统计估计误差之和。数值基准验证了这些理论界线，并表明所提出的方法在PDE一致范数中实现了优于标准基线的更高精度，而残差损失则作为可靠的、可计算的后验误差估计器。

Summary / 总结

This work addresses the issue of variational correctness in neural operators by developing a variationally correct framework using first-order system least-squares (FOSLS) objectives. The method incorporates mixed boundary conditions and ensures norm equivalence without inconsistent penalties. A Reduced Basis Neural Operator (RBNO) is proposed to predict coefficients for a pre-computed reduced basis, ensuring variational stability and efficient training. Theoretical analysis shows that the total error is bounded by several components, and numerical benchmarks confirm the method's superior accuracy and the reliability of the residual loss as an a posteriori error estimator.

该研究通过使用一阶系统最小二乘（FOSLS）目标来解决神经算子的变分正确性问题。方法通过变分提升引入混合Dirichlet-Neumann边界条件，并提出了一种基于预计算的收敛基的Reduced Basis Neural Operator（RBNO），以确保函数空间的符合性。RBNO通过预测预计算收敛基的系数，确保变分稳定性并实现高效的训练。理论分析和数值基准表明，所提出的方法在PDE一致范数下具有更高的准确性，并提供了一个可靠的后验误差估计器。

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

Authors: Roy Turgeman, Tom Tirer

First: 2025-12-24T18:21:01+00:00 · Latest: 2025-12-24T18:21:01+00:00

Abs · PDF · Code1 · Code2

Abstract

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

中文标题/摘要

标题：数据处理不等式反映实践吗？低级任务的有效性探究

数据处理不等式是信息论原理，表明通过处理观测值无法增加信号的信息量。特别是，它表明在解决分类问题之前增强信号或对其进行编码是没有益处的。这一断言可以证明在最优贝叶斯分类器的情况下是正确的。然而，在实践中，尽管现代深度神经网络具有强大的能力，但在“高级”的下游任务之前通常会执行“低级”任务。在本文中，我们旨在理解何时以及为什么低级处理对分类有益。我们对二分类设置进行了全面的理论研究，考虑了一个与最优贝叶斯分类器紧密相连的分类器，并随着训练样本数量的增加而收敛于最优贝叶斯分类器。我们证明了对于任何有限数量的训练样本，都存在一种预分类处理可以提高分类准确性。我们还探讨了类别分离、训练集大小和类别平衡对这种处理相对增益的影响。我们通过理论设置的实证研究支持了我们的理论。最后，我们进行了一项实证研究，探讨了去噪和编码对基准数据集上实用深度分类器性能的影响。具体来说，我们改变了训练集的大小和类别分布以及噪声水平，并展示了与理论结果一致的趋势。

Summary / 总结

This paper investigates the utility of low-level tasks in classification, challenging the data processing inequality which suggests no benefit in preprocessing before classification. Through a theoretical study and empirical investigation, the authors prove that low-level processing can improve classification accuracy for any finite number of training samples. They also explore how class separation, training set size, and class balance affect this improvement. Empirical studies on benchmark datasets further support these findings, showing consistent trends with theoretical predictions.

本文探讨了低级任务在分类中的实用性，挑战了数据处理不等式，该理论认为在分类前处理数据没有益处。通过理论研究和实证调查，作者证明了在任何有限数量的训练样本情况下，预分类处理可以提高分类准确性。他们还研究了类别分离、训练集大小和类别平衡如何影响预分类处理的相对收益。基准数据集上的实证研究进一步支持了这些发现，显示了与理论结果一致的趋势。

Learning to Solve PDEs on Neural Shape Representations

Authors: Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra

First: 2025-12-24T18:14:02+00:00 · Latest: 2025-12-24T18:14:02+00:00

Comments: Article webpage link: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.

中文标题/摘要

标题：在神经形状表示中学习求解偏微分方程

在形状上求解偏微分方程（PDEs）是许多形状分析和工程任务的基础；然而，现有的PDE求解器通常基于多边形/三角形网格，而现代3D资产越来越多地以神经表示形式存在。这种不匹配使得没有合适的方法可以直接在神经域内求解曲面PDEs，迫使进行显式的网格提取或逐实例残差训练，阻碍了端到端的工作流程。我们提出了一种全新的无网格公式，该公式学习一个基于神经（局部）形状属性的局部更新算子，使得可以在数据所在的曲面上直接求解PDEs。该算子自然地与常见的神经曲面表示相结合，只需在一个代表性形状上进行一次训练，即可在形状和拓扑变化下泛化，从而在无需显式网格化或逐实例优化的情况下实现准确、快速的推理，同时保持可微性。在分析基准（球体上的热方程和泊松求解）和不同表示的真实神经资产中，我们的方法在某些方面略优于CPM，同时保持与FEM相当的性能，并且，据我们所知，首次提供了在神经和经典曲面表示上求解曲面PDEs的端到端管道。代码将在接受后发布。

Summary / 总结

This paper addresses the challenge of solving partial differential equations (PDEs) on shapes represented by neural networks, which is crucial for shape analysis and engineering tasks. The authors propose a mesh-free method that learns a local update operator conditioned on neural shape attributes, allowing PDEs to be solved directly on neural data. The method integrates with existing neural surface representations, requires training only once, and generalizes well across different shapes and topologies. Experiments show that the method performs slightly better than the closest competitor (CPM) and is comparable to finite element methods (FEM), while enabling end-to-end workflows without explicit meshing or per-instance optimization.

本文解决了在神经形状表示上求解偏微分方程（PDEs）的问题，这些表示在3D资产中越来越常用。作者提出了一种无网格方法，该方法根据神经形状属性学习局部更新算子，允许直接在神经数据上求解PDEs。该方法与神经表面表示集成，需要单个形状训练，并且能够跨形状和拓扑结构泛化，从而实现快速准确的推理，无需显式网格化或逐实例优化。实验表明，该方法在性能上略优于CPM，并且接近FEM，标志着首个用于神经和经典表面表示的求解表面PDEs的端到端管道。

Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning

Authors: Ke Sun, Yingnan Zhao, Enze Shi, Yafei Wang, Xiaodong Yan, Bei Jiang, Linglong Kong

Venue: NeurIPS 2025

First: 2021-10-07T03:14:46+00:00 · Latest: 2025-12-24T17:53:45+00:00

Comments: NeurIPS 2025; Previous Version in ICML Workshop: Exploration in AI Today (EXAIT) 2025

Abs · PDF · Code1 · Code2

Abstract

The remarkable empirical performance of distributional reinforcement learning (RL) has garnered increasing attention to understanding its theoretical advantages over classical RL. By decomposing the categorical distributional loss commonly employed in distributional RL, we find that the potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization. This less-studied entropy regularization aims to capture additional knowledge of return distribution beyond only its expectation, contributing to an augmented reward signal in policy optimization. In contrast to the vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration by promoting diverse actions, the novel entropy regularization derived from categorical distributional loss implicitly updates policies to align the learned policy with (estimated) environmental uncertainty. Finally, extensive experiments verify the significance of this uncertainty-aware regularization from distributional RL on the empirical benefits over classical RL. Our study offers an innovative exploration perspective to explain the intrinsic benefits of distributional learning in RL.

中文标题/摘要

标题：分类分布损失的内在优势：分布感知正则化探索在强化学习中的应用

分布式强化学习（RL）的卓越实证性能引起了对其与经典RL理论优势的越来越多关注。通过分解在分布式RL中常用的分类分布损失，我们发现分布式RL潜在优势可归因于一种衍生的分布匹配熵正则化。这种较少研究的熵正则化旨在捕捉回报分布的额外知识，而不仅仅是其期望值，从而为策略优化提供增强的奖励信号。与MaxEnt RL中的基本熵正则化相比，后者通过促进多样化的动作显式地鼓励探索，而从分类分布损失中推导出的新熵正则化则隐式地更新策略，使其与（估计的）环境不确定性相一致。最后，广泛的实验验证了这种分布感知正则化在实证上对经典RL的优越性。我们的研究为解释分布式学习在RL中的内在优势提供了创新的探索视角。

Summary / 总结

This paper investigates the theoretical advantages of distributional reinforcement learning (RL) over classical RL by decomposing the categorical distributional loss. It finds that the distribution-matching entropy regularization derived from this loss can capture additional knowledge of return distribution beyond its expectation, enhancing the reward signal. Unlike the explicit exploration encouragement in MaxEnt RL, this regularization implicitly aligns the learned policy with environmental uncertainty. Experiments confirm the significance of this uncertainty-aware regularization in improving RL performance.

该论文通过分解分类分布损失，研究了分布式强化学习（RL）的理论优势。发现从这种损失中推导出的分布匹配熵正则化能够捕捉关于回报分布的额外知识，增强策略优化中的奖励信号。与MaxEnt RL中的显式探索鼓励不同，这种正则化隐式地使策略与环境不确定性对齐。实验验证了这种不确定性意识正则化在与经典RL方法对比中的实际优势。

AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

Authors: Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng

First: 2025-12-24T17:40:42+00:00 · Latest: 2025-12-24T17:40:42+00:00

Comments: 23 pages, 13 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.

中文标题/摘要

标题：AndroidLens：针对Android GUI代理的嵌套子目标长延迟评估

图形用户界面（GUI）代理可以通过自动化移动设备上频繁执行的长延迟任务来显著提高生产力。然而，现有的评估基准仍然局限于有限的应用程序、简单的任务和粗粒度的指标。为了解决这一问题，我们引入了AndroidLens，这是一个针对移动GUI代理的具有挑战性的评估框架，包含571个长延迟任务，涵盖中文和英文环境，每个任务平均需要超过26步才能完成。该框架的特点包括：(1) 来自38个领域的真实世界用户场景的任务，涵盖多种复杂类型，如多约束、多目标和领域特定任务；(2) 静态评估保留了真实世界的异常情况，并允许多条有效路径以减少偏差；(3) 动态评估采用基于里程碑的方案，通过平均任务进度（ATP）进行细粒度的进度测量。我们的评估表明，即使是最优秀的模型也只能达到12.7%的任务成功率和50.47%的ATP。我们还强调了真实世界环境中的一些关键挑战，包括环境异常、自适应探索和长期记忆保留。

Summary / 总结

The motivation for this work is to improve the evaluation of graphical user interface (GUI) agents on mobile devices, particularly for long-latency tasks. The main method involves creating AndroidLens, a framework with 571 complex tasks in both Chinese and English environments, each requiring over 26 steps. Key experimental findings show that even the best models achieve only 12.7% task success and 50.47% Average Task Progress, highlighting significant challenges in real-world scenarios such as environmental anomalies and long-term memory retention.

研究引入了包含571个长延迟任务的AndroidLens框架，这些任务涵盖了38个真实世界领域，每个任务平均需要超过26步。框架包括静态和动态评估来衡量任务成功率和进度。关键发现表明，即使是最优模型也只能达到12.7%的任务成功率和50.47%的平均任务进度，突出了环境异常、自适应探索和长期记忆保持等挑战。

Transcriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering

Authors: Abdullah G. Elafifi, Basma Mamdouh, Mariam Hanafy, Muhammed Alaa Eldin, Yosef Khaled, Nesma Mohamed El-Gelany, Tarek H. M. Abou-El-Enien

First: 2025-12-24T17:39:37+00:00 · Latest: 2025-12-24T17:39:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Acute Myeloid Leukemia (AML) remains a clinical challenge due to its extreme molecular heterogeneity and high relapse rates. While precision medicine has introduced mutation-specific therapies, many patients still lack effective, personalized options. This paper presents a novel, end-to-end computational framework that bridges the gap between patient-specific transcriptomics and de novo drug discovery. By analyzing bulk RNA sequencing data from the TCGA-LAML cohort, the study utilized Weighted Gene Co-expression Network Analysis (WGCNA) to prioritize 20 high-value biomarkers, including metabolic transporters like HK3 and immune-modulatory receptors such as SIGLEC9. The physical structures of these targets were modeled using AlphaFold3, and druggable hotspots were quantitatively mapped via the DOGSiteScorer engine. Then developed a novel, reaction-first evolutionary metaheuristic algorithm as well as multi-objective optimization programming that assembles novel ligands from fragment libraries, guided by spatial alignment to these identified hotspots. The generative model produced structurally unique chemical entities with a strong bias toward drug-like space, as evidenced by QED scores peaking between 0.5 and 0.7. Validation through ADMET profiling and SwissDock molecular docking identified high-confidence candidates, such as Ligand L1, which achieved a binding free energy of -6.571 kcal/mol against the A08A96 biomarker. These results demonstrate that integrating systems biology with metaheuristic molecular assembly can produce pharmacologically viable, patient tailored leads, offering a scalable blueprint for precision oncology in AML and beyond

Summary / 总结

This study addresses the challenge of personalized drug discovery for Acute Myeloid Leukemia (AML) by integrating patient-specific transcriptomics with de novo drug generation. Using WGCNA to identify key biomarkers and AlphaFold3 for structural modeling, the framework then employs a metaheuristic algorithm to assemble novel ligands. Key findings include the generation of drug-like chemical entities and the identification of high-confidence candidates, such as Ligand L1, with a binding free energy of -6.571 kcal/mol against the A08A96 biomarker.

该研究通过将患者特异性转录组学与新药发现相结合，开发了一个端到端的计算框架来应对急性髓系白血病（AML）的临床挑战。方法包括使用WGCNA优先选择20个高价值生物标志物，使用AlphaFold3建模这些生物标志物的结构，并使用一种新型的元启发式算法根据这些生物标志物组装新型配体。关键发现包括生成了具有药物样特性的结构独特化学实体，并识别出高信心药物候选物，如配体L1，其与A08A96生物标志物的结合亲和力很强。

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

Authors: Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane

Venue: NeurIPS 2025

First: 2025-06-06T19:29:13+00:00 · Latest: 2025-12-24T17:26:35+00:00

Comments: 40 pages, 8 figures, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

中文标题/摘要

标题：交替梯度流：两层神经网络中特征学习的理论

神经网络学习哪些特征以及如何学习仍然是一个开放的问题。本文引入了交替梯度流（AGF）算法框架，描述了从小型初始化训练的两层网络中特征学习的动力学。先前的研究表明，在这种情况下，梯度流表现出阶梯状的损失曲线，交替在神经元缓慢对齐到有用方向的平台期和神经元迅速增长的尖锐下降期。AGF 将这种行为近似为交替的两步过程：在休眠神经元上最大化一个效用函数，在活跃神经元上最小化一个成本函数。AGF 从所有神经元都休眠开始。在每次迭代中，一个休眠的神经元激活，触发特征的获取和损失的下降。AGF 定量描述了这些下降的顺序、时间和幅度，与多个常用架构的实验结果相符。我们证明了 AGF 统一并扩展了全连接线性网络和仅注意力线性变压器中已有的鞍点到鞍点分析，其中学习的特征分别是奇异模式和主成分。在对角线线性网络中，我们证明 AGF 在初始化趋于零的极限下收敛到梯度流。将 AGF 应用于训练以执行模块加法的二次网络，我们首次完整地描述了训练动力学，揭示了网络按系数大小递减顺序学习傅里叶特征。总体而言，AGF 为理解神经网络中的特征学习提供了一个有希望的步骤。

Summary / 总结

The paper introduces Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer neural networks trained from small initialization. AGF approximates the alternating behavior of gradient flow as a two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. The key findings include matching the order, timing, and magnitude of feature learning drops with experiments across various architectures, unifying and extending existing saddle-to-saddle analyses, and providing a complete characterization of training dynamics in quadratic networks, revealing the learning of Fourier features in decreasing order of coefficient magnitude.

本文提出了交替梯度流（AGF）算法框架，用于描述从小初始化训练的两层神经网络中的特征学习动态。AGF将梯度流的交替行为近似为激活休眠神经元和对活跃神经元最小化成本函数的迭代过程。关键实验发现表明，AGF与各种架构中观察到的损失下降顺序、时间和幅度相匹配，统一并扩展了线性网络和变压器中的现有分析。它还为二次网络的训练动力学提供了完整的描述，揭示了网络按系数大小递减顺序学习傅里叶特征。

Model Merging via Multi-Teacher Knowledge Distillation

Authors: Seyed Arshan Dalili, Mehrdad Mahdavi

First: 2025-12-24T17:10:44+00:00 · Latest: 2025-12-24T17:10:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.

中文标题/摘要

标题：多教师知识蒸馏下的模型合并

模型合并已成为联合多任务学习（MTL）的轻量级替代方案，但合并模型的泛化特性尚未得到充分探索。建立此类理论保证并不容易，因为合并过程通常禁止访问原始训练数据，并涉及结合在根本上异质数据分布下训练的微调模型。在缺乏这些动态的原理性理解时，当前方法往往依赖于启发式方法来近似参数的最佳组合。这种方法在系数缩放中最为关键，即调节每个微调模型对共享参数贡献大小的加权因子。然而，由于缺乏指导其选择的原理性目标，这些方法会导致脆弱的性能，并且高度依赖于缩放初始化。我们通过(i) 建立一种新的基于平滑度感知的PAC-Bayes泛化界，专门针对模型合并设置。此分析引入了一个“跨任务异质性”项，正式捕捉了多种微调模型先验与目标多任务分布之间的不匹配。受此理论洞察的指导，(ii) 我们将模型合并视为在稀缺未标记数据上的多教师知识蒸馏。我们正式证明，最小化学生-教师Kullback-Leibler散度直接收紧了合并模型超额风险的上界。受所推导的基于平滑度的界指导，(iii) 我们通过SAMerging方法实现这一目标，该方法使用尖锐度感知最小化（SAM）来找到平坦的极小值。实验中，SAMerging在视觉和自然语言处理基准测试中建立了新的最佳状态，实现了卓越的性能。代码可在https://github.com/arshandalili/SAMerging/ 获取。

Summary / 总结

The paper addresses the challenge of model merging, which is a lightweight alternative to joint multi-task learning. It establishes a novel PAC-Bayes generalization bound for model merging, introducing a term to capture cross-task heterogeneity. The authors frame model merging as multi-teacher knowledge distillation and propose SAMerging, which uses Sharpness-Aware Minimization to find flat minima. Experiments show that SAMerging achieves state-of-the-art performance on vision and NLP benchmarks.

论文通过建立新的理论框架并提出方法来提高模型合并的泛化能力。作者为模型合并推导出一个考虑异质性的平滑度感知PAC-Bayes泛化界，该界能够捕捉细调模型与目标任务之间的差异。然后，他们将模型合并视为多教师的知识蒸馏，并引入了SAMerging方法，该方法使用尖锐度感知最小化来寻找平滑的极小值，从而在各种基准测试中取得了更好的性能。实验结果显示，SAMerging在视觉和自然语言处理任务上超过了现有方法。

Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction

Authors: Suren Bandara

First: 2025-12-24T17:10:37+00:00 · Latest: 2025-12-24T17:10:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.

中文标题/摘要

标题：基于掩膜后处理的表格分割结构坐标提取

从表格中提取结构化数据在扫描文档和数字档案的文档图像分析中起着关键作用。尽管已经提出了许多方法来检测表格结构并提取单元格内容，但在低分辨率或噪声图像中准确识别表格段边界（行和列）仍然具有挑战性。在许多实际场景中，表格数据不完整或退化，限制了基于变换器的方法对噪声输入的适应性。基于掩膜的边缘检测技术在这些条件下表现出更大的鲁棒性，因为它们的灵敏度可以通过阈值调整进行调整；然而，现有方法通常直接将掩膜应用于图像，导致噪声敏感性、分辨率损失或高计算成本。本文提出了一种新的多尺度信号处理方法，用于从表格掩膜检测表格边缘。行和列转换被建模为一维信号，并使用逐渐增加方差的高斯卷积进行处理，然后通过统计阈值抑制噪声同时保留稳定的结构边缘。检测到的信号峰值被映射回图像坐标以获得准确的段边界。实验结果表明，将所提出的方法应用于列边缘检测，可以将基于布局感知的度量（PubLayNet-1M基准上的Cell-Aware Segmentation Accuracy，CASA）从67%提高到76%，该度量评估文本正确性和正确的单元格放置。该方法通过零填充和缩放策略对分辨率变化具有鲁棒性，并生成优化的结构化表格输出，适合下游分析。

Summary / 总结

This paper addresses the challenge of accurately identifying table segment boundaries in low-resolution or noisy images, which is crucial for structured data extraction from tables. It proposes a multi-scale signal-processing method that models row and column transitions as one-dimensional signals and uses Gaussian convolution with progressively increasing variances followed by statistical thresholding to detect table edges. The method improves the Cell-Aware Segmentation Accuracy (CASA) from 67% to 76% on the PubLayNet-1M benchmark when used with TableNet and PyTesseract OCR, demonstrating robustness to resolution variations and high adaptability to noisy inputs.

本文针对低分辨率或噪声图像中表格边界检测的挑战，提出了一种多尺度信号处理方法，将行和列的过渡视为一维信号，并使用高斯卷积和统计阈值处理。该方法在使用TableNet和PyTesseract OCR时，将Cell-Aware Segmentation Accuracy (CASA) 从67%提高到76%，在PubLayNet-1M基准上展示了对分辨率变化的鲁棒性和高适应性噪声输入。

Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Authors: Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan

First: 2025-12-24T17:05:09+00:00 · Latest: 2025-12-24T17:05:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

中文标题/摘要

标题：使用尖峰驱动视频变换器的手术场景分割及其实时潜力

现代手术系统越来越多地依赖智能场景理解以提供及时的情境感知，从而增强术中安全性。在此流程中，手术场景分割在准确感知手术事件方面发挥着核心作用。尽管最近的深度学习模型，尤其是大规模基础模型，实现了显著的分割准确性，但它们巨大的计算需求和高能耗阻碍了在资源受限的手术环境中进行实时部署。为解决这一限制，我们探索了新兴的SNN作为高效手术智能的有前途范式。然而，其性能仍受到手术标注数据稀缺和手术视频表示固有的稀疏性限制。为此，我们提出了SpikeSurgSeg，这是首个针对手术场景分割的尖峰驱动视频变换器框架，具有在非GPU平台上实现实时潜力的潜力。为解决手术标注数据有限的问题，我们引入了一种针对SNN的手术场景掩码自编码预训练策略，通过逐层管状掩码实现稳健的空间-时间表示学习。在此预训练骨干的基础上，我们进一步采用一种轻量级的尖峰驱动分割头，该头能够产生时间一致的预测，同时保持SNN的低延迟特性。在EndoVis18和我们内部的SurgBleed数据集上的广泛实验表明，SpikeSurgSeg在推断延迟方面至少减少了8倍，同时其mIoU与最先进的基于ANN的模型相当。值得注意的是，它相对于大多数基础模型基线的加速比超过20倍，突显了其在时间关键型手术场景分割中的潜力。

Summary / 总结

The research aims to develop a real-time surgical scene segmentation model for enhanced intra-operative safety. It proposes SpikeSurgSeg, a spike-driven video Transformer framework that addresses the computational demands of existing models. By using a surgical-scene masked autoencoding pretraining strategy and a lightweight spike-driven segmentation head, SpikeSurgSeg achieves comparable mean intersection over union (mIoU) to state-of-the-art ANN-based models while reducing inference latency by at least 8 times and offering over 20 times acceleration compared to foundation-model baselines.

研究旨在开发实时手术场景分割模型以提高术中安全性。提出了SpikeSurgSeg，这是一种基于尖峰的视频Transformer框架，通过掩码自编码预训练学习稳健的时空表示，并采用轻量级分割头以实现低延迟预测。实验表明，SpikeSurgSeg在平均交并比(mIoU)上与最先进的模型相当，同时将推理延迟减少了至少8倍，并且相对于大多数基础模型基线加速了20倍以上，突显了其在手术场景分割中的潜力。

SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

Authors: Divij Dudeja, Mayukha Pal

First: 2025-12-24T16:59:04+00:00 · Latest: 2025-12-24T16:59:04+00:00

Abs · PDF · Code1 · Code2

Abstract

The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.

中文标题/摘要

标题：SMART SLM：结构化记忆与推理变换器，一种用于准确文档辅助的小型语言模型

工程手册（EM）的用户发现阅读EM文档困难，因为它们很长，格式密集，包含书面文档、逐步程序和工程设备的标准参数列表。现成的变换器，尤其是紧凑型的，将这些材料视为一个扁平的令牌流。这种方法导致自信但错误的数字答案，并迫使模型以低效的方式记忆单独的事实。SMART（结构化记忆与推理变换器）为上述问题提供了一种不同的且实用的解决方案。SMART通过使用分层方法来结构化其处理过程，并基于三个主要工作类别：（1）语法意识事实提取（语法学家）树LSTM，从EM句子中提取作为主语关系宾语关系的事实；（2）紧凑索引记忆MANN（记忆增强神经网络），将这些理性主语关系宾语对象索引为384维向量，与信息来源相关联；（3）6层变换器，学习将之前检索到的事实融合到其生成的响应中。整个SMART模型使用45.51M参数，比GPT-2（124M）少64%，比BERT（133M）少69%，并且准确率比GPT-2高21.3%，表明SMART以最少的处理要求更好地拟合数据。SMART采用双模式推理，已知文档的索引快速路径（亚秒级答案时间）和新上传文件的索引动态路径（借助RAGs的FAISS Top 20结果，记忆限制在64个槽位）。在实际部署中，该框架比可比的小型变换器模型产生更支持的结果，减少了幻觉。

Summary / 总结

The paper addresses the challenge of accurately processing engineering manuals (EM) using small language models. It introduces SMART (Structured Memory and Reasoning Transformer), which uses a hierarchical approach to extract facts from EM sentences, store them in a compact indexed memory, and generate responses by fusing these facts. SMART achieves 21.3% higher accuracy than GPT-2 with fewer parameters, demonstrating its effectiveness in handling EMs efficiently.

论文针对工程手册（EM）内容长且结构密集的问题，提出了SMART（结构化记忆和推理变换器）模型，该模型采用分层方法从EM句子中提取事实，存储在记忆网络中，并生成准确的响应。SMART的准确率比GPT-2高出21.3%，参数量更少，展示了其在处理结构化数据方面的有效性。

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Authors: Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller

First: 2025-12-24T16:46:04+00:00 · Latest: 2025-12-24T16:46:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.

中文标题/摘要

标题：GriDiT: 基于因子化网格的扩散方法用于高效生成长图像序列

现代深度学习方法通常将图像序列视为大型张量，张量由按顺序堆叠的帧组成。然而，给定当前的最先进水平（SoTA），这种简单的表示是否理想？在本文中，我们从生成模型的角度回答了这个问题，并旨在设计一种更有效的图像序列数据建模方法。观察当前SoTA图像序列生成方法的低效性和瓶颈，我们展示了与其处理大型张量，通过首先在低分辨率下生成粗略的序列，然后在高分辨率下细化各个帧，可以改进生成过程。我们仅使用包含下采样帧的网格图像训练生成模型。然而，我们学习使用扩散变换器（DiT）的强自我注意机制来捕捉帧之间的相关性，从而生成图像序列。实际上，我们的建模方式将二维图像生成器扩展为低分辨率的三维图像序列生成器，而无需进行任何架构修改。随后，我们单独超分辨率每个帧以添加与序列无关的高分辨率细节。这种方法具有多种优势，并且可以克服该领域SoTA方法的关键限制。与现有的图像序列生成模型相比，我们的方法在合成质量上表现出色，并且在序列间具有更好的连贯性。它还能够生成任意长度的高保真图像序列，并且在推理时间和训练数据使用方面更加高效。此外，我们简洁的建模方式使我们的方法能够在多种数据领域中有效泛化，这通常需要额外的先验知识和监督才能在生成上下文中建模。我们的方法在数据集上始终在质量和推理速度（至少快两倍）方面优于SoTA。

Summary / 总结

GriDiT addresses the inefficiencies of current state-of-the-art (SoTA) methods in generating long image sequences by factorizing the process into low-resolution sequence generation followed by high-resolution frame refinement. It uses a generative model trained on grid images with subsampled frames and leverages the Diffusion Transformer's self-attention mechanism to capture frame correlations. This approach improves synthesis quality, coherence, and efficiency, achieving superior results compared to existing models and reducing inference time by at least half.

该论文提出了GriDiT方法，将长图像序列的生成过程分为两步：首先生成低分辨率序列，然后逐帧进行高分辨率细化。通过使用具有自注意力机制的扩散变换器（DiT），模型能够在网格图像上学习帧之间的关联。这种方法提高了合成质量、连贯性，并且更加高效，相比现有模型取得了更好的结果，并将推理时间至少缩短了一半。

Learning to Refocus with Video Diffusion Models

Authors: SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

Venue: SIGGRAPH Asia 2025

First: 2025-12-22T19:29:57+00:00 · Latest: 2025-12-24T16:32:32+00:00

Comments: Code and data are available at https://learn2refocus.github.io . SIGGRAPH Asia 2025, Dec. 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at https://learn2refocus.github.io

中文标题/摘要

标题：学习使用视频扩散模型重新聚焦

对焦是摄影的基础，但自动对焦系统往往无法捕捉到预期的主体，用户经常希望在拍摄后调整对焦。我们提出了一种使用视频扩散模型进行现实后对焦的新方法。从单张失焦图像出发，我们的方法生成了一组感知上准确的焦深序列，表示为视频序列，支持交互式重新对焦并解锁一系列下游应用。我们发布了一个大规模的焦深数据集，以支持这项工作和未来的研究，该数据集在多种实际智能手机条件下采集。我们的方法在感知质量和在具有挑战性的场景中的鲁棒性方面均优于现有方法，为日常摄影中的更高级对焦编辑能力铺平了道路。代码和数据可在https://learn2refocus.github.io 获取

Summary / 总结

The paper introduces a novel method for post-capture refocusing using video diffusion models to address the limitations of autofocus systems in photography. Starting from a single defocused image, the approach generates a perceptually accurate focal stack, allowing for interactive refocusing and supporting various downstream applications. The method outperforms existing techniques in both perceptual quality and robustness across challenging scenarios, demonstrating its potential for advanced focus-editing capabilities in everyday photography. A large-scale dataset and code are provided to facilitate further research and development.

该研究提出了一种使用视频扩散模型进行后捕获对焦的方法。从单张失焦图像出发，该方法生成了感知上准确的焦距堆栈，支持交互式对焦和多种下游应用。该方法在感知质量和鲁棒性方面均优于现有方法，特别是在复杂场景下表现出色，并提供了一个大规模的焦距堆栈数据集以促进进一步研究。代码和数据可在https://learn2refocus.github.io 获取。

ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang

First: 2025-12-24T16:24:18+00:00 · Latest: 2025-12-24T16:24:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.

中文标题/摘要

标题：ACD：通过注意力监督实现视频扩散模型的直接条件控制

在视频合成中，可控性是基本要求，准确对齐条件信号至关重要。现有无分类器自由引导方法通常通过建模数据和条件的联合分布间接实现条件化，这往往导致对指定条件的有限可控性。基于分类器的引导通过外部分类器强制执行条件，但模型可能会利用这种机制提高分类器分数而不真正满足预期条件，从而产生对抗性伪影并限制有效的可控性。在本文中，我们提出了注意力条件扩散（ACD），一种通过注意力监督实现视频扩散模型直接条件控制的新框架。通过使模型的注意力图与外部控制信号对齐，ACD 达到了更好的可控性。为此，我们引入了一种稀疏的3D感知对象布局作为高效的条件信号，以及一个专用的布局控制网和自动注释流水线以实现可扩展的布局集成。在基准视频生成数据集上的大量实验表明，ACD 在保持时间连贯性和视觉保真度的同时，提供了与条件输入更好的对齐，建立了条件视频合成的有效范式。

Summary / 总结

The paper proposes Attention-Conditional Diffusion (ACD), a method for direct conditional control in video diffusion models using attention supervision. ACD aligns the model's attention maps with external control signals to enhance controllability. The method uses a sparse 3D-aware object layout as an efficient conditioning signal and includes a Layout ControlNet and an automated annotation pipeline. Experiments show that ACD provides better alignment with conditioning inputs while maintaining temporal coherence and visual fidelity, outperforming existing methods in conditional video synthesis.

论文提出了一种名为注意力条件扩散（ACD）的新框架，通过注意力监督直接使模型与外部控制信号对齐，以提高视频合成中的可控性。ACD引入了一种稀疏的3D感知对象布局作为高效的条件信号，并包含了一个布局控制网和自动注释流水线。实验表明，ACD在保持时间连贯性和视觉保真度的同时，能够更好地与条件输入对齐，优于现有方法在条件视频合成中的表现。

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Authors: Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu

First: 2025-12-24T16:00:15+00:00 · Latest: 2025-12-24T16:00:15+00:00

Comments: Project Page: https://dreamontage.github.io/DreaMontage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

中文标题/摘要

标题：DreaMontage：任意帧引导的一次性视频生成

“一次性”技术在电影制作中代表了一种独特的且高超的美学风格。然而，其实现往往受到高昂成本和复杂现实约束的阻碍。尽管新兴的视频生成模型提供了虚拟替代方案，但现有方法通常依赖于简单的片段拼接，这往往无法保持视觉连贯性和时间一致性。在本文中，我们介绍了DreaMontage，这是一种全面的框架，用于任意帧引导的生成，能够从多种用户提供的输入中合成无缝、富有表现力且长时间的一次性视频。为了实现这一目标，我们从三个主要维度应对挑战。(i) 我们将一个轻量级的中间条件机制整合到DiT架构中。通过采用一种有效的基训练数据调优策略，我们解锁了强大的任意帧控制能力。(ii) 为了提高视觉保真度和电影表现力，我们精心制作了一个高质量的数据集，并实施了一个视觉表达SFT阶段。通过应用定制的DPO方案，我们解决了诸如主体运动合理性及过渡平滑性等关键问题，显著提高了生成内容的成功率和可用性。(iii) 为了促进长序列的生成，我们设计了一种分段自回归(SAR)推理策略，该策略在内存高效的情况下运行。广泛的实验表明，我们的方法能够实现视觉上引人注目且无缝连贯的一次性效果，同时保持计算效率，使用户能够将零散的视觉材料转化为生动、连贯的一次性电影体验。

Summary / 总结

DreaMontage is a framework for generating seamless one-shot videos from arbitrary frames. It integrates a lightweight intermediate-conditioning mechanism into the DiT architecture, uses an Adaptive Tuning strategy, and includes a Visual Expression SFT stage to enhance visual fidelity. The approach also employs a Tailored DPO scheme and a Segment-wise Auto-Regressive (SAR) inference strategy to improve motion rationality and transition smoothness, while maintaining computational efficiency. Experiments show that DreaMontage can produce visually striking and temporally coherent one-shot videos.

DreaMontage 是一个从任意帧生成无缝一镜头视频的框架。它将轻量级的中间条件机制集成到 DiT 架构中，使用自适应调谐策略，并包含视觉表达 SFT 阶段以提高视觉保真度和表现力。该方法还采用定制的 DPO 方案和分段自回归 (SAR) 推断策略来解决关键问题并提高计算效率。实验表明，DreaMontage 生成了视觉上引人注目且时间上连贯的一镜头视频。

LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov

First: 2025-12-24T15:36:21+00:00 · Latest: 2025-12-24T15:36:21+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .

中文标题/摘要

标题：LookPlanGraph：基于VLM图增强的体感指令跟随方法

使用大型语言模型（LLM）作为体感指令跟随任务规划器的方法已经变得普遍。为了成功完成任务，LLM 必须在机器人操作的环境中进行接地。一种解决方案是使用包含所有必要信息的场景图。现代方法依赖于预先构建的场景图，并假设在规划开始时所有任务相关信息都已可用。然而，这些方法没有考虑到在图构建和任务执行之间环境可能发生的变化。我们提出了 LookPlanGraph 方法，该方法利用由静态资产和对象先验组成的场景图。在计划执行过程中，LookPlanGraph 不断通过验证现有先验或发现新实体来更新图。这通过使用视觉语言模型处理代理的主观摄像机视图来实现。我们在具有更改对象位置的 VirtualHome 和 OmniGibson 模拟环境中进行了实验，证明了 LookPlanGraph 在基于预定义静态场景图的方法中表现出色。为了展示我们方法的实际适用性，我们还在现实世界中进行了实验。此外，我们引入了 GraSIF（用于指令跟随的图场景）数据集及其自动验证框架，包含来自 SayPlan Office、BEHAVIOR-1K 和 VirtualHome RobotHow 的 514 个任务。项目页面可在 https://lookplangraph.github.io 查看。

Summary / 总结

The paper proposes LookPlanGraph, a method that enhances embodied instruction following by using a scene graph augmented with object priors and real-time updates. During execution, the method continuously updates the graph based on the robot's egocentric view using a Vision Language Model. Experiments in simulated and real-world environments show that LookPlanGraph outperforms methods relying on static scene graphs, especially when the environment changes between graph construction and task execution. The study also introduces the GraSIF dataset for instruction following with an automated validation framework.

该论文提出了一种名为LookPlanGraph的方法，通过结合对象先验和基于机器人视觉输入的持续更新来增强物体的场景图，从而提高指令跟随能力。该方法在模拟和真实环境中均优于基于预构建静态场景图的方法，特别是在物体位置发生变化时表现出更好的性能。实验表明，与依赖预构建场景图的方法相比，该方法在处理动态环境方面具有优势。

GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer

Authors: Corey Adams, Rishikesh Ranade, Ram Cherukuri, Sanjay Choudhry

First: 2025-12-23T14:40:08+00:00 · Latest: 2025-12-24T15:28:58+00:00

Abs · PDF · Code1 · Code2

Abstract

We present GeoTransolver, a Multiscale Geometry-Aware Physics Attention Transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO) and reused in every block. Implemented and released in NVIDIA PhysicsNeMo, GeoTransolver persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes. We benchmark GeoTransolver on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, comparing against Domino, Transolver (as released in PhysicsNeMo), and literature-reported AB-UPT, and evaluate drag/lift R2 and Relative L1 errors for field variables. GeoTransolver delivers better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency; we include ablations on DrivAerML and qualitative results such as contour plots and design trends for the best GeoTransolver models. By unifying multiscale geometry-aware context with physics-based attention in a scalable transformer, GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes.

中文标题/摘要

标题：GeoTransolver：使用多尺度几何感知物理注意力变换器在不规则域中学习物理

我们提出了GeoTransolver，这是一种用于CAE的多尺度几何感知物理注意力变换器，它用GALE替代了标准注意力，将物理感知的自我注意力应用于学习的状态切片，并与从多尺度球查询（受DoMINO启发）计算的共享几何/全局/边界条件上下文进行交叉注意力连接，并在每个块中重用。GeoTransolver在NVIDIA PhysicsNeMo中实现并发布，持续将几何、全局和边界条件参数投影到物理状态空间，将潜在计算锚定在域结构和操作范围内。我们在DrivAerML、Luminary SHIFT-SUV和Luminary SHIFT-Wing上对GeoTransolver进行了基准测试，与Domino、Transolver（在PhysicsNeMo中发布）和文献报告的AB-UPT进行比较，并评估了场变量的拖曳/升力R2和相对L1误差。GeoTransolver提供了更好的准确性、对几何/范围变化的改进鲁棒性以及有利的数据效率；我们包括了DrivAerML上的消融分析和诸如等值线图和最佳GeoTransolver模型的设计趋势等定性结果。通过在可扩展的变换器中统一多尺度几何感知上下文和基于物理的注意力，GeoTransolver促进了复杂、不规则域和非线性物理范围中的操作学习，以实现高保真代理建模。

Summary / 总结

GeoTransolver is a multiscale geometry-aware physics attention transformer designed to improve the accuracy and robustness of computational fluid dynamics models on irregular domains. It uses GALE (Geometry-Aware Physics Attention) to integrate physics-aware self-attention with cross-attention to a shared geometry/global/boundary-condition context, which is computed from multi-scale ball queries. GeoTransolver outperforms existing methods like Domino and Transolver in terms of accuracy and data efficiency, and it shows improved robustness to geometry and regime shifts. The model is evaluated on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, demonstrating better drag/lift R2 and Relative L1 errors for field variables.

GeoTransolver 是一种多尺度几何感知物理注意力变压器，用于计算解剖学工程（CAE）。它使用 GALE 将物理感知的自注意力应用于学习的状态切片，并结合来自多尺度球查询派生的共享几何/全局/边界条件上下文的交叉注意力。GeoTransolver 在 DrivAerML、Luminary SHIFT-SUV 和 Luminary SHIFT-Wing 上进行了基准测试，显示了比 Domino、Transolver 和文献报告方法更好的准确性、对几何和制度变化的改进鲁棒性以及有利的数据效率。还提供了 DrivAerML 上的消融研究和定性结果。

SegMo: Segment-aligned Text to 3D Human Motion Generation

Authors: Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen

First: 2025-12-24T15:26:11+00:00 · Latest: 2025-12-24T15:26:11+00:00

Comments: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026

Abs · PDF · Code1 · Code2

Abstract

Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.

中文标题/摘要

标题：SegMo: 与片段对齐的文本到3D人体动作生成

从文本描述生成3D人体动作是一个重要的研究问题，在视频游戏、虚拟现实和增强现实等领域有着广泛的应用。最近的方法在序列级别对齐文本描述和人体动作，忽略了模态的内部语义结构。然而，动作描述和动作序列可以自然地分解为更小且语义上更连贯的片段，这些片段可以作为原子对齐单元，以实现更精细的对应。受此启发，我们提出了一种新的SegMo框架，以实现细粒度的文本-动作对齐。我们的框架由三个模块组成：(1) 文本片段提取，将复杂的文本描述分解为按时间顺序排列的短语，每个短语代表一个简单的原子动作；(2) 动作片段提取，将完整的动作序列分割为相应的动作片段；(3) 细粒度文本-动作对齐，通过对比学习对齐文本和动作片段。广泛的实验表明，SegMo在两个广泛使用的数据集上提高了强基线，HumanML3D测试集上的TOP 1得分为0.553。此外，由于学习到的文本和动作片段共享嵌入空间，SegMo还可以应用于检索任务，如动作定位和动作到文本检索。

Summary / 总结

SegMo is a novel framework for generating 3D human motions from text, addressing the limitation of previous methods by aligning text and motion at the segment level. It consists of three modules: Text Segment Extraction, Motion Segment Extraction, and Fine-grained Text-Motion Alignment. SegMo outperforms strong baselines, achieving a TOP 1 score of 0.553 on the HumanML3D test set and showing effectiveness in retrieval tasks such as motion grounding and motion-to-text retrieval.

SegMo 是一种新颖的框架，用于从文本生成 3D 人体动作，通过在段落级别对齐文本和动作来解决先前方法的局限性。它包含三个模块：文本段落提取、动作段落提取和细粒度文本-动作对齐。SegMo 在 HumanML3D 测试集上优于强基线，达到 TOP 1 分数 0.553，并且在动作定位和动作到文本检索等检索任务中也表现出有效性。

MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models

Authors: Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller

First: 2025-12-24T15:15:18+00:00 · Latest: 2025-12-24T15:15:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers -- a property we term 'latent solvability'. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.

中文标题/摘要

标题：MiST：理解中期科学训练在开发化学推理模型中的作用

大型语言模型可以通过基于规则的奖励进行在线微调来发展推理能力。然而，最近的研究揭示了一个关键限制：强化学习仅在基础模型已赋予正确答案非忽略不计的概率时才能成功——我们称这一特性为“潜在可解性”。本研究探讨了化学推理能力的出现及其先决条件对化学领域意味着什么。我们确定了基于强化学习的化学推理的两个必要条件：1) 符号能力，2) 潜在化学知识。我们提出了中期科学训练（MiST）：一系列中期训练技术以满足这些条件，包括数据混合、SMILES/CIF意识预处理、继续预训练29亿个标记以及监督微调1亿个标记。这些步骤将3B和7B模型的潜在可解性得分提高至1.8倍，并使强化学习在有机反应命名中的顶级准确率从10.9%提升至63.9%，在无机材料生成中的顶级准确率从40.6%提升至67.4%。对于其他具有挑战性的化学任务，也观察到了类似的结果，同时生成了可解释的推理痕迹。我们的结果定义了化学推理训练的明确先决条件，并突显了中期训练在解锁推理能力中的更广泛作用。

Summary / 总结

This study explores the development of chemical reasoning capabilities in large language models through mid-stage scientific training (MiST), which includes data-mixing, continued pre-training, and supervised fine-tuning. The research identifies two prerequisites: symbolic competence and latent chemical knowledge. The method significantly improves the latent solvability score, enabling reinforcement learning to enhance accuracy in organic reaction naming and inorganic material generation tasks, demonstrating clear interpretable reasoning traces.

研究探讨了通过中期科学训练（MiST）提升大型语言模型的化学推理能力，该方法包括数据混合、持续预训练和监督微调。研究确定了两个先决条件：符号能力与潜在的化学知识。该方法显著提高了潜在可解性分数，使有机反应命名和无机材料生成任务的准确率分别从初始的10.9%和40.6%提升到63.9%和67.4%。这项工作明确了化学推理训练的明确要求，并强调了中期训练在解锁推理能力中的重要作用。

PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

Authors: Xiao-Qi Han, Ze-Feng Gao, Peng-Jie Guo, Zhong-Yi Lu

First: 2025-12-24T15:07:36+00:00 · Latest: 2025-12-24T15:07:36+00:00

Comments: 19 pages, 6 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen--the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at https://github.com/xqh19970407/PhononBench

中文标题/摘要

标题：PhononBench：一种基于声子的大规模基准测试，用于晶体生成中的动力学稳定性

在本工作中，我们介绍了PhononBench，这是首个用于AI生成晶体动力学稳定性的大规模基准测试。利用最近开发的MatterSim原子间势，该势能在超过10,000种材料中实现了从头算水平的声子预测精度，PhononBench能够高效地进行大规模声子计算和动力学稳定性分析，针对六种领先的晶体生成模型生成的108,843种晶体结构。PhononBench揭示了当前生成模型在确保动力学稳定性方面的普遍局限性：所有生成结构的动力学稳定性平均率为25.83%，最佳模型MatterGen也仅达到41.0%。进一步的案例研究显示，在目标性质生成中——以MatterGen的带隙调节为例——即使在最佳带隙条件0.5 eV下，动力学稳定性率仍低至23.5%。在空间群控制生成中，高对称晶体表现出更好的稳定性（例如，立方系统达到49.2%的稳定性率），但所有控制生成的平均稳定性仍仅为34.4%。这项研究的重要附加成果是识别了28,119种在整个布里渊区都稳定的晶体结构，为未来的材料探索提供了大量可靠的候选者。通过建立首个大规模动力学稳定性基准测试，本工作系统地突显了当前晶体生成模型的局限性，并提供了未来开发设计和发现物理上可行材料所需的重要评估标准和指导。所有模型生成的晶体结构、声子计算结果以及PhononBench开发的高通量评估工作流程将在https://github.com/xqh19970407/PhononBench公开发布

Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Authors: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen

Venue: MM

First: 2025-12-24T15:02:33+00:00 · Latest: 2025-12-24T15:02:33+00:00

Comments: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

中文标题/摘要

标题：利用轻量级实体提取实现可扩展的基于事件的图像检索

从自然语言描述中检索图像是一项核心任务，位于计算机视觉和自然语言处理的交叉领域，广泛应用于搜索引擎、媒体归档和数字内容管理中。然而，由于模糊或依赖上下文的查询、语言的多样性以及需要可扩展的解决方案，现实世界中的图像-文本检索仍然具有挑战性。在本研究中，我们提出了一种轻量级的两阶段检索管道，利用事件中心的实体提取来结合现实世界标题中的时间与上下文信号。第一阶段使用BM25基于显著实体进行高效的候选过滤，而第二阶段则应用BEiT-3模型来捕捉深层次的跨模态语义并重新排序结果。在OpenEvents v1基准测试上，我们的方法达到了0.559的平均精度，显著优于先前的基线。这些结果突显了结合事件导向的过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。我们的代码可在https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval 获取。

Summary / 总结

This paper addresses the challenge of retrieving images from natural language descriptions by proposing a lightweight two-stage retrieval pipeline. The first stage filters candidates using BM25 based on salient entities, while the second stage employs BEiT-3 models to capture deep multimodal semantics and rerank the results. The method achieves a mean average precision of 0.559 on the OpenEvents v1 benchmark, outperforming previous approaches, demonstrating the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in real-world scenarios.

该研究提出了一种轻量级的两阶段检索管道，以解决从自然语言描述中检索图像的挑战。第一阶段使用基于显著实体的BM25进行高效的候选过滤，第二阶段则使用BEiT-3模型捕获深度多模态语义并重新排序结果。该方法在OpenEvents v1基准测试中实现了0.559的平均精度，显著优于之前的基线方法，在具有复杂查询和语言变异性的真实世界场景中表现出色。

RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

First: 2025-12-24T15:01:26+00:00 · Latest: 2025-12-24T15:01:26+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

中文标题/摘要

标题：RoboSafe：通过可执行的安全逻辑保护具身代理

由视觉-语言模型（VLMs）驱动的具身代理越来越能够执行复杂的现实世界任务，但它们仍然容易受到可能导致不安全行为的危险指令的影响。运行时安全护栏可以在任务执行过程中拦截危险行为，提供了一种有前景的解决方案，因为它们具有灵活性。然而，现有的防御措施往往依赖于静态规则过滤或提示级控制，难以应对动态、时序依赖和上下文丰富的环境中出现的隐含风险。为了解决这个问题，我们提出了RoboSafe，这是一种通过可执行谓词基础的安全逻辑为具身代理提供混合推理运行时保护的混合方法。RoboSafe结合了在混合长短期安全记忆上的两种互补推理过程。我们首先提出了一种后向反思推理模块，该模块不断回顾短期记忆中的最近轨迹，以推断时间安全谓词，并在检测到违规行为时主动触发重新规划。然后，我们提出了一种前瞻预测推理模块，该模块通过生成基于长期安全记忆和代理的多模态观察的安全谓词来预见即将出现的风险。这些组件共同形成了一个既可解释又可执行的适应性、验证性安全逻辑。在多个代理的广泛实验中，RoboSafe与领先基准相比显著减少了危险行为（风险发生率降低36.8%），同时保持了接近原始的任务性能。实际世界对物理机器人手臂的评估进一步证实了其实用性。代码将在接受后发布。

Summary / 总结

RoboSafe is designed to safeguard embodied agents by using executable safety logic. It addresses the limitations of static rule filters and prompt-level control by integrating backward reflective and forward predictive reasoning processes. The system reduces hazardous actions by 36.8% compared to leading baselines while maintaining near-original task performance. RoboSafe has been evaluated in both simulated and real-world scenarios, confirming its practicality and effectiveness.

RoboSafe 通过使用可执行的安全逻辑来保护实体代理，结合了后向反思推理以持续监控最近的动作和前向预测推理以预见潜在风险。实验表明，RoboSafe 相比现有方法将危险动作减少了 36.8%，同时保持了类似的任务性能。实际机器人手臂的评估进一步证实了其实用性。

Latent Implicit Visual Reasoning

Authors: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

First: 2025-12-24T14:59:49+00:00 · Latest: 2025-12-24T14:59:49+00:00

Abs · PDF · Code1 · Code2

Abstract

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.

中文标题/摘要

标题：潜在隐式视觉推理

虽然大型多模态模型（LMMs）取得了显著进展，但它们仍然主要以文本为中心，依赖语言作为核心推理模态。因此，它们在处理以视觉为主的推理任务方面能力有限。最近的方法通过使用辅助图像、深度图或图像裁剪来监督中间的视觉步骤，试图解决这一问题。然而，这些策略对“有用的”视觉抽象施加了限制性的先验，增加了注释成本，并且难以在不同任务之间泛化。为了解决这一关键限制，我们提出了一种任务无关的机制，该机制训练LMMs发现和使用视觉推理标记，而无需显式的监督。这些标记全局注意并以任务自适应的方式重新编码图像，使模型能够提取相关视觉信息，而无需手工制作的监督。我们的方法在多种视觉中心任务上优于直接微调，并达到了最先进的结果——包括那些中间抽象难以指定的任务——同时在多任务指令调优方面也表现出泛化能力。

Summary / 总结

The research aims to enhance Large Multimodal Models (LMMs) by addressing their limited ability to handle predominantly visual reasoning tasks. The method involves training LMMs to discover and use visual reasoning tokens without explicit supervision, allowing the model to attend globally and re-encode images in a task-adaptive way. Key experimental findings show that this approach outperforms direct fine-tuning and achieves state-of-the-art results on various vision-centric tasks, including those where intermediate abstractions are difficult to specify, while also generalizing to multi-task instruction tuning.

研究旨在通过提出一种任务无关的机制，使大型多模态模型（LMMs）能够发现并使用视觉推理令牌，而无需显式的监督。该方法使模型能够全局关注并以任务自适应的方式重新编码图像，提取相关的视觉信息。该方法在各种视觉中心任务上优于直接微调，实现了最先进的结果，包括那些难以指定中间抽象的任务，并且在多任务指令调优中表现出良好的泛化能力。

A study of EHVI vs fixed scalarization for molecule design

Authors: Anabel Yong, Austin Tripp, Layla Hosseini-Gerami, Brooks Paige

Venue: NeurIPS

First: 2025-07-18T07:12:19+00:00 · Latest: 2025-12-24T14:56:07+00:00

Comments: Accepted to NeurIPS AI4Science Workshop 2025

Abs · PDF · Code1 · Code2

Abstract

Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy - Expected Hypervolume Improvement (EHVI) - against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants - including random or adaptive schemes - our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.

中文标题/摘要

标题：分子设计中EHVI与固定加权化方法的比较研究

多目标贝叶斯优化（MOBO）为分子设计中的权衡导航提供了一个原则性的框架。然而，它与标量化替代方法的实证优势尚未得到充分探索。我们使用期望改进（EI）作为固定权重标量化基线，与基于帕累托的MOBO策略——期望超体积改进（EHVI）进行基准测试，采用严格控制的设置，具有相同的高斯过程代理和分子表示。在三个分子优化任务中，EHVI在帕累托前沿覆盖、收敛速度和化学多样性方面始终优于标量化EI。虽然标量化包括灵活的变体——包括随机或自适应方案，但我们的结果表明，在数据量有限的情况下，即使是强大的确定性实例也可能表现不佳。这些发现为在有限评估预算和非平凡权衡时新分子优化中帕累托意识获取的实际优势提供了实证证据。

Summary / 总结

The study investigates the performance of Expected Hypervolume Improvement (EHVI) compared to a fixed scalarization method (Expected Improvement, EI) in molecular design using multi-objective Bayesian optimization. Across three molecular optimization tasks, EHVI outperformed EI in terms of Pareto front coverage, convergence speed, and chemical diversity. The results suggest that Pareto-aware acquisition methods are advantageous, particularly in low-data regimes where trade-offs are complex.

研究比较了在分子设计中的多目标贝叶斯优化中，Expected Hypervolume Improvement (EHVI) 和固定权重标量化方法（Expected Improvement, EI）的表现。在三个任务中，EHVI 在帕累托前沿覆盖、收敛速度和化学多样性方面均优于 EI，即使标量化包括自适应方案。这表明，在数据有限和复杂权衡的情况下，帕累托感知的获取方法具有优势。

ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering

Authors: Paritosh Parmar, Eric Peh, Basura Fernando

First: 2025-08-28T17:10:53+00:00 · Latest: 2025-12-24T14:52:45+00:00

Comments: Project page: https://paritoshparmar.github.io/chainreaction/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/

中文标题/摘要

标题：ChainReaction：因果链引导推理在模块化和可解释的因果为什么视频问答中的应用

现有的因果为什么视频问答（VideoQA）模型往往难以进行高层次推理，依赖于不透明的单一管道，将视频理解、因果推理和答案生成紧密结合在一起。这些黑盒方法缺乏可解释性，往往依赖于浅层启发式方法。我们提出了一种新的模块化范式，明确地将因果推理与答案生成分离，引入自然语言因果链作为可解释的中间表示。受人类认知模型的启发，这些结构化的因果序列将低级视频内容与高级因果推理联系起来，使推理变得透明且逻辑连贯。我们的两阶段架构包括因果链提取器（CCE），从视频-问题对生成因果链，以及因果链驱动的答案生成器（CCDA），基于这些链生成答案。为了解决缺乏标注推理痕迹的问题，我们提出了一种生成现有数据集中准确因果链的可扩展方法。我们为46000个样本构建了经过人工验证的因果链。我们还提出了CauCo，一种新的因果导向字幕评估指标。在三个大规模基准上的实验表明，我们的方法不仅优于最先进的模型，还在可解释性、用户信任和泛化方面取得了显著提升——将CCE定位为跨不同领域的可重用因果推理引擎。项目页面：https://paritoshparmar.github.io/chainreaction/

Summary / 总结

The paper proposes ChainReaction, a modular approach for causal-why video question answering that separates causal reasoning from answer generation using natural language causal chains. This method improves interpretability and logical coherence. Experiments show that ChainReaction outperforms existing models on three large-scale benchmarks, enhancing explainability, user trust, and generalization. The CCE generates causal chains, while the CCDA uses them to derive answers. A new evaluation metric, CauCo, is introduced for causality-oriented captioning.

该论文提出了ChainReaction，一种将因果推理与答案生成分离的模块化方法，使用自然语言因果链。这种方法提高了可解释性和逻辑连贯性。实验表明，ChainReaction在三个大规模基准上优于现有模型，增强了可解释性、用户信任和泛化能力。CCE生成因果链，而CCDA使用这些链来生成答案。还提出了一个新的评估指标CauCo，用于因果导向的字幕。

Causal-driven attribution (CDA): Estimating channel influence without user-level data

Authors: Georgios Filippou, Boi Mai Quach, Diana Lenghel, Arthur White, Ashish Kumar Jha

First: 2025-12-24T14:51:12+00:00 · Latest: 2025-12-24T14:51:12+00:00

Comments: 42 pages, 8 figures, submitted initially to the journal of the academy of marketing science on 24th Dec 2025

Abs · PDF · Code1 · Code2

Abstract

Attribution modelling lies at the heart of marketing effectiveness, yet most existing approaches depend on user-level path data, which are increasingly inaccessible due to privacy regulations and platform restrictions. This paper introduces a Causal-Driven Attribution (CDA) framework that infers channel influence using only aggregated impression-level data, avoiding any reliance on user identifiers or click-path tracking. CDA integrates temporal causal discovery (using PCMCI) with causal effect estimation via a Structural Causal Model to recover directional channel relationships and quantify their contributions to conversions. Using large-scale synthetic data designed to replicate real marketing dynamics, we show that CDA achieves an average relative RMSE of 9.50% when given the true causal graph, and 24.23% when using the predicted graph, demonstrating strong accuracy under correct structure and meaningful signal recovery even under structural uncertainty. CDA captures cross-channel interdependencies while providing interpretable, privacy-preserving attribution insights, offering a scalable and future-proof alternative to traditional path-based models.

中文标题/摘要

标题：因果驱动归因(CDA): 在无用户级数据情况下估计渠道影响

归因建模是衡量营销效果的核心，但大多数现有方法依赖于用户级路径数据，这些数据因隐私法规和平台限制而变得越来越难以获取。本文介绍了一种因果驱动归因(CDA)框架，该框架仅使用聚合的印象级数据来推断渠道影响，避免依赖用户标识符或点击路径跟踪。CDA将时间因果发现（使用PCMCI）与结构因果模型中的因果效应估计相结合，以恢复渠道关系的方向并量化其对转化的贡献。使用设计用于复制真实营销动态的大规模合成数据，我们展示了当给定真实因果图时，CDA的平均相对RMSE为9.50%，使用预测图时为24.23%，证明在正确结构下具有很强的准确性，并且即使在结构不确定性下也能恢复有意义的信号。CDA捕捉跨渠道的相互依赖性，同时提供可解释的、保护隐私的归因洞察，为传统的路径模型提供了一个可扩展且未来导向的替代方案。

Summary / 总结

CDA is a framework that infers channel influence using aggregated impression-level data, avoiding user-level data. It combines temporal causal discovery and causal effect estimation to recover channel relationships and quantify their contributions. Experiments on synthetic data show CDA achieves an average relative RMSE of 9.50% with the true causal graph and 24.23% with the predicted graph, indicating strong accuracy and meaningful signal recovery even under structural uncertainty.

该研究提出了一种因果驱动归因（CDA）框架，利用聚合的展示级别数据来推断渠道影响，而不依赖于用户标识符或点击路径跟踪。CDA 结合了时间因果发现和因果效应估计，以恢复渠道关系并量化其对转化的贡献。实验结果显示，CDA 在使用真实因果图时的平均相对 RMSE 为 9.50%，使用预测图时为 24.23%，表明即使在结构不确定性下也能实现较强的准确性和有意义的信号恢复。

Human Motion Estimation with Everyday Wearables

Authors: Siqi Zhu, Yixuan Li, Junfu Li, Qi Wu, Zan Wang, Haozhe Ma, Wei Liang

First: 2025-12-24T14:44:51+00:00 · Latest: 2025-12-24T14:44:51+00:00

Abs · PDF · Code1 · Code2

Abstract

While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.

中文标题/摘要

标题：基于日常穿戴设备的人体运动估计

基于穿戴设备的人体运动估计对于XR交互等应用至关重要，但现有方法往往存在穿戴不便、硬件昂贵和校准繁琐的问题，这阻碍了它们在日常生活中的应用。为解决这些问题，我们提出了EveryWear，这是一种基于日常穿戴设备的轻量级和实用的人体运动捕捉方法：一部智能手机、智能手表、耳塞和智能眼镜，配备一个前置摄像头和两个向下摄像头，使用前无需进行显式校准。我们引入了Ego-Elec，这是一个包含56种日常活动的9小时真实世界数据集，覆盖17种不同的室内和室外环境，并提供了由运动捕捉（MoCap）提供的真实三维标注，以促进该领域的稳健研究和基准测试。我们的方法采用多模态教师-学生框架，将第一人称摄像头的视觉线索与消费设备的惯性信号相结合。通过直接在真实世界数据上进行训练，而不是在合成数据上进行训练，我们的模型有效地消除了限制先前工作的模拟到现实的差距。实验表明，我们的方法优于基线模型，验证了其在实际全身运动估计中的有效性。

Summary / 总结

The research aims to improve human motion estimation for applications like XR interaction by addressing issues such as poor wearability and expensive hardware. EveryWear, a lightweight approach using everyday wearables like a smartphone, smartwatch, earbuds, and smart glasses, is introduced. The method employs a multimodal teacher-student framework integrating visual cues from egocentric cameras with inertial signals from consumer devices, trained on real-world data. Experiments show that this approach outperforms baseline models, demonstrating its effectiveness for practical full-body motion estimation.

研究旨在通过解决穿戴不便和硬件昂贵等问题，改进用于XR交互等人机交互应用的人体动作估计。EveryWear方法利用智能手机、智能手表、耳塞和智能眼镜等日常穿戴设备进行动作捕捉，无需校准。该方法采用结合视觉和惯性信号的多模态教师-学生框架，并在实际场景中的实验表明其优于基线模型。

Analytic and Variational Stability of Deep Learning Systems

Authors: Ronald Katende

First: 2025-12-24T14:43:59+00:00 · Latest: 2025-12-24T14:43:59+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a unified analytic and variational framework for studying stability in deep learning systems viewed as coupled representation-parameter dynamics. The central object is the Learning Stability Profile, which tracks the infinitesimal response of representations, parameters, and update mechanisms to perturbations along the learning trajectory. We prove a Fundamental Analytic Stability Theorem showing that uniform boundedness of these stability signatures is equivalent, up to norm equivalence, to the existence of a Lyapunov-type energy that dissipates along the learning flow. In smooth regimes, the framework yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractivity of the learning dynamics. Classical spectral stability results for feedforward networks, a discrete CFL-type condition for residual architectures, and parametric and temporal stability laws for stochastic gradient methods arise as direct consequences. The theory extends to non-smooth learning systems, including ReLU networks, proximal and projected updates, and stochastic subgradient flows, by replacing classical derivatives with Clarke generalized derivatives and smooth energies with variational Lyapunov functionals. The resulting framework provides a unified dynamical description of stability across architectures and optimization methods, clarifying how architectural and algorithmic choices jointly govern robustness and sensitivity to perturbations. It also provides a foundation for further extensions to continuous-time limits and geometric formulations of learning dynamics.

中文标题/摘要

标题：深度学习系统的分析与变分稳定性

我们提出了一种统一的分析和变分框架，用于研究将深度学习系统视为耦合的表示-参数动力学的稳定性。中心对象是学习稳定性概貌，它跟踪表示、参数和更新机制在学习轨迹上受到扰动时的微小响应。我们证明了一个基本的分析稳定性定理，表明这些稳定性特征的统一有界性，等价于存在一种类似李雅普诺夫的能量，该能量沿学习流耗散。在光滑区域，该框架给出了将谱范数、激活正则性、步长和学习率与学习动力学的收缩性联系起来的显式稳定性指数。对于前馈网络的经典谱稳定性结果、残差架构的离散CFL型条件以及随机梯度方法的参数和时间稳定性法则均作为直接推论出现。该理论扩展到非光滑学习系统，包括ReLU网络、近端和投影更新以及随机梯度流，通过用Clarke广义导数替换经典导数，并用变分李雅普诺夫泛函替换光滑能量。由此产生的框架提供了一种统一的动力学描述，涵盖了各种架构和优化方法的稳定性，阐明了架构和算法选择如何共同影响鲁棒性和对扰动的敏感性。它还为连续时间极限和学习动力学的几何形式提供了进一步扩展的基础。

Summary / 总结

This paper introduces a unified framework for analyzing the stability of deep learning systems by examining the dynamics of representations and parameters. The central concept is the Learning Stability Profile, which measures the response to perturbations. The authors prove a Fundamental Analytic Stability Theorem showing that bounded stability signatures are equivalent to the existence of a Lyapunov energy that dissipates during learning. The framework provides explicit stability exponents for smooth regimes and extends to non-smooth systems like ReLU networks and stochastic subgradient flows, offering a comprehensive description of stability across different architectures and optimization methods.

论文提出了一种结合解析和变分方法的统一框架，用于研究深度学习系统的稳定性。它引入了学习稳定性轮廓来跟踪表示、参数和更新机制对扰动的响应。基本的解析稳定性定理表明，这些稳定性签名的均匀有界性等价于沿学习流存在一种Lyapunov型能量，该能量会耗散。该框架提供了将谱范数、激活正则性、步长和学习率与学习动力学的收缩性联系起来的显式稳定性指数，并扩展到了如ReLU网络和随机梯度流等非光滑学习系统。

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Authors: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

First: 2025-12-18T10:21:14+00:00 · Latest: 2025-12-24T14:39:27+00:00

Comments: Project available at https://github.com/sarapapi/hearing2translate

Abs · PDF · Code1 · Code2 · Code3

Abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

中文标题/摘要

标题：听译：将语音模态整合到LLM中的有效性

随着大型语言模型（LLMs）超越文本，将语音作为原生模态进行整合产生了语音LLMs，旨在直接翻译口语，从而绕过传统的转录管道。然而，这种整合是否能比现有的级联架构提高语音到文本翻译的质量仍然是一个开放的问题。我们提出了听译，这是第一个全面的测试套件，严格地将5个最先进的语音LLMs与16个强大的直接和级联系统进行了基准测试，这些系统结合了领先的语音基础模型（SFM）和多语言LLMs。我们的分析涵盖了16个基准、13种语言对和9种具有挑战性的条件，包括不连贯、嘈杂和长篇语音。在广泛的评估中，我们发现级联系统仍然是最可靠的，当前的语音LLMs仅在某些设置中与级联系统相当，而SFM则落后于两者，这表明在模型内部或管道中整合一个LLM对于高质量的语音翻译是必不可少的。

Summary / 总结

The research aims to evaluate the effectiveness of integrating speech as a native modality into Large Language Models (LLMs) for direct speech-to-text translation. The study benchmarks 5 state-of-the-art SpeechLLMs against 16 direct and cascade systems across 16 benchmarks, 13 language pairs, and 9 challenging conditions. The findings indicate that cascaded systems remain more reliable overall, while current SpeechLLMs only match cascades in selected settings, suggesting that integrating an LLM, either within the model or in a pipeline, is crucial for high-quality speech translation.

研究探讨了将语音作为自然模态集成到大型语言模型（LLMs）中，以实现直接的语音到文本翻译的有效性。研究对比了5个最先进的SpeechLLMs与16个直接和级联系统在16个基准、13个语言对和9个具有挑战性的条件下的表现。结果显示，级联系统在整体上更为可靠，而当前的SpeechLLMs仅在某些设置中与级联系统相当，强调了集成LLM对于高质量语音翻译的重要性。

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

Authors: Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu

First: 2025-12-24T14:28:17+00:00 · Latest: 2025-12-24T14:28:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Zero-shot object navigation (ZSON) requires a robot to locate a target object in a previously unseen environment without relying on pre-built maps or task-specific training. However, existing ZSON methods often struggle in realistic and cluttered environments, particularly when the scene contains heavy occlusions, unknown risks, or dynamically moving target objects. To address these challenges, we propose \textbf{Schrödinger's Navigator}, a navigation framework inspired by Schrödinger's thought experiment on uncertainty. The framework treats unobserved space as a set of plausible future worlds and reasons over them before acting. Conditioned on egocentric visual inputs and three candidate trajectories, a trajectory-conditioned 3D world model imagines future observations along each path. This enables the agent to see beyond occlusions and anticipate risks in unseen regions without requiring extra detours or dense global mapping. The imagined 3D observations are fused into the navigation map and used to update a value map. These updates guide the policy toward trajectories that avoid occlusions, reduce exposure to uncertain space, and better track moving targets. Experiments on a Go2 quadruped robot across three challenging scenarios, including severe static occlusions, unknown risks, and dynamically moving targets, show that Schrödinger's Navigator consistently outperforms strong ZSON baselines in self-localization, object localization, and overall Success Rate in occlusion-heavy environments. These results demonstrate the effectiveness of trajectory-conditioned 3D imagination in enabling robust zero-shot object navigation.

中文标题/摘要

标题：薛定谔的导航器：零样本物体导航的未来图景想象

零样本物体导航（ZSON）要求机器人在未见过的环境中定位目标物体，无需依赖预先构建的地图或特定任务的训练。然而，现有的ZSON方法在现实且杂乱的环境中往往难以应对，尤其是在场景包含严重遮挡、未知风险或动态移动目标的情况下。为了解决这些挑战，我们提出了**薛定谔的导航器**，这是一种受薛定谔不确定性思想实验启发的导航框架。该框架将未观察到的空间视为一组可能的未来世界，并在行动前对其进行推理。基于第一人称视觉输入和三条候选轨迹，一条轨迹条件下的3D世界模型沿着每条路径想象未来的观察结果。这使代理能够超越遮挡物，预见未见区域的风险，而无需额外绕路或密集的全局映射。想象出的3D观察结果被融合到导航图中，并用于更新价值图。这些更新引导策略避开遮挡物，减少对不确定空间的暴露，并更好地追踪移动目标。在具有严重静态遮挡、未知风险和动态移动目标的三个挑战性场景中，使用四足机器人Go2进行的实验表明，薛定谔的导航器在自我定位、物体定位和整体成功率方面始终优于强大的ZSON基线。这些结果证明了轨迹条件下的3D想象在实现稳健的零样本物体导航方面的有效性。

Summary / 总结

Schrödinger's Navigator is a navigation framework designed for zero-shot object navigation in unseen environments with heavy occlusions and dynamic targets. It treats unobserved space as a set of plausible future worlds and reasons over them to guide the robot's actions. The framework uses a trajectory-conditioned 3D world model to imagine future observations along each path, enabling the robot to see beyond occlusions and anticipate risks. Experiments show that Schrödinger's Navigator outperforms existing methods in self-localization, object localization, and overall success rate in environments with severe occlusions and dynamic targets.

论文提出了Schrödinger的导航器，这是一种用于零样本物体导航（ZSON）的导航框架，旨在解决在杂乱和遮挡环境中的挑战。该框架使用轨迹条件下的3D世界模型来想象未来的观察结果并更新导航地图，引导机器人避开遮挡和不确定的空间。实验结果显示，Schrödinger的导航器在自定位、物体定位和遮挡密集环境中的成功率方面优于现有ZSON方法。

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Authors: Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

First: 2025-12-24T14:18:38+00:00 · Latest: 2025-12-24T14:18:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.

中文标题/摘要

标题：VisRes 基准：关于评估 VLM 视觉推理能力的研究

视觉-语言模型（VLMs）在视觉问答和图像描述等任务上取得了显著进展。然而，这些模型在视觉推理方面的表现与其依赖语言先验的程度仍然不清楚。为了解决这个问题，我们引入了 VisRes 基准，该基准旨在在无需上下文语言监督的自然环境中研究视觉推理。通过对三种复杂性级别的模型行为进行分析，我们发现了感知和关系视觉推理能力的明显局限性。VisRes 在其级别上隔离了不同的推理能力。第一级测试在模糊、纹理变化、遮挡和旋转等干扰下的感知完成和全局图像匹配；第二级测试单一属性（如颜色、数量、方向）的基于规则的推理；第三级则针对需要整合多个视觉属性的组合推理。在超过 19,000 张受控任务图像中，我们发现最先进的 VLMs 在微妙的感知干扰下表现接近随机，揭示了其有限的抽象能力，仅限于模式识别。最后，我们讨论了 VisRes 如何为多模态研究中的抽象视觉推理提供统一框架。

Summary / 总结

The paper introduces VisRes Bench, a benchmark to evaluate the visual reasoning capabilities of Vision-Language Models (VLMs) without contextual language supervision. The benchmark consists of three levels of complexity: perceptual completion and global image matching (Level 1), rule-based inference over a single attribute (Level 2), and compositional reasoning (Level 3). Across 19,000 controlled task images, state-of-the-art VLMs show limited abstraction beyond pattern recognition under subtle perceptual perturbations, highlighting their reliance on linguistic priors for visual reasoning tasks.

研究旨在通过引入VisRes Bench这一基准来评估Vision-Language模型（VLM）的视觉推理能力，该基准在无需语言监督的自然环境中测试模型。研究分析了模型在三个复杂度级别的表现：感知完成、基于规则的推理和组合推理。关键发现表明，最先进的VLM在细微的感知干扰下表现不佳，表明它们的抽象能力有限，主要依赖于语言先验而非真正的视觉推理能力。

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

Authors: Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan

First: 2025-12-24T14:08:38+00:00 · Latest: 2025-12-24T14:08:38+00:00

Comments: 14 pages, 10 figures, Technical Report,

Abs · PDF · Code1 · Code2

Abstract

In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.

中文标题/摘要

标题：UltraShape 1.0：通过可扩展的几何细化生成高保真3D形状

在本报告中，我们介绍了UltraShape 1.0，这是一种可扩展的3D扩散框架，用于高保真3D几何生成。所提出的方法采用两阶段生成管道：首先合成粗略的整体结构，然后细化以生成详细的高质量几何形状。为了支持可靠的3D生成，我们开发了一个全面的数据处理管道，包括一种新颖的封闭式处理方法和高质量数据过滤。该管道通过移除低质量样本、填补空洞和增厚细长结构，提高了公共可用3D数据集的几何质量，同时保留了精细的几何细节。为了实现精细的几何细化，我们在扩散过程中将空间定位与几何细节合成解耦。我们通过在固定的空间位置进行体素细化来实现这一点，其中从粗略几何结构派生的体素查询提供了通过RoPE编码的显式位置锚点，使扩散模型能够专注于在减少的结构解决方案空间内合成局部几何细节。我们的模型仅在公共可用的3D数据集上进行训练，尽管训练资源有限，但仍能实现强大的几何质量。广泛的评估表明，UltraShape 1.0在数据处理质量和几何生成方面与现有的开源方法竞争。所有代码和训练模型将被发布以支持未来的研究。

Summary / 总结

UltraShape 1.0 is a scalable 3D diffusion framework for generating high-fidelity 3D geometry. It uses a two-stage pipeline to synthesize a coarse global structure and then refine it for detailed geometry. The framework includes a novel watertight processing method and high-quality data filtering to improve geometric quality. By decoupling spatial localization from geometric detail synthesis, UltraShape 1.0 focuses on local geometric details, achieving strong geometric quality with limited training resources. Evaluations show it performs competitively with existing methods.

UltraShape 1.0 是一个通过两阶段过程生成高保真 3D 几何的可扩展 3D 扩散框架：首先是粗略的整体结构合成，然后进行详细的细化。它包括一个全面的数据处理管道，可以提高几何质量，并且在扩散过程中将空间定位与几何细节合成解耦，专注于在结构化的解空间内合成局部几何细节。评估显示，UltraShape 1.0 在数据处理和几何生成方面与现有方法竞争激烈。

Towards Arbitrary Motion Completing via Hierarchical Continuous Representation

Authors: Chenghao Xu, Guangtao Lyu, Qi Liu, Jiexi Yan, Muli Yang, Cheng Deng

First: 2025-12-24T14:07:04+00:00 · Latest: 2025-12-24T14:07:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.

中文标题/摘要

标题：通过分层连续表示实现任意运动补全

物理运动本质上是连续的，更高的相机帧率通常有助于提高平滑度和时间连贯性。首次探索了人类运动序列的连续表示，能够对任意输入运动序列进行任意帧率的插值、过渡甚至外推。为此，我们提出了一种基于隐式神经表示（INRs）的名为NAME的新型参数激活诱导分层隐式表示框架。我们的方法引入了分层时间编码机制，从运动序列的多个时间尺度中提取特征，有效捕捉复杂的时序模式。此外，我们还将基于傅里叶变换的自定义参数激活函数集成到基于MLP的解码器中，以增强连续表示的表达能力。这种参数化表示显著增强了模型对复杂运动行为的高精度表示能力。在多个基准数据集上的广泛评估表明，我们提出的方法具有有效性和鲁棒性。

Summary / 总结

The research aims to develop a method for arbitrary motion completion by exploring continuous representations of human motion sequences. The proposed method, named NAME, uses a hierarchical implicit representation framework based on Implicit Neural Representations (INRs) and introduces a hierarchical temporal encoding mechanism to capture intricate temporal patterns. It also incorporates a custom parametric activation function to enhance the model's expressiveness. Experimental results show that the method effectively interpolates, inbetween, and extrapolates motion sequences at arbitrary frame rates, demonstrating its effectiveness and robustness across multiple benchmark datasets.

研究旨在通过探索人类运动序列的连续表示来实现任意运动完成。提出的名为NAME的方法使用基于隐式神经表示（INRs）的分层隐式表示框架，结合分层时间编码机制和自定义参数激活函数。实验表明，该方法可以有效地在任意帧率下插值、过渡和外推运动序列，展示了其在多个基准数据集上的有效性和鲁棒性。