arXiv 论文速递

2025-12-28 03:29
Snapshot: 20251228_0329
HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming
Authors: Haonan Qiu, Shikun Liu, Zijian Zhou, Zhaochong An, Weiming Ren, Zhiheng Liu, Jonas Schult, Sen He, Shoufa Chen, Yuren Cong, Tao Xiang, Ziwei Liu, Juan-Manuel Perez-Rua
First: 2025-12-24T18:59:58+00:00 · Latest: 2025-12-24T18:59:58+00:00
Comments: Project Page: http://haonanqiu.com/projects/HiStream.html
Abstract
High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.
中文标题/摘要
标题:HiStream:通过消除冗余的流式传输高效生成高分辨率视频
高分辨率视频生成对于数字媒体和电影至关重要,但由于扩散模型的二次复杂性导致计算瓶颈,使得实际推理变得不可行。为了解决这一问题,我们引入了HiStream,这是一种高效的自回归框架,系统地在三个维度上减少冗余:i) 空间压缩:在低分辨率去噪后再在高分辨率处使用缓存特征进行细化;ii) 时间压缩:采用分块策略并使用固定大小的锚点缓存,确保推理速度稳定;iii) 时间步压缩:对后续的缓存条件块应用较少的去噪步骤。在1080p基准测试中,我们的主要HiStream模型(i+ii)实现了最先进的视觉质量,同时与Wan2.1基线相比去噪速度提高了76.2倍,且几乎无质量损失。我们的更快变体HiStream+应用了所有三种优化(i+ii+iii),相对于基线实现了107.5倍的加速,提供了速度和质量之间的权衡,从而使得高分辨率视频生成既实用又可扩展。
Summary / 总结
HiStream is an efficient autoregressive framework designed to reduce the computational complexity of high-resolution video generation. It achieves this by implementing three strategies: spatial compression, temporal compression, and timestep compression. The primary HiStream model, which includes spatial and temporal compression, achieves state-of-the-art visual quality with up to 76.2x faster denoising compared to the Wan2.1 baseline. The faster variant, HiStream+, which incorporates all three optimizations, offers a 107.5x acceleration, maintaining a good balance between speed and quality.
HiStream 是一种高效的自回归框架,通过空间、时间和时间步长压缩来减少冗余,解决高分辨率视频生成的计算瓶颈问题。主要模型在保持与 Wan2.1 基线相当的视觉质量的同时,实现了高达 76.2 倍的去噪加速,而更快的变体 HiStream+ 进一步实现了 107.5 倍的加速,尽管在质量上略有妥协。
Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu
First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00
Comments: Project page: https://sytwu.github.io/BeyondMemo/
Abstract
We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/
中文标题/摘要
标题:超越记忆:多模态序数回归基准以揭示视觉语言模型中的流行度偏差
我们揭示了最先进的视觉语言模型(VLMs)中存在显著的流行度偏差,这些模型在著名建筑上的准确率比普通建筑高出34%,表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题,我们引入了迄今为止最大的开放基准数据集:YearGuessr数据集,包含来自157个国家的55,546张建筑图像及其多模态属性,这些图像被连续标注了其建设年份(1001-2024)的序数标签、GPS数据以及页面浏览量作为流行度的代理。使用该数据集,我们将建筑年份预测任务框架化为序数回归,并引入了流行度感知的区间准确度指标来量化这种偏差。我们构建的包含30多个模型的基准,包括我们的YearCLIP模型,证实了VLMs在流行、记忆化的项目上表现出色,但在未识别的主题上却面临巨大挑战,揭示了它们推理能力中的关键缺陷。项目页面:https://sytwu.github.io/BeyondMemo/
Summary / 总结
The paper addresses the significant popularity bias in state-of-the-art vision-language models (VLMs), which perform much better on famous buildings than ordinary ones. To systematically investigate this, the authors introduce the YearGuessr dataset, a large multi-modal benchmark with 55,546 building images from 157 countries, annotated with construction years, GPS data, and page-view counts. Using this dataset, they frame the task as ordinal regression and introduce new metrics to quantify the bias. The benchmark shows that VLMs excel on popular items but struggle with unrecognized subjects, highlighting a critical flaw in their reasoning capabilities.
研究揭示了最先进的视觉-语言模型(VLMs)存在显著的流行度偏差,它们在著名建筑上的表现比普通建筑高出34%。为了解决这一问题,研究人员创建了包含55,546张建筑图像的YearGuessr数据集,这些图像具有多模态属性和按建设年份标注的连续序数标签、GPS数据和页面浏览量作为流行度的代理。通过将任务框架化为序数回归,并引入新的度量标准,他们发现VLMs在识别流行物品方面表现出色,但在不知名的主题上却面临重大挑战,这揭示了它们推理能力的一个关键缺陷。
Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty
Authors: Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin
First: 2025-12-24T18:59:51+00:00 · Latest: 2025-12-24T18:59:51+00:00
Abstract
Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.
中文标题/摘要
标题:通过量化不确定性优化掩码扩散模型的解码路径
掩码扩散模型(MDMs)提供了灵活的非自回归生成,但这种自由引入了一个挑战:最终输出质量高度依赖于解码顺序。我们首次正式化了这一问题,将输出质量的差异归因于生成路径中累积的预测不确定性。为了量化这种不确定性,我们引入了去噪熵,这是一种可计算的度量标准,作为评估生成过程的内部信号。利用这一度量标准,我们提出了两种旨在优化解码路径的算法:一种事后选择方法和一种实时指导策略。实验表明,我们的熵导向方法显著提高了生成质量,在具有挑战性的推理、规划和代码基准测试中持续提升了准确性。我们的工作确立了去噪熵作为理解并控制生成过程的原理性工具,有效地将MDMs中的不确定性从一种负担转变为发现高质量解决方案的关键优势。
Summary / 总结
The research aims to address the challenge of output quality variability in Masked Diffusion Models (MDMs) due to their flexible decoding order. The authors introduce Denoising Entropy as a metric to quantify predictive uncertainty along generative paths and propose two algorithms to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments show that these entropy-guided methods significantly enhance generation quality, particularly on complex benchmarks involving reasoning, planning, and code generation.
研究旨在解决Masked Diffusion Models (MDMs)由于解码顺序敏感而导致输出质量变化的问题。作者引入了去噪熵作为量化生成路径上预测不确定性的指标,并提出了两种算法:事后选择方法和实时指导策略。实验表明,这些熵导向的方法显著提高了生成质量,特别是在涉及推理、规划和代码生成的复杂基准测试中。
Streaming Video Instruction Tuning
Authors: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou
First: 2025-12-24T18:59:36+00:00 · Latest: 2025-12-24T18:59:36+00:00
Abstract
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
中文标题/摘要
标题:流式视频指令调优
我们提出了Streamo,一种实时流式视频LLM,作为通用交互式助手。与现有的专注于问答或字幕的在线视频模型不同,Streamo执行广泛的流式视频任务,包括实时解说、动作理解、事件字幕、时间事件定位和时间敏感的问答。为了开发这种多功能性,我们构建了Streamo-Instruct-465K,一个针对流式视频理解的大规模指令遵循数据集。该数据集涵盖了多种时间上下文和多任务监督,使Streamo能够在异构流式任务中统一训练。通过简化的工作流在指令遵循数据集上端到端训练后,Streamo展示了强大的时间推理、响应式交互和在各种流式基准测试中的广泛泛化能力。广泛的实验表明,Streamo填补了离线视频感知模型与实时多模态助手之间的差距,朝着统一、智能的视频理解在连续视频流中的目标迈出了一步。
Summary / 总结
Streamo is a real-time streaming video LLM designed as a general-purpose interactive assistant. It excels in a wide range of streaming video tasks, including real-time narration, action understanding, and event captioning. To achieve this versatility, the researchers created Streamo-Instruct-465K, a large instruction-following dataset for streaming video understanding. After training, Streamo demonstrates strong temporal reasoning and broad generalization across various streaming benchmarks, bridging the gap between offline video models and real-time multimodal assistants.
Streamo 是一个实时流媒体视频 LLM,旨在作为通用交互式助手。它在实时叙述、动作理解等多种流媒体任务上表现出色。为了实现这种多功能性,研究人员创建了 Streamo-Instruct-465K 数据集,专门用于流媒体视频理解。经过训练后,Streamo 展示出强大的时间推理能力和在各种流媒体基准测试中的广泛泛化能力,填补了离线视频模型与实时多模态助手之间的差距。
Fast SAM2 with Text-Driven Token Pruning
Authors: Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen
First: 2025-12-24T18:59:05+00:00 · Latest: 2025-12-24T18:59:05+00:00
Comments: 28 pages, 9 figures
Abstract
Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.
中文标题/摘要
标题:快速SAM2:基于文本驱动的标记剪枝
Segment Anything Model 2 (SAM2),一种视觉基础模型,在基于提示的视频对象分割方面取得了显著进展,但其实际部署受限于处理时间密集视觉标记的高计算和内存成本。SAM2流水线通常会将图像编码器生成的所有视觉标记通过下游的时间推理模块进行传递,而不考虑这些标记与目标对象的相关性,导致由于基于内存的注意力开销呈二次增长而降低了可扩展性。本文介绍了一种基于文本的标记剪枝框架,通过在时间传播之前选择性地减少标记密度来提高推理效率,而不修改底层分割架构。该方法在视觉编码之后、基于内存的传播之前运行,使用一种轻量级的路由机制对标记进行排名,该机制结合了局部视觉上下文、从以对象为中心的文本描述(用户提供的或自动生成的)中推导出的语义相关性以及有助于保留模糊或边界关键区域的不确定性提示。通过仅保留对下游处理最有信息性的标记,所提出的方法减少了冗余计算,同时保持了分割精度。在多个具有挑战性的视频分割基准测试中的广泛实验表明,编码器后标记剪枝提供了一条实用且有效的途径,以实现基于提示的视频分割的高效性,与未剪枝的基线SAM2相比,其推理速度提高了42.50%,GPU内存使用降低了37.41%,同时保持了竞争力的J和F性能。这些结果突显了早期标记选择对提高基于变压器的视频分割系统实时性和资源受限应用可扩展性的潜力。
Summary / 总结
This work introduces a text-guided token pruning framework to enhance the inference efficiency of Segment Anything Model 2 (SAM2) for prompt-driven video object segmentation. By selectively reducing token density before temporal propagation, the method improves scalability without altering the segmentation architecture. Experiments show that this approach reduces inference time by up to 42.50 percent and GPU memory usage by 37.41 percent while maintaining competitive segmentation performance.
该研究引入了一种文本引导的标记剪枝框架,用于增强Segment Anything Model 2 (SAM2)在提示驱动视频对象分割中的推理效率。通过在时间传播前选择性地减少标记密度,并使用一个轻量级路由机制来考虑局部视觉上下文、语义相关性和不确定性提示来对标记进行排序。实验表明,这种方法相比未剪枝的基线SAM2,可实现高达42.50%的推理加速和37.41%的GPU内存使用减少,同时保持竞争力的J和F性能。
TICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning
Authors: Varun Belagali, Saarthak Kapse, Pierre Marza, Srijan Das, Zilinghan Li, Sofiène Boutaj, Pushpak Pati, Srikar Yellapragada, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Prateek Prasanna, Stergios Christodoulidis Maria Vakalopoulou, Dimitris Samaras
First: 2025-12-24T18:58:16+00:00 · Latest: 2025-12-24T18:58:16+00:00
Abstract
The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.
中文标题/摘要
标题:TICON:一种用于组织病理学表示学习的幻灯片级切片上下文化器
在大型全切片图像(WSI)中,小切片的解释通常需要更大的图像上下文。我们引入了TICON,一种基于变换器的切片表示上下文化器,能够为任何计算病理学应用生成丰富的上下文化嵌入。标准基于切片编码器的管道从切片中剥离其上下文提取嵌入,无法建模对于局部和全局任务都至关重要的丰富幻灯片级信息。此外,不同的切片编码器在不同的下游任务中表现出色。因此,需要一个统一的模型来上下文化来自任何切片级基础模型的嵌入。TICON 通过一个共享的编码器来满足这一需求,该编码器使用掩码建模目标进行预训练,以同时统一和上下文化来自多种切片级病理基础模型的表示。我们的实验表明,TICON 上下文化的嵌入在许多不同任务中显著提高了性能,建立了切片级基准(如HEST-Bench、THUNDER、CATCH)和幻灯片级基准(如Patho-Bench)的新最佳结果。最后,我们使用仅11K张WSI对TICON 进行预训练形成一个幻灯片级基础模型,超越了使用多达350K张WSI预训练的当前最佳幻灯片级基础模型。
Summary / 总结
TICON is a transformer-based model designed to provide rich, contextualized embeddings for tiles in whole slide images (WSI), addressing the limitations of tile encoder-based pipelines that lack context. TICON unifies and contextualizes embeddings from various tile-level pathology foundation models using a single, shared encoder. Experiments show that TICON improves performance across multiple tasks, setting new state-of-the-art results on both tile-level and slide-level benchmarks. Additionally, TICON enables the creation of a slide-level foundation model with fewer WSI compared to existing models.
TICON 是一种基于变换器的模型,旨在为大型全切片图像中的小块提供上下文化的嵌入,解决了缺乏切片级别上下文的切片编码器管道的局限性。它使用统一的编码器来上下文化来自各种切片级模型的嵌入,提高了多个任务的表现,并在切片级和小块级基准测试中建立了新的最佳结果。此外,TICON 还能够使用更少的训练图像构建切片级基础模型,优于现有使用多达 35 万张切片进行预训练的模型。
Parallel Token Prediction for Language Models
Authors: Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh, Stephan Mandt
First: 2025-12-24T18:46:55+00:00 · Latest: 2025-12-24T18:46:55+00:00
Comments: Preprint. Under review
Abstract
We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.
中文标题/摘要
标题:语言模型中的并行令牌预测
我们提出了并行令牌预测(PTP),这是一种用于语言模型并行序列生成的通用框架。PTP 在单个变压器调用中通过将采样过程纳入模型中,同时预测多个依赖令牌,从而减少了自回归解码的延迟瓶颈,并避免了现有多种令牌预测方法中常见的独立性假设限制。我们证明PTP 可以表示任意自回归序列分布。PTP 可以通过蒸馏现有模型或通过逆自回归训练进行训练,无需教师。实验上,我们在 Spec-Bench 上通过每步接受超过四个令牌,实现了 Vicuna-7B 的最佳推测解码性能。我们框架的通用性表明,在不损失建模能力的情况下,长序列的并行生成是可行的。
Summary / 总结
The research proposes Parallel Token Prediction (PTP), a framework that jointly predicts multiple dependent tokens in a single transformer call to reduce the latency of autoregressive decoding. PTP incorporates the sampling procedure into the model, avoiding restrictive independence assumptions. Experiments show that PTP achieves state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench, indicating its potential for parallel generation of long sequences without loss of modeling power.
研究提出了并行令牌预测(PTP),这是一种用于语言模型并行序列生成的框架,能够在单个变压器调用中联合预测多个依赖令牌。这种方法减少了自回归解码的延迟,并避免了现有的多令牌预测方法中的限制性独立假设。实验表明,PTP 在 Vicuna-7B 上实现了最先进的推测性解码性能,每步接受超过四个令牌,表明并行生成长序列是可行的且不会损失建模能力。
Variationally correct operator learning: Reduced basis neural operator with a posteriori error estimation
Authors: Yuan Qiu, Wolfgang Dahmen, Peng Chen
First: 2025-12-24T18:37:59+00:00 · Latest: 2025-12-24T18:37:59+00:00
Abstract
Minimizing PDE-residual losses is a common strategy to promote physical consistency in neural operators. However, standard formulations often lack variational correctness, meaning that small residuals do not guarantee small solution errors due to the use of non-compliant norms or ad hoc penalty terms for boundary conditions. This work develops a variationally correct operator learning framework by constructing first-order system least-squares (FOSLS) objectives whose values are provably equivalent to the solution error in PDE-induced norms. We demonstrate this framework on stationary diffusion and linear elasticity, incorporating mixed Dirichlet-Neumann boundary conditions via variational lifts to preserve norm equivalence without inconsistent penalties. To ensure the function space conformity required by the FOSLS loss, we propose a Reduced Basis Neural Operator (RBNO). The RBNO predicts coefficients for a pre-computed, conforming reduced basis, thereby ensuring variational stability by design while enabling efficient training. We provide a rigorous convergence analysis that bounds the total error by the sum of finite element discretization bias, reduced basis truncation error, neural network approximation error, and statistical estimation errors arising from finite sampling and optimization. Numerical benchmarks validate these theoretical bounds and demonstrate that the proposed approach achieves superior accuracy in PDE-compliant norms compared to standard baselines, while the residual loss serves as a reliable, computable a posteriori error estimator.
中文标题/摘要
标题:变分正确的算子学习:基于缩减基神经算子的后验误差估计
最小化PDE残差损失是促进神经算子物理一致性的常用策略。然而,标准形式通常缺乏变分正确性,这意味着小的残差并不保证小的解误差,因为使用了不合规的范数或针对边界条件的任意罚项。本文通过构建首阶系统最小二乘(FOSLS)目标来发展一个变分正确的算子学习框架,这些目标的值在PDE诱导的范数中可证明等同于解误差。我们通过变分提升将混合Dirichlet-Neumann边界条件纳入该框架中,以保持范数等价性而不引入不一致的罚项。为了确保FOSLS损失所需的函数空间一致性,我们提出了一种缩减基神经算子(RBNO)。RBNO预测预计算的、一致的缩减基的系数,从而通过设计确保变分稳定性,同时实现高效的训练。我们提供了一种严格的收敛性分析,将总误差限制为有限元离散偏差、缩减基截断误差、神经网络逼近误差以及由于有限采样和优化产生的统计估计误差之和。数值基准验证了这些理论界限,并表明所提出的方法在PDE一致范数中实现了优于标准基线的更高精度,而残差损失则作为可靠的、可计算的后验误差估计器。
Summary / 总结
This work addresses the issue of variational correctness in neural operators by developing a variationally correct framework using first-order system least-squares (FOSLS) objectives. The framework ensures that small residuals correspond to small solution errors by preserving norm equivalence. A Reduced Basis Neural Operator (RBNO) is proposed to predict coefficients for a pre-computed reduced basis, ensuring variational stability and efficient training. Theoretical analysis and numerical benchmarks show that the approach achieves higher accuracy in PDE-compliant norms and that the residual loss can serve as a reliable a posteriori error estimator.
本文旨在通过构建一阶系统最小二乘(FOSLS)目标来开发一个变分正确的操作学习框架,确保目标值等同于PDE诱导范数下的解误差。该方法使用Reduced Basis Neural Operator(RBNO)来预测预计算的、一致的Reduced Basis中的系数,确保变分稳定性并实现高效的训练。该方法提供了严格的收敛性分析,并通过数值基准测试展示了与标准基线相比,在PDE一致范数下的优越精度,同时残差损失作为可靠的后验误差估计器。
Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks
Authors: Roy Turgeman, Tom Tirer
First: 2025-12-24T18:21:01+00:00 · Latest: 2025-12-24T18:21:01+00:00
Abstract
The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.
中文标题/摘要
标题:数据处理不等式反映实践吗?低级任务的有效性探究
数据处理不等式是信息论原理,表明通过处理观测值无法增加信号的信息量。特别是,它表明在解决分类问题之前增强信号或对其进行编码是没有益处的。这一断言可以证明在最优贝叶斯分类器的情况下是正确的。然而,在实践中,尽管现代深度神经网络具有强大的能力,但在“高级”下游任务之前通常会执行“低级”任务。在本文中,我们旨在理解何时以及为什么低级处理对分类有益。我们对二分类设置进行了全面的理论研究,考虑了一个与最优贝叶斯分类器紧密相连的分类器,并随着训练样本数量的增加而收敛于该分类器。我们证明,在任何有限数量的训练样本下,都存在一种预分类处理可以提高分类准确性。我们还探讨了类别分离、训练集大小和类别平衡对这种程序相对收益的影响。我们通过理论设置的经验研究支持了我们的理论。最后,我们进行了一项经验研究,调查了去噪和编码对基准数据集上实用深度分类器性能的影响。具体来说,我们改变了训练集的大小和类别分布以及噪声水平,并展示了与理论结果一致的趋势。
Summary / 总结
This paper investigates the utility of low-level tasks in classification, challenging the data processing inequality. It presents a theoretical study showing that pre-classification processing can improve accuracy even with finite training samples. The study also explores how class separation, training set size, and class balance affect the benefits of such processing. Empirical evidence from both theoretical setups and benchmark datasets supports these findings, revealing consistent trends with the theoretical predictions.
本文探讨了低级任务在分类中的实用性,挑战了数据处理不等式。研究展示了即使在有限的训练样本数量下,预分类处理也能提高准确性。研究还探讨了类别分离、训练集大小和类别平衡如何影响这种处理的好处。基准数据集上的实证研究证实了这些理论发现,特别是在去噪和编码对深度分类器性能的影响方面。
Learning to Solve PDEs on Neural Shape Representations
Authors: Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra
First: 2025-12-24T18:14:02+00:00 · Latest: 2025-12-24T18:14:02+00:00
Comments: Article webpage link: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/
Abstract
Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.
中文标题/摘要
标题:在神经形状表示上学习求解偏微分方程
在形状上求解偏微分方程(PDEs)是许多形状分析和工程任务的基础;然而,现有的PDE求解器通常基于多边形/三角形网格,而现代3D资产越来越多地以神经表示形式存在。这种不匹配使得没有合适的方法可以直接在神经域内求解曲面PDEs,迫使进行显式的网格提取或逐实例残差训练,阻碍了端到端的工作流程。我们提出了一种新颖的无网格公式,该公式学习一个基于神经(局部)形状属性的局部更新算子,使PDEs可以直接在数据所在的(神经)域内求解。该算子自然地与常见的神经曲面表示相结合,只需在一个代表性形状上进行一次训练,即可在形状和拓扑变化下泛化,从而在无需显式网格化或逐实例优化的情况下实现准确、快速的推理,同时保持可微性。在分析基准(球体上的热方程和泊松求解)和不同表示的现实神经资产中,我们的方法在某些方面优于CPM,同时保持与FEM相当的性能,据我们所知,这是第一个能够同时在神经和经典曲面表示上求解曲面PDEs的端到端管道。代码将在接受后发布。
Summary / 总结
The research addresses the challenge of solving partial differential equations (PDEs) on neural shape representations, which are increasingly used in 3D assets. It introduces a mesh-free formulation that learns a local update operator conditioned on neural shape attributes, allowing PDEs to be solved directly within the neural domain. The method is trained once and generalizes across different shapes and topologies, providing accurate and fast inference without explicit meshing or per-instance optimization. Experiments show that the method slightly outperforms CPM and remains close to FEM, and it is the first to offer an end-to-end pipeline for solving surface PDEs on both neural and classical surface representations.
研究解决了在神经形状表示上求解偏微分方程(PDEs)的问题,这些表示在3D资产中越来越常用。提出了一种无网格公式,该公式根据神经形状属性学习局部更新操作符,允许PDEs直接在神经域内求解。该方法与神经表面表示集成,单个形状训练即可,能够跨形状和拓扑变化进行泛化,实现准确且快速的推理,无需显式网格化或逐实例优化。实验表明,该方法优于CPM,并且接近FEM的准确性,标志着首个同时适用于神经和经典表面表示的求解表面PDEs的端到端管道。
Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning
Authors: Ke Sun, Yingnan Zhao, Enze Shi, Yafei Wang, Xiaodong Yan, Bei Jiang, Linglong Kong
Venue: NeurIPS 2025
First: 2021-10-07T03:14:46+00:00 · Latest: 2025-12-24T17:53:45+00:00
Comments: NeurIPS 2025; Previous Version in ICML Workshop: Exploration in AI Today (EXAIT) 2025
Abstract
The remarkable empirical performance of distributional reinforcement learning (RL) has garnered increasing attention to understanding its theoretical advantages over classical RL. By decomposing the categorical distributional loss commonly employed in distributional RL, we find that the potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization. This less-studied entropy regularization aims to capture additional knowledge of return distribution beyond only its expectation, contributing to an augmented reward signal in policy optimization. In contrast to the vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration by promoting diverse actions, the novel entropy regularization derived from categorical distributional loss implicitly updates policies to align the learned policy with (estimated) environmental uncertainty. Finally, extensive experiments verify the significance of this uncertainty-aware regularization from distributional RL on the empirical benefits over classical RL. Our study offers an innovative exploration perspective to explain the intrinsic benefits of distributional learning in RL.
中文标题/摘要
标题:分类分布损失的内在优势:分布感知正则化探索在强化学习中的应用
分布式强化学习(RL)的卓越实证性能引起了对其与经典RL理论优势的越来越多关注。通过分解在分布式RL中常用的分类分布损失,我们发现分布式RL潜在优势可归因于一种衍生的分布匹配熵正则化。这种较少研究的熵正则化旨在捕捉回报分布的额外知识,而不仅仅是其期望值,从而为策略优化提供增强的奖励信号。与MaxEnt RL中的基本熵正则化相比,后者通过促进多样化的动作显式地鼓励探索,而从分类分布损失中推导出的新型熵正则化则隐式地更新策略,使其与(估计的)环境不确定性相一致。最后,广泛的实验验证了这种分布感知正则化在实证上对经典RL的优越性。我们的研究为解释分布式学习在RL中的内在优势提供了创新的探索视角。
Summary / 总结
This paper investigates the theoretical advantages of distributional reinforcement learning (RL) over classical RL by decomposing the categorical distributional loss. It identifies an entropy regularization term that captures the return distribution beyond its expectation, leading to an augmented reward signal. Unlike vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration, this new regularization implicitly aligns the policy with environmental uncertainty. Experiments confirm the importance of this uncertainty-aware regularization in enhancing the empirical performance of distributional RL.
该研究通过分解分类分布损失,探讨了分布式强化学习(RL)相对于经典RL的理论优势。它发现了一个熵正则化项,能够捕捉回报分布超出其期望的部分,从而增强奖励信号。不同于MaxEnt RL中的显式探索正则化,这种新正则化项隐式地使策略与环境不确定性对齐。实验验证了这种不确定性意识正则化在提升分布式RL的性能方面的重要性。
AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
Authors: Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng
First: 2025-12-24T17:40:42+00:00 · Latest: 2025-12-24T17:40:42+00:00
Comments: 23 pages, 13 figures, 8 tables
Abstract
Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.
中文标题/摘要
标题:AndroidLens: 针对Android GUI代理的嵌套子目标长延迟评估
图形用户界面(GUI)代理可以通过自动化移动设备上频繁执行的长延迟任务来显著提高生产力。然而,现有的评估基准仍然局限于有限的应用程序、简单的任务和粗粒度的指标。为了解决这一问题,我们引入了AndroidLens,这是一个针对移动GUI代理的具有挑战性的评估框架,包含571个长延迟任务,涵盖中文和英文环境,每个任务平均需要超过26步才能完成。该框架的特点包括:(1) 来自38个领域的真实世界用户场景的任务,涵盖多种复杂类型,如多约束、多目标和领域特定任务;(2) 静态评估保留了真实世界的异常情况,并允许多条有效路径以减少偏差;(3) 动态评估采用基于里程碑的方案,通过平均任务进度(ATP)进行细粒度的进度测量。我们的评估表明,即使是最优秀的模型也只能达到12.7%的任务成功率和50.47%的ATP。我们还强调了真实世界环境中的关键挑战,包括环境异常、自适应探索和长期记忆保留。
Summary / 总结
AndroidLens is a challenging evaluation framework for mobile GUI agents, featuring 571 long-latency tasks across 38 domains, each requiring over 26 steps. It includes both static and dynamic evaluations to measure task success and progress. The evaluation shows that even the best models achieve only 12.7% task success and 50.47% Average Task Progress (ATP). Key challenges include environmental anomalies, adaptive exploration, and long-term memory retention.
研究引入了AndroidLens,一个全面的移动GUI代理评估框架,解决了现有基准的局限性。它包含571个跨中英文的长延迟任务,每个任务平均需要超过26步。框架包括真实世界用户场景、静态和动态评估以及基于里程碑的进度测量方案。关键发现表明,即使最好的模型也只能实现12.7%的任务成功率和50.47%的平均任务进度(ATP)。面临的挑战包括环境异常、自适应探索和长期记忆保持。
Transcriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering
Authors: Abdullah G. Elafifi, Basma Mamdouh, Mariam Hanafy, Muhammed Alaa Eldin, Yosef Khaled, Nesma Mohamed El-Gelany, Tarek H. M. Abou-El-Enien
First: 2025-12-24T17:39:37+00:00 · Latest: 2025-12-24T17:39:37+00:00
Abstract
Acute Myeloid Leukemia (AML) remains a clinical challenge due to its extreme molecular heterogeneity and high relapse rates. While precision medicine has introduced mutation-specific therapies, many patients still lack effective, personalized options. This paper presents a novel, end-to-end computational framework that bridges the gap between patient-specific transcriptomics and de novo drug discovery. By analyzing bulk RNA sequencing data from the TCGA-LAML cohort, the study utilized Weighted Gene Co-expression Network Analysis (WGCNA) to prioritize 20 high-value biomarkers, including metabolic transporters like HK3 and immune-modulatory receptors such as SIGLEC9. The physical structures of these targets were modeled using AlphaFold3, and druggable hotspots were quantitatively mapped via the DOGSiteScorer engine. Then developed a novel, reaction-first evolutionary metaheuristic algorithm as well as multi-objective optimization programming that assembles novel ligands from fragment libraries, guided by spatial alignment to these identified hotspots. The generative model produced structurally unique chemical entities with a strong bias toward drug-like space, as evidenced by QED scores peaking between 0.5 and 0.7. Validation through ADMET profiling and SwissDock molecular docking identified high-confidence candidates, such as Ligand L1, which achieved a binding free energy of -6.571 kcal/mol against the A08A96 biomarker. These results demonstrate that integrating systems biology with metaheuristic molecular assembly can produce pharmacologically viable, patient tailored leads, offering a scalable blueprint for precision oncology in AML and beyond
中文标题/摘要
标题:基于转录组的个性化从头药物生成用于AML:使用元启发式组装和靶向筛选
急性髓系白血病(AML)由于其极端的分子异质性和高复发率,仍然是临床挑战。尽管精准医疗引入了针对突变的治疗方法,但许多患者仍然缺乏有效的个性化选择。本文提出了一种全新的端到端计算框架,将患者特异性转录组学与从头药物发现联系起来。通过分析TCGA-LAML队列的大规模RNA测序数据,研究利用加权基因共表达网络分析(WGCNA)优先筛选出20个高价值生物标志物,包括代谢转运蛋白如HK3和免疫调节受体如SIGLEC9。这些靶点的物理结构使用AlphaFold3建模,并通过DOGSiteScorer引擎定量映射可成药热点。开发了一种新的反应优先进化元启发式算法以及多目标优化编程,从片段库中组装新型配体,由这些识别的热点的空间对齐引导。生成模型产生了结构上独特的化学实体,药效团空间偏向性明显,QED评分峰值在0.5到0.7之间。通过ADMET表型分析和SwissDock分子对接验证,识别出高置信度候选物,如配体L1,其与A08A96生物标志物的结合自由能为-6.571 kcal/mol。这些结果表明,将系统生物学与元启发式分子组装相结合可以产生具有药理可行性的个性化先导化合物,为AML和其他癌症的精准肿瘤学提供可扩展的蓝图
Summary / 总结
This study addresses the challenge of personalized drug discovery for Acute Myeloid Leukemia (AML) by integrating patient-specific transcriptomics with de novo drug generation. The research utilized WGCNA to identify 20 key biomarkers, including HK3 and SIGLEC9, and AlphaFold3 to model their structures. A novel metaheuristic algorithm and multi-objective optimization were then applied to assemble novel ligands, guided by spatial alignment to these hotspots. The generated chemical entities showed drug-like properties, with QED scores between 0.5 and 0.7, and high-confidence candidates like Ligand L1 demonstrated strong binding affinity, achieving a binding free energy of -6.571 kcal/mol against the A08A96 biomarker.
该研究通过将患者特异性转录组学与从头药物生成相结合,解决了急性髓系白血病(AML)的个性化药物发现挑战。研究使用加权基因共表达网络分析识别了20个关键生物标志物,包括HK3和SIGLEC9,并使用AlphaFold3和DOGSiteScorer来建模其结构并映射可成药热点。开发了一种新的元启发式算法,从片段库中组装新型配体,由这些热点引导。生成的化学实体具有药物样特性,QED分数在0.5到0.7之间,通过ADMET表型分析和分子对接验证,识别出具有强结合亲和力的高信心候选物。
Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks
Authors: Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane
Venue: NeurIPS 2025
First: 2025-06-06T19:29:13+00:00 · Latest: 2025-12-24T17:26:35+00:00
Comments: 40 pages, 8 figures, NeurIPS 2025
Abstract
What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.
中文标题/摘要
标题:交替梯度流:两层神经网络特征学习的理论
神经网络学习哪些特征以及如何学习仍然是一个开放的问题。本文引入了交替梯度流(AGF)算法框架,描述了从小型初始化训练的两层网络中特征学习的动力学。先前的研究表明,在这种情况下,梯度流表现出阶梯状的损失曲线,交替在神经元缓慢对齐到有用方向的平台期和神经元迅速增长的急剧下降期。AGF 将这种行为近似为交替的两步过程:在休眠神经元上最大化一个效用函数,在活跃神经元上最小化一个成本函数。AGF 从所有神经元都处于休眠状态开始。在每次迭代中,一个休眠的神经元激活,触发特征的获取和损失的下降。AGF 定量描述了这些下降的顺序、时间和幅度,与多个常用架构的实验结果相符。我们证明了 AGF 统一并扩展了全连接线性网络和仅注意力线性变换器中已有的鞍点到鞍点分析,其中学习的特征分别是奇异模式和主成分。在对角线线性网络中,我们证明 AGF 在初始化趋于零的极限下收敛到梯度流。将 AGF 应用于训练以执行模块加法的二次网络,我们首次完整地描述了训练动力学,揭示了网络按系数大小递减的顺序学习傅里叶特征。总体而言,AGF 为理解神经网络中的特征学习提供了一个有希望的步骤。
Summary / 总结
This paper introduces Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer neural networks trained from small initialization. AGF approximates the behavior of gradient flow as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. The paper demonstrates that AGF matches experimental results across various architectures and unifies existing saddle-to-saddle analyses in linear networks and transformers. Key findings include the convergence of AGF to gradient flow in diagonal linear networks and the complete characterization of training dynamics in quadratic networks performing modular addition, revealing the learning of Fourier features in decreasing order of coefficient magnitude.
论文提出了交替梯度流(AGF)算法框架,描述了从小初始化训练的两层神经网络中的特征学习动态。AGF将梯度流的行为近似为交替的两步过程:在不活跃神经元上最大化一个效用函数,在活跃神经元上最小化一个成本函数。该框架量化了特征获取的顺序、时间和幅度,与各种架构的实验结果相符。AGF统一并扩展了线性网络和变压器中的鞍点到鞍点分析,并证明在对角线线性网络中,AGF在初始化趋于零时收敛到梯度流。在二次网络中,AGF刻画了训练动态,揭示了网络按系数大小递减顺序学习傅里叶特征。
Model Merging via Multi-Teacher Knowledge Distillation
Authors: Seyed Arshan Dalili, Mehrdad Mahdavi
First: 2025-12-24T17:10:44+00:00 · Latest: 2025-12-24T17:10:44+00:00
Abstract
Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.
中文标题/摘要
标题:多教师知识蒸馏下的模型合并
模型合并已作为一种轻量级替代方案出现,以应对联合多任务学习(MTL),但合并模型的泛化特性仍鲜有研究。建立此类理论保证并不简单,因为合并过程通常禁止访问原始训练数据,并涉及结合在根本上异质数据分布下训练的微调模型。在缺乏这些动态的原理性理解时,当前方法往往依赖于启发式方法来近似参数的最佳组合。这种方法在系数缩放中最为关键,即调节每个微调模型对共享参数贡献大小的权重因子。然而,由于缺乏指导其选择的原理性目标,这些方法会导致脆弱的性能,并且高度依赖于缩放初始化。我们通过(i) 建立一种新的基于平滑度的PAC-Bayes泛化界,专门针对模型合并设置。此分析引入了一个“跨任务异质性”项,正式捕捉了多种微调模型先验与目标多任务分布之间的不匹配。受此理论洞察的指导,(ii) 我们将模型合并视为在稀缺未标记数据上的多教师知识蒸馏。我们正式证明,最小化学生-教师Kullback-Leibler散度直接收紧了合并模型超额风险的上界。受基于平滑度的界推导的指导,(iii) 我们通过SAMerging方法实现这一目标,该方法使用尖锐度感知最小化(SAM)来寻找平滑的极小值。实验中,SAMerging在视觉和自然语言处理基准测试中建立了新的最佳状态,实现了卓越的性能。代码可在https://github.com/arshandalili/SAMerging/ 获取。
Summary / 总结
The paper addresses the challenge of model merging, a lightweight alternative to joint multi-task learning, by establishing a theoretical generalization bound and framing model merging as multi-teacher knowledge distillation. The authors introduce a flatness-aware PAC-Bayes bound that captures the mismatch between fine-tuned model priors and target distributions, and propose SAMerging, which uses Sharpness-Aware Minimization to find flat minima, leading to improved performance across vision and NLP benchmarks.
该论文通过建立理论泛化界并将模型合并视为多教师知识蒸馏来解决模型合并的挑战。作者提出了一种名为SAMerging的方法,该方法使用尖锐度感知最小化来寻找平坦的最小值,并在视觉和NLP基准测试中展示了其有效性,达到了最先进的性能。
Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction
Authors: Suren Bandara
First: 2025-12-24T17:10:37+00:00 · Latest: 2025-12-24T17:10:37+00:00
Abstract
Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.
中文标题/摘要
标题:基于掩膜后处理的表格分割结构坐标提取
从表格中提取结构化数据在扫描文档和数字档案的文档图像分析中起着关键作用。尽管已经提出了许多方法来检测表格结构并提取单元格内容,但在低分辨率或噪声图像中准确识别表格段边界(行和列)仍然具有挑战性。在许多实际场景中,表格数据不完整或退化,限制了基于变换器的方法对噪声输入的适应性。基于掩膜的边缘检测技术在这些条件下表现出更大的鲁棒性,因为它们的灵敏度可以通过阈值调整进行调整;然而,现有方法通常直接将掩膜应用于图像,导致噪声敏感性、分辨率损失或高计算成本。本文提出了一种新的多尺度信号处理方法,用于从表格掩膜检测表格边缘。行和列转换被建模为一维信号,并使用逐渐增加方差的高斯卷积进行处理,随后使用统计阈值抑制噪声同时保留稳定的结构边缘。检测到的信号峰值被映射回图像坐标以获得准确的段边界。实验结果表明,将所提出的方法应用于列边缘检测,可以将基于布局感知的度量Cell-Aware Segmentation Accuracy (CASA)从PubLayNet-1M基准上的67%提高到76%,该度量评估文本正确性和正确单元格放置。该方法通过零填充和缩放策略对分辨率变化具有鲁棒性,并生成优化的结构化表格输出,适合下游分析。
Summary / 总结
This paper addresses the challenge of accurately identifying table segment boundaries in low-resolution or noisy images, which is crucial for structured data extraction from tables. It proposes a multi-scale signal-processing method that models row and column transitions as one-dimensional signals, processed using Gaussian convolution and statistical thresholding to detect stable structural edges. The method improves the Cell-Aware Segmentation Accuracy (CASA) from 67% to 76% on the PubLayNet-1M benchmark when used with TableNet and PyTesseract OCR, demonstrating its robustness to resolution variations and suitability for downstream analysis.
本文针对低分辨率或噪声图像中准确识别表格边界的问题,提出了一种多尺度信号处理方法,将行和列的过渡视为一维信号,并使用具有递增方差的高斯卷积进行处理,随后通过统计阈值抑制噪声并保留稳定的结构边缘。该方法在使用TableNet和PyTesseract OCR的PubLayNet-1M基准上将Cell-Aware Segmentation Accuracy (CASA) 从67%提高到76%,展示了其在处理噪声输入时的鲁棒性和有效性。
Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential
Authors: Shihao Zou, Jingjing Li, Wei Ji, Jincai Huang, Kai Wang, Guo Dan, Weixin Si, Yi Pan
First: 2025-12-24T17:05:09+00:00 · Latest: 2025-12-24T17:05:09+00:00
Abstract
Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.
中文标题/摘要
标题:使用尖峰驱动视频变换器的手术场景分割及其实时潜力
现代手术系统越来越多地依赖智能场景理解以提供及时的情境感知,从而增强术中安全性。在此流程中,手术场景分割在准确感知手术事件方面发挥着核心作用。尽管最近的深度学习模型,尤其是大规模基础模型,实现了显著的分割准确性,但它们巨大的计算需求和高能耗阻碍了在资源受限的手术环境中进行实时部署。为解决这一限制,我们探索了新兴的SNN作为高效手术智能的有前途范式。然而,其性能仍受到手术标注数据稀缺和手术视频表示固有的稀疏性限制。为此,我们提出了SpikeSurgSeg,这是首个针对手术场景分割的尖峰驱动视频变换器框架,具有在非GPU平台上实现实时潜力的潜力。为解决手术标注数据有限的问题,我们引入了一种针对SNN的手术场景掩码自编码预训练策略,通过逐层管状掩码实现稳健的空间-时间表示学习。基于此预训练骨干,我们进一步采用一种轻量级的尖峰驱动分割头,产生时间一致的预测,同时保持SNN的低延迟特性。在EndoVis18和我们内部的SurgBleed数据集上的广泛实验表明,SpikeSurgSeg在推断延迟上至少减少了8倍,同时其mIoU与最先进的基于ANN的模型相当。值得注意的是,它相对于大多数基础模型基线的加速比超过20倍,突显了其在时间关键型手术场景分割中的潜力。
Summary / 总结
The research aims to develop a real-time surgical scene segmentation model for enhanced intra-operative safety. It introduces SpikeSurgSeg, a spike-driven video Transformer framework, which uses a masked autoencoding pretraining strategy to learn robust spatiotemporal representations. The model achieves comparable mean intersection over union (mIoU) to state-of-the-art models while reducing inference latency by at least 8 times and offering over 20 times acceleration compared to foundation-model baselines.
研究旨在通过利用尖峰驱动的视频Transformer框架来提高实时手术场景分割的性能,解决传统深度学习模型在计算需求和功耗方面的限制。提出的SpikeSurgSeg框架采用手术场景掩码自编码预训练策略和轻量级尖峰驱动分割头,实现了与当前最佳模型相当的准确性,同时显著减少了推理延迟,并将推理加速了20多倍,相比大多数基础模型基线。
SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance
Authors: Divij Dudeja, Mayukha Pal
First: 2025-12-24T16:59:04+00:00 · Latest: 2025-12-24T16:59:04+00:00
Abstract
The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.
中文标题/摘要
标题:SMART SLM:结构化记忆与推理变换器,一种用于准确文档辅助的小型语言模型
工程手册(EM)的用户发现阅读EMs很困难,因为它们很长,格式密集,包含书面文档、逐步程序和工程设备的标准参数列表。现成的变换器,尤其是紧凑型的,将这些材料视为一个扁平的标记流。这种方法导致模型自信但错误的答案,并迫使模型以低效的方式记忆单独的事实。SMART(结构化记忆与推理变换器)为上述问题提供了一种不同的且实用的解决方案。SMART通过分层方法组织其处理过程,并基于三个主要工作类别:(1)语法意识事实提取器(语法学家)树LSTM,从EM句子中提取主语关系对象关系的事实;(2)紧凑索引记忆MANN(记忆增强神经网络),将这些理性主语关系对象作为384维向量索引,与信息来源相关联;(3)6层变换器,学习将之前检索到的事实融合到其生成的响应中。整个SMART模型使用45.51M参数,比GPT-2(124M)少64%,比BERT(133M)少69%,并且准确率比GPT-2高21.3%,表明SMART以最少的处理要求更好地适应数据。SMART采用双模式推理,已知文档的索引快速路径(亚秒级答案时间)和新上传文件的索引动态路径(借助RAGs的FAISS前20结果,记忆限制在64个槽位)。在实际部署中,该框架比可比的小型变换器模型产生更支持的结果,减少了幻觉。
Summary / 总结
The paper addresses the challenge of accurately processing Engineering Manuals (EM) by proposing SMART (Structured Memory and Reasoning Transformer), which uses a hierarchical approach to extract facts from EM sentences, store them in a compact indexed memory, and generate accurate responses by fusing these facts. SMART achieves 21.3% higher accuracy than GPT-2 with fewer parameters, demonstrating its effectiveness in handling complex document assistance tasks with reduced processing requirements.
论文旨在解决使用小型语言模型准确处理工程手册(EM)的挑战。提出了SMART(结构化记忆和推理变换器),采用分层方法,包括语法感知的事实提取器、紧凑的索引记忆和变换器,以提高准确性。SMART使用45.51M参数,比GPT-2和BERT的准确性分别高出21.3%,同时需要较少的处理。它支持已知和新上传文档的双重推理模式,增强实际部署效果。
GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation
Authors: Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller
First: 2025-12-24T16:46:04+00:00 · Latest: 2025-12-24T16:46:04+00:00
Abstract
Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
中文标题/摘要
标题:GriDiT:基于网格的因子化扩散方法用于高效生成长图像序列
现代深度学习方法通常将图像序列视为按顺序堆叠帧的大张量。然而,鉴于当前的最先进水平(SoTA),这种简单的表示是否理想?在本文中,我们从生成模型的角度回答了这个问题,并旨在设计一种更有效的图像序列数据建模方法。观察当前SoTA图像序列生成方法的低效性和瓶颈,我们展示了与其处理大张量,通过首先在低分辨率下生成粗略的序列,然后在高分辨率下细化各个帧,可以改进生成过程。我们仅使用包含下采样帧的网格图像训练生成模型。然而,我们学习使用扩散变换器(DiT)的强自我注意机制来捕捉帧之间的相关性,从而生成图像序列。实际上,我们的建模方式将二维图像生成器扩展为低分辨率的三维图像序列生成器,而无需进行任何架构修改。随后,我们逐帧超分辨率以添加与序列无关的高分辨率细节。这种方法具有多种优势,并可以克服该领域SoTA方法的关键局限性。与现有的图像序列生成模型相比,我们的方法在合成质量上表现出色,并且在序列间具有更好的连贯性。它还能够生成任意长度的高保真图像序列,并在推理时间和训练数据使用方面提高效率。此外,我们简洁的建模方式使我们的方法能够在多种数据领域中有效泛化,这通常需要额外的先验知识和监督才能在生成上下文中建模。我们的方法在数据集上始终在质量和推理速度(至少快两倍)方面优于SoTA。
Summary / 总结
This paper addresses the inefficiencies of current deep learning methods in generating long image sequences by proposing a factorized grid-based diffusion approach. The method first generates a low-resolution sequence and then refines individual frames at high resolution. Experiments show that this approach improves synthesis quality and coherence, supports high-fidelity generation of arbitrary-length sequences, and enhances efficiency in inference time and training data usage compared to existing models.
论文提出了一种名为GriDiT的方法,该方法将长图像序列的生成过程分为两步:首先生成低分辨率的序列,然后逐帧进行高分辨率的细化。这种方法利用扩散变换器捕捉帧间的关联性,并对帧进行超分辨率处理,以提高合成质量和连贯性。该方法在质量和推理速度方面优于现有模型,并且在不同数据领域中表现出良好的泛化能力,无需额外的先验知识或监督。
Learning to Refocus with Video Diffusion Models
Authors: SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin
Venue: SIGGRAPH Asia 2025
First: 2025-12-22T19:29:57+00:00 · Latest: 2025-12-24T16:32:32+00:00
Comments: Code and data are available at https://learn2refocus.github.io . SIGGRAPH Asia 2025, Dec. 2025
Abstract
Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at https://learn2refocus.github.io
中文标题/摘要
标题:学习使用视频扩散模型重新聚焦
对焦是摄影的基础,但自动对焦系统往往无法捕捉到预期的主体,用户经常希望在拍摄后调整对焦。我们提出了一种使用视频扩散模型进行现实后对焦的新方法。从单张失焦图像出发,我们的方法生成了一组感知上准确的焦深序列,表示为视频序列,支持交互式重新对焦并解锁一系列下游应用。我们发布了一个大规模的焦深数据集,以支持这项工作和未来的研究,该数据集在多种实际智能手机条件下采集。我们的方法在感知质量和在具有挑战性的场景中的鲁棒性方面均优于现有方法,为日常摄影中的更高级对焦编辑能力铺平了道路。代码和数据可在https://learn2refocus.github.io 获取
Summary / 总结
This paper addresses the issue of post-capture refocusing in photography by introducing a novel method using video diffusion models. Given a defocused image, the approach generates a perceptually accurate focal stack, allowing for interactive refocusing. The method outperforms existing techniques in both perceptual quality and robustness across various challenging scenarios, and a large-scale dataset is provided to support this work and future research. Code and data are available at https://learn2refocus.github.io.
论文针对摄影中对焦捕捉的挑战,即自动对焦系统往往无法捕捉到预期的主体。它提出了一种使用视频扩散模型的方法,可以从单张失焦图像生成感知上准确的焦距堆栈,从而实现交互式对焦。该方法在感知质量和鲁棒性方面均优于现有方法,并提供了一个大规模的数据集以支持这项工作和未来的研究。
ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision
Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang
First: 2025-12-24T16:24:18+00:00 · Latest: 2025-12-24T16:24:18+00:00
Abstract
Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.
中文标题/摘要
标题:ACD:通过注意力监督实现视频扩散模型的直接条件控制
可控性是视频合成中的基本要求,准确对齐条件信号至关重要。现有无分类器自由引导方法通常通过建模数据和条件的联合分布间接实现条件化,这往往导致对指定条件的有限可控性。基于分类器的引导通过外部分类器强制执行条件,但模型可能会利用这种机制提高分类器分数而不真正满足预期条件,从而产生对抗性伪影并限制有效的可控性。在本文中,我们提出了注意力条件扩散(ACD),这是一种通过注意力监督实现视频扩散模型直接条件控制的新框架。通过使模型的注意力图与外部控制信号对齐,ACD 达到了更好的可控性。为此,我们引入了一种稀疏的3D感知对象布局作为高效的条件信号,以及一个专用的布局控制网和自动注释流水线,以实现可扩展的布局集成。在基准视频生成数据集上的大量实验表明,ACD 在保持时间连贯性和视觉保真度的同时,实现了与条件输入的更优对齐,建立了条件视频合成的有效范式。
Summary / 总结
The paper addresses the need for better controllability in video synthesis by proposing Attention-Conditional Diffusion (ACD), which directly aligns the model's attention maps with external control signals. ACD uses a sparse 3D-aware object layout as a conditioning signal and includes a Layout ControlNet and an automated annotation pipeline. Experiments show that ACD provides better alignment with conditioning inputs while maintaining temporal coherence and visual fidelity, outperforming existing methods in conditional video synthesis.
论文提出了一种名为注意力条件扩散(ACD)的新框架,通过使模型的注意力图与外部控制信号对齐来提高视频合成中的可控性。ACD 使用稀疏的3D感知对象布局作为高效的条件信号,并包含一个布局控制网络和自动注释流水线。实验表明,ACD 在保持时间连贯性和视觉保真度的同时,能够更好地与条件输入对齐。
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
Authors: Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu
First: 2025-12-24T16:00:15+00:00 · Latest: 2025-12-24T16:00:15+00:00
Comments: Project Page: https://dreamontage.github.io/DreaMontage/
Abstract
The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
中文标题/摘要
标题:DreaMontage:任意帧引导的一次性视频生成
“一次性”技术在电影制作中代表了一种独特的且复杂的美学风格。然而,其实现往往受到高昂成本和复杂现实约束的阻碍。尽管新兴的视频生成模型提供了虚拟替代方案,但现有方法通常依赖于简单的片段拼接,这往往无法保持视觉连贯性和时间一致性。在本文中,我们介绍了DreaMontage,这是一种全面的框架,用于任意帧引导的生成,能够从多样化的用户输入中合成无缝、富有表现力且长时间的一次性视频。为了实现这一目标,我们从三个主要维度应对挑战。(i) 我们将一个轻量级的中间条件机制整合到DiT架构中。通过采用一种有效的基训练数据调优策略,我们解锁了强大的任意帧控制能力。(ii) 为了提高视觉保真度和电影表现力,我们精心制作了一个高质量的数据集,并实施了视觉表达SFT阶段。通过应用定制的DPO方案,我们解决了诸如主体运动合理性及过渡平滑性等关键问题,显著提高了生成内容的成功率和可用性。(iii) 为了促进长序列的生成,我们设计了一种段级自回归(SAR)推理策略,该策略在内存高效的情况下运行。广泛的实验表明,我们的方法能够实现视觉上引人注目且无缝连贯的一次性效果,同时保持计算效率,使用户能够将零散的视觉材料转化为生动、连贯的一次性电影体验。
Summary / 总结
DreaMontage is a framework for generating seamless one-shot videos from arbitrary frames. It integrates a lightweight intermediate-conditioning mechanism and a Visual Expression SFT stage to enhance visual fidelity and expressiveness. The approach also includes a Tailored DPO scheme and a Segment-wise Auto-Regressive inference strategy to address motion rationality and smooth transitions, resulting in visually striking and temporally coherent videos. This method enables users to create vivid, cohesive one-shot cinematic experiences from fragmented visual materials efficiently.
DreaMontage 是一个从任意帧生成无缝一镜头视频的框架。它将轻量级的中间条件机制集成到 DiT 架构中,使用自适应调谐策略,并包含视觉表达 SFT 阶段以增强视觉保真度和表现力。该方法还采用定制的 DPO 方案和分段自回归 (SAR) 推断策略来解决运动合理性和平滑过渡问题,从而实现视觉上引人注目且时间上连贯的一镜头视频,同时保持计算效率。
LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation
Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov
First: 2025-12-24T15:36:21+00:00 · Latest: 2025-12-24T15:36:21+00:00
Abstract
Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .
中文标题/摘要
标题:LookPlanGraph:基于VLM图增强的体感指令跟随方法
使用大型语言模型(LLM)作为规划器的方法在体感指令跟随任务中变得普遍。为了成功完成任务,LLM 必须在机器人操作的环境中得到接地。一种解决方案是使用包含所有必要信息的场景图。现代方法依赖于预先构建的场景图,并假设在规划开始时所有任务相关信息都已可用。然而,这些方法没有考虑到在图构建和任务执行之间环境可能发生的变化。我们提出了 LookPlanGraph 方法,该方法利用由静态资产和对象先验组成的场景图。在计划执行过程中,LookPlanGraph 不断通过验证现有先验或发现新实体来更新图。这通过使用视觉语言模型处理代理的主观摄像机视图来实现。我们在具有更改对象位置的 VirtualHome 和 OmniGibson 模拟环境中进行了实验,证明 LookPlanGraph 在基于预定义静态场景图的方法中表现出色。为了展示我们方法的实际适用性,我们还在现实世界中进行了实验。此外,我们引入了 GraSIF(用于指令跟随的图场景)数据集,其中包括自动验证框架,包含从 SayPlan Office、BEHAVIOR-1K 和 VirtualHome RobotHow 中抽取的 514 个任务。项目页面可在 https://lookplangraph.github.io 查看。
Summary / 总结
The research aims to improve embodied instruction following by addressing the limitations of static scene graphs that do not account for environmental changes. LookPlanGraph uses a scene graph with static assets and object priors, which is updated during task execution by processing the agent's egocentric view with a Vision Language Model. Experiments in simulated and real-world environments show that LookPlanGraph outperforms methods relying on predefined static scene graphs, particularly in scenarios with changed object positions.
论文解决了在任务执行过程中更新场景图以应对环境变化的挑战。提出了LookPlanGraph方法,该方法利用视觉语言模型在执行过程中持续更新场景图中的相关物体。实验在模拟和真实环境中表明,LookPlanGraph在任务执行期间更新场景图的能力优于依赖静态场景图的方法。研究还介绍了包含自动验证框架的GraSIF数据集,该数据集包含来自SayPlan Office、BEHAVIOR-1K和VirtualHome RobotHow的514个任务。
GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer
Authors: Corey Adams, Rishikesh Ranade, Ram Cherukuri, Sanjay Choudhry
First: 2025-12-23T14:40:08+00:00 · Latest: 2025-12-24T15:28:58+00:00
Abstract
We present GeoTransolver, a Multiscale Geometry-Aware Physics Attention Transformer for CAE that replaces standard attention with GALE, coupling physics-aware self-attention on learned state slices with cross-attention to a shared geometry/global/boundary-condition context computed from multi-scale ball queries (inspired by DoMINO) and reused in every block. Implemented and released in NVIDIA PhysicsNeMo, GeoTransolver persistently projects geometry, global and boundary condition parameters into physical state spaces to anchor latent computations to domain structure and operating regimes. We benchmark GeoTransolver on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, comparing against Domino, Transolver (as released in PhysicsNeMo), and literature-reported AB-UPT, and evaluate drag/lift R2 and Relative L1 errors for field variables. GeoTransolver delivers better accuracy, improved robustness to geometry/regime shifts, and favorable data efficiency; we include ablations on DrivAerML and qualitative results such as contour plots and design trends for the best GeoTransolver models. By unifying multiscale geometry-aware context with physics-based attention in a scalable transformer, GeoTransolver advances operator learning for high-fidelity surrogate modeling across complex, irregular domains and non-linear physical regimes.
中文标题/摘要
标题:GeoTransolver:使用多尺度几何感知物理注意力变换器在不规则域上学习物理
我们提出了GeoTransolver,这是一种用于CAE的多尺度几何感知物理注意力变换器,用GALE替代了标准注意力,将物理感知的自我注意力应用于学习的状态切片,并与从多尺度球查询(受DoMINO启发)计算出的共享几何/全局/边界条件上下文进行交叉注意力连接,并在每个块中重用。在NVIDIA PhysicsNeMo中实现并发布,GeoTransolver持续将几何、全局和边界条件参数投影到物理状态空间中,将潜在计算锚定在域结构和操作范围内。我们在DrivAerML、Luminary SHIFT-SUV和Luminary SHIFT-Wing上对GeoTransolver进行了基准测试,与Domino、Transolver(在PhysicsNeMo中发布)和文献报告的AB-UPT进行比较,并评估了场变量的拖曳/升力R2和相对L1误差。GeoTransolver提供了更好的准确性、对几何/操作范围变化的改进鲁棒性以及有利的数据效率;我们包括了DrivAerML上的消融分析和GeoTransolver最佳模型的定性结果,如等值线图和设计趋势。通过在可扩展的变换器中统一多尺度几何感知上下文和基于物理的注意力,GeoTransolver促进了复杂、不规则域和非线性物理范围中的操作学习,以实现高保真代理建模。
Summary / 总结
GeoTransolver is a multiscale geometry-aware physics attention transformer designed to improve the accuracy and robustness of computational fluid dynamics (CFD) simulations on irregular domains. It uses GALE, a physics-aware self-attention mechanism, combined with cross-attention to a shared geometry/global/boundary-condition context. GeoTransolver was benchmarked on DrivAerML, Luminary SHIFT-SUV, and Luminary SHIFT-Wing, showing better accuracy, improved robustness to geometry and regime shifts, and favorable data efficiency compared to other methods like Domino and Transolver.
GeoTransolver 是一种多尺度几何感知的物理注意力变压器,旨在提高不规则域上物理模型的准确性和鲁棒性。它使用 GALE(几何感知物理注意力层)进行自注意力处理学习的状态切片,并进行与共享几何/全局/边界条件上下文的交叉注意力。GeoTransolver 在 DrivAerML、Luminary SHIFT-SUV 和 Luminary SHIFT-Wing 上进行了基准测试,显示了比 Domino 和 Transolver 等其他方法更好的准确性和对几何和运行模式变化的鲁棒性,以及更优的数据效率。
SegMo: Segment-aligned Text to 3D Human Motion Generation
Authors: Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen
First: 2025-12-24T15:26:11+00:00 · Latest: 2025-12-24T15:26:11+00:00
Comments: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026
Abstract
Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.
中文标题/摘要
标题:SegMo: 与片段对齐的文本到3D人体动作生成
从文本描述生成3D人体动作是一个重要的研究问题,在视频游戏、虚拟现实和增强现实等领域有着广泛的应用。最近的方法在序列级别上将文本描述与人体动作对齐,忽略了模态的内部语义结构。然而,动作描述和动作序列可以自然地分解为更小且语义上更连贯的片段,这些片段可以作为原子对齐单元以实现更精细的对应。受此启发,我们提出了一种新颖的SegMo框架,以实现细粒度的文本-动作对齐。我们的框架由三个模块组成:(1) 文本片段提取,将复杂的文本描述分解为按时间顺序排列的短语,每个短语代表一个简单的原子动作;(2) 动作片段提取,将完整的动作序列分割为相应的动作片段;(3) 细粒度文本-动作对齐,通过对比学习对齐文本和动作片段。广泛的实验表明,SegMo在两个广泛使用的数据集上改进了强基线,在HumanML3D测试集上实现了0.553的TOP 1得分。此外,由于学习到的文本和动作片段共享嵌入空间,SegMo还可以应用于检索式任务,如动作定位和动作到文本检索。
Summary / 总结
SegMo is a novel framework for generating 3D human motions from text, addressing the limitation of previous methods by aligning text and motion at the segment level. It consists of three modules: Text Segment Extraction, Motion Segment Extraction, and Fine-grained Text-Motion Alignment. SegMo significantly improves the alignment accuracy, achieving a TOP 1 score of 0.553 on the HumanML3D test set and demonstrating its effectiveness in retrieval tasks such as motion grounding and motion-to-text retrieval.
SegMo 是一种新颖的框架,用于从文本生成 3D 人体动作,通过在段落级别对齐文本和动作来解决先前方法的局限性。它包括三个模块:文本段落提取、动作段落提取和细粒度文本-动作对齐。SegMo 在 HumanML3D 测试集上优于强基线,达到 TOP 1 得分 0.553,并且在动作定位和动作到文本检索等检索任务中表现出色。
MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models
Authors: Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller
First: 2025-12-24T15:15:18+00:00 · Latest: 2025-12-24T15:15:18+00:00
Abstract
Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers -- a property we term 'latent solvability'. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.
中文标题/摘要
标题:MiST:理解中期科学训练在发展化学推理模型中的作用
大型语言模型可以通过基于规则的奖励进行在线微调来发展推理能力。然而,最近的研究揭示了一个关键限制:强化学习仅在基础模型已赋予正确答案非忽略不计的概率时才能成功——我们称其为“潜在可解性”。本研究探讨了化学推理能力的出现及其先决条件对化学领域意味着什么。我们确定了基于强化学习的化学推理的两个必要条件:1)符号能力,2)潜在的化学知识。我们提出了中期科学训练(MiST):一系列中期训练技术以满足这些条件,包括数据混合、SMILES/CIF意识预处理、继续预训练29亿个标记以及监督微调1亿个标记。这些步骤将3B和7B模型的潜在可解性得分提高了1.8倍,并使强化学习在有机反应命名中的顶级准确率从10.9%提升到63.9%,在无机材料生成中的顶级准确率从40.6%提升到67.4%。对于其他具有挑战性的化学任务,也观察到了类似的结果,同时产生了可解释的推理痕迹。我们的研究结果定义了化学推理训练的明确先决条件,并突显了中期训练在解锁推理能力中的更广泛作用。
Summary / 总结
This study explores the development of chemical reasoning capabilities in large language models through mid-stage scientific training (MiST), which includes data-mixing, continued pre-training, and supervised fine-tuning. The research identifies two prerequisites: symbolic competence and latent chemical knowledge. The methods significantly improve the models' latent solvability, enabling reinforcement learning to enhance accuracy in organic reaction naming and inorganic material generation from 10.9% to 63.9% and 40.6% to 67.4%, respectively, while providing interpretable reasoning traces.
该研究探讨了中期科学训练(MiST)在开发化学推理能力中的作用。它确定了两个先决条件:符号能力与潜在的化学知识。所提出的MiST技术,包括数据混合、SMILES/CIF意识预处理、持续预训练和监督微调,显著提高了潜在可解性得分,并使强化学习在有机反应命名中达到63.9%的顶级准确率,在无机材料生成中达到67.4%的顶级准确率,明确界定了化学推理训练的先决条件,并强调了中期训练在解锁推理能力中的重要性。
PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation
Authors: Xiao-Qi Han, Ze-Feng Gao, Peng-Jie Guo, Zhong-Yi Lu
First: 2025-12-24T15:07:36+00:00 · Latest: 2025-12-24T15:07:36+00:00
Comments: 19 pages, 6 figures
Abstract
In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen--the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at https://github.com/xqh19970407/PhononBench
中文标题/摘要
标题:PhononBench:一种基于声子的大规模基准测试,用于晶体生成中的动力学稳定性
在本工作中,我们介绍了PhononBench,这是首个用于AI生成晶体动力学稳定性的大规模基准测试。利用最近开发的MatterSim原子间势,该势能在超过10,000种材料中实现了从头算水平的声子预测精度,PhononBench能够高效地进行大规模声子计算和动力学稳定性分析,针对六种领先的晶体生成模型生成的108,843种晶体结构。PhononBench揭示了当前生成模型在确保动力学稳定性方面的普遍局限性:所有生成结构的动力学稳定性平均率为25.83%,最佳模型MatterGen的这一比率也仅为41.0%。进一步的案例研究表明,在目标性质生成中——以MatterGen的带隙调节为例——即使在最佳带隙条件0.5 eV下,动力学稳定性率仍低至23.5%。在空间群控制生成中,高对称晶体表现出更好的稳定性(例如,立方系统达到49.2%的比率),但所有控制生成的平均稳定性仍仅为34.4%。这项研究的重要附加成果是识别了28,119种在整个布里渊区都稳定的晶体结构,为未来的材料探索提供了大量可靠的候选者。通过建立首个大规模动力学稳定性基准测试,本工作系统地突显了当前晶体生成模型的局限性,并提供了未来开发旨在设计和发现物理上可行材料的评估标准和指导。所有模型生成的晶体结构、声子计算结果以及PhononBench开发的高通量评估工作流程将在https://github.com/xqh19970407/PhononBench公开发布
Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval
Authors: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen
Venue: MM
First: 2025-12-24T15:02:33+00:00 · Latest: 2025-12-24T15:02:33+00:00
Comments: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables
Abstract
Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval
中文标题/摘要
标题:利用轻量级实体提取实现可扩展的基于事件的图像检索
从自然语言描述中检索图像是一项核心任务,位于计算机视觉和自然语言处理的交叉领域,广泛应用于搜索引擎、媒体归档和数字内容管理中。然而,由于模糊或依赖上下文的查询、语言的多样性以及需要可扩展的解决方案,现实世界中的图像-文本检索仍然具有挑战性。在本文中,我们提出了一种轻量级的两阶段检索管道,利用事件中心的实体提取来结合真实场景描述中的时间与上下文信号。第一阶段使用基于显著实体的BM25高效过滤候选图像,第二阶段应用BEiT-3模型捕捉深层多模态语义并重新排序结果。在OpenEvents v1基准上评估,我们的方法达到了0.559的平均精度,显著优于先前的基线。这些结果突显了结合事件引导过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。我们的代码可在https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval 获取。
Summary / 总结
This paper addresses the challenge of retrieving images from natural language descriptions by proposing a lightweight two-stage retrieval pipeline. The first stage filters candidates based on salient entities using BM25, while the second stage uses BEiT-3 models to capture deep multimodal semantics and rerank the results. The method achieves a mean average precision of 0.559 on the OpenEvents v1 benchmark, outperforming previous approaches and demonstrating the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in real-world scenarios.
本文提出了一种轻量级的两阶段检索管道,利用事件中心的实体提取来结合时间上下文信号,以解决从自然语言描述中检索图像的挑战。第一阶段使用基于显著实体的BM25进行候选过滤,第二阶段使用BEiT-3模型捕获深度多模态语义并重新排序结果。该方法在OpenEvents v1基准上实现了0.559的平均精度,优于先前的方法,展示了结合事件引导过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。
RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic
Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu
First: 2025-12-24T15:01:26+00:00 · Latest: 2025-12-24T15:01:26+00:00
Comments: 11 pages, 6 figures
Abstract
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
中文标题/摘要
标题:RoboSafe:通过可执行的安全逻辑保护具身代理
由视觉语言模型(VLMs)驱动的具身代理越来越能够执行复杂的现实世界任务,但它们仍然容易受到可能导致不安全行为的危险指令的攻击。运行时安全护栏在任务执行过程中拦截危险行为,提供了灵活的解决方案。然而,现有的防御措施往往依赖于静态规则过滤或提示级控制,难以应对动态、时间依赖性和上下文丰富的环境中隐含的风险。为了解决这个问题,我们提出了RoboSafe,这是一种通过可执行谓词安全逻辑为具身代理提供混合推理运行时保护的混合方法。RoboSafe结合了在混合长短期安全记忆上的两种互补推理过程。我们首先提出了一种后向反思推理模块,该模块不断回顾短期记忆中的最近轨迹,以推断时间安全谓词,并在检测到违规行为时主动触发重新规划。然后,我们提出了一种前瞻预测推理模块,该模块通过生成基于长期安全记忆和代理的多模态观察的安全谓词来预测即将出现的风险。这些组件共同形成了一个既可解释又可执行的适应性、验证性安全逻辑。在多个代理的广泛实验中,RoboSafe将危险行为的风险发生率降低了36.8%,同时保持了接近原始的任务性能。在物理机器人手臂上的实际评估进一步证实了其实用性。代码将在接受后发布。
Summary / 总结
RoboSafe is designed to safeguard embodied agents by using executable safety logic. It addresses the limitations of existing static rule filters and prompt-level controls by integrating backward reflective and forward predictive reasoning processes. The system reduces hazardous actions by 36.8% compared to leading baselines while maintaining near-original task performance. RoboSafe has been validated through extensive experiments and real-world evaluations on physical robotic arms.
RoboSafe 通过使用可执行的安全逻辑来保护由视觉-语言模型驱动的实体代理免受危险指令的影响,结合了持续监控近期行为以确保安全的后向反思推理,以及基于长期记忆和当前观察预测未来风险的前向预测推理。实验表明,RoboSafe 相比现有方法能显著减少 36.8% 的危险行为,同时保持类似的任务性能。实际世界中的机器人手臂评估进一步证实了其实用性。
Latent Implicit Visual Reasoning
Authors: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig
First: 2025-12-24T14:59:49+00:00 · Latest: 2025-12-24T14:59:49+00:00
Abstract
While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.
中文标题/摘要
标题:潜在隐式视觉推理
虽然大型多模态模型(LMMs)取得了显著进展,但它们仍然主要以文本为中心,依赖语言作为核心推理模态。因此,它们在处理以视觉为主的推理任务方面能力有限。最近的方法通过使用辅助图像、深度图或图像裁剪来监督中间的视觉步骤,试图解决这一问题。然而,这些策略对“有用的”视觉抽象施加了限制性的先验,增加了注释成本,并且难以在不同任务之间泛化。为了解决这一关键限制,我们提出了一种任务无关的机制,该机制训练LMMs发现和使用视觉推理标记,而无需显式的监督。这些标记全局注意并以任务自适应的方式重新编码图像,使模型能够提取相关视觉信息,而无需手工设计的监督。我们的方法在多种视觉中心任务上优于直接微调,并达到了最先进的性能——包括那些中间抽象难以指定的任务——同时也能泛化到多任务指令调优。
Summary / 总结
The research aims to enhance the visual reasoning capabilities of Large Multimodal Models (LMMs) by developing a task-agnostic mechanism that trains the models to discover and use visual reasoning tokens without explicit supervision. This method allows the models to attend globally and re-encode images in a task-adaptive manner, enabling them to extract relevant visual information. The approach outperforms direct fine-tuning and achieves state-of-the-art results on various vision-centric tasks, including those where intermediate abstractions are challenging to specify, and it also generalizes to multi-task instruction tuning.
研究旨在通过开发一种任务无关的机制来增强大型多模态模型(LMMs)的视觉推理能力,该机制训练模型发现和使用视觉推理标记,而无需显式的监督。这种方法允许模型全局关注并以任务适应的方式重新编码图像,使其能够提取相关视觉信息。该方法在各种视觉中心任务上超过了直接微调,达到了最先进的结果,包括那些中间抽象难以指定的任务,并且也适用于多任务指令调优。
A study of EHVI vs fixed scalarization for molecule design
Authors: Anabel Yong, Austin Tripp, Layla Hosseini-Gerami, Brooks Paige
Venue: NeurIPS
First: 2025-07-18T07:12:19+00:00 · Latest: 2025-12-24T14:56:07+00:00
Comments: Accepted to NeurIPS AI4Science Workshop 2025
Abstract
Multi-objective Bayesian optimization (MOBO) provides a principled framework for navigating trade-offs in molecular design. However, its empirical advantages over scalarized alternatives remain underexplored. We benchmark a simple Pareto-based MOBO strategy - Expected Hypervolume Improvement (EHVI) - against a simple fixed-weight scalarized baseline using Expected Improvement (EI), under a tightly controlled setup with identical Gaussian Process surrogates and molecular representations. Across three molecular optimization tasks, EHVI consistently outperforms scalarized EI in terms of Pareto front coverage, convergence speed, and chemical diversity. While scalarization encompasses flexible variants - including random or adaptive schemes - our results show that even strong deterministic instantiations can underperform in low-data regimes. These findings offer concrete evidence for the practical advantages of Pareto-aware acquisition in de novo molecular optimization, especially when evaluation budgets are limited and trade-offs are nontrivial.
中文标题/摘要
标题:分子设计中EHVI与固定加权化方法的比较研究
多目标贝叶斯优化(MOBO)为分子设计中的权衡提供了一个原则性的框架。然而,其与标量化替代方法的实证优势尚未得到充分探索。我们使用期望改进(EI)作为固定权重标量化基线,与基于帕累托的期望超体积改进(EHVI)策略进行了基准测试,使用了相同高斯过程代理和分子表示的严格控制设置。在三个分子优化任务中,EHVI在帕累托前沿覆盖、收敛速度和化学多样性方面始终优于标量化EI。虽然标量化包括灵活的变体——包括随机或自适应方案,但我们的结果表明,在数据量有限的情况下,即使是强大的确定性实例也可能表现不佳。这些发现为在新分子优化中使用帕累托意识的获取函数提供了实际优势的实证证据,尤其是在评估预算有限且权衡复杂的情况下。
Summary / 总结
The study investigates the performance of Expected Hypervolume Improvement (EHVI) compared to fixed scalarization methods in molecular design using multi-objective Bayesian optimization. Across three molecular optimization tasks, EHVI outperformed the scalarized Expected Improvement (EI) in terms of Pareto front coverage, convergence speed, and chemical diversity. The results suggest that EHVI is more effective, particularly in low-data regimes, supporting the practical advantages of Pareto-aware acquisition strategies.
该研究对比了Expected Hypervolume Improvement (EHVI) 和固定权重标量化基线(Expected Improvement, EI)在分子设计中的多目标贝叶斯优化表现。在三个任务中,EHVI 在帕累托前沿覆盖、收敛速度和化学多样性方面均优于 EI。研究结果表明,帕累托感知的获取方法在数据有限且权衡复杂的情况下具有优势。
ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering
Authors: Paritosh Parmar, Eric Peh, Basura Fernando
First: 2025-08-28T17:10:53+00:00 · Latest: 2025-12-24T14:52:45+00:00
Comments: Project page: https://paritoshparmar.github.io/chainreaction/
Abstract
Existing Causal-Why Video Question Answering (VideoQA) models often struggle with higher-order reasoning, relying on opaque, monolithic pipelines that entangle video understanding, causal inference, and answer generation. These black-box approaches offer limited interpretability and tend to depend on shallow heuristics. We propose a novel, modular paradigm that explicitly decouples causal reasoning from answer generation, introducing natural language causal chains as interpretable intermediate representations. Inspired by human cognitive models, these structured cause-effect sequences bridge low-level video content with high-level causal reasoning, enabling transparent and logically coherent inference. Our two-stage architecture comprises a Causal Chain Extractor (CCE) that generates causal chains from video-question pairs, and a Causal Chain-Driven Answerer (CCDA) that derives answers grounded in these chains. To address the lack of annotated reasoning traces, we introduce a scalable method for generating accurate causal chains from existing datasets. We construct human verified causal chains for 46K samples. We also propose CauCo, a new evaluation metric for causality-oriented captioning. Experiments on three large-scale benchmarks demonstrate that our approach not only outperforms state-of-the-art models, but also yields substantial gains in explainability, user trust, and generalization -- positioning the CCE as a reusable causal reasoning engine across diverse domains. Project page: https://paritoshparmar.github.io/chainreaction/
中文标题/摘要
标题:ChainReaction:因果链引导推理的模块化可解释因果-为什么视频问答
现有的因果-为什么视频问答(VideoQA)模型往往难以进行高层次推理,依赖于不透明的单一管道,将视频理解、因果推理和答案生成紧密结合。这些黑盒方法缺乏可解释性,通常依赖于浅层启发式方法。我们提出了一种新颖的模块化范式,明确地将因果推理与答案生成分离,引入自然语言因果链作为可解释的中间表示。受人类认知模型的启发,这些结构化的因果序列将低级视频内容与高级因果推理联系起来,使推理变得透明且逻辑连贯。我们的两阶段架构包括因果链提取器(CCE),从视频-问题对生成因果链,以及因果链驱动的答案生成器(CCDA),基于这些链生成答案。为了解决缺乏标注推理轨迹的问题,我们提出了一种生成现有数据集中准确因果链的可扩展方法。我们为46000个样本构建了经人工验证的因果链。我们还提出了CauCo,一种新的因果导向字幕评估指标。在三个大规模基准上的实验表明,我们的方法不仅优于最先进的模型,还在可解释性、用户信任和泛化方面取得了显著提升——将CCE定位为跨不同领域的可重用因果推理引擎。项目页面:https://paritoshparmar.github.io/chainreaction/
Summary / 总结
The paper addresses the limitations of existing VideoQA models that struggle with higher-order reasoning and lack interpretability. It introduces ChainReaction, a modular approach that separates causal reasoning from answer generation using natural language causal chains as intermediate representations. The two-stage architecture includes a Causal Chain Extractor and a Causal Chain-Driven Answerer. Experiments show that ChainReaction outperforms state-of-the-art models and improves explainability, user trust, and generalization. The approach also includes a new evaluation metric, CauCo, for causality-oriented captioning.
论文针对现有视频问答模型在高层次推理方面存在的局限性和缺乏可解释性的问题,提出了ChainReaction,这是一种模块化的方法,通过使用自然语言因果链作为中间表示来分离因果推理和答案生成。该两阶段架构包括因果链提取器和因果链驱动的答案生成器。实验表明,ChainReaction不仅超越了最先进的模型,还在解释性、用户信任和泛化方面取得了显著的提升。该方法还提出了一种新的评估指标CauCo,用于因果导向的字幕生成。
Causal-driven attribution (CDA): Estimating channel influence without user-level data
Authors: Georgios Filippou, Boi Mai Quach, Diana Lenghel, Arthur White, Ashish Kumar Jha
First: 2025-12-24T14:51:12+00:00 · Latest: 2025-12-24T14:51:12+00:00
Comments: 42 pages, 8 figures, submitted initially to the journal of the academy of marketing science on 24th Dec 2025
Abstract
Attribution modelling lies at the heart of marketing effectiveness, yet most existing approaches depend on user-level path data, which are increasingly inaccessible due to privacy regulations and platform restrictions. This paper introduces a Causal-Driven Attribution (CDA) framework that infers channel influence using only aggregated impression-level data, avoiding any reliance on user identifiers or click-path tracking. CDA integrates temporal causal discovery (using PCMCI) with causal effect estimation via a Structural Causal Model to recover directional channel relationships and quantify their contributions to conversions. Using large-scale synthetic data designed to replicate real marketing dynamics, we show that CDA achieves an average relative RMSE of 9.50% when given the true causal graph, and 24.23% when using the predicted graph, demonstrating strong accuracy under correct structure and meaningful signal recovery even under structural uncertainty. CDA captures cross-channel interdependencies while providing interpretable, privacy-preserving attribution insights, offering a scalable and future-proof alternative to traditional path-based models.
中文标题/摘要
标题:因果驱动归因(CDA):在无用户级数据情况下估计渠道影响
归因建模是衡量营销效果的核心,但大多数现有方法依赖于用户级路径数据,由于隐私法规和平台限制,这些数据越来越难以获取。本文介绍了一种因果驱动归因(CDA)框架,该框架仅使用聚合的印象级数据推断渠道影响,不依赖于用户标识符或点击路径跟踪。CDA 结合使用 PCMCI 的时间因果发现与结构因果模型中的因果效应估计,以恢复渠道关系的方向并量化其对转化的贡献。使用设计用于复制真实营销动态的大规模合成数据,我们展示了当给定真实因果图时,CDA 的平均相对 RMSE 为 9.50%,使用预测图时为 24.23%,在正确结构下表现出强大的准确性,并且即使在结构不确定性下也能恢复有意义的信号。CDA 捕捉跨渠道的相互依赖性,同时提供可解释的、隐私保护的归因洞察,提供了一种可扩展且面向未来的替代传统路径模型的选择。
Summary / 总结
The paper introduces Causal-Driven Attribution (CDA), a framework that estimates channel influence using aggregated impression-level data without relying on user identifiers or click-path tracking, addressing privacy concerns. CDA combines temporal causal discovery with causal effect estimation to infer channel relationships and their contributions to conversions. Experiments on synthetic data show CDA achieves an average relative RMSE of 9.50% with the true causal graph and 24.23% with a predicted graph, indicating strong accuracy and meaningful signal recovery even under structural uncertainty.
论文提出了因果驱动归因(CDA)框架,该框架利用聚合的曝光级数据来推断渠道影响,而不依赖于用户标识符或点击路径跟踪,解决了隐私法规带来的挑战。CDA 结合了时间因果发现与因果效应估计,以恢复渠道关系并量化其对转化的贡献。实验使用合成数据表明,CDA 在真实因果图下的平均相对 RMSE 为 9.50%,在预测图下的 RMSE 为 24.23%,显示出在结构不确定性下的强准确性和有意义的信号恢复能力。
Human Motion Estimation with Everyday Wearables
Authors: Siqi Zhu, Yixuan Li, Junfu Li, Qi Wu, Zan Wang, Haozhe Ma, Wei Liang
First: 2025-12-24T14:44:51+00:00 · Latest: 2025-12-24T14:44:51+00:00
Abstract
While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.
中文标题/摘要
标题:基于日常穿戴设备的人体运动估计
基于穿戴设备的人体运动估计对于XR交互等应用至关重要,但现有方法往往存在穿戴不便、硬件昂贵和繁琐校准的问题,这阻碍了它们在日常生活中的应用。为了解决这些问题,我们提出了EveryWear,这是一种完全基于日常穿戴设备的轻量级和实用的人体运动捕捉方法:一部智能手机、智能手表、耳塞和配备一个前置摄像头和两个向下摄像头的智能眼镜,无需在使用前进行显式校准。我们引入了Ego-Elec,这是一个包含56种日常活动的9小时真实世界数据集,覆盖17种不同的室内和室外环境,并提供了由运动捕捉(MoCap)提供的真实3D注释,以促进该领域的稳健研究和基准测试。我们的方法采用了一种多模态教师-学生框架,将第一人称摄像头的视觉线索与消费设备的惯性信号结合起来。通过直接在真实世界数据上进行训练而不是合成数据,我们的模型有效地消除了限制先前工作的模拟到现实的差距。实验表明,我们的方法优于基线模型,验证了其在实际全身运动估计中的有效性。
Summary / 总结
The research aims to improve human motion estimation for applications like XR interaction by addressing issues such as poor wearability and expensive hardware. EveryWear, a lightweight approach using everyday wearables like a smartphone, smartwatch, earbuds, and smart glasses, is introduced. The method uses a multimodal teacher-student framework that combines visual cues from egocentric cameras with inertial signals from consumer devices, trained on real-world data. Experiments show that this approach outperforms baseline models, making it effective for practical full-body motion estimation.
研究旨在通过解决穿戴不便和硬件昂贵等问题,改进用于XR交互的人体动作估计。提出了一个轻量级的方法EveryWear,使用日常穿戴设备如智能手机、智能手表、耳塞和智能眼镜。该方法采用多模态教师-学生框架,结合来自第一人称摄像头的视觉线索和消费级设备的惯性信号,并直接在真实世界数据上进行训练。实验表明,该方法优于基线模型,证明了其在实际全身动作估计中的有效性。
Analytic and Variational Stability of Deep Learning Systems
Authors: Ronald Katende
First: 2025-12-24T14:43:59+00:00 · Latest: 2025-12-24T14:43:59+00:00
Abstract
We propose a unified analytic and variational framework for studying stability in deep learning systems viewed as coupled representation-parameter dynamics. The central object is the Learning Stability Profile, which tracks the infinitesimal response of representations, parameters, and update mechanisms to perturbations along the learning trajectory. We prove a Fundamental Analytic Stability Theorem showing that uniform boundedness of these stability signatures is equivalent, up to norm equivalence, to the existence of a Lyapunov-type energy that dissipates along the learning flow. In smooth regimes, the framework yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractivity of the learning dynamics. Classical spectral stability results for feedforward networks, a discrete CFL-type condition for residual architectures, and parametric and temporal stability laws for stochastic gradient methods arise as direct consequences. The theory extends to non-smooth learning systems, including ReLU networks, proximal and projected updates, and stochastic subgradient flows, by replacing classical derivatives with Clarke generalized derivatives and smooth energies with variational Lyapunov functionals. The resulting framework provides a unified dynamical description of stability across architectures and optimization methods, clarifying how architectural and algorithmic choices jointly govern robustness and sensitivity to perturbations. It also provides a foundation for further extensions to continuous-time limits and geometric formulations of learning dynamics.
中文标题/摘要
标题:深度学习系统的解析与变分稳定性
我们提出了一种统一的解析和变分框架,用于研究深度学习系统作为耦合的表示-参数动力学系统的稳定性。中心对象是学习稳定性概貌,它跟踪表示、参数和更新机制在学习轨迹上受到扰动时的微小响应。我们证明了一个基本的解析稳定性定理,表明这些稳定性特征的统一有界性,等价于存在一种类似李雅普诺夫的能量,该能量在学习流中耗散。在光滑区域,该框架给出了将谱范数、激活正则性、步长和学习率与学习动力学的收缩性联系起来的显式稳定性指数。对于前馈网络的经典谱稳定性结果、残差架构的离散CFL条件以及随机梯度方法的参数和时间稳定性法则均作为直接推论出现。该理论扩展到非光滑学习系统,包括ReLU网络、近端和投影更新以及随机梯度流,通过用Clarke广义导数替换经典导数,并用变分李雅普诺夫泛函替换光滑能量。由此形成的框架提供了一种统一的动力学描述,涵盖了各种架构和优化方法的稳定性,阐明了架构和算法选择如何共同影响鲁棒性和对扰动的敏感性。它还为连续时间极限和学习动力学的几何形式提供了进一步扩展的基础。
Summary / 总结
The paper introduces a unified framework for analyzing stability in deep learning systems by examining the dynamics of representations and parameters. The central concept is the Learning Stability Profile, which measures the infinitesimal response to perturbations. The authors prove a Fundamental Analytic Stability Theorem showing that stability is equivalent to the existence of a Lyapunov-type energy that dissipates along the learning flow. The framework provides explicit stability exponents for smooth regimes and extends to non-smooth systems like ReLU networks and stochastic subgradient flows, offering a unified description of stability across different architectures and optimization methods.
本文提出了一种统一框架来分析深度学习系统的稳定性,重点关注衡量扰动响应的Learning Stability Profile。作者证明了一个基本的分析稳定性定理,将稳定性签名的均匀有界性与沿学习流耗散的Lyapunov型能量联系起来。该框架为光滑系统提供了显式的稳定性指数,并扩展到如ReLU网络和随机梯度流等非光滑系统,提供了一种统一描述不同架构和优化方法稳定性的描述。
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Authors: Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle
First: 2025-12-18T10:21:14+00:00 · Latest: 2025-12-24T14:39:27+00:00
Comments: Project available at https://github.com/sarapapi/hearing2translate
Abstract
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
中文标题/摘要
标题:听译:将语音模态整合到LLM中的有效性
随着大型语言模型(LLMs)超越文本,将语音作为原生模态进行整合产生了语音LLMs,旨在直接翻译口语,从而绕过传统的基于转录的管道。然而,这种整合是否能提高语音到文本的翻译质量,与现有的级联架构相比,仍然是一个开放的问题。我们提出了听译,这是第一个全面的测试套件,严格基准测试了5个最先进的语音LLMs与16个强大的直接和级联系统,这些系统结合了领先的语音基础模型(SFM)和多语言LLMs。我们的分析涵盖了16个基准测试、13种语言对和9种具有挑战性的条件,包括不流畅、嘈杂和长篇语音。在广泛的评估中,我们发现级联系统仍然是最可靠的,当前的语音LLMs仅在某些设置中与级联系统相当,而SFM则落后于两者,这表明在模型内部或管道中整合一个LLM对于高质量的语音翻译是必不可少的。
Summary / 总结
The research aims to evaluate the effectiveness of integrating speech as a native modality into Large Language Models (LLMs), known as SpeechLLMs, for direct speech-to-text translation. The study uses a comprehensive test suite to benchmark 5 state-of-the-art SpeechLLMs against 16 direct and cascade systems across 16 benchmarks, 13 language pairs, and 9 challenging conditions. The findings indicate that cascaded systems remain more reliable overall, while current SpeechLLMs only match cascades in certain settings, suggesting that integrating an LLM is crucial for high-quality speech translation.
研究旨在评估将语音作为原生模态集成到大型语言模型(LLMs)中,以实现直接的语音到文本翻译的效果。研究使用了一个全面的测试套件来比较5个最先进的SpeechLLMs与16个直接和级联系统在16个基准、13个语言对和9个挑战性条件下的表现。研究发现,级联系统在整体上更为可靠,而当前的SpeechLLMs仅在某些设置中与级联系统相当,表明集成一个LLM对于高质量的语音翻译至关重要。
Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Authors: Yu He, Da Huang, Zhenyang Liu, Zixiao Gu, Qiang Sun, Guangnan Ye, Yanwei Fu
First: 2025-12-24T14:28:17+00:00 · Latest: 2025-12-24T14:28:17+00:00
Abstract
Zero-shot object navigation (ZSON) requires a robot to locate a target object in a previously unseen environment without relying on pre-built maps or task-specific training. However, existing ZSON methods often struggle in realistic and cluttered environments, particularly when the scene contains heavy occlusions, unknown risks, or dynamically moving target objects. To address these challenges, we propose \textbf{Schrödinger's Navigator}, a navigation framework inspired by Schrödinger's thought experiment on uncertainty. The framework treats unobserved space as a set of plausible future worlds and reasons over them before acting. Conditioned on egocentric visual inputs and three candidate trajectories, a trajectory-conditioned 3D world model imagines future observations along each path. This enables the agent to see beyond occlusions and anticipate risks in unseen regions without requiring extra detours or dense global mapping. The imagined 3D observations are fused into the navigation map and used to update a value map. These updates guide the policy toward trajectories that avoid occlusions, reduce exposure to uncertain space, and better track moving targets. Experiments on a Go2 quadruped robot across three challenging scenarios, including severe static occlusions, unknown risks, and dynamically moving targets, show that Schrödinger's Navigator consistently outperforms strong ZSON baselines in self-localization, object localization, and overall Success Rate in occlusion-heavy environments. These results demonstrate the effectiveness of trajectory-conditioned 3D imagination in enabling robust zero-shot object navigation.
中文标题/摘要
标题:薛定谔的导航器:零样本对象导航的未来图景想象
零样本对象导航(ZSON)要求机器人在未见过的环境中定位目标对象,无需依赖预先构建的地图或特定任务的训练。然而,现有的ZSON方法在现实且杂乱的环境中往往难以应对,尤其是在场景包含严重遮挡、未知风险或动态移动目标的情况下。为了解决这些挑战,我们提出了**薛定谔的导航器**,这是一种受薛定谔不确定性思想实验启发的导航框架。该框架将未观察到的空间视为一组可能的未来世界,并在行动前对其进行推理。基于第一人称视觉输入和三条候选轨迹,一条轨迹条件下的3D世界模型沿着每条路径想象未来的观察结果。这使代理能够超越遮挡物,预见未见区域的风险,而无需额外绕路或密集的全局映射。想象出的3D观察结果被融合到导航地图中,并用于更新价值地图。这些更新引导策略避开遮挡物,减少对不确定空间的暴露,并更好地追踪移动目标。在具有严重静态遮挡、未知风险和动态移动目标的三个具有挑战性的场景中,使用四足机器人Go2进行的实验表明,薛定谔的导航器在自我定位、对象定位和整体成功率方面始终优于强大的ZSON基线。这些结果证明了轨迹条件下的3D想象在实现稳健的零样本对象导航方面的有效性。
Summary / 总结
The research aims to address the challenges of zero-shot object navigation in cluttered and dynamic environments by proposing Schrödinger's Navigator. This framework uses a trajectory-conditioned 3D world model to imagine future observations and navigate through unobserved spaces, avoiding occlusions and risks. Experiments show that Schrödinger's Navigator outperforms existing methods in self-localization, object localization, and success rate in environments with heavy occlusions and moving targets.
论文提出了一种名为薛定谔导航器的方法,以应对复杂和拥挤环境中零样本物体导航的挑战。该方法利用轨迹条件下的3D世界模型来想象未来观察,并导航通过未观察到的空间,使机器人能够避开遮挡和不确定区域。实验表明,薛定谔导航器在自定位、物体定位和总体成功率方面优于现有方法,特别是在具有大量遮挡和移动目标的环境中。
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
Authors: Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan
First: 2025-12-24T14:18:38+00:00 · Latest: 2025-12-24T14:18:38+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
中文标题/摘要
标题:VisRes 基准:关于评估 VLM 视觉推理能力的研究
视觉-语言模型(VLMs)在视觉问答和图像描述等任务上取得了显著进展。然而,这些模型在视觉推理方面的表现与其依赖语言先验的程度之间的关系尚不明确。为了解决这一问题,我们引入了 VisRes 基准,该基准旨在在无需上下文语言监督的自然场景中研究视觉推理。通过对三种复杂性级别的模型行为进行分析,我们发现了感知和关系视觉推理能力的明显局限性。VisRes 在其级别中隔离了不同的推理能力。第一级测试在模糊、纹理变化、遮挡和旋转等干扰下的感知完成和全局图像匹配;第二级测试单一属性(如颜色、数量、方向)的基于规则的推理;第三级则针对需要整合多个视觉属性的组合推理。在超过 19,000 张受控任务图像中,我们发现最先进的 VLM 在细微的感知扰动下表现接近随机,揭示了其有限的抽象能力,仅限于模式识别。最后,我们讨论了 VisRes 如何为多模态研究中推进抽象视觉推理提供统一框架。
Summary / 总结
The research aims to evaluate the visual reasoning capabilities of Vision-Language Models (VLMs) by introducing VisRes Bench, a benchmark that tests models in naturalistic settings without language supervision. The benchmark consists of three levels: perceptual completion and global image matching (Level 1), rule-based inference over a single attribute (Level 2), and compositional reasoning integrating multiple visual attributes (Level 3). Key findings show that state-of-the-art VLMs struggle with subtle perceptual perturbations, indicating limited abstraction beyond pattern recognition. This suggests that VLMs rely heavily on linguistic priors rather than true visual reasoning abilities.
论文提出了VisRes Bench,这是一个用于评估Vision-Language模型(VLM)在无需依赖上下文语言监督的情况下视觉推理能力的基准。该基准包括三个复杂度级别:感知完成和全局图像匹配(Level 1)、单一属性的规则推理(Level 2)和多视觉属性的组合推理(Level 3)。在超过19,000张受控任务图像中,最先进的VLMs在细微的感知扰动下表现出有限的抽象能力,超越了模式识别,表明它们在推理时依赖于语言先验。
UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement
Authors: Tanghui Jia, Dongyu Yan, Dehao Hao, Yang Li, Kaiyi Zhang, Xianyi He, Lanjiong Li, Jinnan Chen, Lutao Jiang, Qishen Yin, Long Quan, Ying-Cong Chen, Li Yuan
First: 2025-12-24T14:08:38+00:00 · Latest: 2025-12-24T14:08:38+00:00
Comments: 14 pages, 10 figures, Technical Report,
Abstract
In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.
中文标题/摘要
标题:UltraShape 1.0:通过可扩展的几何细化生成高保真3D形状
在本报告中,我们介绍了UltraShape 1.0,这是一种可扩展的3D扩散框架,用于高保真3D几何生成。所提出的方法采用两阶段生成管道:首先合成粗略的整体结构,然后细化以生成详细的高质量几何形状。为了支持可靠的3D生成,我们开发了一个全面的数据处理管道,包括一种新颖的水密处理方法和高质量数据过滤。该管道通过去除低质量样本、填补孔洞和增厚细长结构,提高了公开可用的3D数据集的几何质量,同时保留了精细的几何细节。为了实现精细的几何细化,我们在扩散过程中将空间定位与几何细节合成解耦。我们通过在固定的空间位置进行体素细化来实现这一点,其中从粗略几何形状派生的体素查询提供了通过RoPE编码的显式位置锚点,使扩散模型能够专注于在减少的结构解决方案空间内合成局部几何细节。我们的模型仅在公开可用的3D数据集上进行训练,尽管训练资源有限,但仍能实现强大的几何质量。广泛的评估表明,UltraShape 1.0在数据处理质量和几何生成方面与现有的开源方法竞争。所有代码和训练模型将被发布以支持未来的研究。
Summary / 总结
UltraShape 1.0 is a scalable 3D diffusion framework for generating high-fidelity 3D geometry through a two-stage process: initial coarse structure synthesis followed by detailed refinement. It includes a data processing pipeline that enhances dataset quality by removing low-quality samples and filling holes, while preserving fine details. The method decouples spatial localization from geometric detail synthesis, using voxel-based refinement with RoPE encoding to focus on local details. Despite limited training resources, UltraShape 1.0 matches the performance of existing methods in both data processing and geometry generation.
UltraShape 1.0 是一种可扩展的 3D 扩散框架,用于生成高保真 3D 形状。它采用两阶段管道来合成粗略的全局结构,然后对其进行细化以生成详细的几何形状。该框架包括一个数据处理管道,通过去除低质量样本和填补孔洞来提高几何质量。扩散过程中空间定位和几何细节合成被解耦,以实现精细的细化。尽管训练资源有限,UltraShape 1.0 在数据处理和几何生成方面与现有方法竞争表现出色。
Towards Arbitrary Motion Completing via Hierarchical Continuous Representation
Authors: Chenghao Xu, Guangtao Lyu, Qi Liu, Jiexi Yan, Muli Yang, Cheng Deng
First: 2025-12-24T14:07:04+00:00 · Latest: 2025-12-24T14:07:04+00:00
Abstract
Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.
中文标题/摘要
标题:通过分层连续表示实现任意运动补全
物理运动本质上是连续的,更高的相机帧率通常有助于提高平滑度和时间一致性。首次探索了人类运动序列的连续表示,能够对任意输入运动序列在任意帧率下进行插值、过渡甚至外推。为此,我们提出了一种基于隐式神经表示(INRs)的新型参数激活诱导分层隐式表示框架,称为NAME。我们的方法引入了分层时间编码机制,从运动序列的多个时间尺度中提取特征,有效捕捉复杂的时序模式。此外,我们还将基于傅里叶变换的自定义参数激活函数集成到基于MLP的解码器中,以增强连续表示的表达能力。这种参数化表示显著增强了模型对复杂运动行为的高精度表示能力。在多个基准数据集上的广泛评估表明,我们提出的方法具有有效性和鲁棒性。
Summary / 总结
The research aims to develop a method for arbitrary motion completion by leveraging continuous representations of human motion sequences. The proposed method, named NAME, uses a hierarchical implicit representation framework based on Implicit Neural Representations (INRs) and introduces a hierarchical temporal encoding mechanism to capture intricate temporal patterns at multiple scales. The method also incorporates a parametric activation function powered by Fourier transformations to enhance the expressiveness of the continuous representation. Experimental results show that the proposed approach effectively interpolates, inbetween, and extrapolates motion sequences at arbitrary frame rates across various benchmark datasets, demonstrating its effectiveness and robustness.
本文旨在通过探索人类运动序列的连续表示来解决任意运动补全的挑战。作者提出了一种名为NAME的新型参数激活诱导分层隐式表示框架,该框架使用隐式神经表示(INRs)和分层时间编码机制来在多个时间尺度上捕捉复杂的时序模式。该方法还结合了一个基于傅里叶变换的自定义参数激活函数,以增强连续表示的表达能力。在多个基准数据集上的实验结果表明,所提出的方法在任意运动补全任务中是有效且鲁棒的。
History
20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553