WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Authors: Xuweiyi Chen, Wentao Zhou, Zezhou Cheng
First: 2026-01-15T18:59:58+00:00 · Latest: 2026-01-15T18:59:58+00:00
Comments: Project Page: https://wild-rayzer.cs.virginia.edu/
Abstract
We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
中文标题/摘要
标题:WildRayZer:动态环境中的自监督大规模视图合成
我们提出了WildRayZer,一种在动态环境中进行新颖视图合成(NVS)的自监督框架,其中相机和物体都在移动。动态内容破坏了静态NVS模型依赖的多视图一致性,导致鬼影、虚假几何结构和不稳定的姿态估计。WildRayZer 通过执行分析-合成测试来解决这一问题:仅相机的静态渲染器解释刚性结构,其残差揭示了瞬态区域。从这些残差中,我们构建伪运动掩码,提炼出一个运动估计器,并使用它来屏蔽输入标记和门控损失梯度,使监督集中在跨视图背景完成上。为了实现大规模训练和评估,我们整理了包含15000个随意捕捉的动态序列的真实世界数据集Dynamic RealEstate10K(D-RE10K),以及D-RE10K-iPhone,这是一个瞬态和干净的基准数据集,用于稀疏视图瞬态感知NVS。实验表明,WildRayZer 在瞬态区域去除和全帧NVS质量方面,单次前向传递时均优于基于优化和前馈的基线。
Summary / 总结
WildRayZer is a self-supervised framework for novel view synthesis in dynamic environments where both the camera and objects move. It addresses the issue of multi-view inconsistency by using a camera-only static renderer to explain rigid structure and constructing pseudo motion masks to focus supervision on cross-view background completion. Experiments show that WildRayZer outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
WildRayZer 是一种用于动态环境中的新颖视图合成的自监督框架,解决了鬼影和姿态估计不稳定等问题。它通过静态渲染器解释刚性结构,并利用残差构建伪运动掩码,从而将监督重点放在背景完成上。实验结果显示,WildRayZer 在移除过渡区域和全帧 NVS 质量方面优于基于优化和前馈的基本模型,且仅需一次前馈传递。
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
First: 2026-01-15T18:59:23+00:00 · Latest: 2026-01-15T18:59:23+00:00
Abstract
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
中文标题/摘要
标题:MatchTIR:通过二分匹配实现细粒度监督的工具集成推理
工具集成推理(TIR)通过交替进行推理步骤和外部工具交互,赋予大型语言模型(LLMs)处理复杂任务的能力。然而,现有的强化学习方法通常依赖于结果或轨迹级别的奖励,对轨迹内的所有步骤分配相同的优点,这种粗粒度的信用分配无法区分有效的工具调用与冗余或错误的调用,特别是在长时间多轮场景中。为了解决这个问题,我们提出了MatchTIR框架,通过基于二分匹配的轮次级别奖励分配和双层优势估计引入细粒度监督。具体来说,我们将信用分配形式化为预测和真实轨迹之间的二分匹配问题,利用两种分配策略推导出密集的轮次级别奖励。此外,为了平衡局部步骤精度与全局任务成功,我们引入了一种双层优势估计方案,结合轮次级别和轨迹级别的信号,为每个交互轮次分配不同的优势值。在三个基准上的广泛实验表明,MatchTIR具有优越性。值得注意的是,我们的4B模型在大多数8B竞争对手中表现更优,特别是在长时间多轮任务中。我们的代码可在https://github.com/quchangle1/MatchTIR获取。
Summary / 总结
The paper introduces MatchTIR, a framework that enhances Tool-Integrated Reasoning (TIR) for large language models by using bipartite matching to assign fine-grained turn-level rewards and dual-level advantage estimation. This method distinguishes effective tool calls from redundant ones, improving performance in long-horizon multi-turn tasks. Experiments on three benchmarks show that MatchTIR outperforms most 8B competitors, especially in complex tasks.
MatchTIR 通过基于二分匹配的回合级奖励分配和双层优势估计提供细粒度监督,以增强大型语言模型的工具集成推理 (TIR)。该方法在长时程多轮场景中区分有效的工具调用和冗余或错误的调用。实验结果表明,MatchTIR 在三个基准测试上优于大多数 8B 竞争对手,特别是在长时程和多轮任务中。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略且不对称的连接方式,仅将视觉编码器的输出链接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现多层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 可以显著提高性能,确立了CLI 作为一种可扩展范式的地位,通过赋予LLM按需访问完整视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitations of current vision-language models (VLMs) by proposing Cross-Layer Injection (CLI), a dynamic many-to-many framework that enhances the interaction between visual and language modalities. CLI includes an Adaptive Multi-Projection (AMP) module for feature harmonization and an Adaptive Gating Fusion (AGF) mechanism for selective visual information injection. Experiments show CLI improves performance on 18 diverse benchmarks, demonstrating its effectiveness and scalability in achieving deeper multimodal understanding.
论文通过引入动态框架Cross-Layer Injection (CLI),解决了静态视觉-语言模型(VLMs)的局限性,CLI使视觉和语言模态之间能够形成多对多的连接。CLI包括一个适应性多投影(AMP)模块来协调来自不同视觉层的特征,以及一个适应性门控融合(AGF)机制,允许语言模型根据实时解码上下文选择性地注入相关视觉信息。在18个不同基准上的实验表明,CLI提高了性能并增强了模型将视觉和语言信息整合以进行连贯推理的能力。
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Authors: Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf
First: 2026-01-15T18:58:33+00:00 · Latest: 2026-01-15T18:58:33+00:00
Abstract
Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
中文标题/摘要
标题:少看,开好车:基于基础模型的随机补丁选择实现通用端到端自动驾驶
端到端自动驾驶的最新进展表明,训练于基础模型提取的补丁对齐特征上的策略在分布外(OOD)场景下的泛化能力更强。我们假设由于自注意力机制,每个补丁特征隐含地嵌入/包含来自其他所有补丁的信息,以不同的方式和强度表示,使得这些描述符高度冗余。我们通过主成分分析(PCA)和跨补丁相似性量化了(BLIP2)特征中的冗余性:90%的方差由17/64个主成分捕获,且跨标记相关性很强。在这样的重叠信息上进行训练会导致策略过度拟合虚假的相关性,损害了OOD鲁棒性。我们提出了一种简单而有效的随机补丁选择(SPS)方法,用于学习更鲁棒、更通用且更高效的策略。对于每一帧,SPS随机遮蔽一部分补丁描述符,不将其提供给策略模型,同时保持剩余补丁的空间布局。因此,策略获得了不同但完整的(相同)场景视图:每随机子集的补丁像不同的,但仍合理的、连贯的世界投影。策略基于哪些特定标记存活的特征做出决策。大量实验表明,我们的方法在所有OOD场景中均优于现有最佳方法(SOTA),平均改进6.2%,在闭环模拟中最高可达20.4%,同时速度提高2.4倍。我们在遮蔽率和补丁特征重组、训练和评估9个系统中进行了消融研究,其中8个系统均超越了先前的SOTA。最后,我们展示了相同的策略在无需调整的情况下可以转移到物理的、真实世界的汽车上。
Summary / 总结
This paper addresses the issue of overfitting to spurious correlations in end-to-end autonomous driving policies trained on patch-aligned features from foundation models. The authors propose Stochastic-Patch-Selection (SPS), which randomly masks a fraction of patch descriptors in each frame, leading to more robust and generalizable policies. Experiments show that SPS outperforms the state of the art by up to 20.4% in closed-loop simulations, with a 6.2% average improvement and a 2.4x speedup compared to baseline methods.
本文针对端到端自动驾驶政策在从基础模型提取的补丁对齐特征上过拟合到虚假相关性的问题,提出了随机补丁选择(SPS)方法。该方法在每帧中随机遮掩部分补丁描述符,从而提高政策的鲁棒性和通用性。实验结果显示,SPS在闭环模拟中表现优于现有最佳方法,最高可达20.4%的提升,平均提升6.2%,且比基线方法快2.4倍。
Grounding Agent Memory in Contextual Intent
Authors: Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
First: 2026-01-15T18:55:13+00:00 · Latest: 2026-01-15T18:55:13+00:00
Abstract
Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history.
For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
中文标题/摘要
标题:将代理记忆扎根于上下文意图
在长期目标导向的交互中部署大型语言模型仍然具有挑战性,因为相似的实体和事实会在不同的潜在目标和约束下反复出现,导致记忆系统检索到上下文不匹配的证据。我们提出了STITCH(基于上下文历史的结构化意图跟踪),这是一种代理记忆系统,通过结构化的检索提示、上下文意图对每个轨迹步骤进行索引,并通过匹配当前步骤的意图来检索历史。上下文意图提供了紧凑的信号,消除了重复提及的歧义并减少了干扰:(1) 当前定义主题段落的潜在目标,(2) 动作类型,以及(3) 重要的实体类型,锚定哪些属性是相关的。在推理过程中,STITCH通过意图兼容性筛选和优先级排序记忆片段,抑制语义相似但上下文不匹配的历史记录。
为了评估,我们引入了CAME-Bench,这是一个基准测试,用于在现实、动态的目标导向轨迹中进行上下文感知检索。在CAME-Bench和LongMemEval上,STITCH达到了最先进的性能,比最强基线高出35.6%,随着轨迹长度的增加,性能提升最大。我们的分析表明,意图索引显著减少了检索噪声,支持意图感知的记忆,以实现稳健的长期推理。
Summary / 总结
The research addresses the challenge of deploying large language models in long-term goal-oriented interactions by proposing STITCH, an agentic memory system that uses contextual intent to index and retrieve relevant history. STITCH filters memory snippets based on intent compatibility, reducing interference from context-mismatched evidence. On CAME-Bench and LongMemEval, STITCH outperforms existing methods by 35.6%, demonstrating significant improvements in handling long trajectories.
论文提出STITCH,一种使用上下文意图来定位代理记忆的系统,以解决大型语言模型在长期目标导向交互中的部署问题。STITCH 为每个步骤提供结构化的检索线索,并根据意图兼容性检索历史,这有助于消除重复提及的歧义并减少干扰。在CAME-Bench和LongMemEval上,STITCH 的表现优于最强基线35.6%,显示出在处理长轨迹方面的显著改进。
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Authors: Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
First: 2026-01-15T18:54:50+00:00 · Latest: 2026-01-15T18:54:50+00:00
Abstract
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
中文标题/摘要
标题:LIBERTy:基于因果框架的概念驱动解释基准测试方法,使用结构性反事实
概念驱动的解释量化了高层概念(例如,性别或经验)对模型行为的影响,这对于高风险领域的决策者至关重要。近期工作通过将这些解释与从反事实中估计的参考因果效应进行比较来评估其忠实性。实际上,现有的基准测试依赖于昂贵的人工编写的反事实,这些反事实作为不完美的代理。为了解决这一问题,我们提出了一种构建包含结构性反事实配对的数据集的框架:LIBERTy(基于LLM的干预解释基准测试,带有参考目标)。LIBERTy基于明确定义的结构因果模型(SCM),文本生成中的干预传播通过SCM,直到LLM生成反事实。我们引入了三个数据集(疾病检测、CV筛选和工作场所暴力预测)以及一个新的评估指标,顺序忠实性。使用这些数据集,我们在五个模型上评估了各种方法的范围,并确定了改进概念驱动解释的大量空间。LIBERTy还使模型对干预的敏感性系统分析成为可能:我们发现专有LLM在人口统计概念上的敏感性明显降低,这可能是由于后训练缓解措施所致。总体而言,LIBERTy为开发忠实的解释方法提供了急需的基准测试。
Summary / 总结
The research aims to improve the faithfulness of concept-based explanations for large language models (LLMs) by using structural counterfactuals. The main method involves creating LIBERTy, a framework that generates datasets with structural counterfactual pairs based on explicitly defined Structured Causal Models (SCMs). Key findings include the identification of substantial room for improvement in concept-based explanations and the discovery that proprietary LLMs show reduced sensitivity to demographic concepts, possibly due to post-training mitigation. The evaluation metric, order-faithfulness, is introduced to assess these explanations more accurately.
研究旨在通过开发一个新的基准LIBERTy来提高大型语言模型(LLM)的概念基础解释的忠实性。该框架使用结构化因果模型(SCMs)生成结构性反事实,然后用于评估解释。主要发现包括在概念基础解释方面存在显著改进空间,并且发现专有LLM对人口统计学概念的敏感度较低,这可能是由于后训练缓解策略所致。
Data-driven stochastic reduced-order modeling of parametrized dynamical systems
Authors: Andrew F. Ilersich, Kevin Course, Prasanth B. Nair
First: 2026-01-15T18:50:18+00:00 · Latest: 2026-01-15T18:50:18+00:00
Abstract
Modeling complex dynamical systems under varying conditions is computationally intensive, often rendering high-fidelity simulations intractable. Although reduced-order models (ROMs) offer a promising solution, current methods often struggle with stochastic dynamics and fail to quantify prediction uncertainty, limiting their utility in robust decision-making contexts. To address these challenges, we introduce a data-driven framework for learning continuous-time stochastic ROMs that generalize across parameter spaces and forcing conditions. Our approach, based on amortized stochastic variational inference, leverages a reparametrization trick for Markov Gaussian processes to eliminate the need for computationally expensive forward solvers during training. This enables us to jointly learn a probabilistic autoencoder and stochastic differential equations governing the latent dynamics, at a computational cost that is independent of the dataset size and system stiffness. Additionally, our approach offers the flexibility of incorporating physics-informed priors if available. Numerical studies are presented for three challenging test problems, where we demonstrate excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing approaches.
中文标题/摘要
标题:基于数据驱动的参数化动力系统随机降阶建模
在不同条件下建模复杂的动力系统计算强度大,通常使高保真模拟难以实现。尽管降阶模型(ROMs)提供了有希望的解决方案,但当前方法往往难以处理随机动力学并无法量化预测不确定性,限制了其在稳健决策环境中的应用。为解决这些挑战,我们提出了一种基于数据驱动的框架,用于学习在参数空间和激励条件下泛化的连续时间随机ROMs。我们的方法基于可约化随机变分推断,利用马尔可夫高斯过程的重参数化技巧,在训练过程中消除昂贵的前向求解器需求。这使我们能够同时学习概率自编码器和支配潜在动力学的随机微分方程,计算成本与数据集大小和系统刚度无关。此外,如果可用,我们的方法还提供了物理信息先验的灵活性。我们通过三个具有挑战性的测试问题进行了数值研究,展示了在未见过的参数组合和激励条件下的出色泛化能力,并与现有方法相比实现了显著的效率提升。
Summary / 总结
The paper addresses the computational challenges of modeling complex dynamical systems under varying conditions by introducing a data-driven framework for learning continuous-time stochastic reduced-order models (ROMs). This framework uses amortized stochastic variational inference and a reparametrization trick for Markov Gaussian processes to avoid expensive forward solvers during training. Key findings include excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing methods.
论文提出了一种数据驱动的方法,用于学习连续时间的随机降阶模型(ROMs),以应对复杂动态系统在不同条件下的建模计算挑战。该方法使用了带有马尔可夫高斯过程重参数技巧的近似随机变分推断,能够在训练和泛化方面高效处理参数空间和激励条件的变化。数值研究显示,该方法在未见过的参数组合和激励条件下表现出色,并且相比现有方法具有显著的效率提升。
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Authors: Zirui Ren, Ziming Liu
First: 2026-01-15T18:42:50+00:00 · Latest: 2026-01-15T18:42:50+00:00
Abstract
Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) "Grokking" dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM "guesses" the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be "guessing" instead of "reasoning". Leveraging this "guessing" picture, we propose three strategies to scale HRM's guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models "reason".
中文标题/摘要
标题:你的推理模型是在推理还是猜测?层级推理模型的机制分析
层级推理模型(HRM)在各种推理任务中表现出色,显著优于基于大型语言模型的推理器。为了理解HRM的优势和潜在失败模式,我们对其推理模式进行了机制研究,发现了三个令人惊讶的事实:(a) 失败于极其简单的谜题,例如,HRM 可能在只有一个未知单元格的谜题上失败。我们将其失败归因于违反了HRM的基本假设——固定点性质。(b) 推理步骤中的“顿悟”动态,即答案并非均匀改进,而是存在一个关键的推理步骤,突然使答案变得正确;(c) 存在多个固定点。HRM“猜测”第一个固定点,该固定点可能是错误的,并且可能会暂时或永远被困在那里。所有这些事实表明,HRM 似乎是在“猜测”而不是“推理”。利用这种“猜测”的观点,我们提出了三种策略来扩展HRM的猜测:数据增强(提高猜测的质量)、输入扰动(通过利用推理的随机性来扩展猜测的数量)和模型自举(通过利用训练的随机性来扩展猜测的数量)。在实际应用方面,通过结合所有方法,我们开发了增强的HRM,将数独极端难度的准确率从54.5%提升到96.9%。在科学方面,我们的分析为理解推理模型如何“推理”提供了新的见解。
Summary / 总结
The study investigates the reasoning patterns of hierarchical reasoning models (HRMs) and finds that they often fail on simple puzzles due to a violation of the fixed point property, exhibit 'grokking' dynamics where the answer becomes correct suddenly, and can get trapped in incorrect fixed points. These findings suggest that HRMs are more 'guessing' than 'reasoning'. To improve HRMs, the study proposes strategies like data augmentation, input perturbation, and model bootstrapping, achieving a significant accuracy boost on Sudoku-Extreme from 54.5% to 96.9%. This analysis provides new insights into the reasoning mechanisms of these models.
研究探讨了层次推理模型(HRM)的推理模式,发现它们在简单谜题上经常失败,因为违反了固定点性质,表现出“顿悟”动态,即答案突然变得正确,还可能陷入错误的固定点。这些发现表明,HRM 更像是“猜测”而不是“推理”。为了改进 HRM,研究提出了数据增强、输入扰动和模型自举等策略,将数独极端问题的准确率从54.5% 提升到96.9%。这一分析为理解这些模型的推理机制提供了新的见解。
PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution
Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
First: 2026-01-15T18:25:23+00:00 · Latest: 2026-01-15T18:25:23+00:00
Abstract
Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.
中文标题/摘要
标题:PACEvolve:促进长期目标感知一致进化的框架
大型语言模型(LLMs)已成为进化搜索的强大操作员,然而高效的搜索支架设计仍缺乏系统方法。尽管前景广阔,当前的LLM在环系统缺乏管理进化过程的系统方法。我们识别出三种不同的失败模式:上下文污染,实验历史偏差未来候选生成;模式崩溃,由于探索与利用平衡不佳,代理在局部最小值中停滞;弱协作,僵化的杂交策略无法有效利用并行搜索轨迹。我们引入了目标感知一致进化(PACEvolve)框架,旨在稳健地管理代理的上下文和搜索动力学,以应对这些挑战。PACEvolve结合了分层上下文管理(HCM)与剪枝以解决上下文污染;基于动量的回溯(MBB)以逃离局部最小值;以及一种自适应采样策略,统一了回溯和杂交以进行动态搜索协调(CE),使代理能够平衡内部细化与跨轨迹协作。我们证明,PACEvolve提供了一条系统的方法,以实现一致的长期自改进,其在LLM-SR和KernelBench上达到了最先进的结果,同时发现超越Modded NanoGPT记录的解决方案。
Summary / 总结
The paper addresses the challenges in evolutionary search using Large Language Models (LLMs), such as context pollution, mode collapse, and weak collaboration. It introduces PACEvolve, a framework that uses hierarchical context management, momentum-based backtracking, and a self-adaptive sampling policy to mitigate these issues. PACEvolve achieves state-of-the-art results on LLM-SR and KernelBench and discovers solutions that surpass previous records on Modded NanoGPT.
论文通过引入PACEvolve框架来管理大型语言模型(LLMs)的进化过程,该框架解决了上下文污染、局部最小值陷阱和协作不足的问题。PACEvolve利用分层上下文管理、动量回溯和自适应采样策略来改进搜索动态。实验结果表明,PACEvolve在LLM-SR和KernelBench上达到了最先进的性能,并发现了超越Modded NanoGPT记录的解决方案。
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
First: 2025-05-23T18:01:09+00:00 · Latest: 2026-01-15T18:18:24+00:00
Abstract
Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
中文标题/摘要
标题:PMOA-TTS:介绍PubMed开放获取文本时间序列语料库
临床病历记录了对于建模患者轨迹至关重要的时间动态,但大规模的时间标注资源稀缺。我们介绍了PMOA-TTS语料库,该语料库包含124,699份单个患者的PubMed开放获取病例报告,通过可扩展的大语言模型管道(Llama 3.3 70B和DeepSeek-R1)转换为结构化的文本时间线(事件,时间)对。该语料库包含超过560万条带时间戳的事件,以及提取的 demographics 和诊断信息。技术验证使用了临床专家审核的金标准集和三种度量:语义事件匹配、时间一致性(c-指数)和通过对数时间累积分布函数下的面积(AULTC)总结的对齐误差。我们对替代提示和模型选择进行了基准测试,并提供了支持再现的文档。PMOA-TTS 使从叙述文本中提取时间线、时间推理、生存建模和事件预测的研究成为可能,并提供了广泛的诊断和人口统计学覆盖范围。数据和代码在公共存储库中公开可用。
Summary / 总结
The research introduces PMOA-TTS, a corpus of 124,699 PubMed Open Access case reports converted into structured timelines using a scalable large-language-model pipeline. The corpus includes over 5.6 million timestamped events and demographic information. Technical validation shows high performance in semantic event matching, temporal concordance, and alignment error. The corpus enables research on timeline extraction, temporal reasoning, survival modeling, and event forecasting from narrative text, with broad diagnostic and demographic coverage. Data and code are openly available.
研究引入了PMOA-TTS语料库,包含124,699篇PubMed开放获取病例报告,通过可扩展的大语言模型管道转换为结构化时间线。语料库包括超过5.6百万个时间戳事件和人口统计信息。技术验证显示在语义事件匹配、时间一致性以及对齐误差方面表现良好。该语料库可用于从叙述文本中提取时间线、时间推理、生存建模和事件预测的研究,具有广泛的诊断和人口统计学覆盖范围。数据和代码已公开可用。
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Authors: Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
First: 2026-01-15T18:15:06+00:00 · Latest: 2026-01-15T18:15:06+00:00
Abstract
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural
中文标题/摘要
标题:CURVE:文化与多语言长视频推理基准
近期视频模型的发展取得了显著进步,特别是在长视频理解方面。然而,当前的基准测试主要以西方为中心的数据和英语为主,引入了评估中的显著偏差。为解决这一问题,我们引入了CURVE(视频评价中的文化理解和推理),这是一个针对多文化和多语言视频推理的具有挑战性的基准测试。CURVE 包含来自18个全球地区的高质量、完全由人类生成的、针对特定区域文化的视频注释。与以往依赖自动翻译的工作不同,CURVE 提供了复杂的问题、答案和多步推理步骤,所有这些都用当地语言精心制作。要在CURVE上取得进展需要对视觉文化背景有深入的理解。此外,我们利用CURVE的推理痕迹构建基于证据的图表,并提出了一种使用这些图表的新型迭代策略,以识别推理中的细微错误。我们的评估表明,最先进的视频大模型面临巨大挑战,其性能远低于人类水平,错误主要来自对文化元素的视觉感知。CURVE 将在 https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural 公开可用。
Summary / 总结
The research aims to address the bias in current video understanding benchmarks by introducing CURVE, a new benchmark for multicultural and multilingual video reasoning. It uses high-quality, human-generated annotations from diverse cultural videos across 18 global locales and provides complex questions and answers in native languages. The study finds that state-of-the-art Video-LLMs perform significantly below human-level accuracy, mainly due to difficulties in visual perception of cultural elements. The benchmark will be publicly available at https://github.com/google-deepmind/neptune\#minerva-cultural.
研究旨在通过引入CURVE这一新的多文化多语言视频推理基准,解决当前视频理解基准中的偏见问题。该基准使用来自18个全球区域的多样化文化视频的高质量、人工生成注释,并以本地语言提供复杂的问题和答案。研究发现,最先进的Video-LLMs的性能明显低于人类水平,主要原因是难以识别文化元素的视觉特征。该基准将公开发布在https://github.com/google-deepmind/neptune\#minerva-cultural。
STEM: Scaling Transformers with Embedding Modules
Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
First: 2026-01-15T18:00:27+00:00 · Latest: 2026-01-15T18:00:27+00:00
Abstract
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.
中文标题/摘要
标题:STEM:使用嵌入模块扩展变换器
细粒度稀疏性在不按比例增加每词计算量的情况下承诺了更高的参数容量,但通常会遭受训练不稳定性、负载均衡和通信开销的问题。我们引入了STEM(使用嵌入模块扩展变换器),这是一种静态、按词索引的方法,用局部嵌入查找替换FFN的上投影,同时保持门和下投影密集。这消除了运行时路由,允许CPU卸载和异步预取,并将容量与每词FLOPs和跨设备通信脱钩。实验证明,尽管稀疏性极端,STEM仍能稳定训练。它在减少每词FLOPs和参数访问次数的同时,提高了下游性能(消除大约三分之一FFN参数)。STEM学习具有大角度扩展的嵌入空间,增强了其知识存储容量。更有趣的是,这种增强的知识容量伴随着更好的可解释性。STEM嵌入的按词索引性质允许以简单的方式在不干预输入文本或增加额外计算的情况下进行知识编辑和知识注入。此外,STEM增强了长上下文性能:随着序列长度的增长,更多的不同参数被激活,从而实现实际的测试时容量扩展。在350M和1B模型规模下,STEM总体上提供了高达约3-4%的准确率改进,特别是在知识和推理密集型基准(ARC-Challenge、OpenBookQA、GSM8K、MMLU)上取得了显著进步。总体而言,STEM是一种有效的方法,可以在提供更好的可解释性、更好的训练稳定性和改进的效率的同时扩展参数内存。
Summary / 总结
STEM introduces a static, token-indexed approach to scaling transformers by replacing the FFN up-projection with a layer-local embedding lookup, which enhances training stability and reduces per-token FLOPs and parameter accesses. Experiments show that STEM improves downstream performance over dense baselines while maintaining or enhancing interpretability and long-context performance, achieving up to 4% accuracy improvements on knowledge and reasoning-heavy benchmarks.
STEM 通过将 FFN 上投影替换为 层局部嵌入查找来引入一种静态的、基于 token 的方法,从而实现 transformer 的扩展,这使得 CPU 转移和减少通信开销成为可能。实验表明,STEM 在极端稀疏性下稳定训练,并且在减少每 token FLOPs 和参数访问的同时提高了下游性能。此外,它还增强了可解释性和长上下文性能,实现了在知识和推理密集型基准测试中高达 4% 的准确率提升。
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Authors: Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu, Zekai Gu, Chengwei Ren, Zhiyang Dou, Qing Shuai, Yuan Liu
First: 2026-01-15T17:52:29+00:00 · Latest: 2026-01-15T17:52:29+00:00
Comments: Project Page: https://igl-hkust.github.io/CoMoVi/
Abstract
In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
中文标题/摘要
标题:CoMoVi: 同步生成3D人体动作和逼真视频
在本文中,我们发现3D人体动作生成和2D人体视频生成是内在耦合的。3D动作提供了视频中合理性和一致性的结构先验,而预训练的视频模型为动作提供了强大的泛化能力,这需要耦合它们的生成过程。基于此,我们提出了CoMoVi,一种将两个视频扩散模型(VDMs)耦合在一起的同步生成框架,可以在单一扩散去噪循环中同步生成3D人体动作和视频。为此,我们首先提出了一种有效的2D人体动作表示,可以继承预训练VDMs的强大先验。然后,我们设计了一种双分支扩散模型,通过相互特征交互和3D-2D跨注意力来耦合人体动作和视频生成过程。此外,我们构建了CoMoVi数据集,这是一个包含文本和动作注释的大规模真实世界人体视频数据集,涵盖了多样且具有挑战性的动作。广泛的实验表明,我们的方法在3D人体动作和视频生成任务中均具有有效性。
Summary / 总结
The paper addresses the intrinsic coupling between 3D human motion generation and 2D human video generation, proposing CoMoVi, a co-generative framework that uses two video diffusion models to generate 3D human motions and videos synchronously. It introduces an effective 2D human motion representation and a dual-branch diffusion model with mutual feature interaction and 3D-2D cross attentions to couple motion and video generation. The method is validated through extensive experiments on 3D human motion and video generation tasks, demonstrating its effectiveness.
研究旨在将3D人体动作和2D人体视频的生成过程耦合起来,以提高合理性和一致性。CoMoVi是一种共生生成框架,将两个视频扩散模型结合起来,同步生成3D动作和视频。实验结果表明,该方法能够有效生成3D人体动作和逼真的视频,双分支扩散模型和3D-2D交叉注意力机制提高了动作和视频生成过程的耦合。CoMoVi数据集是一个大规模的真实世界人体视频数据集,支持这些发现。
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
Authors: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
First: 2025-10-01T20:32:08+00:00 · Latest: 2026-01-15T17:51:14+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.
中文标题/摘要
标题:多不确定性引导的多模态推理策略学习
具有可验证奖励的强化学习(RLVR)在多模态大型语言模型中提升了推理能力。然而,现有方法通常将视觉输入视为确定性的,忽视了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源于复杂的推理还是模糊的感知,从而阻碍了探索或学习信号的针对性分配。为解决这一问题,我们引入了DUPL,这是一种多模态RLVR中的双不确定性引导策略学习方法,通过对称KL散度量化和利用感知不确定性(以及通过策略熵利用输出不确定性)来引导策略更新。通过建立不确定性驱动的反馈循环并采用动态分支优先机制,DUPL重新校准策略优势,使其专注于具有高感知或决策模糊性的状态,从而实现有效的目标探索,超越被动的数据增强。DUPL基于GRPO实现,并在六个多模态数学和通用领域推理基准测试上进行评估,提高了Qwen2.5-VL 3B和7B模型的准确性,视觉数学任务上的准确率提升高达11.2%,通用领域推理任务上的准确率提升高达7.1%,并且始终优于GRPO。这些结果表明,双不确定性引导的策略学习是一种有效且通用的方法,适用于多模态RLVR。
Summary / 总结
The research aims to enhance the reasoning capabilities of multimodal large language models by addressing the limitations of existing reinforcement learning methods that treat visual inputs deterministically. DUPL, a dual-uncertainty guided policy learning approach, quantifies perceptual and output uncertainties to guide policy updates, focusing on states with high ambiguity. This method improves Qwen2.5-VL 3B and 7B models, achieving up to 11.2% accuracy gains on visual math tasks and up to 7.1% on general-domain reasoning tasks, outperforming GRPO across six benchmarks.
该论文提出了DUPL,一种用于多模态可验证奖励强化学习(RLVR)的双重不确定性引导策略学习方法,该方法量化并利用感知不确定性和输出不确定性来引导策略更新。通过解决现有方法将视觉输入视为确定性的局限性,DUPL重新校准策略优势,聚焦于具有高感知或决策不确定性状态,从而在视觉数学和通用领域推理任务上分别实现了高达11.2%和7.1%的准确率提升,优于基线GRPO。
Data-Driven Dynamic Factor Modeling via Manifold Learning
Authors: Graeme Baker, Agostino Capponi, J. Antonio Sidaoui
First: 2025-06-24T18:40:40+00:00 · Latest: 2026-01-15T17:50:32+00:00
Abstract
We introduce a data-driven dynamic factor framework for modeling the joint evolution of high-dimensional covariates and responses without parametric assumptions. Standard factor models applied to covariates alone often lose explanatory power for responses. Our approach uses anisotropic diffusion maps, a manifold learning technique, to learn low-dimensional embeddings that preserve both the intrinsic geometry of the covariates and the predictive relationship with responses. For time series arising from Langevin diffusions in Euclidean space, we show that the associated graph Laplacian converges to the generator of the underlying diffusion. We further establish a bound on the approximation error between the diffusion map coordinates and linear diffusion processes, and we show that ergodic averages in the embedding space converge under standard spectral assumptions. These results justify using Kalman filtering in diffusion-map coordinates for predicting joint covariate-response evolution. We apply this methodology to equity-portfolio stress testing using macroeconomic and financial variables from Federal Reserve supervisory scenarios, achieving mean absolute error improvements of up to 55% over classical scenario analysis and 39% over principal component analysis benchmarks.
中文标题/摘要
标题:基于流形学习的数据驱动动态因子建模
我们提出了一种数据驱动的动态因子框架,用于在无需参数假设的情况下建模高维协变量和响应的联合演变。单独应用于协变量的标准因子模型往往在解释响应方面失去效力。我们的方法使用各向异性扩散映射,这是一种流形学习技术,来学习低维嵌入,同时保留协变量的内在几何结构和与响应的预测关系。对于来自欧几里得空间朗文扩散的时间序列,我们证明了相关的图拉普拉斯算子收敛于基础扩散的生成器。我们进一步建立了扩散映射坐标与线性扩散过程之间近似误差的上界,并证明在嵌入空间中的遍历平均值在标准谱假设下收敛。这些结果证明了在扩散映射坐标中使用卡尔曼滤波预测协变量-响应联合演变的有效性。我们使用联邦储备监管情景中的宏观经济和金融变量将该方法应用于股票组合压力测试,实现了与经典情景分析相比高达55%的平均绝对误差改进,以及与主成分分析基准相比39%的改进。
Summary / 总结
This paper presents a data-driven dynamic factor model using anisotropic diffusion maps to model the joint evolution of high-dimensional covariates and responses without parametric assumptions. The approach preserves the intrinsic geometry of covariates and their predictive relationship with responses. The methodology is applied to equity-portfolio stress testing, showing improvements in mean absolute error of up to 55% over classical scenario analysis and 39% over principal component analysis benchmarks.
该研究提出了一种数据驱动的动态因子框架,用于在无需参数假设的情况下建模高维协变量和响应的联合演化。它使用各向异性扩散图来学习同时保留协变量内在几何结构及其与响应预测关系的低维嵌入。该方法应用于股票组合压力测试,结果显示相对于经典情景分析的绝对误差改善了55%,相对于主成分分析基准提高了39%。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒高成本错误的风险。我们研究了基于信心的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们得出两项发现。首先,置信度阈值化在分布内提供了机制控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率f
Summary / 总结
The research aims to improve the reliability of vision-language models in high-stakes applications by enabling them to abstain from making predictions when uncertain. The study investigates the effectiveness of confidence-based abstention in controlling error rates in video question answering. Key findings include the establishment of smooth risk-coverage tradeoffs through sweeping confidence thresholds, which reduce error rates effectively within the in-distribution setting. The study also examines the robustness of this approach under distribution shift using NExT-QA and Gemini 2.0 Flash datasets.
该研究探讨了在视频问答模型中使用基于信心的回避策略来管理错误率,以确保在高风险应用中的可靠性能。研究发现,信心阈值提供了在训练分布内可靠控制错误率的方式,随着阈值的变化,风险和覆盖率之间可以实现平滑的权衡。但这种控制在分布变化下的鲁棒性仍需进一步研究。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs生成的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最新技术,并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、用于微调的自由形式视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法,并展示了在视觉标记上使用双向注意和一种新颖的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕任务中优于其他开放权重和数据模型,并在长视频任务中具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并超越了如Gemini 3 Pro等私有模型的某些任务(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
Molmo2 is a new family of open-source vision-language models that outperform existing open-weight models and proprietary models in video understanding and grounding tasks. The research addresses the lack of open-source foundations for improving video language models by providing 9 new datasets and a training recipe. Key findings include superior performance on point-driven grounding tasks and competitive results on long-video tasks. Molmo2 significantly outperforms Qwen3-VL and Gemini 3 Pro on video counting and video pointing tasks, respectively.
Molmo2 是一种新的开源视觉-语言模型,其在视频理解和定位任务中优于现有开源模型和专有模型。研究通过提供 9 个新数据集和训练方案,解决了改善视频语言模型缺乏开源基础的问题。主要发现包括在点驱动的定位任务中表现出色,并在长视频任务中取得竞争力的结果。Molmo2 在视频计数和视频定位任务中显著优于 Qwen3-VL 和 Gemini 3 Pro。
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-15T17:24:46+00:00
Comments: Work in Progress
Abstract
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文标题/摘要
标题:奖励稀有性:面向创造性问题解决的LLMs独特性意识强化学习
强化学习(RL)已成为后训练大型语言模型(LLMs)的核心范式,特别是在复杂推理任务中,但它经常遭受探索崩溃的问题:策略过早地集中在一小套主导的推理模式上,提高了pass@1,但限制了rollout级别的多样性以及pass@k的收益。我们认为这种失败源于对局部token行为的正则化,而不是对解集多样性的正则化。为了解决这个问题,我们提出了独特性意识强化学习,这是一种rollout级别的目标,明确奖励那些表现出罕见高级策略的正确解。该方法使用基于LLM的裁判将相同问题的rollout根据其高级解策略聚类,忽略表面差异,并将策略优势重新加权为与聚类大小成反比。因此,正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中,我们的方法在大规模采样预算下始终如一地提高了pass@$k$,并增加了pass@$k$曲线下的面积(AUC@$K$),同时不牺牲pass@1,同时保持探索并揭示更多多样化的解策略。
Summary / 总结
The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant reasoning patterns. It introduces Uniqueness-Aware Reinforcement Learning, which rewards rare high-level strategies to promote diversity. The method uses an LLM-based judge to cluster rollouts based on high-level solution strategies and reweights policy advantages inversely with cluster size. Experiments show consistent improvements in pass@$k$ across various reasoning benchmarks, increasing the AUC@$K$ without sacrificing pass@1, and uncovering more diverse solution strategies.
论文针对强化学习在大型语言模型(LLMs)中出现的探索枯竭问题,即政策倾向于集中于少数主导推理模式。提出了一种新颖的Uniqueness-Aware强化学习方法,通过LLM判别器对同一问题的策略进行高阶解策略聚类,并按集群大小反比重新加权策略优势,从而奖励表现出罕见高阶策略的正确但新颖的解决方案。该方法在多个推理基准上提高了pass@$k$,增加了AUC@$K$,同时保持了pass@1并维持了探索多样性。
Multi-Objective Pareto-Front Optimization for Efficient Adaptive VVC Streaming
Authors: Angeliki Katsenou, Vignesh V. Menon, Guoda Laurinaviciute, Benjamin Bross, Detlev Marpe
First: 2026-01-15T17:23:39+00:00 · Latest: 2026-01-15T17:23:39+00:00
Comments: 19 pages
Abstract
Adaptive video streaming has facilitated improved video streaming over the past years. A balance among coding performance objectives such as bitrate, video quality, and decoding complexity is required to achieve efficient, content- and codec-dependent, adaptive video streaming. This paper proposes a multi-objective Pareto-front (PF) optimization framework to construct quality-monotonic, content-adaptive bitrate ladders Versatile Video Coding (VVC) streaming that jointly optimize video quality, bitrate, and decoding time, which is used as a practical proxy for decoding energy. Two strategies are introduced: the Joint Rate-Quality-Time Pareto Front (JRQT-PF) and the Joint Quality-Time Pareto Front (JQT-PF), each exploring different tradeoff formulations and objective prioritizations. The ladders are constructed under quality monotonicity constraints during adaptive streaming to ensure a consistent Quality of Experience (QoE). Experiments are conducted on a large-scale UHD dataset (Inter-4K), with quality assessed using PSNR, VMAF, and XPSNR, and complexity measured via decoding time and energy consumption. The JQT-PF method achieves 11.76% average bitrate savings while reducing average decoding time by 0.29% to maintain the same XPSNR, compared to a widely-used fixed ladder. More aggressive configurations yield up to 27.88% bitrate savings at the cost of increased complexity. The JRQT-PF strategy, on the other hand, offers more controlled tradeoffs, achieving 6.38 % bitrate savings and 6.17 % decoding time reduction. This framework outperforms existing methods, including fixed ladders, VMAF- and XPSNR-based dynamic resolution selection, and complexity-aware benchmarks. The results confirm that PF optimization with decoding time constraints enables sustainable, high-quality streaming tailored to network and device capabilities.
中文标题/摘要
标题:多目标帕累托前沿优化以实现高效的自适应VVC流媒体
自适应视频流媒体在过去几年中促进了视频流媒体的改进。为了实现高效、内容和编解码器依赖的自适应视频流媒体,需要在比特率、视频质量和解码复杂性等编码性能目标之间取得平衡。本文提出了一种多目标帕累托前沿(PF)优化框架,以构建质量单调、内容自适应的Versatile Video Coding (VVC)流媒体比特率梯度,该框架联合优化视频质量、比特率和解码时间,后者用作解码能耗的实用代理。介绍了两种策略:联合速率-质量和时间帕累托前沿(JRQT-PF)和联合质量和时间帕累托前沿(JQT-PF),每种策略探索不同的权衡形式和目标优先级。在自适应流媒体过程中,根据质量单调性约束构建梯度,以确保一致的用户体验(QoE)。在大规模UHD数据集(Inter-4K)上进行了实验,使用PSNR、VMAF和XPSNR评估质量,使用解码时间和能耗测量复杂性。JQT-PF方法平均节省了11.76%的比特率,同时将平均解码时间减少了0.29%,以保持相同的XPSNR,与广泛使用的固定梯度相比。更激进的配置在复杂性增加的情况下可节省高达27.88%的比特率。另一方面,JRQT-PF策略提供了更可控的权衡,实现了6.38%的比特率节省和6.17%的解码时间减少。该框架优于现有方法,包括固定梯度、基于VMAF和XPSNR的动态分辨率选择以及复杂性感知基准。结果表明,带解码时间约束的PF优化能够实现针对网络和设备能力的可持续、高质量流媒体。
Summary / 总结
This paper proposes a multi-objective Pareto-front optimization framework for adaptive Versatile Video Coding (VVC) streaming, balancing video quality, bitrate, and decoding time. Two strategies, JRQT-PF and JQT-PF, are introduced to explore different tradeoffs. Experiments on a large-scale UHD dataset show that the JQT-PF method achieves 11.76% average bitrate savings with a slight increase in decoding time, while the JRQT-PF strategy offers more controlled tradeoffs with 6.38% bitrate savings and 6.17% decoding time reduction. The proposed framework outperforms existing methods in terms of efficiency and quality.
本文提出了一种多目标帕累托前沿优化框架,用于自适应Versatile Video Coding (VVC) 流媒体,旨在平衡码率、视频质量和解码时间。引入了两种策略JRQT-PF和JQT-PF,以探索不同的权衡。实验结果表明,JQT-PF方法在保持相同XPSNR的情况下,平均码率节省了11.76%,解码时间减少了0.29%;而JRQT-PF策略提供了更可控的权衡,实现了6.38%的码率节省和6.17%的解码时间减少。该框架在效率和质量方面优于现有方法。
RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
Authors: Peng Chen, Xiaobao Wei, Yi Yang, Naiming Yao, Hui Chen, Feng Tian
First: 2026-01-15T17:23:19+00:00 · Latest: 2026-01-15T17:23:19+00:00
Abstract
Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
中文标题/摘要
标题:RSATalker:现实社会意识的多轮对话生成头像
头像生成在虚拟现实(VR)中越来越重要,尤其是在涉及多轮对话的社会场景中。现有方法面临显著的局限性:基于网格的3D方法可以建模双人对话,但缺乏现实的纹理,而基于大型模型的2D方法产生自然外观但计算成本高昂。最近,基于3D高斯散点图(3DGS)的方法实现了高效的现实渲染,但仍仅支持单人说话且忽略了社会关系。我们引入了RSATalker,这是第一个利用3DGS进行现实社会意识的多轮对话生成头像生成框架。我们的方法首先从语音驱动网格基3D面部运动,然后将3D高斯点绑定到网格面以渲染高保真2D化身视频。为了捕捉人际动态,我们提出了一种社会意识模块,通过可学习的查询机制将社会关系,包括血缘和非血缘以及平等和不平等,编码为高级嵌入。我们设计了三阶段训练范式,并构建了包含社会关系标注的RSATalker数据集,带有语音-网格-图像三元组。大量实验表明,RSATalker在现实性和社会意识方面均达到最先进的性能。代码和数据集将被发布。
Summary / 总结
RSATalker is a novel framework for generating realistic and socially-aware talking heads for multi-turn conversations. It uses 3D Gaussian Splatting to render high-fidelity 2D avatar videos, overcoming the limitations of previous methods. The socially-aware module encodes social relationships into high-level embeddings, and a three-stage training paradigm is employed. Experiments show that RSATalker outperforms existing methods in both realism and social awareness.
RSATalker 是一个框架,利用 3D 高斯散点技术生成虚拟现实中的多轮对话中具有现实感和社会意识的说话头部。它从语音驱动 3D 面部运动,并将 3D 高斯点绑定到网格面片以生成高质量的 2D 头像视频。一个社会意识模块将社会关系编码为高级嵌入,并采用三阶段训练范式。实验表明,RSATalker 在现实感和社会意识方面均优于现有方法。
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-15T17:20:36+00:00
Comments: Work in Progress
Abstract
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
中文标题/摘要
标题:协作多智能体推理时强化学习
多智能体系统已成为许多应用中的实用LLM驱动合作者,通过多样性和交叉验证获得鲁棒性。然而,多智能体强化学习(MARL)训练资源密集且不稳定:队友的共同适应引入了非平稳性,奖励通常稀疏且高方差。因此,我们引入了**多智能体推理时强化学习(MATTRL)**框架,在推理时向多智能体协商注入结构化文本经验。MATTRL 形成一个由专家组成的多专家团队,进行多轮讨论,检索和整合推理时的经验,并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池,然后将其重新注入对话。在医学、数学和教育领域的具有挑战性的基准测试中,MATTRL 在多智能体基线上的准确率平均提高了3.67%,在可比的单智能体基线上的准确率提高了8.67%。消融研究探讨了不同的信用分配方案,并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径,无需调整即可实现分布转移鲁棒的多智能体推理。
Summary / 总结
The paper introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which injects structured textual experience into multi-agent deliberation at inference time to improve robustness and accuracy. MATTRL forms a multi-expert team for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. Experiments across medicine, math, and education show that MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines.
论文提出了Multi-Agent Test-Time Reinforcement Learning (MATTRL),该方法在推理时注入结构化的文本经验,以提高鲁棒性和准确性。MATTRL 形成一个多专家团队进行多轮讨论,检索并整合测试时的经验,并达成共识进行最终决策。实验结果显示,MATTRL 在医学、数学和教育等领域的准确率分别比多智能体基线提高了 3.67%,比单智能体基线提高了 8.67%。
Semantic Misalignment in Vision-Language Models under Perceptual Degradation
Authors: Guo Cheng
First: 2026-01-13T09:13:05+00:00 · Latest: 2026-01-15T17:10:05+00:00
Comments: 10 pages, 4 figures, 6 tables
Abstract
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
中文标题/摘要
标题:视觉-语言模型在感知退化下的语义不匹配
视觉-语言模型(VLMs)在自动驾驶和具身AI系统中越来越被部署,可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLMs在多模态基准测试中表现出色,但它们对现实感知退化的鲁棒性仍然不甚了解。在本工作中,我们系统地研究了在上游视觉感知受控退化下VLMs中的语义不匹配,使用Cityscapes数据集上的语义分割作为代表性感知模块。我们引入了感知现实的退化,这些退化仅在传统分割指标上引起适度下降,但观察到下游VLM行为的严重失败,包括虚构对象提及、安全关键实体的遗漏以及不一致的安全判断。为了量化这些影响,我们提出了一组语言层面的不匹配度量标准,以捕捉虚构、关键遗漏和安全误判,并分析这些度量标准与分割质量之间的关系,涵盖多个对比性和生成性VLMs。我们的结果揭示了像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显了当前VLM基系统的一个关键局限性,并强调了需要评估框架来明确考虑感知不确定性在关键安全应用中的重要性。
Summary / 总结
This study investigates the robustness of Vision-Language Models (VLMs) under perceptual degradation, focusing on their performance in autonomous driving and embodied AI systems. By applying realistic corruptions to semantic segmentation on the Cityscapes dataset, the research identifies severe failures in VLMs, such as hallucinations and omissions of critical entities. The authors introduce language-level misalignment metrics to quantify these issues and find a clear disconnect between pixel-level robustness and multimodal semantic reliability, emphasizing the need for better evaluation frameworks.
该研究探讨了视觉-语言模型(VLMs)在自主驾驶和具身AI系统中的鲁棒性,通过在Cityscapes数据集上应用可控的感知退化,发现即使轻微的分割精度下降也会导致VLMs出现严重的语义错位,如幻觉和安全误判。研究引入了新的语言层面的度量标准来量化这些问题,并发现像素级鲁棒性和多模态语义可靠性之间存在明显的差距,强调了在安全关键应用中需要更好的评估方法。
STEP3-VL-10B Technical Report
Authors: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge
First: 2026-01-14T17:58:24+00:00 · Latest: 2026-01-15T17:06:04+00:00
Comments: 50 pages
Abstract
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
中文标题/摘要
标题:STEP3-VL-10B 技术报告
我们提出了STEP3-VL-10B,这是一种轻量级开源基础模型,旨在重新定义紧凑效率与前沿多模态智能之间的权衡。STEP3-VL-10B 通过两个战略转变实现:首先,一种统一的、完全解冻的预训练策略,基于1.2万亿多模态令牌,将语言对齐的感知编码器与Qwen3-8B解码器结合,以建立内在的视觉-语言协同作用;其次,一个扩展后的后训练流水线,包含超过1000次强化学习迭代。关键的是,我们实现了并行协调推理(PaCoRe)以扩展测试时的计算能力,将资源分配给可扩展的感知推理,探索和综合多种视觉假设。因此,尽管其紧凑的10B参数量,STEP3-VL-10B 在性能上与比其大10-20倍的模型(如GLM-4.6V-106B、Qwen3-VL-235B)相当或超越,并且在顶级专有旗舰产品(如Gemini 2.5 Pro和Seed-1.5-VL)中表现出色。它在MMBench上记录了92.2%的得分,在MMMU上记录了80.11%的得分,同时在复杂推理方面分别达到了94.43%的AIME2025得分和75.95%的MathVision得分。我们发布了完整的模型套件,为社区提供了一个强大、高效且可重复的基础。
Summary / 总结
STEP3-VL-10B is a lightweight foundation model that redefines the balance between efficiency and multimodal intelligence through a unified pre-training strategy and a scaled post-training pipeline. It achieves this by integrating a language-aligned Perception Encoder with a Qwen3-8B decoder and using over 1,000 iterations of reinforcement learning. Despite its compact 10B parameters, STEP3-VL-10B outperforms larger models and proprietary flagships on various benchmarks, including MMBench, MMMU, AIME2025, and MathVision, with scores of 92.2%, 80.11%, 94.43%, and 75.95%, respectively.
STEP3-VL-10B 是一种轻量级的基础模型,通过统一的预训练策略和扩展后的后训练管道重新定义了效率与多模态智能之间的平衡。它通过将语言对齐的感知编码器与 Qwen3-8B 解码器集成,并使用超过 1,000 次强化学习迭代来实现这一点。尽管其参数量仅为 10B,但 STEP3-VL-10B 在 MMBench、MMMU、AIME2025 和 MathVision 等多个基准测试中表现出色,得分分别为 92.2%、80.11%、94.43% 和 75.95%。
Action100M: A Large-scale Video Action Dataset
Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung
First: 2026-01-15T17:02:27+00:00 · Latest: 2026-01-15T17:02:27+00:00
Abstract
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
中文标题/摘要
标题:Action100M:大规模视频动作数据集
从视觉观察中推断物理动作是推进物理世界机器智能的基本能力。实现这一目标需要涵盖广泛领域的大型、开放词汇量视频动作数据集。我们介绍了Action100M,这是一个从1.2M互联网教学视频(14.6年时长)构建的大规模数据集,包含O(100百万)个时间局部化片段,具有开放词汇量动作监督和丰富的描述。Action100M通过一个完全自动化的流水线生成,该流水线(i)使用V-JEPA 2嵌入进行分层时间分割,(ii)生成多级帧和片段描述组织成描述树,(iii)使用多轮Self-Refine程序下的推理模型(GPT-OSS-120B)聚合证据以输出结构化注释(简要/详细动作、演员、简要/详细描述)。在Action100M上训练VL-JEPA展示了在各种动作识别基准测试中一致的数据规模改进和强大的零样本性能,确立了Action100M作为视频理解和世界建模可扩展研究新基础的地位。
Summary / 总结
The research aims to develop a large-scale video action dataset to enhance machine intelligence in recognizing physical actions from visual observations. The main method involves creating Action100M from 1.2 million instructional videos, using a fully automated pipeline for hierarchical temporal segmentation, multi-level captioning, and structured annotation. Key findings show consistent data-scaling improvements and strong zero-shot performance across various action recognition benchmarks, positioning Action100M as a foundational dataset for video understanding and world modeling research.
研究旨在通过开发大规模视频动作数据集来提升机器智能在识别物理动作方面的表现。Action100M 从 120 万条教学视频中构建,提供了超过 1 亿个时间局部化片段,带有开放词汇的动作监督。该数据集通过一个全自动流水线生成,包括层次时间分割、多级字幕生成和结构化注释精炼。实验表明,Action100M 在各种动作识别基准测试中表现出一致的数据扩展改进和强大的零样本性能,将其确立为视频理解与世界建模研究的新基础。
SSFL: Discovering Sparse Unified Subnetworks at Initialization for Efficient Federated Learning
Authors: Riyasat Ohib, Bishal Thapaliya, Gintare Karolina Dziugaite, Jingyu Liu, Vince Calhoun, Sergey Plis
Venue: Transactions on Machine Learning Research, 2026
First: 2024-05-15T02:13:51+00:00 · Latest: 2026-01-15T17:01:07+00:00
Comments: Published in Transactions on Machine Learning Research (TMLR), 2026
Abstract
In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient communication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy sparsity trade off, achieving more than 20\% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by $2 \times$ relative to dense FL. Finally, in a real-world federated learning deployment, SSFL delivers over $2.3 \times$ faster communication time, underscoring its practical efficiency.
中文标题/摘要
标题:SSFL:初始化时发现稀疏统一子网络以实现高效的联邦学习
在本文中,我们提出了显著稀疏联邦学习(SSFL),这是一种用于稀疏联邦学习的高效通信简化方法。SSFL 在训练前识别一个稀疏子网络,利用在非IID场景下分别在本地客户端数据上计算的参数显著性分数进行聚合,以确定一个全局掩码。仅在每次客户端与服务器之间通信时训练和传输稀疏模型权重。在包括CIFAR-10、CIFAR-100和Tiny-ImageNet的标准基准测试中,SSFL 一致地改善了准确性和稀疏性之间的权衡,在CIFAR-10上相对于最强的稀疏基线实现了超过20%的相对误差减少,同时将通信成本减少了2倍。最后,在实际的联邦学习部署中,SSFL 实现了超过2.3倍的更快通信时间,突显了其实用效率。
Summary / 总结
The research proposes SSFL, a method for sparse federated learning that identifies a sparse subnetwork at initialization using parameter saliency scores from local client data. This approach improves accuracy while reducing communication costs. On benchmarks, SSFL achieves over 20% relative error reduction on CIFAR-10 compared to the strongest sparse baseline and reduces communication by 2 times compared to dense federated learning. In practical deployment, SSFL cuts communication time by over 2.3 times, highlighting its efficiency.
研究旨在通过提出SSFL,即在初始化时利用参数显著性分数识别稀疏子网络,来提高联邦学习的效率。该方法减少了通信成本并优于现有稀疏基线,提高了准确性。在CIFAR-10上,SSFL实现了超过20%的相对误差减少,并将通信成本减少了2倍。在实际部署中,SSFL还将通信时间加快了超过2.3倍。
Searching for Quantum Effects in the Brain: A Bell-Type Test for Nonclassical Latent Representations in Autoencoders
Authors: I. K. Kominis, C. Xie, S. Li, M. Skotiniotis, G. P. Tsironis
First: 2026-01-15T16:59:40+00:00 · Latest: 2026-01-15T16:59:40+00:00
Comments: 6 pages, 2 figures
Abstract
Whether neural information processing is entirely classical or involves quantum-mechanical elements remains an open question. Here we propose a model-agnostic, information-theoretic test of nonclassicality that bypasses microscopic assumptions and instead probes the structure of neural representations themselves. Using autoencoders as a transparent model system, we introduce a Bell-type consistency test in latent space, and ask whether decoding statistics obtained under multiple readout contexts can be jointly explained by a single positive latent-variable distribution. By shifting the search for quantum-like signatures in neural systems from microscopic dynamics to experimentally testable constraints on information processing, this work opens a new route for probing the fundamental physics of neural computation.
中文标题/摘要
标题:在大脑中寻找量子效应:一种非经典潜在表示的贝尔型测试
神经信息处理是否完全经典或涉及量子力学元素仍是一个开放问题。在这里,我们提出了一种模型无关的信息论非经典性检验,该检验绕过了微观假设,而是直接探测神经表示本身的结构。利用自编码器作为透明的模型系统,我们引入了潜在空间中的贝尔型一致性检验,并询问在多种读出上下文中获得的解码统计是否可以由单一正潜在变量分布联合解释。通过将对神经系统中似量子特征的搜索从微观动力学转移到可实验测试的信息处理约束上,这项工作为探索神经计算的基本物理原理开辟了一条新途径。
Summary / 总结
This study aims to explore whether neural information processing involves quantum-mechanical elements by proposing a model-agnostic test based on information theory. The method involves using autoencoders to perform a Bell-type consistency test in latent space, checking if decoding statistics from different contexts can be explained by a single positive latent-variable distribution. Key findings suggest that this approach can probe the fundamental physics of neural computation without relying on microscopic dynamics, opening a new avenue for investigating quantum-like signatures in neural systems.
该研究旨在通过提出基于信息理论的模型无关测试来探索神经信息处理是否涉及量子力学元素。方法是使用自编码器在潜在空间中进行贝尔一致性测试,检查不同上下文下的解码统计是否可以由单一的正潜在变量分布来解释。主要发现表明,这种方法可以在不依赖于微观动力学的情况下,探索神经计算的基本物理学,为研究神经系统中的量子特征开辟了新的途径。
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Authors: Mikel Williams-Lekuona, Georgina Cosma
First: 2025-12-17T12:19:54+00:00 · Latest: 2026-01-15T16:58:39+00:00
Comments: Camera-ready version for ECIR 2026
Abstract
Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
中文标题/摘要
标题:基于图像复杂性的自适应检索以提高视觉语言模型的效率
视觉语言模型中的视觉变换器通常对每张图像使用相同的计算量,无论其简单与否。我们提出了ICAR(图像复杂性感知检索),这是一种自适应计算方法,使视觉变换器能够为简单的图像使用较少的计算量,而为复杂的图像通过其全网络深度进行处理。关键挑战是保持跨模态对齐:不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR 通过双路径训练来解决这一问题,该训练产生来自早期退出路径和全深度路径的兼容嵌入。这在相同的语义空间中保持了图像表示和文本嵌入之间的兼容性,无论图像是否早期退出或完全处理。与现有的两阶段方法不同,ICAR 不需要昂贵的重排序,可以直接进行图像-文本匹配而无需额外开销。为了确定使用多少计算量,我们开发了ConvNeXt-IC,将其视为分类任务。通过应用现代分类器骨干网络而非专门的架构,ConvNeXt-IC 达到了最先进的性能,获得了与人工标注 0.959 的皮尔逊相关系数,同时实现了 4.4 倍更快的复杂性预测。在标准基准上增加了真实世界的网络数据进行评估,ICAR 在保持类别级性能的同时实现了 20% 更快的图像编码,并保持了 95% 的实例级性能,从而实现了视觉语言系统的可持续扩展。
Summary / 总结
The paper proposes ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach for vision transformers in vision-language models, which uses less compute for simple images and full network depth for complex ones. It addresses the challenge of maintaining cross-modal alignment through dual-path training. ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance on standard benchmarks. ConvNeXt-IC, a novel complexity assessment model, is developed to determine compute usage, achieving state-of-the-art performance with 4.4x faster complexity prediction compared to existing methods.
论文提出了ICAR(图像复杂性感知检索),一种针对视觉语言模型中视觉变换器的自适应计算方法,对于简单的图像使用较少的计算量,而对于复杂的图像则完全处理。这通过双路径训练来保持跨模态对齐。ICAR实现了20%更快的图像编码,同时保持性能,使视觉语言系统的可持续扩展成为可能。还开发了ConvNeXt-IC,一种用于图像复杂性评估的分类器骨干网络,其性能达到最新水平,预测速度比现有方法快4.4倍。
Combinatorial Optimization Augmented Machine Learning
Authors: Maximilian Schiffer, Heiko Hoppe, Yue Su, Louis Bouvier, Axel Parmentier
First: 2026-01-15T16:55:19+00:00 · Latest: 2026-01-15T16:55:19+00:00
Abstract
Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.
中文标题/摘要
标题:组合优化增强机器学习
组合优化增强机器学习(COAML)最近作为一种强大的范式,将预测模型与组合决策制定相结合而崭露头角。通过将组合优化或acles嵌入到学习管道中,COAML能够构建既数据驱动又符合可行性的策略,连接机器学习、运筹学和随机优化的传统。本文提供了COAML最新状态的全面概述。我们介绍了一个统一的COAML管道框架,描述了其方法论构建块,并形式化了其与经验成本最小化之间的联系。然后,我们根据不确定性形式和决策结构发展了一个分类体系。使用这个分类体系,我们回顾了静态和动态问题的算法方法,概述了跨调度、车辆路线、随机规划和强化学习等领域的应用,并从经验成本最小化、模仿学习和强化学习的角度综合了方法论贡献。最后,我们确定了关键的研究前沿。本文综述旨在既作为该领域的教程性介绍,又作为未来研究的路线图,连接组合优化和机器学习的界面。
Summary / 总结
This paper explores the COAML paradigm, which integrates machine learning with combinatorial optimization. It introduces a unifying framework for COAML pipelines, detailing their components and their link to empirical cost minimization. The study reviews algorithmic approaches for both static and dynamic problems, covering applications in scheduling, vehicle routing, stochastic programming, and reinforcement learning. Key research frontiers are also identified, aiming to guide future research at the intersection of combinatorial optimization and machine learning.
论文探讨了组合优化增强机器学习(COAML),将预测模型与组合决策相结合。它介绍了一个统一的COAML管道框架,详细描述了其构建模块及其与经验成本最小化的关系。研究回顾了静态和动态问题的算法方法,涵盖了调度、车辆路线、随机规划和强化学习等领域中的应用。还指出了关键的研究前沿,旨在作为该领域的教程和未来研究的路线图。
From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA
Authors: Kimia Abedini, Farzad Shami, Gianmaria Silvello
First: 2026-01-15T16:54:11+00:00 · Latest: 2026-01-15T16:54:11+00:00
Comments: Accepted paper by the 48th European Conference on Information Retrieval (ECIR'26)
Abstract
Comprehending genomic information is essential for biomedical research, yet extracting data from complex distributed databases remains challenging. Large language models (LLMs) offer potential for genomic Question Answering (QA) but face limitations due to restricted access to domain-specific databases. GeneGPT is the current state-of-the-art system that enhances LLMs by utilizing specialized API calls, though it is constrained by rigid API dependencies and limited adaptability. We replicate GeneGPT and propose GenomAgent, a multi-agent framework that efficiently coordinates specialized agents for complex genomics queries. Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average, and its flexible architecture extends beyond genomics to various scientific domains needing expert knowledge extraction.
中文标题/摘要
标题:从单体到多智能体推理:推进GeneGPT在基因组学问答中的应用
理解基因组信息对于生物医学研究至关重要,但从复杂分布的数据库中提取数据仍然具有挑战性。大型语言模型(LLMs)为基因组学问答(QA)提供了潜在解决方案,但受限于对特定领域数据库的访问限制。GeneGPT是当前最先进的系统,通过使用专门的API调用增强LLMs,但它受到固定API依赖性和有限适应性的限制。我们复制了GeneGPT并提出了GenomAgent,这是一种多智能体框架,能够高效地协调专门的智能体处理复杂的基因组学查询。在GeneTuring基准测试的九项任务上,GenomAgent的平均性能比GeneGPT高出12%,其灵活的架构还适用于需要专家知识提取的各种科学领域。
Summary / 总结
The research aims to improve genomic question answering by addressing the limitations of existing large language models and specialized API dependencies. The study introduces GenomAgent, a multi-agent system that coordinates specialized agents to handle complex genomics queries. Experimental results show that GenomAgent outperforms GeneGPT by 12% on average across nine tasks from the GeneTuring benchmark and demonstrates flexibility in various scientific domains requiring expert knowledge extraction.
研究旨在通过解决现有大型语言模型(LLMs)在访问领域特定数据库方面的限制,来提高基因组问答(QA)的性能。该研究提出了GenomAgent,这是一种多代理框架,通过协调专门的代理来处理复杂的基因组查询,以增强GeneGPT。GenomAgent在来自GeneTuring基准的九项任务中平均比GeneGPT高出12%的性能,并且其灵活的架构适用于需要专业知识提取的各种科学领域。
How Quantization Shapes Bias in Large Language Models
Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
First: 2025-08-25T14:48:26+00:00 · Latest: 2026-01-15T16:30:08+00:00
Abstract
This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, fairness, toxicity, and sentiment. We employ both probability- and generated text-based metrics across 13 benchmarks and evaluate models that differ in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and subgroups, and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
中文标题/摘要
标题:量化如何塑造大型语言模型中的偏差
本研究全面评估了量化对模型偏差的影响,特别关注其对个体人口子群体的影响。我们专注于权重和激活量化策略,并在广泛的偏差类型(包括刻板印象、公平性、毒性及情感)中考察其影响。我们使用概率和生成文本的指标,在13个基准上评估了不同架构家族和推理能力的模型。我们的研究结果表明,量化对偏差的影响是复杂的:虽然它可以减少模型的毒性,并且对情感影响不大,但它倾向于在生成任务中略微增加刻板印象和不公平性,尤其是在激进压缩下。这些趋势在不同的人口类别和子群体以及不同模型类型中通常是一致的,尽管其程度取决于具体环境。总体而言,我们的结果强调了在实践中应用量化时在效率和伦理考虑之间仔细平衡的重要性。
Summary / 总结
This study evaluates how quantization influences model bias, particularly focusing on its effects on demographic subgroups. The research examines weight and activation quantization across various bias types, using probability- and generated text-based metrics on 13 benchmarks. The findings indicate that while quantization reduces model toxicity and does not significantly affect sentiment, it tends to increase stereotypes and unfairness, especially under aggressive compression. These trends are consistent across different demographic categories and model types, though the magnitude varies depending on the specific setting.
这项研究评估了量化如何影响模型偏见,特别是对不同人口亚组的影响,通过在各种偏见类型上检查权重和激活量化来实现。研究使用了13个基准的基于概率和生成文本的指标,发现虽然量化可以减少模型的毒性且对情感影响不大,但它会增加刻板印象和不公平性,尤其是在剧烈压缩下。这些趋势在不同的人口亚组和模型类型中是一致的,但具体影响的大小取决于特定的环境设置。
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Authors: Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, Nazia Tasnim, Farig Sadeque
First: 2026-01-15T16:28:14+00:00 · Latest: 2026-01-15T16:28:14+00:00
Comments: 16 pages, 4 figures
Abstract
Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
中文标题/摘要
标题:基于激活签名的表示感知去学习:从抑制到知识签名消除
从大型语言模型(LLM)中选择性地删除知识对于满足GDPR合规性和模型安全性至关重要,但当前的去学习方法将行为抑制与真正的知识删除混淆,允许潜在能力在表面拒绝之下持续存在。在本文中,我们通过引入知识免疫框架(KIF),一种基于表示的架构,通过针对内部激活签名而不是表面输出来区分真正的删除与混淆,来解决这一挑战。我们的方法结合了针对特定主题的表示的动态抑制与参数高效的适应,能够在无需完全重新训练模型的情况下实现持久的去学习。KIF 实现了接近完美的删除(FQ 约为 0.99,而理想值为 1.00),同时保持了与理想水平相当的实用性(MU = 0.62),从而打破了所有先前工作都受到限制的稳定性和删除之间的权衡。我们在标准基础模型(Llama 和 Mistral)和推理优先模型(Qwen 和 DeepSeek)上进行了评估,参数范围从 3B 到 14B。我们的观察结果表明,标准模型表现出与规模无关的真实删除(<3% 的实用性漂移),而推理优先模型揭示了基本架构的差异。我们综合了表面泄漏与潜在痕迹持续性的双重评估协议,操作化了混淆与删除之间的区分,并首次系统地诊断了不同模型家族和规模下的机制级遗忘行为。
Summary / 总结
This paper addresses the challenge of selective knowledge erasure from large language models (LLMs) by introducing the Knowledge Immunization Framework (KIF), which distinguishes genuine erasure from obfuscation by targeting internal activation signatures. KIF combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, achieving near-oracle erasure while preserving utility. The study evaluates KIF on various models, demonstrating scale-independent true erasure for standard models and revealing architectural divergence for reasoning-prior models. The evaluation protocol combines surface-level leakage with latent trace persistence to diagnose mechanism-level forgetting behavior.
本文通过引入知识免疫框架(KIF),针对大型语言模型(LLMs)中的选择性知识擦除问题,通过靶向内部激活签名来区分真正的擦除和混淆。KIF 结合动态抑制特定主题的表示与参数高效适应,实现接近理想的擦除效果同时保持功能。研究在多种模型上评估了 KIF,展示了标准模型在规模上的独立真正擦除效果,并揭示了推理优先模型的架构差异。评估协议结合表面泄漏与潜在痕迹持久性来诊断机制级遗忘行为。
Kolmogorov Arnold Networks and Multi-Layer Perceptrons: A Paradigm Shift in Neural Modelling
Authors: Aradhya Gaonkar, Nihal Jain, Vignesh Chougule, Nikhil Deshpande, Sneha Varur, Channabasappa Muttal
First: 2026-01-15T16:26:49+00:00 · Latest: 2026-01-15T16:26:49+00:00
Comments: 13 pages, 8 figures, 2 tables
Abstract
The research undertakes a comprehensive comparative analysis of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP), highlighting their effectiveness in solving essential computational challenges like nonlinear function approximation, time-series prediction, and multivariate classification. Rooted in Kolmogorov's representation theorem, KANs utilize adaptive spline-based activation functions and grid-based structures, providing a transformative approach compared to traditional neural network frameworks. Utilizing a variety of datasets spanning mathematical function estimation (quadratic and cubic) to practical uses like predicting daily temperatures and categorizing wines, the proposed research thoroughly assesses model performance via accuracy measures like Mean Squared Error (MSE) and computational expense assessed through Floating Point Operations (FLOPs). The results indicate that KANs reliably exceed MLPs in every benchmark, attaining higher predictive accuracy with significantly reduced computational costs. Such an outcome highlights their ability to maintain a balance between computational efficiency and accuracy, rendering them especially beneficial in resource-limited and real-time operational environments. By elucidating the architectural and functional distinctions between KANs and MLPs, the paper provides a systematic framework for selecting the most suitable neural architectures for specific tasks. Furthermore, the proposed study highlights the transformative capabilities of KANs in progressing intelligent systems, influencing their use in situations that require both interpretability and computational efficiency.
中文标题/摘要
标题:科莫戈罗夫-阿诺德网络与多层感知机:神经建模范式的转变
研究对科莫戈罗夫-阿诺德网络(KAN)和多层感知机(MLP)进行了全面的比较分析,突显了它们在解决非线性函数逼近、时间序列预测和多元分类等关键计算挑战方面的有效性。基于科莫戈罗夫表示定理,KANs 使用自适应样条激活函数和基于网格的结构,提供了与传统神经网络框架相比具有变革性的方法。通过涵盖从数学函数估计(二次和三次)到实际应用如预测每日温度和分类葡萄酒的各种数据集,研究通过准确度指标如均方误差(MSE)和通过浮点运算(FLOPs)评估的计算成本,全面评估了模型性能。结果表明,KANs 在所有基准测试中都可靠地超过了 MLPs,在保持更高预测准确性的同时显著降低了计算成本。这一结果突显了它们在计算效率和准确性之间保持平衡的能力,特别是在资源有限和实时操作环境中尤其有益。通过阐明 KANs 和 MLPs 在架构和功能上的区别,论文提供了一种系统框架,用于为特定任务选择最合适的神经网络架构。此外,提出的研究所揭示的 KANs 在推进智能系统方面的变革能力,影响了其在需要可解释性和计算效率的情况下使用。
Summary / 总结
The research compares Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP) in solving nonlinear function approximation, time-series prediction, and multivariate classification. KANs, based on Kolmogorov's representation theorem, use adaptive spline-based activation functions and grid-based structures, showing superior performance in accuracy and computational efficiency across various datasets. The study demonstrates that KANs achieve higher predictive accuracy with significantly fewer computational resources compared to MLPs, making them ideal for resource-constrained environments.
研究对比了Kolmogorov-Arnold网络(KAN)和多层感知机(MLP)在非线性函数逼近、时间序列预测和多变量分类中的应用。KANs基于Kolmogorov表示定理,使用自适应样条激活函数和网格结构,展示了在各种数据集上的优越性能。研究发现,KANs在准确性和计算效率方面均优于MLP,能够以更少的计算资源实现更高的预测精度,特别适用于资源受限的环境。
Process-Guided Concept Bottleneck Model
Authors: Reza M. Asiyabi, SEOSAW Partnership, Steven Hancock, Casey Ryan
First: 2026-01-15T16:25:55+00:00 · Latest: 2026-01-15T16:25:55+00:00
Comments: 13 pages with 7 figures and 1 table, Supplementary Materials 10 pages with 3 figures
Abstract
Concept Bottleneck Models (CBMs) improve the explainability of black-box Deep Learning (DL) by introducing intermediate semantic concepts. However, standard CBMs often overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well defined. To address this, we propose the Process-Guided Concept Bottleneck Model (PG-CBM), an extension of CBMs which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. Using above ground biomass density estimation from Earth Observation data as a case study, we show that PG-CBM reduces error and bias compared to multiple benchmarks, whilst leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs. Beyond improved accuracy, PG-CBM enhances transparency, enables detection of spurious learning, and provides scientific insights, representing a step toward more trustworthy AI systems in scientific applications.
中文标题/摘要
标题:过程导向的概念瓶颈模型
概念瓶颈模型(CBMs)通过引入中间语义概念来提高黑盒深度学习(DL)的可解释性。然而,标准的CBMs往往忽视了特定领域的关系和因果机制,并且它们对完整概念标签的依赖限制了在监督稀疏但过程定义良好的科学领域中的应用。为了解决这个问题,我们提出了过程导向的概念瓶颈模型(PG-CBM),这是一种扩展的CBM,通过生物物理意义明确的中间概念来约束学习,使其遵循领域定义的因果机制。以地球观测数据估计地上生物量密度为例,我们展示了PG-CBM相比多个基准模型减少了误差和偏差,同时利用了多源异质训练数据并产生了可解释的中间输出。除了提高准确性,PG-CBM还增强了透明度,能够检测虚假学习,并提供科学见解,代表了在科学应用中更可信赖的AI系统的一个步骤。
Summary / 总结
The research aims to improve the explainability and applicability of Concept Bottleneck Models (CBMs) in scientific domains by addressing their limitations in handling sparse supervision and domain-specific causal mechanisms. The Process-Guided Concept Bottleneck Model (PG-CBM) is proposed, which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. The study demonstrates that PG-CBM reduces error and bias in estimating above ground biomass density from Earth Observation data, while also enhancing transparency and providing scientific insights. Beyond accuracy, PG-CBM enables the detection of spurious learning and supports more trustworthy AI systems in scientific applications.
研究旨在通过解决概念瓶颈模型(CBMs)在处理特定领域因果机制和稀疏监督方面的局限性,提高其解释性和适用性。提出了过程导向的概念瓶颈模型(PG-CBM),使其学习遵循生物物理意义下的因果机制。在基于地球观测数据估计地上生物量密度的案例研究中,PG-CBM在减少误差和偏差方面优于多个基准模型,同时提供可解释的中间输出,并增强透明度和科学洞察力。
Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems
Authors: Xi Shi, Mengxin Zheng, Qian Lou
First: 2026-01-15T16:23:53+00:00 · Latest: 2026-01-15T16:23:53+00:00
Comments: Preprint
Abstract
Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at https://github.com/xishi404/LAMaS
中文标题/摘要
标题:学习针对并行多智能体系统的延迟感知编排
多智能体系统(MAS)通过协调多个智能体实现复杂的推理,但由于多步执行和重复模型调用,往往导致较高的推理延迟,严重限制了其在时间敏感场景中的可扩展性和实用性。大多数现有方法主要优化任务性能和推理成本,并且显式或隐式地假设顺序执行,使得它们在并行执行下控制延迟方面不太理想。在本工作中,我们研究了在并行执行下具有显式延迟监督的多智能体系统的基于学习的编排。我们提出了延迟感知多智能体系统(LAMaS),这是一种延迟感知的多智能体编排框架,能够实现并行执行,并显式优化关键执行路径,使控制器能够在并行执行下构建具有较低延迟的执行拓扑图。我们的实验表明,与多个基准上的最新基线方法相比,我们的方法在多智能体架构搜索中将关键路径长度减少了38-46%,同时保持或甚至提高了任务性能。这些结果突显了在设计高效的多智能体系统时,显式优化并行执行下的延迟的重要性。代码可在https://github.com/xishi404/LAMaS获取
Summary / 总结
This work addresses the high latency issue in multi-agent systems (MAS) by proposing LAMaS, a latency-aware orchestration framework for parallel execution. It explicitly optimizes the critical execution path to reduce latency, showing a 38-46% reduction compared to existing methods across various benchmarks while maintaining or improving task performance. This highlights the necessity of explicitly optimizing latency in parallel MAS design.
该研究针对多智能体系统(MAS)中的高延迟问题,提出了LAMaS,一种支持并行执行的延迟感知协调框架。LAMaS 优化关键执行路径以减少延迟,相比现有方法在多个基准测试中实现了38-46%的临界路径长度减少,同时保持或提升了任务性能。
DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery
Authors: Constantin Selzer, Fabian B. Flohr
Venue: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, AB, Canada, 2024, pp. 221-227
First: 2026-01-15T16:18:42+00:00 · Latest: 2026-01-15T16:18:42+00:00
Abstract
The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban
中文标题/摘要
标题:DeepUrban:基于航空影像的交互感知轨迹预测与规划在自动化驾驶中的应用
自动驾驶系统的有效性在很大程度上依赖于其预测和规划能力的稳健性。然而,当前的基准测试受到一个明显的问题困扰,即缺乏密集交通场景,这对于理解和建模道路使用者之间的复杂交互至关重要。为了解决这一问题,我们与工业合作伙伴DeepScenario合作,开发了DeepUrban——一个新的无人机数据集,旨在增强针对密集城市环境的轨迹预测和规划基准。DeepUrban提供了从大约100米高空拍摄的高分辨率城市交叉口图像中提取的丰富3D交通对象集合,并进一步丰富了全面的地图和场景信息,以支持高级建模和仿真任务。我们评估了最先进的(SOTA)预测和规划方法,并进行了泛化能力实验。我们的研究结果表明,将DeepUrban添加到nuScenes中可以提高车辆预测和规划的准确性,ADE / FDE指标上的改进幅度最高可达44.1% / 44.3%。网站:https://iv.ee.hm.edu/deepurban
Summary / 总结
The research aims to improve autonomous driving systems by addressing the scarcity of dense traffic scenarios in current benchmarks. The study introduces DeepUrban, a new drone dataset that captures 3D traffic objects from urban intersections, enhancing trajectory prediction and planning capabilities. Experiments show that integrating DeepUrban into nuScenes improves prediction accuracy by up to 44.1% and 44.3% on ADE and FDE metrics, respectively.
研究旨在通过解决当前基准中密集交通场景稀缺的问题来提升自动驾驶系统。研究引入了DeepUrban,这是一个新的无人机数据集,从城市交叉口捕获3D交通对象,以增强轨迹预测和规划。实验表明,将DeepUrban集成到nuScenes中可将预测准确性分别提高44.1%和44.3%的ADE和FDE指标。
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
First: 2026-01-15T16:18:00+00:00 · Latest: 2026-01-15T16:18:00+00:00
Comments: 22 pages, 10 figures
Abstract
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
中文标题/摘要
标题:视频生成模型与潜在世界模型的推理时物理对齐
最先进的视频生成模型能够生成令人印象深刻的视觉内容,但往往违反基本的物理原理,限制了它们的应用。虽然有人认为这种缺陷源于预训练时对物理理解不足,但我们发现,物理合理性不足也源于不理想的推理策略。因此,我们引入了WMReward,并将提高视频生成的物理合理性视为一种推理时的对齐问题。具体来说,我们利用潜在世界模型(这里为VJEPA-2)的强物理先验作为奖励,搜索和引导多个候选去噪轨迹,从而实现测试时计算量的扩展,以提高生成性能。实验证明,我们的方法在图像条件、多帧条件和文本条件生成设置中显著提高了物理合理性,得到了人类偏好研究的验证。值得注意的是,在ICCV 2025 Perception Test PhysicsIQ挑战中,我们获得了62.64%的最终得分,获得第一名,并且超越了之前的最先进的技术水平7.42%。我们的工作证明了使用潜在世界模型提高视频生成的物理合理性的可行性,超越了这一特定实例或参数化。
Summary / 总结
The research addresses the issue of video generative models violating basic physics principles by proposing WMReward, which treats physics plausibility as an inference-time alignment problem. By using the strong physics prior of a latent world model (VJEPA-2), the method searches and steers multiple candidate denoising trajectories, improving generation performance. The approach significantly enhances physics plausibility in various generation settings and wins the first place in the ICCV 2025 Perception Test PhysicsIQ Challenge with a score of 62.64%, surpassing the previous state of the art by 7.42%.
研究旨在通过解决推理策略不足的问题,提高视频生成模型的物理合理性。方法是利用潜在世界模型作为奖励,在推理时对视频生成进行物理原则对齐。关键发现显示,在不同生成设置中显著提高了物理合理性,并在ICCV 2025 Perception Test PhysicsIQ挑战赛中取得了62.64%的分数,比之前的方法高出7.42%。
Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
First: 2026-01-15T16:16:34+00:00 · Latest: 2026-01-15T16:16:34+00:00
Abstract
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
中文标题/摘要
标题:释放大型视觉语言模型在路边基础设施智能感知方面的潜力
城市路边基础设施的自动化感知对于智能城市管理至关重要,但通用模型往往难以捕捉到必要的细粒度属性和领域规则。虽然大型视觉语言模型(VLMs)在开放世界识别方面表现出色,但在遵循工程标准准确解释复杂设施状态方面却常常力不从心,导致在实际应用中的性能不可靠。为了解决这一问题,我们提出了一种领域适应框架,将VLMs转化为专门的智能基础设施分析代理。我们的方法结合了数据高效微调策略和基于知识的推理机制。具体来说,我们利用Grounding DINO的开放式词汇微调来在最少监督的情况下稳健地定位各种资产,然后利用基于LoRA的Qwen-VL适应进行深入的语义属性推理。为了减轻幻觉并确保专业合规,我们引入了一个双模态检索增强生成(RAG)模块,在推理过程中动态检索权威的行业标准和视觉示例。在一项新的城市路边场景数据集上进行评估,我们的框架实现了58.9的检测性能mAP和95.5%的属性识别准确率,展示了智能基础设施监控的稳健解决方案。
Summary / 总结
The research aims to improve the automated perception of urban roadside infrastructure for smart city management. It proposes a domain-adapted framework that fine-tunes large vision-language models with a data-efficient strategy and a knowledge-grounded reasoning mechanism. The framework includes open-vocabulary fine-tuning on Grounding DINO for robust localization and LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To ensure compliance, a dual-modality RAG module retrieves industry standards and visual exemplars. The framework achieves 58.9 mAP in detection and 95.5% accuracy in attribute recognition, showing promise for intelligent infrastructure monitoring.
研究旨在通过改进通用模型的局限性,提升城市路边基础设施的自动化感知能力,以支持智能城市管理。提出了一种领域适应框架,通过开放词汇量微调和知识导向推理技术来细化大型视觉语言模型,并结合双模态检索增强生成模块以确保专业合规。该框架在新数据集上实现了58.9 mAP的检测性能和95.5%的属性识别准确性,展示了智能基础设施监控的稳健解决方案。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding
Authors: Xingang Pan, Xiaohang Zhan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang
Venue: AAAI 2018
First: 2017-12-17T09:37:52+00:00 · Latest: 2026-01-15T16:01:43+00:00
Comments: Accepted to AAAI 2018
Abstract
Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.
中文标题/摘要
标题:空间卷积神经网络:用于交通场景理解的空间CNN
卷积神经网络(CNN)通常通过逐层堆叠卷积操作构建。尽管CNN在从原始像素中提取语义方面表现出强大的能力,但其捕捉图像行和列中像素的空间关系的能力尚未得到充分探索。这些关系对于学习具有强烈形状先验但外观一致性较弱的语义对象(如交通车道)非常重要,如图1(a)所示,交通车道经常被遮挡或根本未在路面上绘制。在本文中,我们提出了一种空间CNN(SCNN),它将传统的逐层卷积推广到特征图内的切片间卷积,从而在层内使行和列中的像素之间能够进行消息传递。这种SCNN特别适用于具有强烈空间关系但外观线索较少的长连续形状结构或大型对象,如交通车道、杆和墙。我们在一个新发布的非常具有挑战性的交通车道检测数据集和Cityscapse数据集上应用了SCNN。结果表明,SCNN能够学习结构输出的空间关系,并显著提高了性能。我们展示了SCNN在车道检测数据集上分别比基于RNN的ReNet和MRF+CNN(MRFNet)高出8.7%和4.6%。此外,我们的SCNN在TuSimple基准车道检测挑战中获得了第一名,准确率为96.53%。
Summary / 总结
This paper introduces Spatial CNN (SCNN), which extends traditional CNNs by enabling message passing between pixels across rows and columns within feature maps, enhancing the ability to capture spatial relationships. Applied to traffic lane detection and Cityscapes datasets, SCNN significantly improved performance, outperforming RNN-based ReNet and MRFNet by 8.7% and 4.6% respectively, and achieving 96.53% accuracy in the TuSimple Benchmark Lane Detection Challenge.
本文提出了Spatial CNN (SCNN),该方法将传统的逐层卷积扩展为在特征图内的切片间卷积,使像素在行和列之间能够进行消息传递。该方法特别适用于如交通车道等长连续形状结构。实验结果显示,SCNN在交通车道检测数据集和Cityscapes上的表现显著提升,分别比基于RNN的ReNet和MRF+CNN(MRFNet)高出8.7%和4.6%,并在TuSimple基准车道检测挑战中取得了96.53%的准确率。
Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning
Authors: Oscar H. Ramírez-Agudelo, Akshay N. Shewatkar, Edoardo Milana, Roland C. Aydin, Kai Franke
Venue: SPIE Vol. 12675 126750A-12, 2023
First: 2026-01-15T15:59:12+00:00 · Latest: 2026-01-15T15:59:12+00:00
Comments: 17 pages, 10 figures, 6 tables, SPIE Applications of Machine Learning 2023, San Diego, US
Abstract
Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80\% train, 10\% validation, and 10\% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43\,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges
中文标题/摘要
标题:通过深度学习提高烟雾和雾霾场景中测度图像的质量
在雾霾和烟雾环境中拍摄的图像由于能见度降低,给基础设施监测和紧急服务在关键时刻造成了挑战。本研究探讨了使用深度学习模型提高烟雾环境中测度图像的自动可读性,准确的测度数据解释对于一线救援人员来说是一个有价值的工具。研究使用了两种深度学习架构,FFA-Net和AECR-Net,以提高被烟雾和雾霾污染的测度图像的可见度。由于缺乏模拟测度图像的基准数据集,使用Unreal Engine生成了一个新的合成数据集,包含超过14,000张图像。模型分别以80%的训练集、10%的验证集和10%的测试集进行了训练。对于合成雾霾数据集,SSIM和PSNR指标分别为0.98和43 dB,与最先进的结果相当。此外,AECR-Net比FFA-Net获得了更稳健的结果。虽然合成烟雾数据集的结果较差,但训练模型仍取得了有趣的结果。总体而言,在烟雾中成像更难提高,因为不均匀性和高密度。其次,FFA-Net和AECR-Net被实现用于去雾霾,而不是去烟雾。本研究表明,使用深度学习架构可以极大地提高烟雾和雾霾场景中模拟测度图像的质量。最后,增强后的输出图像可以成功后处理以实现自动自主读取测度
Summary / 总结
This study addresses the challenge of reduced visibility in gauge images captured in hazy and smoky environments by employing deep learning models. Two architectures, FFA-Net and AECR-Net, were used to enhance the readability of gauge images. A new synthetic dataset was created using Unreal Engine to train these models. The study found that AECR-Net outperformed FFA-Net in terms of SSIM and PSNR metrics, although results for the synthetic smoke dataset were less favorable. The enhanced images can be further processed for automatic reading of gauges, improving emergency response capabilities.
该研究旨在通过深度学习模型提高在烟雾和雾霾环境中拍摄的图像中指针表的可读性。使用FFA-Net和AECR-Net两种架构来改善这些条件下的指针表图像。通过Unreal Engine创建了一个新的合成数据集进行模型训练,实现了高SSIM和PSNR指标。AECR-Net在雾霾数据集中的表现优于FFA-Net,但在烟雾数据集中的结果较差,因为烟雾的复杂性较大。这些模型显著提高了指针表图像的质量,使其更容易被自动读取系统识别。
A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning
Authors: Stefano Cerri, Asbjørn Munk, Jakob Ambsdorf, Julia Machnio, Sebastian Nørgaard Llambias, Vardan Nersesjan, Christian Hedeager Krag, Peirong Liu, Pablo Rocamora García, Mostafa Mehdipour Ghazi, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen
First: 2025-06-17T11:48:05+00:00 · Latest: 2026-01-15T15:58:31+00:00
Abstract
We present FOMO300K, a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging (MRI) scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO300K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.
中文标题/摘要
标题:用于自我监督学习的大规模异构3D磁共振脑成像数据集
我们提出了FOMO300K,这是一个包含318,877个脑磁共振成像(MRI)扫描的数据集,来自82,678个MRI会话和59,969个受试者,汇总自920个公开可用的来源。该数据集包括临床级和研究级图像、多种MRI序列以及广泛的解剖和病理变异性,包括具有大脑异常的扫描。对原始图像特征进行了最小预处理,以降低新用户的入门门槛。提供了用于自我监督预训练和微调的配套代码,以及预训练模型。FOMO300K旨在支持大规模医学影像中自我监督学习方法的开发和基准测试。
Summary / 总结
The research motivation is to develop a large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning. The main method involves collecting 318,877 brain MRI scans from 82,678 sessions and 59,969 subjects, with minimal preprocessing to maintain original image characteristics. Key experimental findings include the successful creation of FOMO300K, which supports the development and benchmarking of self-supervised learning methods in medical imaging at scale.
研究动机是开发一个大规模异质的3D磁共振脑成像数据集以支持自我监督学习。主要方法是从82,678个MRI会话和59,969个受试者中收集了318,877个脑MRI扫描,进行了最少的预处理以保留原始图像特征。关键实验发现包括成功创建了FOMO300K,该数据集支持大规模医学影像中自我监督学习方法的发展和基准测试。