WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Authors: Xuweiyi Chen, Wentao Zhou, Zezhou Cheng
First: 2026-01-15T18:59:58+00:00 · Latest: 2026-01-15T18:59:58+00:00
Comments: Project Page: https://wild-rayzer.cs.virginia.edu/
Abstract
We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
中文标题/摘要
标题:WildRayZer:动态环境中的自监督大规模视图合成
我们提出了WildRayZer,一种在动态环境中进行新颖视图合成(NVS)的自监督框架,其中相机和物体都在移动。动态内容破坏了静态NVS模型依赖的多视图一致性,导致鬼影、虚假几何结构和不稳定的姿态估计。WildRayZer 通过执行分析-合成测试来解决这一问题:仅相机的静态渲染器解释刚性结构,其残差揭示了瞬态区域。从这些残差中,我们构建伪运动掩码,提炼一个运动估计器,并使用它来屏蔽输入令牌和门控损失梯度,使监督集中在跨视图背景完成上。为了实现大规模的训练和评估,我们整理了包含15000个随意捕捉的动态序列的真实世界数据集Dynamic RealEstate10K(D-RE10K),以及D-RE10K-iPhone,一个包含瞬态和干净基准的稀疏视图瞬态感知NVS配对数据集。实验表明,WildRayZer 在瞬态区域去除和全帧NVS质量方面,单次前向传递时均优于基于优化和前馈的基线。
Summary / 总结
WildRayZer is a self-supervised framework for novel view synthesis in dynamic environments, addressing issues like ghosting and unstable pose estimation. It uses a camera-only static renderer to explain rigid structure and residuals to construct pseudo motion masks, which help in focusing supervision on background completion. Experiments show that WildRayZer outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single pass.
WildRayZer 是一种用于动态环境中的新颖视图合成的自监督框架,解决了鬼影和姿态估计不稳定等问题。它通过静态渲染器解释刚性结构,并利用残差构建伪运动掩码,从而提炼出运动估计器,并将监督重点放在背景完成上。实验表明,WildRayZer 在移除过渡区域和全帧 NVS 质量方面均优于基于优化和前馈的基线方法,并且只需一次前馈传递。
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
First: 2026-01-15T18:59:23+00:00 · Latest: 2026-01-15T18:59:23+00:00
Abstract
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
中文标题/摘要
标题:MatchTIR:通过二分匹配实现细粒度监督的工具集成推理
工具集成推理(TIR)通过交替进行推理步骤和外部工具交互,增强大型语言模型(LLMs)处理复杂任务的能力。然而,现有的强化学习方法通常依赖于结果或轨迹级别的奖励,对轨迹内的所有步骤分配相同的优点,这种粗粒度的信用分配无法区分有效的工具调用与冗余或错误的调用,特别是在长时间多轮场景中。为了解决这个问题,我们提出了一种MatchTIR框架,通过基于二分匹配的轮次级别奖励分配和双层优势估计引入细粒度监督。具体来说,我们将信用分配问题形式化为预测和真实轨迹之间的二分匹配问题,利用两种分配策略推导出密集的轮次级别奖励。此外,为了平衡局部步骤精度与全局任务成功,我们引入了一种双层优势估计方案,结合轮次级别和轨迹级别的信号,为每个交互轮次分配不同的优势值。在三个基准上的广泛实验表明,MatchTIR具有优越性。值得注意的是,我们的4B模型在大多数8B竞争对手中表现更优,特别是在长时间多轮任务中。我们的代码可在https://github.com/quchangle1/MatchTIR获取。
Summary / 总结
MatchTIR addresses the issue of coarse-grained credit assignment in Tool-Integrated Reasoning by proposing a fine-grained supervision framework. It uses bipartite matching to assign turn-level rewards and dual-level advantage estimation to balance local precision and global task success. Experiments show that MatchTIR outperforms most 8B models, especially in long-horizon and multi-turn tasks.
论文提出了MatchTIR框架,通过二分匹配分配细粒度的回合级奖励和双层优势估计,区分有效的工具调用和冗余调用,提升长时序多回合任务的表现。实验表明,4B模型在复杂场景中优于大多数8B竞争对手。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略且不对称的连接,仅将视觉编码器的输出链接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 可以显著提高性能,将其确立为一种可扩展的范式,通过赋予LLM按需访问完整视觉层次结构的能力,解锁更深层次的多模态理解。
Summary / 总结
The paper addresses the limitation of static vision-language models (VLMs) by proposing Cross-Layer Injection (CLI), a dynamic framework that creates a many-to-many connection between visual and language modalities. CLI includes an Adaptive Multi-Projection (AMP) module for feature harmonization and an Adaptive Gating Fusion (AGF) mechanism for selective visual information injection. Experiments show CLI improves performance on 18 benchmarks, enhancing LLMs' ability to integrate visual and linguistic information comprehensively.
论文提出了一种名为Cross-Layer Injection (CLI)的动态框架,旨在解决静态视觉-语言模型(VLMs)的局限性,通过创建视觉和语言模态之间的多对多连接来增强模型能力。CLI包含一个自适应多投影(AMP)模块用于特征协调,以及一个自适应门控融合(AGF)机制用于根据实时解码上下文选择性地注入相关视觉信息。实验结果显示CLI在18个基准测试中显著提高了性能,增强了LLMs对视觉和语言信息的综合理解能力。
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Authors: Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf
First: 2026-01-15T18:58:33+00:00 · Latest: 2026-01-15T18:58:33+00:00
Abstract
Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
中文标题/摘要
标题:少看,开得好:基于基础模型的随机补丁选择实现通用端到端自动驾驶
端到端自动驾驶的最新进展表明,训练于基础模型提取的补丁对齐特征上的策略在处理未见过的数据分布(OOD)时表现更好。我们假设由于自注意力机制,每个补丁特征隐含地嵌入/包含来自其他所有补丁的信息,以不同的方式和强度表示,使得这些描述符高度冗余。我们通过主成分分析(PCA)和跨补丁相似性量化了(BLIP2)特征中的冗余性:90%的方差由17/64个主成分捕获,且跨标记相关性很强。在这些重叠信息上进行训练会导致策略过度拟合虚假的相关性,损害了OOD鲁棒性。我们提出了随机补丁选择(SPS),这是一种简单而有效的方法,用于学习更鲁棒、更通用且更高效的策略。对于每一帧,SPS随机遮蔽一部分补丁描述符,不将它们提供给策略模型,同时保持剩余补丁的空间布局。因此,策略获得了不同但完整的(相同)场景视图:每随机子集的补丁像不同的,但仍然合理的、连贯的世界投影。策略基于不变于哪些特定标记幸存的特征做出决策。大量实验表明,我们的方法在所有OOD场景中均优于现有最佳方法(SOTA),平均改进6.2%,在闭环模拟中最高可达20.4%,同时速度提高2.4倍。我们进行了遮蔽率和补丁特征重组的消融研究,训练和评估了9个系统,其中8个系统超越了先前的SOTA。最后,我们展示了相同的策略在无需调整的情况下可以转移到物理的、真实世界的汽车上。
Grounding Agent Memory in Contextual Intent
Authors: Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
First: 2026-01-15T18:55:13+00:00 · Latest: 2026-01-15T18:55:13+00:00
Abstract
Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history.
For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
中文标题/摘要
标题:将代理记忆扎根于上下文意图
在长期目标导向的交互中部署大型语言模型仍然具有挑战性,因为相似的实体和事实会在不同的潜在目标和约束下反复出现,导致记忆系统检索到上下文不匹配的证据。我们提出了STITCH(基于上下文历史的结构化意图跟踪),这是一种代理记忆系统,通过结构化的检索提示、上下文意图对每个轨迹步骤进行索引,并通过匹配当前步骤的意图来检索历史。上下文意图提供了紧凑的信号,消除了重复提及的歧义并减少了干扰:(1) 当前定义主题段落的潜在目标,(2) 动作类型,以及(3) 重要的实体类型,锚定哪些属性是相关的。在推理过程中,STITCH通过意图兼容性筛选和优先级排序记忆片段,抑制语义相似但上下文不匹配的历史记录。
为了评估,我们引入了CAME-Bench,这是一个基准测试,用于在现实、动态的目标导向轨迹中进行上下文感知检索。在CAME-Bench和LongMemEval上,STITCH达到了最先进的性能,比最强基线高出35.6%,随着轨迹长度的增加,性能提升最大。我们的分析表明,意图索引显著减少了检索噪声,支持意图感知的记忆,以实现稳健的长期推理。
Summary / 总结
The research aims to address the challenge of memory retrieval in long-term goal-oriented interactions with large language models. STITCH, a structured intent tracking system, is proposed to index each step with a contextual intent, which includes the current latent goal, action type, and salient entity types. During inference, STITCH filters memory snippets based on intent compatibility, reducing retrieval noise. Experiments on CAME-Bench and LongMemEval show that STITCH outperforms existing methods by 35.6%, especially in longer trajectories, by effectively disambiguating repeated mentions and reducing interference.
研究旨在解决大型语言模型在长期目标导向交互中的记忆检索难题。提出了STITCH结构化意图跟踪系统,每个步骤都用上下文意图进行索引,包括当前的潜在目标、行动类型和重要的实体类型。在推理过程中,STITCH根据意图兼容性过滤记忆片段,减少检索噪声。实验表明,STITCH在CAME-Bench和LongMemEval上的表现优于现有方法,尤其是在更长的轨迹中,通过有效区分重复提及和减少干扰,提高了35.6%的性能。
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Authors: Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
First: 2026-01-15T18:54:50+00:00 · Latest: 2026-01-15T18:54:50+00:00
Abstract
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
中文标题/摘要
标题:LIBERTy:基于因果框架的概念驱动解释基准测试方法,使用结构性反事实
概念驱动的解释量化了高层概念(例如,性别或经验)对模型行为的影响,这对于高风险领域的决策者至关重要。近期工作通过将这些解释与从反事实中估计的参考因果效应进行比较,来评估其忠实性。实际上,现有的基准测试依赖于昂贵的人工编写的反事实,这些反事实作为不完美的代理。为了解决这一问题,我们提出了一种构建包含结构性反事实配对的数据集的框架:LIBERTy(基于LLM的解释性基准测试,具有参考目标)。LIBERTy基于明确定义的结构因果模型(SCM),文本生成中的干预传播通过SCM,直到LLM生成反事实。我们引入了三个数据集(疾病检测、CV筛选和工作场所暴力预测)以及一个新的评估指标,顺序忠实性。使用这些数据集,我们评估了五种模型的广泛方法,并确定了改进概念驱动解释的大量空间。LIBERTy还使系统分析模型对干预的敏感性成为可能:我们发现专有LLM在人口统计概念上的敏感性明显降低,这可能是由于后训练缓解措施所致。总体而言,LIBERTy为开发忠实的解释方法提供了急需的基准测试。
Summary / 总结
The research aims to develop a benchmark for evaluating concept-based explanations of LLMs by using structural counterfactuals. The method involves creating datasets with explicit Structured Causal Models (SCMs) where interventions on concepts generate counterfactuals. The study introduces three datasets and a new evaluation metric, order-faithfulness, to assess a variety of methods across five models. Key findings include substantial room for improving concept-based explanations and a reduced sensitivity of proprietary LLMs to demographic concepts, possibly due to post-training mitigation. This framework, LIBERTy, offers a valuable benchmark for faithful explainability methods.
研究旨在通过结构化反事实数据集来评估LLM的概念解释。方法是使用明确的结构化因果模型(SCMs),其中对概念的干预生成反事实。研究引入了三个数据集和一个新的评估指标,即顺序忠实度,以评估五种模型下的多种方法。主要发现包括概念解释有很大的改进空间,以及由于后训练缓解措施,专有LLM对人口统计概念的敏感度明显降低。LIBERTy框架为忠实的解释方法提供了急需的基准。
Data-driven stochastic reduced-order modeling of parametrized dynamical systems
Authors: Andrew F. Ilersich, Kevin Course, Prasanth B. Nair
First: 2026-01-15T18:50:18+00:00 · Latest: 2026-01-15T18:50:18+00:00
Abstract
Modeling complex dynamical systems under varying conditions is computationally intensive, often rendering high-fidelity simulations intractable. Although reduced-order models (ROMs) offer a promising solution, current methods often struggle with stochastic dynamics and fail to quantify prediction uncertainty, limiting their utility in robust decision-making contexts. To address these challenges, we introduce a data-driven framework for learning continuous-time stochastic ROMs that generalize across parameter spaces and forcing conditions. Our approach, based on amortized stochastic variational inference, leverages a reparametrization trick for Markov Gaussian processes to eliminate the need for computationally expensive forward solvers during training. This enables us to jointly learn a probabilistic autoencoder and stochastic differential equations governing the latent dynamics, at a computational cost that is independent of the dataset size and system stiffness. Additionally, our approach offers the flexibility of incorporating physics-informed priors if available. Numerical studies are presented for three challenging test problems, where we demonstrate excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing approaches.
中文标题/摘要
标题:基于数据驱动的参数化动力系统随机降阶建模
在不同条件下建模复杂的动力系统计算强度大,常常使高保真模拟变得不可行。虽然降阶模型(ROMs)提供了有希望的解决方案,但当前的方法往往难以处理随机动力学,并且无法量化预测不确定性,限制了其在稳健决策环境中的应用。为了解决这些挑战,我们提出了一种基于数据驱动的学习连续时间随机ROMs的方法,该方法可以在参数空间和激励条件下泛化。我们的方法基于可约化随机变分推断,利用马尔可夫高斯过程的重参数化技巧,在训练过程中消除昂贵的前向求解器的需要。这使我们能够联合学习概率自编码器和控制潜在动力学的随机微分方程,计算成本与数据集大小和系统刚度无关。此外,如果可用,我们的方法还提供了物理信息先验的灵活性。我们通过三个具有挑战性的测试问题进行了数值研究,展示了在未见过的参数组合和激励条件下的出色泛化能力,并与现有方法相比取得了显著的效率提升。
Summary / 总结
The paper addresses the computational challenges of modeling complex dynamical systems under varying conditions by introducing a data-driven framework for learning continuous-time stochastic reduced-order models (ROMs). This framework uses amortized stochastic variational inference and a reparametrization trick for Markov Gaussian processes to avoid expensive forward solvers during training. The method jointly learns a probabilistic autoencoder and stochastic differential equations for latent dynamics, achieving computational efficiency and flexibility to incorporate physics-informed priors. The approach demonstrates excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing methods in numerical studies.
该研究通过引入数据驱动的方法来学习随机降阶模型(ROMs),以应对复杂动态系统在不同条件下的建模计算挑战。该方法使用了近似随机变分推断和马尔可夫高斯过程,训练过程中无需昂贵的前向求解器,从而能够高效地学习概率自编码器和随机微分方程。研究在三个具有挑战性的测试问题上展示了该方法在未见参数组合和激励下的出色泛化能力,并且相比现有方法具有显著的效率提升。
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Authors: Zirui Ren, Ziming Liu
First: 2026-01-15T18:42:50+00:00 · Latest: 2026-01-15T18:42:50+00:00
Abstract
Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) "Grokking" dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM "guesses" the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be "guessing" instead of "reasoning". Leveraging this "guessing" picture, we propose three strategies to scale HRM's guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models "reason".
中文标题/摘要
标题:你的推理模型是在推理还是猜测?层级推理模型的机制分析
层级推理模型(HRM)在各种推理任务中表现出色,显著优于基于大型语言模型的推理器。为了理解HRM的优势及其潜在的失败模式,我们对其推理模式进行了机制研究,发现了三个令人惊讶的事实:(a) 失败于极其简单的谜题,例如,HRM 可能在只有一个未知单元格的谜题上失败。我们将其失败归因于违反了HRM的基本假设——固定点性质。(b) 推理步骤中的“顿悟”动态,即答案并非均匀改进,而是存在一个关键的推理步骤,突然使答案变得正确;(c) 存在多个固定点。HRM“猜测”第一个固定点,该固定点可能是错误的,并且可能会暂时或永远被困在那里。所有这些事实表明,HRM 似乎是在“猜测”而不是“推理”。利用这种“猜测”的观点,我们提出了三种策略来扩展HRM的猜测:数据增强(提高猜测的质量)、输入扰动(通过利用推理的随机性来扩展猜测的数量)和模型自举(通过利用训练的随机性来扩展猜测的数量)。在实际应用方面,通过结合所有方法,我们开发了增强的HRM,将数独极端难度的准确率从54.5%提升到96.9%。在科学方面,我们的分析为理解推理模型如何“推理”提供了新的见解。
Summary / 总结
This study investigates the reasoning patterns of hierarchical reasoning models (HRMs) and finds that they often fail on simple puzzles, exhibit 'grokking' dynamics, and can get stuck at incorrect fixed points. These findings suggest that HRMs are more like 'guessing' than 'reasoning'. The authors propose strategies to improve HRMs, including data augmentation, input perturbation, and model bootstrapping, which together enhance Sudoku-Extreme accuracy from 54.5% to 96.9%.
研究分析了层次推理模型(HRM)的推理模式,发现HRM在简单谜题上会失败,因为违反了固定点性质;推理过程中存在‘顿悟’现象,答案会突然变得正确;并且HRM可能会陷入错误的固定点。这些发现表明,HRM更像是在‘猜测’而不是‘推理’。为了改进HRM,研究提出了数据增强、输入扰动和模型自举等策略,使得数独极端难度的准确率从54.5%提升到了96.9%。
PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution
Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
First: 2026-01-15T18:25:23+00:00 · Latest: 2026-01-15T18:25:23+00:00
Abstract
Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.
中文标题/摘要
标题:PACEvolve:促进长期目标感知一致进化的框架
大型语言模型(LLMs)已成为进化搜索的强大操作员,但高效的搜索支架设计仍缺乏系统方法。尽管前景广阔,当前的LLM在环系统缺乏管理进化过程的系统方法。我们识别出三种不同的失败模式:上下文污染,实验历史偏差未来候选生成;模式崩溃,由于探索与利用平衡不佳,代理在局部最小值中停滞;以及弱协作,僵化的杂交策略无法有效利用并行搜索轨迹。我们引入了目标感知一致进化(PACEvolve)框架,旨在稳健地管理代理的上下文和搜索动力学,以应对这些挑战。PACEvolve结合了分层上下文管理(HCM)与剪枝以应对上下文污染;基于动量的回溯(MBB)以逃离局部最小值;以及一种自适应采样策略,统一了回溯和杂交以进行动态搜索协调(CE),使代理能够平衡内部细化与跨轨迹协作。我们证明,PACEvolve提供了一种系统的方法,以实现一致的长期自我改进,其在LLM-SR和KernelBench上取得了最先进的结果,同时发现超越Modded NanoGPT记录的解决方案。
Summary / 总结
The paper addresses the challenges in evolutionary search using Large Language Models (LLMs) by introducing PACEvolve, a framework that tackles context pollution, mode collapse, and weak collaboration. PACEvolve uses hierarchical context management, momentum-based backtracking, and a self-adaptive sampling policy to improve search dynamics. Experimental results show that PACEvolve achieves state-of-the-art results on LLM-SR and KernelBench, and discovers solutions surpassing the record on Modded NanoGPT.
研究旨在通过引入PACEvolve框架来解决使用大型语言模型(LLMs)进行进化搜索时的效率问题,该框架能够缓解上下文污染、局部最小值陷阱和协作不足的问题。PACEvolve采用分层上下文管理、动量回溯和自适应采样策略来提升搜索动态。关键实验结果表明,PACEvolve在LLM-SR和KernelBench上达到了最先进的性能,并发现了超越Modded NanoGPT记录的解决方案。
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
First: 2025-05-23T18:01:09+00:00 · Latest: 2026-01-15T18:18:24+00:00
Abstract
Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
中文标题/摘要
标题:PMOA-TTS:介绍PubMed开放获取文本时间序列语料库
临床病历记录了对于建模患者轨迹至关重要的时间动态,但大规模的时间标注资源稀缺。我们介绍了PMOA-TTS语料库,该语料库包含124,699份单个患者的PubMed开放获取病例报告,通过可扩展的大语言模型管道(Llama 3.3 70B和DeepSeek-R1)转换为结构化的文本时间线(事件,时间)对。该语料库包含超过560万条带时间戳的事件,以及提取的人口统计信息和诊断信息。技术验证使用了临床专家审核的金标准集和三种度量:语义事件匹配、时间一致性(c-指数)和通过对数时间累积分布函数下的面积(AULTC)总结的对齐误差。我们对替代提示和模型选择进行了基准测试,并提供了支持再现的文档。PMOA-TTS 使从叙述文本中提取时间线、时间推理、生存建模和事件预测的研究成为可能,并提供了广泛的诊断和人口统计学覆盖范围。数据和代码在公共存储库中公开可用。
Summary / 总结
PMOA-TTS is a corpus of 124,699 PubMed Open Access case reports converted into structured textual timelines, containing over 5.6 million timestamped events. The corpus was created using a scalable large-language-model pipeline and includes extracted demographics and diagnoses. Technical validation was performed using a clinician-curated gold set and three measures: semantic event matching, temporal concordance, and alignment error. The corpus enables research on timeline extraction, temporal reasoning, survival modeling, and event forecasting from narrative text, and is openly available in public repositories.
PMOA-TTS 通过将 124,699 份 PubMed 开放获取病例报告转化为结构化的文本时间线来解决大规模时间标注临床叙事的稀缺问题,使用的是可扩展的大语言模型管道。该语料库包含超过 560 万条时间戳事件和人口统计信息。技术验证使用了临床专家标注的金标准集和三种指标:语义事件匹配、时间一致性(c-指数)和对齐误差。该语料库可用于时间线提取、时间推理、生存建模和事件预测研究,并公开提供用于复制和进一步研究。
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Authors: Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
First: 2026-01-15T18:15:06+00:00 · Latest: 2026-01-15T18:15:06+00:00
Abstract
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural
中文标题/摘要
标题:CURVE:文化与多语言长视频推理基准
近期视频模型的发展取得了显著进步,特别是在长视频理解方面。然而,当前的基准测试主要以西方为中心的数据和英语为主,引入了评估中的显著偏差。为解决这一问题,我们引入了CURVE(视频评估中的文化理解和推理),这是一个针对多文化和多语言视频推理的具有挑战性的基准测试。CURVE 包含来自18个全球地区的高质量、完全由人类生成的、针对特定区域文化的视频注释。与以往依赖自动翻译的工作不同,CURVE 提供了复杂的问题、答案和多步推理步骤,所有这些都用当地语言精心制作。要在CURVE上取得进展需要对视觉文化背景有深入的理解。此外,我们利用CURVE的推理痕迹构建基于证据的图表,并提出了一种使用这些图表的新型迭代策略,以识别推理中的细微错误。我们的评估表明,最先进的视频大模型面临巨大挑战,其性能远低于人类水平,错误主要来自对文化元素的视觉感知。CURVE 将在 https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural 公开可用。
Summary / 总结
The research introduces CURVE, a benchmark for evaluating multicultural and multilingual video reasoning, addressing the bias in existing benchmarks. It includes high-quality, human-generated annotations from 18 global locales and provides complex, native-language questions and answers. Evaluations show that state-of-the-art video language models perform poorly, often due to difficulties in visual perception of cultural elements, significantly below human-level accuracy.
研究引入了CURVE基准,用于评估多文化多语言视频推理能力,解决了现有基准的偏差问题。它包含来自18个全球地区的高质量、人工生成的注释,并提供了复杂的本地语言问题和答案。评估结果显示,最先进的视频语言模型表现不佳,通常由于难以识别文化元素的视觉特征,远低于人类水平的准确度。
STEM: Scaling Transformers with Embedding Modules
Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
First: 2026-01-15T18:00:27+00:00 · Latest: 2026-01-15T18:00:27+00:00
Abstract
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.
中文标题/摘要
标题:STEM:使用嵌入模块扩展变换器
细粒度稀疏性在不按比例增加每词计算的情况下提供了更高的参数容量,但通常会遭受训练不稳定性、负载均衡和通信开销的问题。我们提出了STEM(使用嵌入模块扩展变换器),这是一种静态、基于token的方案,用层局部的嵌入查找替换FFN的上投影,同时保持门和下投影密集。这消除了运行时路由,允许CPU卸载和异步预取,并将容量与每词FLOPs和跨设备通信脱钩。实验证明,尽管稀疏性极端,STEM仍能稳定训练。它在减少每词FLOPs和参数访问的同时,提高了下游性能(相比密集基线减少了约三分之一的FFN参数)。STEM学习具有大角度扩展的嵌入空间,增强了其知识存储容量。更有趣的是,这种增强的知识容量带来了更好的可解释性。STEM嵌入的token索引性质允许以简单的方式在不干预输入文本或增加计算的情况下进行知识编辑和知识注入。此外,STEM增强了长上下文性能:随着序列长度的增长,更多的不同参数被激活,从而实现实际的测试时容量扩展。在350M和1B模型规模下,STEM总体上提供了高达约3-4%的准确率改进,特别是在知识和推理密集型基准(ARC-Challenge、OpenBookQA、GSM8K、MMLU)上取得了显著进步。总体而言,STEM是一种有效的方法,可以在提供更好的可解释性、更好的训练稳定性和改进的效率的同时扩展参数内存。
Summary / 总结
STEM introduces a static, token-indexed approach to scaling transformers by replacing the FFN up-projection with a layer-local embedding lookup, which removes runtime routing and enables CPU offload. Empirically, STEM trains stably even with extreme sparsity and improves downstream performance while reducing per-token FLOPs and parameter accesses. It also enhances interpretability and long-context performance, achieving up to 4% accuracy improvements across various benchmarks.
STEM 通过将 FFN 上投影替换为 层局部嵌入查找,引入了一种静态的、基于令牌的 方法来扩展变压器,这消除了运行时路由并允许 CPU 卸载。实验表明,STEM 在极高的稀疏性下仍能稳定训练,减少了每令牌 FLOPs 和参数访问,并在各种基准测试上提高了高达 3-4% 的下游性能。该方法增强了知识存储容量和可解释性,并且随着序列长度的增长,扩展得更好。
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Authors: Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu, Zekai Gu, Chengwei Ren, Zhiyang Dou, Qing Shuai, Yuan Liu
First: 2026-01-15T17:52:29+00:00 · Latest: 2026-01-15T17:52:29+00:00
Comments: Project Page: https://igl-hkust.github.io/CoMoVi/
Abstract
In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
中文标题/摘要
标题:CoMoVi: 共生成3D人体动作和逼真视频
在本文中,我们发现3D人体动作生成和2D人体视频生成是内在耦合的。3D动作提供了视频中合理性和一致性的结构先验,而预训练的视频模型为动作提供了强大的泛化能力,这需要耦合它们的生成过程。基于此,我们提出了CoMoVi,一种将两个视频扩散模型(VDMs)耦合在一起,在单一扩散去噪循环中同步生成3D人体动作和视频的共生成框架。为此,我们首先提出了一种有效的2D人体动作表示,可以继承预训练VDMs的强大先验。然后,我们设计了一种双分支扩散模型,通过相互特征交互和3D-2D交叉注意力来耦合人体动作和视频生成过程。此外,我们构建了CoMoVi数据集,这是一个包含文本和动作注释的大规模真实世界人体视频数据集,涵盖了多样且具有挑战性的动作。广泛的实验表明,我们的方法在3D人体动作和视频生成任务中均具有有效性。
Summary / 总结
The paper addresses the coupling of 3D human motion generation and 2D video generation by proposing CoMoVi, a co-generative framework that synchronously generates 3D human motions and videos using two video diffusion models. The method introduces an effective 2D human motion representation and a dual-branch diffusion model with mutual feature interaction and 3D-2D cross attentions. The authors also created the CoMoVi Dataset, a large-scale dataset with text and motion annotations. Experiments show the method's effectiveness in both 3D human motion and video generation.
论文探讨了3D人体动作生成与2D人体视频生成之间的耦合关系,提出了一种名为CoMoVi的共生成框架,该框架使用两个视频扩散模型同步生成3D人体动作和视频。方法引入了有效的2D人体动作表示,并设计了具有相互特征交互和3D-2D交叉注意力的双分支扩散模型。通过广泛的实验验证了该方法在3D人体动作和视频生成任务中的有效性。
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
Authors: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
First: 2025-10-01T20:32:08+00:00 · Latest: 2026-01-15T17:51:14+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.
中文标题/摘要
标题:基于双不确定性指导的多模态推理策略学习
具有可验证奖励的强化学习(RLVR)在多模态大型语言模型中提升了推理能力。然而,现有方法通常将视觉输入视为确定性的,忽视了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源于复杂的推理还是模糊的感知,从而无法有针对性地分配探索或学习信号。为解决这一问题,我们引入了DUPL,这是一种用于多模态RLVR的基于双不确定性指导的策略学习方法,通过对称KL散度量化和利用感知不确定性以及通过策略熵量化输出不确定性来指导策略更新。通过建立一个基于不确定性的反馈循环并采用动态分支优先机制,DUPL重新校准了策略优势,使其能够聚焦于具有高感知或决策模糊性的状态,从而实现有效的目标探索,超越了被动的数据增强。DUPL基于GRPO实现,并在六个多模态数学和通用领域推理基准测试上进行了评估,提高了Qwen2.5-VL 3B和7B模型的准确性,视觉数学任务上的准确率提高了11.2%,通用领域推理任务上的准确率提高了7.1%,并且始终优于GRPO。这些结果表明,基于双不确定性的策略学习是一种有效且可泛化的多模态RLVR方法。
Summary / 总结
The paper introduces DUPL, a dual-uncertainty guided policy learning approach for multimodal reinforcement learning with verifiable rewards (RLVR), which addresses the issue of treating visual inputs as deterministic. DUPL quantifies perceptual and output uncertainties to guide policy updates, focusing on states with high ambiguity. Experiments on six benchmarks show DUPL improves Qwen2.5-VL models by up to 11.2% in visual math tasks and 7.1% in general-domain reasoning tasks, outperforming GRPO.
研究解决了现有强化学习方法将视觉输入视为确定性的问题,导致难以区分复杂的推理和感知的不确定性。DUPL通过量化感知和输出不确定性来引导策略更新。该方法提高了Qwen2.5-VL 3B和7B模型,在视觉数学任务上实现了高达11.2%的准确率提升,在一般领域推理任务上实现了高达7.1%的提升,并在六个跨模态推理基准测试中优于GRPO。
Data-Driven Dynamic Factor Modeling via Manifold Learning
Authors: Graeme Baker, Agostino Capponi, J. Antonio Sidaoui
First: 2025-06-24T18:40:40+00:00 · Latest: 2026-01-15T17:50:32+00:00
Abstract
We introduce a data-driven dynamic factor framework for modeling the joint evolution of high-dimensional covariates and responses without parametric assumptions. Standard factor models applied to covariates alone often lose explanatory power for responses. Our approach uses anisotropic diffusion maps, a manifold learning technique, to learn low-dimensional embeddings that preserve both the intrinsic geometry of the covariates and the predictive relationship with responses. For time series arising from Langevin diffusions in Euclidean space, we show that the associated graph Laplacian converges to the generator of the underlying diffusion. We further establish a bound on the approximation error between the diffusion map coordinates and linear diffusion processes, and we show that ergodic averages in the embedding space converge under standard spectral assumptions. These results justify using Kalman filtering in diffusion-map coordinates for predicting joint covariate-response evolution. We apply this methodology to equity-portfolio stress testing using macroeconomic and financial variables from Federal Reserve supervisory scenarios, achieving mean absolute error improvements of up to 55% over classical scenario analysis and 39% over principal component analysis benchmarks.
中文标题/摘要
标题:基于流形学习的数据驱动动态因子建模
我们提出了一种数据驱动的动态因子框架,用于在无需参数假设的情况下建模高维协变量和响应的联合演变。单独应用于协变量的标准因子模型往往无法解释响应。我们的方法使用各向异性扩散映射,这是一种流形学习技术,来学习低维嵌入,同时保留协变量的内在几何结构和与响应的预测关系。对于来自欧几里得空间朗文扩散的时间序列,我们证明了相关的图拉普拉斯算子收敛于基础扩散的生成器。我们进一步建立了扩散映射坐标与线性扩散过程之间近似误差的上界,并证明在嵌入空间中的遍历平均值在标准谱假设下收敛。这些结果证明了在扩散映射坐标中使用卡尔曼滤波预测协变量-响应联合演变的有效性。我们使用联邦储备监管情景中的宏观经济和金融变量将该方法应用于股票组合压力测试,实现了相对于经典情景分析高达55%的平均绝对误差改进,以及相对于主成分分析基准的39%改进。
Summary / 总结
The research aims to model the joint evolution of high-dimensional covariates and responses without parametric assumptions, using anisotropic diffusion maps to learn low-dimensional embeddings that preserve both the intrinsic geometry of the covariates and their predictive relationship with responses. The study shows that the graph Laplacian converges to the generator of the underlying diffusion for time series from Langevin diffusions, and establishes a bound on the approximation error between diffusion map coordinates and linear diffusion processes. The methodology improves mean absolute error by up to 55% in equity-portfolio stress testing compared to classical scenario analysis and 39% compared to principal component analysis benchmarks.
该论文提出了一种数据驱动的动态因子框架,用于在无参数假设的情况下建模高维协变量和响应的联合演变。它使用各向异性扩散图来学习同时保留协变量内在几何结构及其与响应预测关系的低维嵌入。该方法应用于股票组合压力测试,显示出比经典情景分析高出55%的平均绝对误差改进,以及比主成分分析基准高出39%的改进。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒高成本错误的风险。我们研究了基于信心的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们得出两项发现。首先,置信度阈值化在分布内提供了机制控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率f
Summary / 总结
The research aims to improve the reliability of vision-language models in high-stakes applications by enabling selective prediction, where the models abstain when uncertain. The study uses confidence-based abstention and finds that sweeping confidence thresholds provide a smooth tradeoff between error rates and coverage, effectively controlling risk in video question answering tasks. The findings are robust under distribution shift, as demonstrated using NExT-QA and Gemini 2.0 Flash datasets.
研究旨在通过使视觉-语言模型在不确定时能够选择不进行预测,提高其在高风险应用中的可靠性。研究使用基于信心的避免预测,并发现通过调整信心阈值可以平滑地在风险和覆盖率之间进行权衡,从而有效降低同分布情况下的错误率。但该方法在分布变化下的鲁棒性仍需进一步研究。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs生成的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最新技术,并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法,并展示了在视觉标记上使用双向注意和一种新颖的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
Molmo2 is a new family of open-source vision-language models that outperform existing open-weight models and proprietary models in video understanding and grounding tasks. The research addresses the lack of open-source foundations for improving video language models by providing 9 new datasets and a training recipe. Key findings include superior performance on point-driven grounding tasks, with Molmo2 achieving higher accuracy than Qwen3-VL and Gemini 3 Pro on video counting and video pointing tasks.
Molmo2 是一种新的开源视觉-语言模型,其在视频理解和定位任务中优于现有开源模型和专有模型。研究通过提供 9 个新数据集和训练方法来解决缺乏开源基础的问题。主要发现包括在点驱动的定位任务中表现更优,Molmo2 在视频计数和视频定位任务上的准确率高于 Qwen3-VL 和 Gemini 3 Pro。
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-15T17:24:46+00:00
Comments: Work in Progress
Abstract
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文标题/摘要
标题:奖励稀有性:面向创造性问题解决的LLM独特性意识强化学习
强化学习(RL)已成为后训练大型语言模型(LLMs)的核心范式,特别是在复杂推理任务中,但它经常遭受探索崩溃的问题:策略过早地集中在少数主导推理模式上,提高了pass@1,但限制了滚动力量多样性以及pass@k的收益。我们认为这种失败源于对局部token行为的正则化,而不是对解决方案集的多样性。为了解决这个问题,我们提出了面向独特性的强化学习,这是一种滚动力量目标,明确奖励表现出罕见高层策略的正确解决方案。该方法使用基于LLM的裁判将相同问题的滚动力量根据其高层解决方案策略聚类,忽略表面差异,并根据聚类大小反向重权重策略优势。因此,正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中,我们的方法在大规模采样预算下始终如一地提高了pass@$k$,并增加了pass@$k$曲线下的面积(AUC@$K$),同时不牺牲pass@1,同时保持探索并大规模揭示更多多样化的解决方案策略。
Summary / 总结
The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant strategies. It introduces Uniqueness-Aware Reinforcement Learning, which rewards rare high-level strategies to enhance rollout diversity. The method uses an LLM-based judge to cluster similar solutions and reweights policy advantages inversely with cluster size, leading to improved pass@$k$ across various reasoning benchmarks without compromising pass@1.
论文针对强化学习中大型语言模型探索坍塌的问题,即政策倾向于聚焦于少数主导推理模式。提出了一种新颖的Uniqueness-Aware强化学习方法,通过奖励罕见的高阶策略来促进多样性。该方法使用基于LLM的评判器根据高阶策略对策略进行聚类,并根据聚类大小反向重权重策略优势。实验结果显示,该方法在数学、物理和医学推理基准上提高了pass@$k$,增加了AUC@$K$,同时不牺牲pass@1,并增强了探索,揭示了更多样化的解决方案策略。
Multi-Objective Pareto-Front Optimization for Efficient Adaptive VVC Streaming
Authors: Angeliki Katsenou, Vignesh V. Menon, Guoda Laurinaviciute, Benjamin Bross, Detlev Marpe
First: 2026-01-15T17:23:39+00:00 · Latest: 2026-01-15T17:23:39+00:00
Comments: 19 pages
Abstract
Adaptive video streaming has facilitated improved video streaming over the past years. A balance among coding performance objectives such as bitrate, video quality, and decoding complexity is required to achieve efficient, content- and codec-dependent, adaptive video streaming. This paper proposes a multi-objective Pareto-front (PF) optimization framework to construct quality-monotonic, content-adaptive bitrate ladders Versatile Video Coding (VVC) streaming that jointly optimize video quality, bitrate, and decoding time, which is used as a practical proxy for decoding energy. Two strategies are introduced: the Joint Rate-Quality-Time Pareto Front (JRQT-PF) and the Joint Quality-Time Pareto Front (JQT-PF), each exploring different tradeoff formulations and objective prioritizations. The ladders are constructed under quality monotonicity constraints during adaptive streaming to ensure a consistent Quality of Experience (QoE). Experiments are conducted on a large-scale UHD dataset (Inter-4K), with quality assessed using PSNR, VMAF, and XPSNR, and complexity measured via decoding time and energy consumption. The JQT-PF method achieves 11.76% average bitrate savings while reducing average decoding time by 0.29% to maintain the same XPSNR, compared to a widely-used fixed ladder. More aggressive configurations yield up to 27.88% bitrate savings at the cost of increased complexity. The JRQT-PF strategy, on the other hand, offers more controlled tradeoffs, achieving 6.38 % bitrate savings and 6.17 % decoding time reduction. This framework outperforms existing methods, including fixed ladders, VMAF- and XPSNR-based dynamic resolution selection, and complexity-aware benchmarks. The results confirm that PF optimization with decoding time constraints enables sustainable, high-quality streaming tailored to network and device capabilities.
中文标题/摘要
标题:多目标帕累托前沿优化以实现高效的自适应VVC流媒体
自适应视频流媒体在过去几年中促进了视频流媒体的改进。为了实现高效、内容和编解码器依赖的自适应视频流媒体,需要在比特率、视频质量和解码复杂性等编码性能目标之间取得平衡。本文提出了一种多目标帕累托前沿(PF)优化框架,以构建质量单调、内容自适应的Versatile Video Coding (VVC)流媒体比特率梯度,该框架联合优化视频质量、比特率和解码时间,解码时间被用作解码能耗的实用代理。介绍了两种策略:联合速率-质量和时间帕累托前沿(JRQT-PF)和联合质量和时间帕累托前沿(JQT-PF),每种策略探索不同的权衡形式和目标优先级。在自适应流媒体过程中,比特率梯度在质量单调性约束下构建,以确保一致的用户体验(QoE)。在大规模UHD数据集(Inter-4K)上进行了实验,使用PSNR、VMAF和XPSNR评估质量,使用解码时间和能耗测量复杂性。JQT-PF方法平均节省了11.76%的比特率,同时将平均解码时间减少了0.29%,以保持相同的XPSNR,与广泛使用的固定梯度相比。更激进的配置在成本增加复杂性的情况下可节省高达27.88%的比特率。另一方面,JRQT-PF策略提供了更可控的权衡,实现了6.38%的比特率节省和6.17%的解码时间减少。该框架优于现有方法,包括固定梯度、基于VMAF和XPSNR的动态分辨率选择以及考虑复杂性的基准。结果表明,带解码时间约束的PF优化能够实现针对网络和设备能力的可持续、高质量流媒体。
Summary / 总结
This paper proposes a multi-objective Pareto-front optimization framework for adaptive VVC streaming, aiming to balance video quality, bitrate, and decoding time. Two strategies, JRQT-PF and JQT-PF, are introduced to explore different tradeoffs. Experiments on a large-scale UHD dataset show that the JQT-PF method saves 11.76% average bitrate while reducing decoding time by 0.29% to maintain the same XPSNR, outperforming fixed ladders and other benchmarks.
该论文提出了一种多目标帕累托前沿优化框架,用于自适应VVC流媒体,旨在平衡码率、视频质量和解码时间。引入了两种策略JRQT-PF和JQT-PF来探索不同的权衡。在大规模UHD数据集上的实验表明,JQT-PF方法实现了11.76%的平均码率节省,并且解码时间略有减少,而JRQT-PF策略提供了更可控的权衡,实现了6.38%的码率节省和6.17%的解码时间减少。所提出的框架在解码时间约束下优于现有方法,在效率和质量方面表现出色。
RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
Authors: Peng Chen, Xiaobao Wei, Yi Yang, Naiming Yao, Hui Chen, Feng Tian
First: 2026-01-15T17:23:19+00:00 · Latest: 2026-01-15T17:23:19+00:00
Abstract
Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
中文标题/摘要
标题:RSATalker:现实社会意识的多轮对话模拟头部生成
头部生成在虚拟现实(VR)中越来越重要,尤其是在涉及多轮对话的社会场景中。现有方法面临显著的局限性:基于网格的3D方法可以建模双人对话,但缺乏逼真的纹理,而基于大型模型的2D方法产生自然外观但计算成本高昂。最近,基于3D高斯散点图(3DGS)的方法实现了高效的逼真渲染,但仍然仅支持单人说话且忽略了社会关系。我们引入了RSATalker,这是第一个利用3DGS进行现实社会意识的多轮对话模拟头部生成的框架。我们的方法首先从语音驱动网格基3D面部运动,然后将3D高斯点绑定到网格面以渲染高保真2D头像视频。为了捕捉人际动态,我们提出了一种社会意识模块,通过可学习的查询机制将社会关系,包括血缘和非血缘以及平等和不平等,编码为高级嵌入。我们设计了三阶段训练范式,并构建了包含社会关系注释的RSATalker数据集,带有语音-网格-图像三元组。大量实验表明,RSATalker在逼真度和社会意识方面均达到最先进的性能。代码和数据集将被发布。
Summary / 总结
RSATalker is a framework for generating realistic and socially-aware talking heads for multi-turn conversations in virtual reality. It uses 3D Gaussian Splatting to render high-fidelity 2D avatar videos and includes a socially-aware module that encodes social relationships. Experiments show that RSATalker outperforms existing methods in both realism and social awareness.
RSATalker 是一个框架,利用 3D 高斯散射生成多轮对话中具有现实感和社会意识的说话头部。它从语音驱动 3D 面部运动,并将 3D 高斯分布绑定到网格面片以生成高质量的 2D 头像视频。社会意识模块将社会关系编码为高级嵌入以捕捉人际动态。实验表明,RSATalker 在现实感和社会意识方面均优于现有方法。
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-15T17:20:36+00:00
Comments: Work in Progress
Abstract
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
中文标题/摘要
标题:协作多智能体推理的测试时强化学习
多智能体系统已成为许多应用中的实用LLM驱动合作者,通过多样性和交叉验证获得鲁棒性。然而,多智能体强化学习(MARL)训练资源密集且不稳定:队友的共同适应引入了非平稳性,奖励通常稀疏且高方差。因此,我们引入了**多智能体测试时强化学习(MATTRL)**框架,在推理时向多智能体协商注入结构化文本经验。MATTRL 形成一个由专家组成的多专家团队,进行多轮讨论,检索和整合测试时经验,并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池,然后将其重新注入对话。在医学、数学和教育领域的具有挑战性的基准测试中,MATTRL 的准确率平均提高了3.67%,相对于多智能体基线提高了8.67%,相对于可比的单智能体基线提高了8.67%。消融研究探讨了不同的信用分配方案,并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径,无需调整即可实现分布转移鲁棒的多智能体推理。
Summary / 总结
The paper introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which injects structured textual experience into multi-agent deliberation at inference time to improve robustness and accuracy. MATTRL forms a multi-expert team for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. Experiments across medicine, math, and education show that MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines.
研究通过引入Multi-Agent Test-Time Reinforcement Learning (MATTRL),在推理时注入结构化的文本经验,以解决多智能体强化学习(MARL)中的挑战。MATTRL 形成一个多专家团队进行多轮讨论,检索并整合测试时的经验,并达成共识进行最终决策。研究显示,MATTRL 在医学、数学和教育等领域的各种基准测试中,相对于多智能体基线提高了平均 3.67% 的准确性,相对于可比的单智能体基线提高了 8.67%。消融研究进一步分析了不同信用分配方案对训练结果的影响。
Semantic Misalignment in Vision-Language Models under Perceptual Degradation
Authors: Guo Cheng
First: 2026-01-13T09:13:05+00:00 · Latest: 2026-01-15T17:10:05+00:00
Comments: 10 pages, 4 figures, 6 tables
Abstract
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
中文标题/摘要
标题:视觉-语言模型在感知退化下的语义不匹配
视觉-语言模型(VLMs)在自动驾驶和具身AI系统中越来越被部署,可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLMs在多模态基准测试中表现出色,但它们对现实感知退化的鲁棒性仍然不甚了解。在本工作中,我们系统地研究了在上游视觉感知受控退化下VLMs中的语义不匹配,使用Cityscapes数据集上的语义分割作为代表性感知模块。我们引入了感知现实的退化,这些退化仅在传统分割指标上引起适度下降,但观察到下游VLM行为的严重失败,包括虚构对象提及、安全关键实体的遗漏以及不一致的安全判断。为了量化这些影响,我们提出了一组语言层面的不匹配度量标准,以捕捉虚构、关键遗漏和安全误判,并分析这些度量标准与分割质量之间的关系,涵盖多个对比性和生成性VLMs。我们的结果揭示了像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显了当前VLM基系统的一个关键局限性,并强调了需要评估框架来明确考虑感知不确定性在关键安全应用中的重要性。
Summary / 总结
This study investigates the robustness of Vision-Language Models (VLMs) under perceptual degradation, focusing on their performance in autonomous driving and embodied AI. By introducing controlled corruptions to the semantic segmentation of the Cityscapes dataset, the research reveals that even moderate degradation can lead to severe misalignments in VLMs, such as hallucinations and omissions of critical entities. The authors propose new metrics to quantify these misalignments and find a clear disconnect between pixel-level robustness and multimodal semantic reliability, emphasizing the need for better evaluation methods in safety-critical applications.
该研究探讨了视觉语言模型(VLMs)在自主驾驶和具身AI系统中对感知退化的鲁棒性。通过在Cityscapes数据集上系统地引入控制性的损坏,研究揭示了VLMs在严重失败,包括幻觉和关键遗漏。作者提出了语言层面的对齐度量标准来量化这些问题,并发现像素级鲁棒性和多模态语义可靠性之间存在明显的断层,强调了需要更好的评估框架来考虑感知不确定性在关键安全应用中的影响。
STEP3-VL-10B Technical Report
Authors: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge
First: 2026-01-14T17:58:24+00:00 · Latest: 2026-01-15T17:06:04+00:00
Comments: 50 pages
Abstract
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
中文标题/摘要
标题:STEP3-VL-10B 技术报告
我们提出了STEP3-VL-10B,这是一种轻量级开源基础模型,旨在重新定义紧凑效率与前沿多模态智能之间的权衡。STEP3-VL-10B 通过两个战略转变实现:首先,采用统一的、完全解冻的预训练策略,基于1.2万亿多模态令牌,结合语言对齐的感知编码器与Qwen3-8B解码器,建立内在的视觉-语言协同作用;其次,采用扩展的后训练流水线,包含超过1000次强化学习迭代。关键地,我们实现了并行协调推理(PaCoRe)以扩展测试时计算,将资源分配给可扩展的感知推理,探索和综合多种视觉假设。因此,尽管其紧凑的10B参数量,STEP3-VL-10B 在性能上与比其大10-20倍的模型(如GLM-4.6V-106B、Qwen3-VL-235B)相当或超越,并在顶级专有旗舰产品(如Gemini 2.5 Pro和Seed-1.5-VL)中表现出色。它在MMBench上记录了92.2%的得分,在MMMU上记录了80.11%的得分,同时在复杂推理方面分别取得了94.43%和75.95%的得分。我们发布了完整的模型套件,为社区提供了一个强大、高效且可复现的基础。
Summary / 总结
The research aims to develop a lightweight foundation model, STEP3-VL-10B, to balance compactness with advanced multimodal intelligence. It employs a unified pre-training strategy and a scaled post-training pipeline, including over 1,000 iterations of reinforcement learning and Parallel Coordinated Reasoning (PaCoRe) for efficient test-time compute. Despite its 10B parameter size, STEP3-VL-10B outperforms larger models and proprietary flagships, achieving top scores on MMBench, MMMU, AIME2025, and MathVision tasks.
研究旨在开发一种轻量级模型,平衡紧凑性和高级多模态智能。STEP3-VL-10B采用统一的预训练策略和扩展的后训练管道,性能与更大规模的模型相当。关键发现包括在MMBench上达到92.2%,在MMMU上达到80.11%,在AIME2025上达到94.43%。同时保持10B参数量。
Action100M: A Large-scale Video Action Dataset
Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung
First: 2026-01-15T17:02:27+00:00 · Latest: 2026-01-15T17:02:27+00:00
Abstract
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
中文标题/摘要
标题:Action100M:大规模视频动作数据集
从视觉观察中推断物理动作是推进物理世界机器智能的基本能力。实现这一点需要涵盖广泛领域的大型、开放词汇量视频动作数据集。我们介绍了Action100M,这是一个从1.2M互联网教学视频(14.6年时长)构建的大规模数据集,提供了O(100百万)个时间局部化片段,带有开放词汇量的动作监督和丰富的描述。Action100M通过一个完全自动化的流水线生成,该流水线(i)使用V-JEPA 2嵌入进行分层时间分割,(ii)生成多级帧和片段描述,组织为描述树,(iii)使用多轮Self-Refine程序下的推理模型(GPT-OSS-120B)聚合证据,输出结构化注释(简要/详细动作、演员、简要/详细描述)。在Action100M上训练VL-JEPA展示了在各种动作识别基准测试中一致的数据规模改进和强大的零样本性能,确立了Action100M作为视频理解和世界建模可扩展研究新基础的地位。
Summary / 总结
The research aims to develop a large-scale video action dataset to enhance machine intelligence in physical world applications. The method involves creating Action100M from 1.2 million instructional videos, using a fully automated pipeline for temporal segmentation, caption generation, and structured annotation. Key findings show consistent data-scaling improvements and strong zero-shot performance across various action recognition benchmarks, positioning Action100M as a foundational resource for video understanding and world modeling research.
研究旨在开发大规模视频动作数据集,以促进物理世界中的机器智能应用。主要方法是从120万条教学视频中创建Action100M,生成超过1亿个片段,带有开放词汇的动作监督和丰富的字幕,通过全自动流水线生成。关键发现显示,Action100M在各种动作识别基准测试中表现出一致的数据扩展改进和强大的零样本性能,确立了其作为视频理解和世界建模研究新基础的地位。
SSFL: Discovering Sparse Unified Subnetworks at Initialization for Efficient Federated Learning
Authors: Riyasat Ohib, Bishal Thapaliya, Gintare Karolina Dziugaite, Jingyu Liu, Vince Calhoun, Sergey Plis
Venue: Transactions on Machine Learning Research, 2026
First: 2024-05-15T02:13:51+00:00 · Latest: 2026-01-15T17:01:07+00:00
Comments: Published in Transactions on Machine Learning Research (TMLR), 2026
Abstract
In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient communication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy sparsity trade off, achieving more than 20\% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by $2 \times$ relative to dense FL. Finally, in a real-world federated learning deployment, SSFL delivers over $2.3 \times$ faster communication time, underscoring its practical efficiency.
中文标题/摘要
标题:SSFL:初始化时发现稀疏统一子网络以实现高效的联邦学习
在本文中,我们提出了显著稀疏联邦学习(SSFL),这是一种用于稀疏联邦学习的高效通信简化方法。SSFL 在训练前识别一个稀疏子网络,利用在非IID场景下分别在本地客户端数据上计算的参数显著性分数进行聚合,以确定一个全局掩码。仅在每次客户端与服务器之间通信时训练和传输稀疏模型权重。在包括CIFAR-10、CIFAR-100和Tiny-ImageNet的标准基准测试中,SSFL 一致地改善了准确性和稀疏性之间的权衡,在CIFAR-10上相对于最强的稀疏基线实现了超过20%的相对误差减少,同时将通信成本减少了2倍。最后,在实际的联邦学习部署中,SSFL 实现了超过2.3倍的更快通信时间,突显了其实用效率。
Searching for Quantum Effects in the Brain: A Bell-Type Test for Nonclassical Latent Representations in Autoencoders
Authors: I. K. Kominis, C. Xie, S. Li, M. Skotiniotis, G. P. Tsironis
First: 2026-01-15T16:59:40+00:00 · Latest: 2026-01-15T16:59:40+00:00
Comments: 6 pages, 2 figures
Abstract
Whether neural information processing is entirely classical or involves quantum-mechanical elements remains an open question. Here we propose a model-agnostic, information-theoretic test of nonclassicality that bypasses microscopic assumptions and instead probes the structure of neural representations themselves. Using autoencoders as a transparent model system, we introduce a Bell-type consistency test in latent space, and ask whether decoding statistics obtained under multiple readout contexts can be jointly explained by a single positive latent-variable distribution. By shifting the search for quantum-like signatures in neural systems from microscopic dynamics to experimentally testable constraints on information processing, this work opens a new route for probing the fundamental physics of neural computation.
中文标题/摘要
标题:在大脑中寻找量子效应:非经典潜在表示的贝尔型测试
神经信息处理是否完全经典或涉及量子力学元素仍是一个开放问题。在这里,我们提出了一种模型无关的信息论非经典性测试,该测试绕过了微观假设,而是直接探测神经表示本身的结构。利用自编码器作为透明的模型系统,我们引入了潜在空间中的贝尔型一致性测试,并询问在多种读出上下文中获得的解码统计是否可以由单一的正潜在变量分布联合解释。通过将对神经系统中似量子特征的搜索从微观动力学转移到可实验测试的信息处理约束上,这项工作为探索神经计算的基本物理学开辟了一条新途径。
Summary / 总结
This study aims to explore whether neural information processing involves quantum-mechanical elements by proposing a model-agnostic test based on information theory. The researchers use autoencoders to conduct a Bell-type consistency test in latent space, examining whether decoding statistics from different readout contexts can be explained by a single positive latent-variable distribution. The key finding is that this approach shifts the focus from microscopic dynamics to experimentally testable constraints on information processing, potentially opening a new avenue for understanding the fundamental physics of neural computation.
该研究旨在通过基于信息理论的方法,探索神经信息处理中是否存在量子效应。方法是使用自编码器和潜空间中的贝尔一致性测试,检查解码统计是否可以由单一的正潜变量分布来解释。主要发现表明,这种方法允许在不假设微观动力学的情况下测试神经表示的非经典性,从而开辟了一条探索神经计算基本物理的新途径。
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Authors: Mikel Williams-Lekuona, Georgina Cosma
First: 2025-12-17T12:19:54+00:00 · Latest: 2026-01-15T16:58:39+00:00
Comments: Camera-ready version for ECIR 2026
Abstract
Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
中文标题/摘要
标题:基于图像复杂性的自适应检索以提高视觉-语言模型的效率
视觉语言模型中的视觉变换器通常对每张图像使用相同的计算量,无论其简单与否。我们提出了ICAR(基于图像复杂性的自适应检索),这是一种自适应计算方法,使视觉变换器能够为简单的图像使用较少的计算量,而为复杂的图像通过其全网络深度进行处理。关键挑战是保持跨模态对齐:不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR 通过双路径训练解决这一问题,该训练产生来自早期退出路径和全深度路径的兼容嵌入。这在图像表示和文本嵌入在相同的语义空间中保持兼容性的同时,无论图像是否早期退出或完全处理。与现有的两阶段方法不同,ICAR 不需要昂贵的重排序,可以直接进行图像-文本匹配而无需额外开销。为了确定使用多少计算量,我们开发了ConvNeXt-IC,将其视为分类任务。通过应用现代分类器骨干网络而非专门的架构,ConvNeXt-IC 达到了最先进的性能,获得了与人工标注 0.959 的皮尔逊相关系数,同时实现了 4.4 倍更快的复杂性预测。在标准基准上增加了真实世界的网络数据进行评估,ICAR 在保持类别级性能的同时实现了 20% 更快的图像编码,并且保留了 95% 的实例级性能,从而实现了视觉-语言系统的可持续扩展。
Summary / 总结
The paper proposes ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach for vision transformers in vision-language models, which uses less compute for simple images and processes complex images through their full network depth. ICAR maintains cross-modal alignment by producing compatible embeddings from both early-exit and full-depth paths through dual-path training. The method, ConvNeXt-IC, classifies image complexity with high accuracy and speed, enabling ICAR to achieve 20% faster image encoding while maintaining performance on standard benchmarks. Unlike existing two-stage approaches, ICAR allows direct image-text matching without additional overhead.
论文提出了一种名为ICAR(Image Complexity-Aware Retrieval)的方法,该方法使视觉变换器在视觉语言模型中能够为简单图像使用较少的计算资源,而为复杂图像使用全网络深度。该方法通过双路径训练保持跨模态对齐,确保图像表示和文本嵌入在相同的语义空间中兼容。ICAR在标准基准上的图像编码速度提高了20%,同时保持了类别级别的性能和95%的实例级别性能。ConvNeXt-IC作为一种分类器骨干网络,提供了4.4倍更快的图像复杂性评估,同时保持了高准确性。
Combinatorial Optimization Augmented Machine Learning
Authors: Maximilian Schiffer, Heiko Hoppe, Yue Su, Louis Bouvier, Axel Parmentier
First: 2026-01-15T16:55:19+00:00 · Latest: 2026-01-15T16:55:19+00:00
Abstract
Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.
中文标题/摘要
标题:组合优化增强机器学习
组合优化增强机器学习(COAML)最近作为一种强大的范式,将预测模型与组合决策制定相结合而崭露头角。通过将组合优化或acles嵌入到学习管道中,COAML能够构建既数据驱动又符合可行性的策略,连接机器学习、运筹学和随机优化的传统。本文提供了COAML最新状态的全面概述。我们介绍了一个统一的COAML管道框架,描述了其方法论构建块,并形式化了其与经验成本最小化的关系。然后,我们根据不确定性形式和决策结构发展了一个分类体系。使用这个分类体系,我们回顾了静态和动态问题的算法方法,概述了跨调度、车辆路线、随机规划和强化学习等领域的应用,并从经验成本最小化、模仿学习和强化学习的角度综合了方法论贡献。最后,我们确定了关键的研究前沿。本文综述旨在既作为该领域的教程性介绍,又作为未来研究的路线图,该研究领域介于组合优化和机器学习之间。
Summary / 总结
The paper explores the COAML paradigm, which integrates combinatorial optimization with machine learning to create feasible and data-driven policies. It introduces a unified framework for COAML pipelines, discusses their components, and connects them to empirical cost minimization. The study reviews algorithmic approaches for both static and dynamic problems, covers applications in various domains, and outlines key research directions. This work serves as both an introduction to the field and a roadmap for future research.
论文探讨了组合优化增强机器学习(COAML),该方法将预测模型与组合决策相结合。它引入了COAML管道的统一框架,回顾了静态和动态问题的算法方法,并在调度、车辆路线、随机规划和强化学习等多个领域进行了应用综述。主要发现包括将机器学习、运筹学和随机优化相结合,并指出了研究前沿。
From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA
Authors: Kimia Abedini, Farzad Shami, Gianmaria Silvello
First: 2026-01-15T16:54:11+00:00 · Latest: 2026-01-15T16:54:11+00:00
Comments: Accepted paper by the 48th European Conference on Information Retrieval (ECIR'26)
Abstract
Comprehending genomic information is essential for biomedical research, yet extracting data from complex distributed databases remains challenging. Large language models (LLMs) offer potential for genomic Question Answering (QA) but face limitations due to restricted access to domain-specific databases. GeneGPT is the current state-of-the-art system that enhances LLMs by utilizing specialized API calls, though it is constrained by rigid API dependencies and limited adaptability. We replicate GeneGPT and propose GenomAgent, a multi-agent framework that efficiently coordinates specialized agents for complex genomics queries. Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average, and its flexible architecture extends beyond genomics to various scientific domains needing expert knowledge extraction.
中文标题/摘要
标题:从单体到多智能体推理:推进GeneGPT在基因组问答中的应用
理解基因组信息对于生物医学研究至关重要,但从复杂分布的数据库中提取数据仍然具有挑战性。大型语言模型(LLMs)为基因组问答(QA)提供了潜在解决方案,但受限于对领域特定数据库的访问限制。GeneGPT是当前最先进的系统,通过使用专门的API调用增强LLMs,但它受到固定API依赖性和有限适应性的限制。我们复制了GeneGPT并提出了GenomAgent,这是一种多智能体框架,能够高效地协调专门的智能体以处理复杂的基因组查询。在GeneTuring基准测试的九项任务上,GenomAgent的平均性能比GeneGPT高出12%,其灵活的架构还适用于需要专家知识提取的各种科学领域。
Summary / 总结
The research aims to improve genomic question answering by addressing the limitations of existing systems like GeneGPT, which are constrained by API dependencies and limited adaptability. The study introduces GenomAgent, a multi-agent framework that coordinates specialized agents to handle complex genomics queries more effectively. GenomAgent outperforms GeneGPT by 12% on average across nine tasks from the GeneTuring benchmark and offers a flexible architecture applicable to various scientific domains requiring expert knowledge extraction.
研究旨在通过解决现有系统如GeneGPT依赖于固定API接口的局限性,提高基因组问答的质量。研究引入了GenomAgent,这是一种多智能体框架,通过协调专门的智能体来处理复杂的基因组查询。GenomAgent在来自GeneTuring基准的九项任务中平均比GeneGPT高出12%,并且其灵活的架构适用于需要专业知识提取的各种科学领域。
How Quantization Shapes Bias in Large Language Models
Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
First: 2025-08-25T14:48:26+00:00 · Latest: 2026-01-15T16:30:08+00:00
Abstract
This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, fairness, toxicity, and sentiment. We employ both probability- and generated text-based metrics across 13 benchmarks and evaluate models that differ in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and subgroups, and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
中文标题/摘要
标题:量化如何塑造大型语言模型中的偏差
本研究全面评估了量化对模型偏差的影响,特别关注其对个体人口子群体的影响。我们专注于权重和激活量化策略,并在包括刻板印象、公平性、毒性及情感在内的多种偏差类型中考察其影响。我们使用概率和生成文本的指标,在13个基准上评估了不同架构家族和推理能力的模型。我们的研究结果表明,量化对偏差的影响是复杂的:虽然它可以减少模型的毒性,且对情感影响不大,但它倾向于在生成任务中略微增加刻板印象和不公平性,尤其是在激进压缩下。这些趋势在不同的人口类别和子群体以及不同模型类型中通常是一致的,尽管其程度取决于具体环境。总体而言,我们的结果强调了在实践中应用量化时在效率和伦理考虑之间仔细平衡的重要性。
Summary / 总结
This study evaluates how quantization influences model bias, particularly focusing on its effects on demographic subgroups. The research examines weight and activation quantization across various bias types, using probability and generated text metrics across 13 benchmarks. Key findings indicate that quantization reduces model toxicity and does not significantly affect sentiment, but it tends to increase stereotypes and unfairness, especially under aggressive compression. These trends are consistent across different demographic categories and model types, though the magnitude varies depending on the specific setting.
这项研究评估了量化如何影响模型偏见,特别是在个体 demographic 子群体中的影响,通过考察权重和激活量化策略在各种偏见类型上的效果。研究使用了13个基准的概率和生成文本指标,发现虽然量化可以减少模型毒性且对情感影响不大,但它会稍微增加刻板印象和不公平性,尤其是在激进压缩下。这些趋势在不同的 demographic 子群体和模型类型中是一致的,但具体影响的程度取决于特定的环境设置。
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Authors: Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, Nazia Tasnim, Farig Sadeque
First: 2026-01-15T16:28:14+00:00 · Latest: 2026-01-15T16:28:14+00:00
Comments: 16 pages, 4 figures
Abstract
Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
中文标题/摘要
标题:基于表示感知的激活签名去学习:从抑制到知识签名消除
从大型语言模型中选择性地删除知识对于遵守GDPR和模型安全性至关重要,但当前的去学习方法将行为抑制与真正的知识删除混淆,允许潜在能力在表面拒绝之下持续存在。在本文中,我们通过引入知识免疫框架(KIF),一种基于表示感知的架构,通过针对内部激活签名而不是表面输出来区分真正的删除与混淆,来解决这一挑战。我们的方法结合了针对特定主题的表示的动态抑制与参数高效的适应,能够在无需完全重新训练模型的情况下实现持久的去学习。KIF 实现了接近完美的删除(FQ 约为 0.99,而理想值为 1.00),同时保持了与理想水平相当的实用性(MU = 0.62),从而打破了所有先前工作都受到限制的稳定性和删除之间的权衡。我们评估了从标准基础模型(Llama 和 Mistral)到推理优先模型(Qwen 和 DeepSeek)的多种模型,参数范围从 3B 到 14B。我们的观察表明,标准模型表现出与规模无关的真实删除(<3% 的实用性漂移),而推理优先模型揭示了基本架构的差异。我们综合了表面泄漏与潜在痕迹持续性的双重评估协议,操作化了混淆与删除之间的区分,并首次系统地诊断了不同模型家族和规模下的机制级遗忘行为。
Summary / 总结
This paper addresses the challenge of selective knowledge erasure in LLMs by introducing the Knowledge Immunization Framework (KIF), which distinguishes genuine erasure from obfuscation through targeting internal activation signatures. KIF combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, achieving near-oracle erasure while preserving utility. The study evaluates KIF on various models, demonstrating scale-independent true erasure in standard models and revealing architectural divergence in reasoning-prior models. The evaluation protocol combines surface-level leakage with latent trace persistence to diagnose mechanism-level forgetting behavior.
本文通过引入知识免疫框架(KIF),针对大型语言模型(LLMs)中的选择性知识擦除问题,KIF 目标于内部激活签名以实现真正的知识去除,而无需进行全面模型重新训练。KIF 结合了对特定主题表示的动态抑制与参数高效的适应,实现了接近完美的擦除效果,同时保持了实用性。研究在多种模型上评估了 KIF,展示了标准模型的规模无关的真实擦除,并揭示了推理先验模型的架构差异。评估协议结合了表面级泄漏和潜在痕迹持久性,诊断了不同模型家族和规模下的遗忘行为。
Kolmogorov Arnold Networks and Multi-Layer Perceptrons: A Paradigm Shift in Neural Modelling
Authors: Aradhya Gaonkar, Nihal Jain, Vignesh Chougule, Nikhil Deshpande, Sneha Varur, Channabasappa Muttal
First: 2026-01-15T16:26:49+00:00 · Latest: 2026-01-15T16:26:49+00:00
Comments: 13 pages, 8 figures, 2 tables
Abstract
The research undertakes a comprehensive comparative analysis of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP), highlighting their effectiveness in solving essential computational challenges like nonlinear function approximation, time-series prediction, and multivariate classification. Rooted in Kolmogorov's representation theorem, KANs utilize adaptive spline-based activation functions and grid-based structures, providing a transformative approach compared to traditional neural network frameworks. Utilizing a variety of datasets spanning mathematical function estimation (quadratic and cubic) to practical uses like predicting daily temperatures and categorizing wines, the proposed research thoroughly assesses model performance via accuracy measures like Mean Squared Error (MSE) and computational expense assessed through Floating Point Operations (FLOPs). The results indicate that KANs reliably exceed MLPs in every benchmark, attaining higher predictive accuracy with significantly reduced computational costs. Such an outcome highlights their ability to maintain a balance between computational efficiency and accuracy, rendering them especially beneficial in resource-limited and real-time operational environments. By elucidating the architectural and functional distinctions between KANs and MLPs, the paper provides a systematic framework for selecting the most suitable neural architectures for specific tasks. Furthermore, the proposed study highlights the transformative capabilities of KANs in progressing intelligent systems, influencing their use in situations that require both interpretability and computational efficiency.
中文标题/摘要
标题:科莫戈罗夫-阿诺德网络与多层感知机:神经建模范式的转变
研究对科莫戈罗夫-阿诺德网络(KAN)和多层感知机(MLP)进行了全面的比较分析,突显了它们在解决非线性函数逼近、时间序列预测和多元分类等关键计算挑战方面的有效性。基于科莫戈罗夫表示定理,KANs 使用自适应样条激活函数和网格结构,提供了与传统神经网络框架相比的变革性方法。通过涵盖数学函数估计(二次和三次)到实际应用如预测每日温度和葡萄酒分类等多种数据集,研究通过均方误差(MSE)等准确度指标和通过浮点运算(FLOPs)评估计算成本,全面评估了模型性能。结果表明,KANs 在所有基准测试中都超过了MLPs,以显著降低的计算成本实现了更高的预测准确性。这种结果突显了它们在计算效率和准确性之间保持平衡的能力,特别是在资源有限和实时操作环境中尤其有益。通过阐明KANs和MLPs在架构和功能上的区别,论文提供了一个系统框架,用于为特定任务选择最合适的神经网络架构。此外,提出的研究所揭示的KANs在推进智能系统方面的变革能力,影响了其在需要可解释性和计算效率的情况下的应用。
Summary / 总结
This research compares Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP) in terms of their effectiveness in nonlinear function approximation, time-series prediction, and multivariate classification. KANs, based on Kolmogorov's representation theorem, use adaptive spline-based activation functions and grid-based structures, offering a more efficient approach than traditional MLPs. Across various datasets, KANs demonstrated superior predictive accuracy with reduced computational costs, making them particularly advantageous in resource-limited environments. The study provides a framework for selecting the most appropriate neural architecture for specific tasks.
该研究对比了Kolmogorov-Arnold网络(KAN)和多层感知机(MLP)在非线性函数逼近、时间序列预测和多变量分类中的应用效果。KAN基于Kolmogorov表示定理,采用自适应样条激活函数和网格结构,展示了在各种数据集上的优越性能,在准确性和计算效率方面均优于MLP。研究发现,KAN在各个基准测试中实现了更高的预测精度,并且计算资源消耗显著减少,使其成为资源受限环境的理想选择。
Process-Guided Concept Bottleneck Model
Authors: Reza M. Asiyabi, SEOSAW Partnership, Steven Hancock, Casey Ryan
First: 2026-01-15T16:25:55+00:00 · Latest: 2026-01-15T16:25:55+00:00
Comments: 13 pages with 7 figures and 1 table, Supplementary Materials 10 pages with 3 figures
Abstract
Concept Bottleneck Models (CBMs) improve the explainability of black-box Deep Learning (DL) by introducing intermediate semantic concepts. However, standard CBMs often overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well defined. To address this, we propose the Process-Guided Concept Bottleneck Model (PG-CBM), an extension of CBMs which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. Using above ground biomass density estimation from Earth Observation data as a case study, we show that PG-CBM reduces error and bias compared to multiple benchmarks, whilst leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs. Beyond improved accuracy, PG-CBM enhances transparency, enables detection of spurious learning, and provides scientific insights, representing a step toward more trustworthy AI systems in scientific applications.
中文标题/摘要
标题:过程导向的概念瓶颈模型
概念瓶颈模型(CBMs)通过引入中间语义概念来提高黑盒深度学习(DL)的可解释性。然而,标准CBMs往往忽视了特定领域的关系和因果机制,并且依赖完整的概念标签限制了其在监督稀疏但过程定义良好的科学领域中的应用。为了解决这个问题,我们提出了过程导向的概念瓶颈模型(PG-CBM),这是一种扩展的CBM,通过生物物理意义明确的中间概念约束学习,遵循领域定义的因果机制。以地球观测数据估计地上生物量密度为例,我们展示了PG-CBM相比多个基准模型减少了误差和偏差,同时利用多源异构训练数据并产生可解释的中间输出。除了提高准确性,PG-CBM还增强了透明度,能够检测虚假学习,并提供科学见解,代表了在科学应用中更可信赖的人工智能系统的一个步骤。
Summary / 总结
The research aims to improve the explainability and applicability of Concept Bottleneck Models (CBMs) in scientific domains by addressing their limitations in handling domain-specific causal mechanisms and sparse supervision. The Process-Guided Concept Bottleneck Model (PG-CBM) is proposed, which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. The study demonstrates that PG-CBM reduces error and bias in estimating above ground biomass density from Earth Observation data, while also providing interpretable intermediate outputs and enhancing transparency and scientific insights.
研究旨在通过解决概念瓶颈模型(CBMs)在处理特定领域因果机制和稀疏监督方面的局限性,提高其解释性和适用性。提出了过程引导的概念瓶颈模型(PG-CBM),该模型通过生物物理意义明确的中间概念来约束学习,遵循领域定义的因果机制。研究显示,PG-CBM 在从地球观测数据估计地上生物量密度时减少了误差和偏差,同时利用了多源异质训练数据并生成了可解释的中间输出,从而增强了透明度和科学洞察。
Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems
Authors: Xi Shi, Mengxin Zheng, Qian Lou
First: 2026-01-15T16:23:53+00:00 · Latest: 2026-01-15T16:23:53+00:00
Comments: Preprint
Abstract
Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at https://github.com/xishi404/LAMaS
中文标题/摘要
标题:学习面向延迟感知的并行多智能体系统编排
多智能体系统(MAS)通过协调多个智能体实现复杂推理,但由于多步执行和重复模型调用,往往导致较高的推理延迟,严重限制了其在时间敏感场景中的可扩展性和实用性。大多数现有方法主要优化任务性能和推理成本,并显式或隐式假设顺序执行,使得它们在并行执行下控制延迟方面不太理想。在本文中,我们研究了在并行执行下具有显式延迟监督的多智能体系统学习式编排。我们提出了面向延迟的多智能体系统(LAMaS),这是一种面向延迟的多智能体编排框架,能够实现并行执行,并显式优化关键执行路径,使控制器能够在并行执行下构建具有较低延迟的执行拓扑图。我们的实验表明,与多个基准上的最新基线方法相比,我们的方法在多智能体架构搜索中将关键路径长度减少了38-46%,同时保持或甚至提高了任务性能。这些结果突显了在设计高效多智能体系统时显式优化并行执行下延迟的重要性。代码可在https://github.com/xishi404/LAMaS获取
Summary / 总结
This work addresses the high latency issue in multi-agent systems (MAS) by proposing LAMaS, a latency-aware orchestration framework that optimizes the critical execution path under parallel execution. The approach reduces the critical path length by 38-46% compared to existing methods while maintaining or improving task performance across various benchmarks. This demonstrates the necessity of explicitly optimizing latency for efficient MAS design in time-sensitive scenarios.
该研究通过提出LAMaS框架解决多智能体系统(MAS)中的高延迟问题,LAMaS针对并行执行优化关键执行路径,并构建低延迟的执行拓扑图,相比现有方法减少关键路径长度38-46%,同时保持任务性能。这强调了在设计高效MAS时显式优化延迟的重要性。
DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery
Authors: Constantin Selzer, Fabian B. Flohr
Venue: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, AB, Canada, 2024, pp. 221-227
First: 2026-01-15T16:18:42+00:00 · Latest: 2026-01-15T16:18:42+00:00
Abstract
The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban
中文标题/摘要
标题:DeepUrban:基于无人机影像的交互感知轨迹预测与规划
自主驾驶系统的有效性在很大程度上取决于其预测和规划能力的稳健性。然而,当前的基准测试受到一个明显问题的阻碍,即缺乏密集交通场景,这对于理解并建模道路使用者之间的复杂交互至关重要。为了解决这一问题,我们与工业合作伙伴DeepScenario合作,开发了DeepUrban——一个新的无人机数据集,旨在增强密集城市环境下的轨迹预测和规划基准。DeepUrban提供了从城市交叉口高空拍摄的高分辨率图像中提取的丰富3D交通对象。数据集进一步丰富了全面的地图和场景信息,以支持高级建模和仿真任务。我们评估了最先进的(SOTA)预测和规划方法,并进行了泛化能力实验。我们的研究结果表明,将DeepUrban添加到nuScenes中可以提高车辆预测和规划的准确性,ADE / FDE指标上的改进幅度最高可达44.1% / 44.3%。网站:https://iv.ee.hm.edu/deepurban
Summary / 总结
The research aims to improve autonomous driving systems by addressing the scarcity of dense traffic scenarios in current benchmarks. The study introduces DeepUrban, a new drone dataset that captures 3D traffic objects from urban intersections, enhancing trajectory prediction and planning. Experiments show that integrating DeepUrban into nuScenes improves prediction accuracy by up to 44.1% and 44.3% on ADE and FDE metrics respectively, demonstrating its effectiveness in supporting advanced modeling and simulation tasks.
研究旨在通过解决当前基准中密集交通场景稀缺的问题来提升自动驾驶系统的性能。研究引入了DeepUrban,这是一个新的无人机数据集,从城市交叉口捕获3D交通对象。对最先进的方法在DeepUrban数据集上的评估显示了显著的改进,当与nuScenes数据集结合时,ADE和FDE指标分别提高了44.1%和44.3%,证明了该数据集在提升轨迹预测和规划能力方面的有效性。
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
First: 2026-01-15T16:18:00+00:00 · Latest: 2026-01-15T16:18:00+00:00
Comments: 22 pages, 10 figures
Abstract
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
中文标题/摘要
标题:视频生成模型与潜在世界模型的推理时物理对齐
最先进的视频生成模型能够生成令人印象深刻的视觉内容,但往往违反基本的物理原理,限制了它们的应用。虽然有人认为这种缺陷源于预训练时对物理理解不足,但我们发现,物理合理性不足也源于不理想的推理策略。因此,我们引入了WMReward,并将提高视频生成的物理合理性视为一种推理时的对齐问题。具体而言,我们利用潜在世界模型(这里为VJEPA-2)的强物理先验作为奖励,搜索和引导多个候选去噪轨迹,从而实现测试时计算量的扩展,以提高生成性能。实验上,我们的方法在图像条件、多帧条件和文本条件生成设置中显著提高了物理合理性,得到了人类偏好研究的验证。值得注意的是,在ICCV 2025感知测试PhysicsIQ挑战中,我们获得了62.64%的最终得分,获得第一名,并且优于之前最先进的方法7.42%。我们的工作证明了使用潜在世界模型提高视频生成的物理合理性的可行性,超越了这一特定实例或参数化。
Summary / 总结
This paper addresses the issue of video generative models violating basic physics principles by introducing WMReward, which treats physics plausibility as an inference-time alignment problem. By leveraging the strong physics prior of a latent world model, the method searches and steers multiple candidate denoising trajectories, enhancing generation performance. The approach significantly improves physics plausibility across various generation settings and achieves a final score of 62.64% in the ICCV 2025 Perception Test PhysicsIQ Challenge, winning first place and outperforming the previous state of the art by 7.42%.
研究通过引入WMReward,将物理合理性视为推理时的对齐问题,利用潜在世界模型的强大物理先验来引导多个去噪轨迹,从而提高生成性能。该方法在各种生成设置中显著提高了物理合理性,并在ICCV 2025 Perception Test PhysicsIQ挑战赛中获得第一名,得分为62.64%,超越了之前的最先进水平7.42%。
Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
First: 2026-01-15T16:16:34+00:00 · Latest: 2026-01-15T16:16:34+00:00
Abstract
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
中文标题/摘要
标题:释放大型视觉语言模型在路边基础设施智能感知方面的潜力
城市路边基础设施的自动化感知对于智能城市管理至关重要,但通用模型往往难以捕捉到必要的细粒度属性和领域规则。虽然大型视觉语言模型(VLMs)在开放世界识别方面表现出色,但在遵循工程标准准确解释复杂设施状态方面却常常力不从心,导致在实际应用中的性能不可靠。为了解决这一问题,我们提出了一种领域适应框架,将VLMs转化为专门的智能基础设施分析代理。我们的方法结合了数据高效微调策略和基于知识的推理机制。具体来说,我们利用Grounding DINO的开放式词汇微调来在最少监督的情况下稳健地定位各种资产,然后利用基于LoRA的Qwen-VL适应进行深入的语义属性推理。为了减轻幻觉并确保专业合规,我们引入了一个双模态检索增强生成(RAG)模块,在推理过程中动态检索权威的行业标准和视觉示例。在全面的新建城市路边场景数据集上进行评估,我们的框架实现了58.9的检测性能mAP和95.5%的属性识别准确率,展示了智能基础设施监控的稳健解决方案。
Summary / 总结
The research aims to improve the automated perception of urban roadside infrastructure for smart city management. It proposes a domain-adapted framework that fine-tunes large vision-language models with a data-efficient strategy and a knowledge-grounded reasoning mechanism. The framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, showing robust performance in real-world applications.
研究旨在通过改进通用模型的局限性,提升对城市路边基础设施的自动化感知,以支持智能城市管理。提出的领域适配框架通过开放词汇量微调和知识导向的推理机制对大型视觉语言模型进行细调,并引入双模态检索增强生成模块,以确保符合工程标准。实验结果表明,检测性能达到58.9 mAP,属性识别准确率为95.5%,表明这是一种稳健的智能基础设施监控解决方案。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding
Authors: Xingang Pan, Xiaohang Zhan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang
Venue: AAAI 2018
First: 2017-12-17T09:37:52+00:00 · Latest: 2026-01-15T16:01:43+00:00
Comments: Accepted to AAAI 2018
Abstract
Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.
中文标题/摘要
标题:空间作为深度:用于交通场景理解的空间CNN
卷积神经网络(CNN)通常通过逐层堆叠卷积操作来构建。尽管CNN在从原始像素中提取语义方面表现出强大的能力,但其捕捉图像行和列中像素的空间关系的能力尚未得到充分探索。这些关系对于学习具有强烈形状先验但外观一致性较弱的语义对象(如交通车道)非常重要,如图1(a)所示。本文中,我们提出了一种空间CNN(SCNN),它将传统的逐层卷积推广到特征图内的切片间卷积,从而在层内使行和列中的像素之间能够进行消息传递。这种SCNN特别适用于具有强烈空间关系但外观线索较少的长连续形状结构或大型对象,如交通车道、杆和墙。我们在一个新发布的非常具有挑战性的交通车道检测数据集和Cityscapse数据集上应用了SCNN。结果表明,SCNN能够学习结构输出的空间关系,并显著提高了性能。我们展示了SCNN在车道检测数据集上的性能优于基于递归神经网络(RNN)的ReNet和MRF+CNN(MRFNet),分别提高了8.7%和4.6%。此外,我们的SCNN在TuSimple基准车道检测挑战中获得了第一名,准确率为96.53%。
Summary / 总结
This paper introduces Spatial CNN (SCNN), which extends traditional CNNs by enabling message passing between pixels across rows and columns within feature maps, enhancing the ability to capture spatial relationships. Applied to traffic lane detection and Cityscapes datasets, SCNN significantly improves performance, outperforming RNN-based ReNet and MRFNet by 8.7% and 4.6% respectively, achieving 96.53% accuracy on the TuSimple Benchmark Lane Detection Challenge.
本文提出了Spatial CNN (SCNN),通过在特征图内实现跨行和列的像素间消息传递,扩展了传统的CNN。这种方法特别适用于学习交通场景中的空间关系,如交通车道。SCNN在挑战性的交通车道检测数据集和Cityscapes数据集上进行了测试,显示出比ReNet和MRFNet等现有方法显著的改进,分别提高了8.7%和4.6%的性能。此外,SCNN在TuSimple基准车道检测挑战中获得了96.53%的最高准确率。
Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning
Authors: Oscar H. Ramírez-Agudelo, Akshay N. Shewatkar, Edoardo Milana, Roland C. Aydin, Kai Franke
Venue: SPIE Vol. 12675 126750A-12, 2023
First: 2026-01-15T15:59:12+00:00 · Latest: 2026-01-15T15:59:12+00:00
Comments: 17 pages, 10 figures, 6 tables, SPIE Applications of Machine Learning 2023, San Diego, US
Abstract
Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80\% train, 10\% validation, and 10\% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43\,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges
中文标题/摘要
标题:通过深度学习提高烟雾和雾霾场景中测度图像的质量
在雾霾和烟雾环境中拍摄的图像由于能见度降低,给基础设施监测带来了挑战,并在紧急情况下妨碍了应急服务。本研究探讨了使用深度学习模型提高烟雾环境中测度图像的自动可读性,准确的测度数据解释对于一线救援人员来说是一个有价值的工具。研究使用了两种深度学习架构,FFA-Net和AECR-Net,以提高被烟雾和雾霾污染的测度图像的可见度。由于缺乏模拟测度图像的基准数据集,使用Unreal Engine生成了一个新的合成数据集,包含超过14,000张图像。模型分别以80%的训练集、10%的验证集和10%的测试集进行了训练。对于合成雾霾数据集,SSIM和PSNR指标分别为0.98和43 dB,与最先进的结果相当。此外,AECR-Net比FFA-Net获得了更稳健的结果。虽然烟雾数据集的合成结果较差,但训练模型仍取得了有趣的结果。总体而言,在烟雾环境中成像更难提高,因为烟雾的不均匀性和高密度。其次,FFA-Net和AECR-Net被实现用于去雾,而不是去烟。本研究表明,使用深度学习架构可以极大地提高烟雾和雾霾场景中模拟测度图像的质量。最后,增强后的输出图像可以成功后处理以实现自动自主读取测度
Summary / 总结
This study addresses the challenge of reduced visibility in images captured in hazy and smoky environments, which can hinder infrastructure monitoring and emergency services. It proposes using deep learning models, specifically FFA-Net and AECR-Net, to enhance the readability of gauge images. A new synthetic dataset was created using Unreal Engine to train these models. The AECR-Net outperformed FFA-Net in terms of SSIM and PSNR metrics, though results from the synthetic smoke dataset were less favorable. The models significantly improved the quality of gauge images in both haze and smoke, enabling better automatic reading of gauges.
研究旨在通过深度学习模型提高烟雾和雾霾环境中模拟表盘图像的可读性。使用Unreal Engine生成了超过14,000张图像的合成数据集,训练了FFA-Net和AECR-Net两种架构。模型在SSIM和PSNR指标上表现良好,AECR-Net显示出更稳健的结果。尽管合成烟雾数据集的结果较差,但增强后的图像仍可进行自动表盘读取,展示了深度学习在改善恶劣条件下图像质量方面的潜力。
A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning
Authors: Stefano Cerri, Asbjørn Munk, Jakob Ambsdorf, Julia Machnio, Sebastian Nørgaard Llambias, Vardan Nersesjan, Christian Hedeager Krag, Peirong Liu, Pablo Rocamora García, Mostafa Mehdipour Ghazi, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen
First: 2025-06-17T11:48:05+00:00 · Latest: 2026-01-15T15:58:31+00:00
Abstract
We present FOMO300K, a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging (MRI) scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO300K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.
中文标题/摘要
标题:用于自我监督学习的大规模异构3D磁共振脑成像数据集
我们提出了FOMO300K,这是一个包含318,877个脑磁共振成像(MRI)扫描的数据集,来自82,678个MRI会话和59,969个受试者,汇总自920个公开可用的来源。该数据集包括临床级和研究级图像、多种MRI序列以及广泛的解剖和病理变异性,包括具有大脑异常的扫描。对原始图像特征进行了最小预处理,以降低新用户的入门门槛。提供了用于自我监督预训练和微调的配套代码,以及预训练模型。FOMO300K旨在支持大规模医学影像中自我监督学习方法的开发和基准测试。