arXiv 论文速递

2026-01-18 03:24
Snapshot: 20260118_0324
WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Authors: Xuweiyi Chen, Wentao Zhou, Zezhou Cheng
First: 2026-01-15T18:59:58+00:00 · Latest: 2026-01-15T18:59:58+00:00
Comments: Project Page: https://wild-rayzer.cs.virginia.edu/
Abstract
We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
中文标题/摘要
标题:WildRayZer:动态环境中的自监督大规模视图合成
我们提出了WildRayZer,一种在动态环境中进行新颖视图合成(NVS)的自监督框架,其中相机和物体都在移动。动态内容破坏了静态NVS模型依赖的多视图一致性,导致鬼影、虚假几何结构和不稳定的姿态估计。WildRayZer 通过执行分析-合成测试来解决这一问题:仅相机的静态渲染器解释刚性结构,其残差揭示了瞬态区域。从这些残差中,我们构建伪运动掩码,提炼出一个运动估计器,并使用它来屏蔽输入标记和门控损失梯度,使监督集中在跨视图背景完成上。为了实现大规模训练和评估,我们整理了包含15000个随意捕捉的动态序列的真实世界数据集Dynamic RealEstate10K(D-RE10K),以及D-RE10K-iPhone,这是一个稀疏视图瞬态感知NVS的瞬态和干净基准。实验表明,WildRayZer 在瞬态区域去除和全帧NVS质量方面,单次前向传递时均优于基于优化和前馈的基本方法。
Summary / 总结
WildRayZer is a self-supervised framework for novel view synthesis in dynamic environments, addressing issues like ghosting and unstable pose estimation. It uses a camera-only static renderer to explain rigid structure and residuals to construct pseudo motion masks, which help in focusing supervision on background completion. Experiments show that WildRayZer outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.
WildRayZer 是一种用于动态环境中的新颖视图合成的自监督框架,解决了鬼影和姿态估计不稳定等问题。它通过静态渲染器解释刚性结构,并利用残差构建伪运动掩码,帮助估计运动并聚焦于背景完成的监督。实验表明,WildRayZer 在去除动态区域和全帧 NVS 质量方面均优于基于优化和前馈的基线方法,且只需一次前馈传递。
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
First: 2026-01-15T18:59:23+00:00 · Latest: 2026-01-15T18:59:23+00:00
Abstract
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
中文标题/摘要
标题:MatchTIR:通过二分匹配实现细粒度监督的工具集成推理
工具集成推理(TIR)通过在推理步骤与外部工具交互之间交替,赋予大型语言模型(LLMs)处理复杂任务的能力。然而,现有的强化学习方法通常依赖于结果或轨迹级别的奖励,对轨迹内的所有步骤分配相同的优点。这种粗粒度的信用分配无法区分有效的工具调用与冗余或错误的调用,特别是在长时间多轮场景中。为了解决这个问题,我们提出了MatchTIR框架,通过基于二分匹配的轮次级别奖励分配和双层优势估计引入细粒度监督。具体来说,我们将信用分配形式化为预测和真实轨迹之间的二分匹配问题,利用两种分配策略推导密集的轮次级别奖励。此外,为了平衡局部步骤精度与全局任务成功,我们引入了一种双层优势估计方案,结合轮次级别和轨迹级别的信号,为每个交互轮次分配不同的优势值。在三个基准上的广泛实验表明MatchTIR的优越性。值得注意的是,我们的4B模型在大多数8B竞争对手中表现更优,特别是在长时间多轮任务中。我们的代码可在https://github.com/quchangle1/MatchTIR获取。
Summary / 总结
MatchTIR is a framework that improves Tool-Integrated Reasoning by assigning fine-grained turn-level rewards through bipartite matching, distinguishing effective tool calls from redundant ones. It uses dual-level advantage estimation to balance local precision and global task success, outperforming 8B models in long-horizon and multi-turn tasks on three benchmarks.
论文提出MatchTIR框架,通过二分匹配分配细粒度的回合级奖励,并结合双层优势估计来区分有效的工具调用和冗余调用。实验表明,MatchTIR在长时程和多轮任务中优于现有方法,4B模型尤其超过了大多数8B竞争对手。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略且不对称的连接方式,仅将视觉编码器的输出链接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现多层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 显著提高了性能,确立了CLI 作为一种可扩展范式,通过赋予LLM按需访问完整视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitations of static vision-language models (VLMs) by introducing Cross-Layer Injection (CLI), a dynamic framework that creates a many-to-many connection between vision and language. CLI includes an Adaptive Multi-Projection (AMP) module for feature harmonization and an Adaptive Gating Fusion (AGF) mechanism for selective visual information injection. Experiments show CLI improves performance on 18 diverse benchmarks, enhancing LLMs' ability to integrate visual details with language reasoning.
论文通过提出一种动态框架Cross-Layer Injection (CLI),解决了现有视觉-语言模型(VLMs)的局限性,CLI使视觉和语言模态之间能够实现多对多的连接。CLI 包含一个用于特征协调的Adaptive Multi-Projection (AMP) 模块和一个用于选择性注入视觉信息的Adaptive Gating Fusion (AGF) 机制。实验表明,CLI 在18个不同基准上的表现显著提升,增强了LLMs对视觉和语言信息的综合理解能力。
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Authors: Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf
First: 2026-01-15T18:58:33+00:00 · Latest: 2026-01-15T18:58:33+00:00
Abstract
Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
中文标题/摘要
标题:少看,开好车:基于基础模型的随机补丁选择实现通用端到端自动驾驶
端到端自动驾驶的最新进展表明,基础模型提取的补丁对齐特征训练出的策略在分布外(OOD)场景下表现更好。我们假设由于自注意力机制,每个补丁特征隐含地嵌入/包含来自其他所有补丁的信息,以不同的方式和强度表示,使得这些描述符高度冗余。我们通过主成分分析(PCA)和跨补丁相似性量化了(BLIP2)特征中的冗余性:90%的方差由17/64个主成分捕获,且跨标记相关性普遍强烈。在这些重叠信息上进行训练会导致策略过度拟合虚假的相关性,损害了OOD鲁棒性。我们提出了一种简单而有效的随机补丁选择(SPS)方法,用于学习更鲁棒、更通用且更高效的策略。对于每一帧,SPS随机遮蔽一部分补丁描述符,不将其提供给策略模型,同时保持剩余补丁的空间布局。因此,策略获得了不同但完整的(相同)场景视图:每随机子集的补丁像不同的,但仍然合理的世界投影。策略基于不变于哪些特定标记幸存的特征做出决策。大量实验表明,我们的方法在所有OOD场景中均优于现有最佳方法(SOTA),平均提高6.2%,在闭环模拟中最高提高20.4%,同时速度提高2.4倍。我们在遮蔽率和补丁特征重组、训练和评估9个系统中进行了消融研究,其中8个系统均超越了先前的SOTA。最后,我们展示了相同的策略在无需调整的情况下可以转移到物理的、真实世界的汽车上。
Summary / 总结
The research aims to improve the robustness and generalizability of end-to-end autonomous driving policies by reducing redundancy in patch features extracted from foundation models. The method, Stochastic-Patch-Selection (SPS), randomly masks a fraction of patch descriptors in each frame, leading to more robust and generalizable policies. Experiments show that SPS outperforms the state of the art, achieving up to 20.4% improvement in closed-loop simulations and being 2.4 times faster.
研究旨在通过减少patch特征中的冗余性来提高端到端自动驾驶策略的鲁棒性和通用性。方法Stochastic-Patch-Selection (SPS) 在每帧中随机遮掩部分patch描述符,从而产生更鲁棒和高效的策略。实验结果显示平均改进6.2%,最高达20.4%的闭环仿真性能,同时比现有最佳方法快2.4倍。
Grounding Agent Memory in Contextual Intent
Authors: Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
First: 2026-01-15T18:55:13+00:00 · Latest: 2026-01-15T18:55:13+00:00
Abstract
Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
中文标题/摘要
标题:将代理记忆扎根于上下文意图
在长期目标导向的交互中部署大型语言模型仍然具有挑战性,因为相似的实体和事实会在不同的潜在目标和约束下反复出现,导致记忆系统检索到上下文不匹配的证据。我们提出了STITCH(基于上下文历史的结构化意图跟踪),这是一种代理记忆系统,通过结构化的检索提示、上下文意图对每个轨迹步骤进行索引,并通过匹配当前步骤的意图来检索历史。上下文意图提供了紧凑的信号,消除了重复提及的歧义并减少了干扰:(1) 当前定义主题段落的潜在目标,(2) 动作类型,以及(3) 重要的实体类型,锚定哪些属性是相关的。在推理过程中,STITCH通过意图兼容性筛选和优先级排序记忆片段,抑制语义相似但上下文不匹配的历史记录。 为了评估,我们引入了CAME-Bench,这是一个基准测试,用于在现实、动态的目标导向轨迹中进行上下文感知检索。在CAME-Bench和LongMemEval上,STITCH达到了最先进的性能,比最强基线高出35.6%,随着轨迹长度的增加,性能提升最大。我们的分析表明,意图索引显著减少了检索噪声,支持意图感知的记忆,以实现稳健的长期推理。
Summary / 总结
The paper addresses the challenge of deploying large language models in long-term goal-oriented interactions by proposing STITCH, a memory system that uses contextual intent to disambiguate repeated mentions and reduce interference. STITCH indexes each step with a structured retrieval cue and filters memory snippets based on intent compatibility. It outperforms existing methods by 35.6% on CAME-Bench and LongMemEval, with the largest improvements seen in longer trajectories.
研究解决了在长期目标导向交互中,相似实体和事实可能导致记忆系统检索到上下文不匹配信息的挑战。提出了STITCH,一种使用上下文意图进行索引和检索相关历史的代理记忆系统。STITCH在CAME-Bench和LongMemEval上的表现优于现有方法35.6%,特别是在长期推理方面显示出显著改进。
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Authors: Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
First: 2026-01-15T18:54:50+00:00 · Latest: 2026-01-15T18:54:50+00:00
Abstract
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
中文标题/摘要
标题:LIBERTy:基于因果框架的概念驱动解释基准测试方法,使用结构性反事实
概念驱动的解释量化了高层概念(例如,性别或经验)对模型行为的影响,这对于高风险领域的决策者至关重要。近期工作通过将这些解释与从反事实中估计的参考因果效应进行比较,来评估其忠实性。实际上,现有的基准测试依赖于昂贵的人工编写的反事实,这些反事实作为不完美的代理。为了解决这一问题,我们提出了一种构建包含结构性反事实配对的数据集的框架:LIBERTy(基于LLM的解释性基准测试,具有参考目标)。LIBERTy基于明确定义的结构化因果模型(SCMs)的文本生成,对概念的干预通过SCM传播,直到LLM生成反事实。我们引入了三个数据集(疾病检测、CV筛选和工作场所暴力预测)以及一个新的评估指标,顺序忠实性。使用这些数据集,我们在五种模型上评估了各种方法的范围,并确定了改进概念驱动解释的大量空间。LIBERTy还使系统分析模型对干预的敏感性成为可能:我们发现,专有LLM在对人口统计概念的敏感性方面明显降低,这可能是由于后训练缓解措施所致。总体而言,LIBERTy为开发忠实的解释方法提供了急需的基准测试。
Data-driven stochastic reduced-order modeling of parametrized dynamical systems
Authors: Andrew F. Ilersich, Kevin Course, Prasanth B. Nair
First: 2026-01-15T18:50:18+00:00 · Latest: 2026-01-15T18:50:18+00:00
Abstract
Modeling complex dynamical systems under varying conditions is computationally intensive, often rendering high-fidelity simulations intractable. Although reduced-order models (ROMs) offer a promising solution, current methods often struggle with stochastic dynamics and fail to quantify prediction uncertainty, limiting their utility in robust decision-making contexts. To address these challenges, we introduce a data-driven framework for learning continuous-time stochastic ROMs that generalize across parameter spaces and forcing conditions. Our approach, based on amortized stochastic variational inference, leverages a reparametrization trick for Markov Gaussian processes to eliminate the need for computationally expensive forward solvers during training. This enables us to jointly learn a probabilistic autoencoder and stochastic differential equations governing the latent dynamics, at a computational cost that is independent of the dataset size and system stiffness. Additionally, our approach offers the flexibility of incorporating physics-informed priors if available. Numerical studies are presented for three challenging test problems, where we demonstrate excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing approaches.
中文标题/摘要
标题:基于数据驱动的参数化动力系统随机降阶建模
在不同条件下建模复杂的动力系统计算强度大,常常使高保真模拟变得不可行。虽然降阶模型(ROMs)提供了有希望的解决方案,但当前的方法往往难以处理随机动力学,并且无法量化预测不确定性,限制了其在稳健决策环境中的应用。为了解决这些挑战,我们提出了一种基于数据驱动的学习连续时间随机ROMs的方法,该方法可以在参数空间和激励条件下泛化。我们的方法基于可约化随机变分推断,利用马尔可夫高斯过程的重参数化技巧,在训练过程中消除昂贵的前向求解器的需要。这使我们能够联合学习概率自编码器和控制潜在动力学的随机微分方程,计算成本与数据集大小和系统刚度无关。此外,如果可用,我们的方法还提供了纳入物理信息先验的灵活性。我们通过三个具有挑战性的测试问题进行了数值研究,展示了在未见过的参数组合和激励条件下的出色泛化能力,并与现有方法相比实现了显著的效率提升。
Summary / 总结
This study addresses the computational challenges of modeling complex dynamical systems under varying conditions by introducing a data-driven framework for learning continuous-time stochastic reduced-order models (ROMs). The approach uses amortized stochastic variational inference with a reparametrization trick for Markov Gaussian processes, enabling efficient training and generalization across parameter spaces and forcing conditions. The method jointly learns a probabilistic autoencoder and stochastic differential equations, offering significant computational efficiency and the flexibility to incorporate physics-informed priors. The framework demonstrates excellent generalization and efficiency gains in numerical studies for three challenging test problems.
该研究通过引入一种数据驱动的方法来学习连续时间的随机降阶模型(ROMs),以应对在不同条件下建模复杂动力系统的计算挑战。该方法使用了鞅高斯过程的重参数技巧和近似随机变分推断,避免了在训练过程中使用昂贵的前向求解器,从而能够高效地学习概率自编码器和随机微分方程。研究在三个具有挑战性的测试问题上展示了该方法在未见过的参数组合和激励条件下的出色泛化能力,并且相比现有方法实现了显著的效率提升。
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Authors: Zirui Ren, Ziming Liu
First: 2026-01-15T18:42:50+00:00 · Latest: 2026-01-15T18:42:50+00:00
Abstract
Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) "Grokking" dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM "guesses" the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be "guessing" instead of "reasoning". Leveraging this "guessing" picture, we propose three strategies to scale HRM's guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models "reason".
中文标题/摘要
标题:你的推理模型是在推理还是猜测?层级推理模型的机制分析
层级推理模型(HRM)在各种推理任务中表现出色,显著优于基于大型语言模型的推理器。为了理解HRM的优势和潜在的失败模式,我们对其推理模式进行了机制研究,发现了三个令人惊讶的事实:(a) 失败于极其简单的谜题,例如,HRM 可能在只有一个未知单元格的谜题上失败。我们将其失败归因于违反了HRM的基本假设——固定点性质。(b) 推理步骤中的“顿悟”动态,即答案并非均匀改进,而是存在一个关键的推理步骤,突然使答案正确;(c) 存在多个固定点。HRM“猜测”第一个固定点,该固定点可能是错误的,并且会暂时或永远被困在那里。所有事实表明,HRM 似乎是在“猜测”而不是“推理”。利用这种“猜测”的观点,我们提出了三种策略来扩展HRM的猜测:数据增强(提高猜测的质量)、输入扰动(通过利用推理的随机性扩展猜测的数量)和模型自举(通过利用训练的随机性扩展猜测的数量)。从实用角度来看,通过结合所有方法,我们开发了增强的HRM,在数独-极限上的准确率从54.5%提升到96.9%。从科学角度来看,我们的分析为理解推理模型如何“推理”提供了新的见解。
Summary / 总结
The study investigates the reasoning patterns of hierarchical reasoning models (HRM) and finds that HRM can fail on simple puzzles due to a violation of the fixed point property, exhibits 'grokking' dynamics where the answer is suddenly correct, and can get trapped in incorrect fixed points. These findings suggest that HRM is more like 'guessing' than 'reasoning'. To improve HRM, the study proposes strategies such as data augmentation, input perturbation, and model bootstrapping, leading to a significant boost in Sudoku-Extreme accuracy from 54.5% to 96.9%.
研究分析了层次推理模型(HRM)的推理模式,发现HRM在简单谜题上会失败,这是由于违反了固定点性质;推理过程中存在‘顿悟’现象,即答案会突然变得正确;并且HRM可能会陷入错误的固定点。这些发现表明,HRM更像是在‘猜测’而不是‘推理’。为了改进HRM,研究提出了数据增强、输入扰动和模型自举等策略,使得数独极端难度的准确率从54.5%提升到了96.9%。
PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution
Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
First: 2026-01-15T18:25:23+00:00 · Latest: 2026-01-15T18:25:23+00:00
Abstract
Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.
中文标题/摘要
标题:PACEvolve:促进长期目标感知一致进化的框架
大型语言模型(LLMs)已成为进化搜索的强大操作员,然而高效的搜索支架设计仍缺乏系统方法。尽管前景广阔,当前的LLM在环系统缺乏管理进化过程的系统方法。我们识别出三种不同的失败模式:上下文污染,实验历史偏差未来候选生成;模式崩溃,由于探索与利用平衡不佳,代理在局部最小值中停滞;以及弱协作,僵化的杂交策略无法有效利用并行搜索轨迹。我们引入了目标感知一致进化(PACEvolve)框架,旨在稳健地管理代理的上下文和搜索动力学,以应对这些挑战。PACEvolve结合了分层上下文管理(HCM)与剪枝以解决上下文污染;基于动量的回溯(MBB)以逃离局部最小值;以及一种自适应采样策略,统一了回溯和杂交以进行动态搜索协调(CE),使代理能够平衡内部细化与跨轨迹协作。我们证明,PACEvolve提供了一条系统的方法,以实现一致的长期自我改进,其在LLM-SR和KernelBench上取得了最先进的结果,同时发现了超越Modded NanoGPT记录的解决方案。
Summary / 总结
The research addresses the challenges of managing the evolutionary process in Large Language Models (LLMs) by introducing PACEvolve, a framework that tackles context pollution, mode collapse, and weak collaboration. PACEvolve uses hierarchical context management, momentum-based backtracking, and a self-adaptive sampling policy to improve search dynamics. Experimental results show that PACEvolve achieves state-of-the-art results on LLM-SR and KernelBench and discovers solutions surpassing the record on Modded NanoGPT.
研究旨在解决由大型语言模型(LLMs)管理的进化搜索过程中的低效问题,重点关注三个失败模式:上下文污染、模式崩溃和弱协作。研究引入了PACEvolve框架,该框架使用分层上下文管理、动量回溯和自适应采样策略来缓解这些问题。实验结果表明,PACEvolve在LLM-SR和KernelBench上达到了最先进的性能,并发现了超越Modded NanoGPT记录的解决方案。
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
First: 2025-05-23T18:01:09+00:00 · Latest: 2026-01-15T18:18:24+00:00
Abstract
Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
中文标题/摘要
标题:PMOA-TTS:介绍PubMed开放获取文本时间序列语料库
临床病历记录了对于建模患者轨迹至关重要的时间动态,但大规模的时间标注资源稀缺。我们介绍了PMOA-TTS语料库,该语料库包含124,699份单个患者的PubMed开放获取病例报告,通过可扩展的大语言模型管道(Llama 3.3 70B和DeepSeek-R1)转换为结构化的文本时间线(事件,时间)对。该语料库包含超过560万条带时间戳的事件,以及提取的统计数据和诊断信息。技术验证使用了临床专家编纂的金标准集和三种度量:语义事件匹配、时间一致性(c-指数)和通过对数时间累积分布函数下的面积(AULTC)总结的对齐误差。我们对替代提示和模型选择进行了基准测试,并提供了支持再现的文档。PMOA-TTS 使从叙述文本中提取时间线、时间推理、生存建模和事件预测的研究成为可能,并提供了广泛的诊断和人口统计学覆盖范围。数据和代码在公共存储库中公开可用。
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Authors: Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
First: 2026-01-15T18:15:06+00:00 · Latest: 2026-01-15T18:15:06+00:00
Abstract
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural
中文标题/摘要
标题:CURVE:文化与多语言长视频推理基准
近期视频模型的发展取得了显著进步,特别是在长视频理解方面。然而,当前的基准测试主要以西方为中心的数据和英语为主,导致评估中存在显著的偏差。为解决这一问题,我们引入了CURVE(视频评价中的文化理解和推理),这是一个针对多文化和多语言视频推理的具有挑战性的基准测试。CURVE 包含来自18个全球地区的高质量、完全由人类生成的、针对特定区域文化的视频注释。与以往依赖自动翻译的工作不同,CURVE 提供了复杂的问题、答案和多步推理步骤,所有这些都用当地语言精心制作。要在CURVE上取得进展需要对视觉文化背景有深入的理解。此外,我们利用CURVE的推理痕迹构建基于证据的图表,并提出了一种使用这些图表的新型迭代策略,以识别推理中的细微错误。我们的评估表明,最先进的视频大模型面临巨大挑战,其性能远低于人类水平,错误主要来自对文化元素的视觉感知。CURVE 将在 https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural 公开可用。
Summary / 总结
The research aims to address the bias in current video understanding benchmarks by introducing CURVE, a new benchmark for multicultural and multilingual video reasoning. CURVE includes high-quality, human-generated annotations from diverse cultural videos across 18 global locales, with questions and answers in native languages. The study finds that state-of-the-art video language models perform poorly on CURVE, with errors mainly due to difficulties in visual perception of cultural elements, indicating a need for deeper cultural understanding in video reasoning systems.
研究旨在通过引入CURVE,一个新的多文化多语言视频推理基准,来解决当前视频理解基准中的偏见问题。CURVE 包含来自18个全球区域的多样化文化视频的高质量、人工生成的注释,并以本地语言提出问题和答案。研究发现,最先进的视频LLMs的表现远低于人类水平,主要原因是视觉感知文化元素的困难。作者提出了一种使用推理痕迹的迭代策略来识别推理中的错误。该基准将公开发布在https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural。
STEM: Scaling Transformers with Embedding Modules
Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
First: 2026-01-15T18:00:27+00:00 · Latest: 2026-01-15T18:00:27+00:00
Abstract
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.
中文标题/摘要
标题:STEM:使用嵌入模块扩展变换器
细粒度稀疏性在不按比例增加每词计算量的情况下承诺了更高的参数容量,但通常会遭受训练不稳定性、负载均衡和通信开销的问题。我们引入了STEM(使用嵌入模块扩展变换器),这是一种静态、按词索引的方法,用局部嵌入查找替换FFN的上投影,同时保持门和下投影密集。这消除了运行时路由,允许CPU卸载和异步预取,并将容量与每词FLOPs和跨设备通信脱钩。实验证明,尽管稀疏性极端,STEM仍能稳定训练。它在减少每词FLOPs和参数访问次数的同时,提高了下游性能(消除大约三分之一FFN参数)。STEM学习具有大角度扩展的嵌入空间,增强了其知识存储容量。更有趣的是,这种增强的知识容量伴随着更好的可解释性。STEM嵌入的按词索引性质允许以简单的方式在不干预输入文本或增加额外计算的情况下进行知识编辑和知识注入。此外,STEM增强了长上下文性能:随着序列长度的增长,更多的不同参数被激活,从而实现实际的测试时容量扩展。在350M和1B模型规模下,STEM总体上提供了高达约3-4%的准确率改进,特别是在知识和推理密集型基准(ARC-Challenge、OpenBookQA、GSM8K、MMLU)上取得了显著进步。总体而言,STEM是一种有效的方法,可以扩展参数内存,同时提供更好的可解释性、更好的训练稳定性和更高的效率。
Summary / 总结
STEM introduces a static, token-indexed approach to scaling Transformers by replacing the FFN up-projection with a layer-local embedding lookup, which enables CPU offload and reduces communication overhead. Empirically, STEM trains stably with extreme sparsity and improves downstream performance while reducing per-token FLOPs and parameter accesses. It also enhances interpretability through simple knowledge editing and injection methods and scales better with sequence length, achieving up to 4% accuracy improvements on knowledge and reasoning benchmarks.
STEM 通过用层局部的嵌入查找替换 FFN 上投影,实现了变压器的扩展,同时允许 CPU 卸载并减少通信开销。尽管极度稀疏,STEM 仍然能够稳定训练,并且在减少每令牌 FLOPs 和参数访问的同时提高下游性能。它还增强了知识存储容量和可解释性,并随着序列长度的增长改善了长上下文性能,实现了高达 3-4% 的准确率改进,覆盖了多种基准测试。
CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Authors: Chengfeng Zhao, Jiazhi Shu, Yubo Zhao, Tianyu Huang, Jiahao Lu, Zekai Gu, Chengwei Ren, Zhiyang Dou, Qing Shuai, Yuan Liu
First: 2026-01-15T17:52:29+00:00 · Latest: 2026-01-15T17:52:29+00:00
Comments: Project Page: https://igl-hkust.github.io/CoMoVi/
Abstract
In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
中文标题/摘要
标题:CoMoVi: 共生成3D人体动作和逼真视频
在本文中,我们发现3D人体动作生成和2D人体视频生成是内在耦合的。3D动作提供了视频中合理性和一致性的结构先验,而预训练的视频模型为动作提供了强大的泛化能力,这需要耦合它们的生成过程。基于此,我们提出了CoMoVi,一种将两个视频扩散模型(VDMs)耦合在一起,在单一扩散去噪循环中同步生成3D人体动作和视频的共生成框架。为此,我们首先提出了一种有效的2D人体动作表示,可以继承预训练VDMs的强大先验。然后,我们设计了一种双分支扩散模型,通过相互特征交互和3D-2D交叉注意力来耦合人体动作和视频生成过程。此外,我们构建了CoMoVi数据集,这是一个包含文本和动作注释的大规模真实世界人体视频数据集,涵盖了多样且具有挑战性的动作。广泛的实验表明,我们的方法在3D人体动作和视频生成任务中均具有有效性。
Summary / 总结
The paper addresses the coupling between 3D human motion generation and 2D human video generation by proposing CoMoVi, a co-generative framework that uses two video diffusion models to generate 3D motions and videos synchronously. It introduces an effective 2D motion representation and a dual-branch diffusion model with mutual feature interaction and 3D-2D cross attentions. The method is validated on the CoMoVi Dataset, showing its effectiveness in both 3D motion and video generation tasks.
研究旨在通过利用3D动作的结构先验和预训练视频模型的泛化能力,将3D人类动作和2D人类视频的生成过程耦合起来。CoMoVi是一个共生生成框架,将两个视频扩散模型耦合在一起,同步生成3D人类动作和视频。方法包括有效的2D人类动作表示和具有相互特征交互和3D-2D交叉注意力的双分支扩散模型。实验表明,CoMoVi能够有效生成3D人类动作和逼真的视频。
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
Authors: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
First: 2025-10-01T20:32:08+00:00 · Latest: 2026-01-15T17:51:14+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.
中文标题/摘要
标题:基于双重不确定性指导的多模态推理策略学习
具有可验证奖励的强化学习(RLVR)在多模态大型语言模型中提升了推理能力。然而,现有方法通常将视觉输入视为确定性的,忽视了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源于复杂的推理还是模糊的感知,从而无法有针对性地分配探索或学习信号。为解决这一问题,我们引入了DUPL,这是一种用于多模态RLVR的双重不确定性指导策略学习方法,通过对称KL散度量化和利用感知不确定性(感知不确定性)以及通过策略熵量化和利用输出不确定性(输出不确定性)来指导策略更新。通过建立一个以不确定性驱动的反馈循环并采用动态分支优先机制,DUPL重新校准策略优势,使其专注于具有高感知或决策模糊性的状态,从而实现有效的目标探索,超越被动的数据增强。DUPL基于GRPO实现,并在六个多模态数学和通用领域推理基准测试上进行评估,提高了Qwen2.5-VL 3B和7B模型的准确性,视觉数学任务上的准确率提升高达11.2%,通用领域推理任务上的准确率提升高达7.1%,并且始终优于GRPO。这些结果表明,双重不确定性指导的策略学习是一种有效且可泛化的多模态RLVR方法。
Summary / 总结
The paper introduces DUPL, a dual-uncertainty guided policy learning approach for multimodal reinforcement learning with verifiable rewards (RLVR). It quantifies perceptual and output uncertainties to guide policy updates, focusing on states with high ambiguity. DUPL improves Qwen2.5-VL 3B and 7B models by up to 11.2% on visual math tasks and 7.1% on general-domain reasoning tasks, outperforming GRPO across six benchmarks.
论文提出了DUPL,一种用于多模态强化学习与可验证奖励(RLVR)的双重不确定性引导策略学习方法。它使用对称KL散度量化感知不确定性,使用策略熵量化输出不确定性来引导策略更新。DUPL在六个基准测试中提高了Qwen2.5-VL 3B和7B模型的表现,最高分别在视觉数学任务和通用领域推理任务上提升了11.2%和7.1%,并优于GRPO。
Data-Driven Dynamic Factor Modeling via Manifold Learning
Authors: Graeme Baker, Agostino Capponi, J. Antonio Sidaoui
First: 2025-06-24T18:40:40+00:00 · Latest: 2026-01-15T17:50:32+00:00
Abstract
We introduce a data-driven dynamic factor framework for modeling the joint evolution of high-dimensional covariates and responses without parametric assumptions. Standard factor models applied to covariates alone often lose explanatory power for responses. Our approach uses anisotropic diffusion maps, a manifold learning technique, to learn low-dimensional embeddings that preserve both the intrinsic geometry of the covariates and the predictive relationship with responses. For time series arising from Langevin diffusions in Euclidean space, we show that the associated graph Laplacian converges to the generator of the underlying diffusion. We further establish a bound on the approximation error between the diffusion map coordinates and linear diffusion processes, and we show that ergodic averages in the embedding space converge under standard spectral assumptions. These results justify using Kalman filtering in diffusion-map coordinates for predicting joint covariate-response evolution. We apply this methodology to equity-portfolio stress testing using macroeconomic and financial variables from Federal Reserve supervisory scenarios, achieving mean absolute error improvements of up to 55% over classical scenario analysis and 39% over principal component analysis benchmarks.
中文标题/摘要
标题:基于流形学习的数据驱动动态因子建模
我们提出了一种数据驱动的动态因子框架,用于在无需参数假设的情况下建模高维协变量和响应的联合演变。单独应用于协变量的标准因子模型往往在解释响应方面失去效力。我们的方法使用各向异性扩散映射,这是一种流形学习技术,来学习低维嵌入,同时保留协变量的内在几何结构和与响应的预测关系。对于来自欧几里得空间朗文扩散的时间序列,我们证明了相关的图拉普拉斯算子收敛于基础扩散的生成器。我们进一步建立了扩散映射坐标与线性扩散过程之间近似误差的上界,并证明在嵌入空间中的遍历平均值在标准谱假设下收敛。这些结果证明了在扩散映射坐标中使用卡尔曼滤波预测协变量-响应联合演变的有效性。我们使用联邦储备监管情景中的宏观经济和金融变量将该方法应用于股票组合压力测试,实现了相对于经典情景分析高达55%的平均绝对误差改进,以及相对于主成分分析基准的39%改进。
Summary / 总结
This paper presents a data-driven dynamic factor framework that models the joint evolution of high-dimensional covariates and responses without making parametric assumptions. It uses anisotropic diffusion maps to learn low-dimensional embeddings that preserve the intrinsic geometry of covariates and their predictive relationship with responses. The methodology is applied to equity-portfolio stress testing, showing improvements in mean absolute error of up to 55% over classical scenario analysis and 39% over principal component analysis benchmarks.
研究旨在无需参数假设的情况下,通过使用各向异性扩散图来学习同时保留协变量内在几何结构及其与响应预测关系的低维嵌入。研究表明,图拉普拉斯矩阵在来自朗格维扩散的时间序列中收敛于潜在扩散的生成器,并建立了扩散图坐标与线性扩散过程之间的近似误差界。该方法在宏观和金融变量的股权组合压力测试中,相对于经典情景分析和主成分分析基准,将均绝对误差改善了最多55%和39%。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒高成本错误的风险。我们研究了基于信心的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们建立了两个发现。首先,置信度阈值化提供了在分布内进行机制性控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率
Summary / 总结
The research aims to enable reliable error control in high-stakes deployment of vision-language models by allowing systems to abstain when uncertain. The study investigates the effectiveness of confidence-based abstention in video question answering, showing that varying confidence thresholds can produce smooth risk-coverage tradeoffs, thereby reducing error rates. The findings suggest that this method provides reliable control over error rates both in-distribution and under distribution shift.
该研究探讨了在视频问答模型中使用基于置信度的回避机制来控制错误率和在分布变化下保持可靠性。使用NExT-QA和Gemini 2.0 Flash数据集,研究发现置信度阈值调整可以可靠地管理训练分布内的错误率,随着阈值的变化,风险和覆盖率之间存在平滑的权衡关系。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,有效从中提炼,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是通过像素跟踪。即使是私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最先进的,并展示了在单图像、多图像和视频任务中出色的基于指针的定位新能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、用于微调的自由形式视频问答数据集、一种新的具有复杂查询的对象跟踪数据集以及一种创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法,并展示了在视觉标记上进行双向注意以及一种新的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
Molmo2 is a new family of open-source vision-language models that outperform existing open-source and proprietary models in video understanding and grounding tasks. The research addresses the lack of open-source foundations for improving video language models by providing 9 new datasets and a training recipe. Key findings include superior performance on tasks like video counting, captioning, and video-grounding, with significant improvements over existing models like Qwen3-VL and proprietary models like Gemini 3 Pro.
Molmo2 是一种新的开源视觉-语言模型,其在视频理解和定位任务上超越了现有的开源和专有模型。研究通过提供 9 个新数据集和训练方法来解决缺乏开源基础的问题。主要发现包括在视频计数、字幕生成和视频定位等任务上的优越表现,显著优于现有模型如 Qwen3-VL 和专有模型如 Gemini 3 Pro。
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-15T17:24:46+00:00
Comments: Work in Progress
Abstract
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文标题/摘要
标题:奖励稀有性:面向创造性问题解决的LLM独特性意识强化学习
强化学习(RL)已成为后训练大型语言模型(LLMs)的核心范式,特别是在复杂推理任务中,但它经常遭受探索崩溃的问题:策略过早地集中在一小套主导的推理模式上,提高了pass@1,但限制了rollout级别的多样性和pass@k的收益。我们认为这种失败源于对局部token行为的正则化,而不是对解决方案集的多样性。为了解决这个问题,我们提出了独特性意识强化学习,这是一种rollout级别的目标,明确奖励那些表现出罕见高级策略的正确解决方案。该方法使用基于LLM的裁判将相同问题的rollout根据其高级解决方案策略聚类,忽略表面差异,并根据集群大小反向重权重策略优势。因此,正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中,我们的方法在大规模采样预算下始终如一地提高了pass@$k$,并增加了pass@$k$曲线下的面积(AUC@$K$),同时不牺牲pass@1,同时保持探索并大规模揭示更多多样化的解决方案策略。
Summary / 总结
The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant strategies. It introduces Uniqueness-Aware Reinforcement Learning, which rewards solutions that exhibit rare high-level strategies, using an LLM-based judge to cluster rollouts and reweight policy advantages. The method improves pass@$k$ across various reasoning benchmarks and increases the AUC@$K$ without compromising pass@1, promoting exploration and diversity in solutions.
本文解决了强化学习在大型语言模型中探索坍塌的问题,即策略倾向于关注少数主导推理模式。作者提出了基于独特性意识的强化学习方法,该方法通过LLM判别器对同一问题的策略进行高层面策略聚类,并按聚类大小反比重新加权策略优势,从而奖励表现出罕见高层面策略的解决方案。该方法在各种推理基准测试中提高了pass@$k$,增加了pass@$k$曲线下的面积(AUC@$K$),同时保持了pass@1并维持了探索多样性。
Multi-Objective Pareto-Front Optimization for Efficient Adaptive VVC Streaming
Authors: Angeliki Katsenou, Vignesh V. Menon, Guoda Laurinaviciute, Benjamin Bross, Detlev Marpe
First: 2026-01-15T17:23:39+00:00 · Latest: 2026-01-15T17:23:39+00:00
Comments: 19 pages
Abstract
Adaptive video streaming has facilitated improved video streaming over the past years. A balance among coding performance objectives such as bitrate, video quality, and decoding complexity is required to achieve efficient, content- and codec-dependent, adaptive video streaming. This paper proposes a multi-objective Pareto-front (PF) optimization framework to construct quality-monotonic, content-adaptive bitrate ladders Versatile Video Coding (VVC) streaming that jointly optimize video quality, bitrate, and decoding time, which is used as a practical proxy for decoding energy. Two strategies are introduced: the Joint Rate-Quality-Time Pareto Front (JRQT-PF) and the Joint Quality-Time Pareto Front (JQT-PF), each exploring different tradeoff formulations and objective prioritizations. The ladders are constructed under quality monotonicity constraints during adaptive streaming to ensure a consistent Quality of Experience (QoE). Experiments are conducted on a large-scale UHD dataset (Inter-4K), with quality assessed using PSNR, VMAF, and XPSNR, and complexity measured via decoding time and energy consumption. The JQT-PF method achieves 11.76% average bitrate savings while reducing average decoding time by 0.29% to maintain the same XPSNR, compared to a widely-used fixed ladder. More aggressive configurations yield up to 27.88% bitrate savings at the cost of increased complexity. The JRQT-PF strategy, on the other hand, offers more controlled tradeoffs, achieving 6.38 % bitrate savings and 6.17 % decoding time reduction. This framework outperforms existing methods, including fixed ladders, VMAF- and XPSNR-based dynamic resolution selection, and complexity-aware benchmarks. The results confirm that PF optimization with decoding time constraints enables sustainable, high-quality streaming tailored to network and device capabilities.
中文标题/摘要
标题:多目标帕累托前沿优化以实现高效的自适应VVC流媒体
自适应视频流媒体在过去几年中促进了视频流媒体的改进。为了实现高效、内容和编解码器依赖的自适应视频流媒体,需要在比特率、视频质量和解码复杂性等编码性能目标之间取得平衡。本文提出了一种多目标帕累托前沿(PF)优化框架,以构建质量单调、内容自适应的Versatile Video Coding (VVC)流媒体比特率梯度,该框架联合优化视频质量、比特率和解码时间,后者用作解码能耗的实用代理。介绍了两种策略:联合速率-质量和时间帕累托前沿(JRQT-PF)和联合质量和时间帕累托前沿(JQT-PF),每种策略探索不同的权衡形式和目标优先级。在自适应流媒体过程中,根据质量单调性约束构建梯度,以确保一致的用户体验(QoE)。在大规模UHD数据集(Inter-4K)上进行了实验,使用PSNR、VMAF和XPSNR评估质量,使用解码时间和能耗测量复杂性。JQT-PF方法平均节省了11.76%的比特率,同时将平均解码时间减少了0.29%,以保持相同的XPSNR,与广泛使用的固定梯度相比。更激进的配置在复杂性增加的情况下可节省高达27.88%的比特率。另一方面,JRQT-PF策略提供了更可控的权衡,实现了6.38%的比特率节省和6.17%的解码时间减少。该框架优于现有方法,包括固定梯度、基于VMAF和XPSNR的动态分辨率选择以及复杂性感知基准。结果表明,带解码时间约束的PF优化能够实现针对网络和设备能力的可持续、高质量流媒体。
Summary / 总结
This paper proposes a multi-objective Pareto-front optimization framework for adaptive Versatile Video Coding (VVC) streaming, aiming to balance bitrate, video quality, and decoding time. Two strategies, JRQT-PF and JQT-PF, are introduced to explore different tradeoffs. Experiments on a large-scale UHD dataset show that the JQT-PF method saves 11.76% average bitrate while reducing decoding time by 0.29% to maintain the same XPSNR, outperforming existing methods and fixed ladders. The JRQT-PF strategy achieves 6.38% bitrate savings and 6.17% decoding time reduction with more controlled tradeoffs.
该论文提出了一种多目标帕累托前沿优化框架,用于自适应Versatile Video Coding (VVC) 流媒体,旨在平衡码率、视频质量和解码时间。引入了两种策略JRQT-PF和JQT-PF来探索不同的权衡。实验结果表明,JQT-PF方法在保持相同XPSNR的情况下平均节省了11.76%的码率,同时减少了0.29%的解码时间,优于现有方法和固定梯度。JRQT-PF策略实现了6.38%的码率节省和6.17%的解码时间减少,具有更可控的权衡。
RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
Authors: Peng Chen, Xiaobao Wei, Yi Yang, Naiming Yao, Hui Chen, Feng Tian
First: 2026-01-15T17:23:19+00:00 · Latest: 2026-01-15T17:23:19+00:00
Abstract
Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.
中文标题/摘要
标题:RSATalker:多轮对话的现实社会意识头部生成
头部生成在虚拟现实(VR)中越来越重要,尤其是在涉及多轮对话的社会场景中。现有方法面临显著的局限性:基于网格的3D方法可以建模双人对话,但缺乏现实的纹理,而基于大型模型的2D方法产生自然外观但计算成本高昂。最近,基于3D高斯散点图(3DGS)的方法实现了高效的现实渲染,但仍仅支持单人说话且忽略了社会关系。我们引入了RSATalker,这是第一个利用3DGS进行多轮对话的现实社会意识头部生成框架。我们的方法首先从语音驱动网格基3D面部运动,然后将3D高斯点绑定到网格面以渲染高保真2D头像视频。为了捕捉人际动态,我们提出了一种社会意识模块,通过可学习的查询机制将社会关系,包括血缘和非血缘以及平等和不平等,编码为高级嵌入。我们设计了三阶段训练范式,并构建了包含社会关系标注的RSATalker数据集,附带语音-网格-图像三元组。大量实验表明,RSATalker在现实性和社会意识方面均达到最先进的性能。代码和数据集将被发布。
Summary / 总结
RSATalker is a framework for generating realistic and socially-aware talking heads for multi-turn conversations in VR. It uses 3D Gaussian Splatting to render high-fidelity 2D avatar videos, overcoming the limitations of existing methods. The socially-aware module encodes social relationships into high-level embeddings, and a three-stage training paradigm is employed. Experiments show that RSATalker outperforms existing methods in both realism and social awareness.
RSATalker 是一种用于多轮对话的现实且社交意识强的说话头部生成框架。它使用 3D 高斯散射来渲染高质量的 2D 头像视频,克服了现有方法的局限性。社交意识模块将社会关系编码为高级嵌入,采用三阶段训练范式。实验表明,RSATalker 在现实性和社交意识方面均优于现有方法。
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-15T17:20:36+00:00
Comments: Work in Progress
Abstract
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
中文标题/摘要
标题:协作多智能体推理时强化学习
多智能体系统已成为许多应用中的实用LLM驱动合作者,通过多样性和交叉验证获得鲁棒性。然而,多智能体强化学习(MARL)训练资源密集且不稳定:队友的共同适应导致非平稳性,奖励通常稀疏且高方差。因此,我们引入了**多智能体推理时强化学习(MATTRL)**框架,在推理时向多智能体协商注入结构化文本经验。MATTRL 形成一个由专家组成的多专家团队,进行多轮讨论,检索和整合推理时的经验,并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池,然后将其重新注入对话。在医学、数学和教育领域的具有挑战性的基准测试中,MATTRL 在多智能体基线上的准确率平均提高了3.67%,在可比的单智能体基线上的准确率提高了8.67%。消融研究探讨了不同的信用分配方案,并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径,无需调整即可实现分布转移鲁棒的多智能体推理。
Summary / 总结
The paper introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which injects structured textual experience into multi-agent deliberation at inference time to improve robustness and accuracy. MATTRL forms a multi-expert team for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. Experiments across medicine, math, and education show that MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines.
论文提出了Multi-Agent Test-Time Reinforcement Learning (MATTRL),该方法在推理时注入结构化的文本经验,以提高鲁棒性和准确性。MATTRL 形成一个多专家团队进行多轮讨论,检索和整合测试时的经验,并达成共识进行最终决策。实验结果显示,MATTRL 在医学、数学和教育等领域的准确率分别比多智能体基线提高了 3.67%,比单智能体基线提高了 8.67%。
Semantic Misalignment in Vision-Language Models under Perceptual Degradation
Authors: Guo Cheng
First: 2026-01-13T09:13:05+00:00 · Latest: 2026-01-15T17:10:05+00:00
Comments: 10 pages, 4 figures, 6 tables
Abstract
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
中文标题/摘要
标题:视觉-语言模型在感知退化下的语义不匹配
视觉-语言模型(VLMs)在自主驾驶和具身AI系统中越来越被部署,可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLMs在多模态基准测试中表现出色,但它们对现实感知退化的鲁棒性仍然知之甚少。在本工作中,我们系统地研究了在上游视觉感知受控退化下VLMs中的语义不匹配,使用Cityscapes数据集上的语义分割作为代表性感知模块。我们引入了感知现实的退化,这些退化仅在传统分割指标上引起适度下降,但观察到下游VLM行为的严重失败,包括虚构对象提及、安全关键实体的遗漏以及不一致的安全判断。为了量化这些影响,我们提出了一组语言层面的不匹配度量标准,以捕捉虚构、关键遗漏和安全误判,并分析这些度量标准与分割质量之间的关系,涵盖多个对比性和生成性VLMs。我们的结果揭示了像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显了当前VLM基系统的一个关键局限性,并强调了需要评估框架的重要性,这些框架能够明确考虑安全关键应用中的感知不确定性。
Summary / 总结
This study investigates the robustness of Vision-Language Models (VLMs) to perceptual degradation in autonomous driving and embodied AI systems. By introducing controlled corruptions to semantic segmentation, the research reveals severe failures in VLMs, such as hallucinations and omissions of critical entities, despite only moderate drops in segmentation metrics. The authors propose new language-level misalignment metrics to quantify these issues and highlight the need for evaluation frameworks that consider perception uncertainty in safety-critical applications.
该研究探讨了视觉-语言模型(VLMs)在自主驾驶和具身AI系统中的感知退化鲁棒性。通过在Cityscapes数据集上应用控制化的语义分割退化,研究发现VLMs存在严重的失败现象,如幻觉和关键实体的遗漏。作者引入了语言层面的不一致度量来量化这些问题,并发现像素级鲁棒性和多模态语义可靠性之间存在明显的断层,强调了需要更好的评估框架来考虑感知不确定性在关键安全应用中的影响。
STEP3-VL-10B Technical Report
Authors: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge
First: 2026-01-14T17:58:24+00:00 · Latest: 2026-01-15T17:06:04+00:00
Comments: 50 pages
Abstract
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
中文标题/摘要
标题:STEP3-VL-10B 技术报告
我们提出了STEP3-VL-10B,这是一种轻量级开源基础模型,旨在重新定义紧凑效率与前沿级多模态智能之间的权衡。STEP3-VL-10B 通过两个战略转变实现:首先,一种统一的、完全解冻的预训练策略,基于1.2万亿多模态令牌,整合了语言对齐的感知编码器与Qwen3-8B解码器,以建立内在的视觉-语言协同作用;其次,一个扩展后的后训练流水线,包含超过1000次强化学习迭代。关键的是,我们实现了并行协调推理(PaCoRe)以扩展测试时计算,将资源分配给可扩展的感知推理,探索和综合多种视觉假设。因此,尽管其紧凑的10B参数量,STEP3-VL-10B 在性能上与比其大10-20倍的模型(如GLM-4.6V-106B、Qwen3-VL-235B)相当或超越,并且在顶级专有旗舰产品(如Gemini 2.5 Pro和Seed-1.5-VL)中表现出色。它在MMBench上记录了92.2%的得分,在MMMU上为80.11%,在复杂推理方面分别达到了94.43%的AIME2025得分和75.95%的MathVision得分。我们发布了完整的模型套件,为社区提供了一个强大、高效且可重现的基础线。
Summary / 总结
The research aims to develop a compact yet powerful multimodal foundation model, STEP3-VL-10B, through a unified pre-training strategy and a scaled post-training pipeline. The model integrates a language-aligned Perception Encoder with a Qwen3-8B decoder and employs Parallel Coordinated Reasoning to enhance test-time compute efficiency. Despite its 10B parameters, STEP3-VL-10B outperforms larger models on various benchmarks, achieving 92.2% on MMBench, 80.11% on MMMU, 94.43% on AIME2025, and 75.95% on MathVision.
研究旨在开发一种轻量级基础模型,平衡紧凑性和先进的多模态智能。STEP3-VL-10B采用统一的预训练策略和扩展后的后训练管道,包括超过1000次迭代的强化学习和并行协调推理(PaCoRe)。尽管参数量仅为10亿,但在MM Bench、MMMU、AIME2025和MathVision等基准测试中,该模型的表现优于更大规模的模型,分别取得了92.2%、80.11%、94.43%和75.95%的成绩。
Action100M: A Large-scale Video Action Dataset
Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung
First: 2026-01-15T17:02:27+00:00 · Latest: 2026-01-15T17:02:27+00:00
Abstract
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
中文标题/摘要
标题:Action100M:大规模视频动作数据集
从视觉观察中推断物理动作是推进物理世界机器智能的基本能力。实现这一目标需要涵盖广泛领域的大型、开放词汇量视频动作数据集。我们介绍了Action100M,这是一个从120万互联网教学视频(14.6年时长)构建的大规模数据集,提供了O(100百万)个时间局部化片段,具有开放词汇量的动作监督和丰富的描述。Action100M通过一个完全自动化的流水线生成,该流水线(i)使用V-JEPA 2嵌入进行分层时间分割,(ii)生成多级帧和片段描述组织成描述树,(iii)使用多轮Self-Refine程序下的推理模型(GPT-OSS-120B)聚合证据以输出结构化注释(简要/详细动作、演员、简要/详细描述)。在Action100M上训练VL-JEPA展示了在各种动作识别基准测试中一致的数据规模改进和强大的零样本性能,确立了Action100M作为视频理解和世界建模可扩展研究新基础的地位。
Summary / 总结
The research motivation is to develop a large-scale video action dataset for advancing machine intelligence in physical world applications. The main method involves constructing Action100M from 1.2 million instructional videos, using a fully automated pipeline for hierarchical temporal segmentation, multi-level captioning, and structured annotation refinement. Key experimental findings show consistent data-scaling improvements and strong zero-shot performance across various action recognition benchmarks, establishing Action100M as a valuable resource for video understanding and world modeling research.
研究动机是开发大规模视频动作数据集,以推动机器智能在物理世界中的应用。主要方法是从120万条教学视频中创建Action100M,并使用全自动流水线进行层次时间分割、多级字幕生成和结构化注释精炼。关键实验发现表明,在各种动作识别基准测试中,数据量的增加表现出一致的改进,并且具有强大的零样本性能,确立了Action100M作为视频理解和世界建模研究的基础资源的地位。
SSFL: Discovering Sparse Unified Subnetworks at Initialization for Efficient Federated Learning
Authors: Riyasat Ohib, Bishal Thapaliya, Gintare Karolina Dziugaite, Jingyu Liu, Vince Calhoun, Sergey Plis
Venue: Transactions on Machine Learning Research, 2026
First: 2024-05-15T02:13:51+00:00 · Latest: 2026-01-15T17:01:07+00:00
Comments: Published in Transactions on Machine Learning Research (TMLR), 2026
Abstract
In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient communication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy sparsity trade off, achieving more than 20\% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by $2 \times$ relative to dense FL. Finally, in a real-world federated learning deployment, SSFL delivers over $2.3 \times$ faster communication time, underscoring its practical efficiency.
中文标题/摘要
标题:SSFL:初始化时发现稀疏统一子网络以实现高效的联邦学习
在本文中,我们提出了显著稀疏联邦学习(SSFL),这是一种用于稀疏联邦学习的高效通信简化方法。SSFL 在训练前识别一个稀疏子网络,利用在非IID场景下分别在本地客户端数据上计算的参数显著性分数进行聚合,以确定全局掩码。每次在客户端和服务器之间仅训练和通信稀疏模型权重。在包括CIFAR-10、CIFAR-100和Tiny-ImageNet的标准基准测试中,SSFL 一致地改善了准确性和稀疏性之间的权衡,在CIFAR-10上相对于最强的稀疏基线实现了超过20%的相对误差减少,同时将通信成本减少了2倍。最后,在实际的联邦学习部署中,SSFL 实现了超过2.3倍的更快通信时间,突显了其实用效率。
Summary / 总结
The research aims to improve the efficiency of federated learning by identifying a sparse subnetwork at initialization using parameter saliency scores. The method involves computing saliency scores locally on client data and aggregating them to form a global mask, which is then used to train and communicate only sparse model weights. Key findings show that SSFL enhances the accuracy-sparsity trade-off, achieving over 20% relative error reduction on CIFAR-10 compared to the best sparse baseline, while reducing communication costs by 2 times compared to dense federated learning. In practical deployment, SSFL also reduces communication time by over 2.3 times.
研究动机是通过减少通信成本并保持或提高模型准确性来提高联邦学习的效率。SSFL在初始化时通过使用来自本地客户端数据的参数显著性分数来识别稀疏子网络,并将这些分数聚合形成全局掩码。这种方法只训练和通信稀疏模型权重,与最强的稀疏基线相比,在CIFAR-10上的相对误差减少了20%,通信成本减少了2倍。此外,在实际的联邦学习部署中,SSFL将通信时间减少了超过2.3倍。
Searching for Quantum Effects in the Brain: A Bell-Type Test for Nonclassical Latent Representations in Autoencoders
Authors: I. K. Kominis, C. Xie, S. Li, M. Skotiniotis, G. P. Tsironis
First: 2026-01-15T16:59:40+00:00 · Latest: 2026-01-15T16:59:40+00:00
Comments: 6 pages, 2 figures
Abstract
Whether neural information processing is entirely classical or involves quantum-mechanical elements remains an open question. Here we propose a model-agnostic, information-theoretic test of nonclassicality that bypasses microscopic assumptions and instead probes the structure of neural representations themselves. Using autoencoders as a transparent model system, we introduce a Bell-type consistency test in latent space, and ask whether decoding statistics obtained under multiple readout contexts can be jointly explained by a single positive latent-variable distribution. By shifting the search for quantum-like signatures in neural systems from microscopic dynamics to experimentally testable constraints on information processing, this work opens a new route for probing the fundamental physics of neural computation.
中文标题/摘要
标题:在大脑中寻找量子效应:非经典潜变量表示在自编码器中的贝尔型测试
神经信息处理是否完全经典或涉及量子力学元素仍是一个开放问题。在这里,我们提出了一种模型无关的信息论非经典性测试,该测试绕过了微观假设,而是直接探测神经表示本身的结构。利用自编码器作为透明的模型系统,我们引入了潜空间中的贝尔型一致性测试,并询问在多种读出上下文中获得的解码统计是否可以由单一正潜变量分布联合解释。通过将对神经系统中似量子特征的搜索从微观动力学转移到可实验测试的信息处理约束上,这项工作为探索神经计算的基本物理学开辟了一条新途径。
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Authors: Mikel Williams-Lekuona, Georgina Cosma
First: 2025-12-17T12:19:54+00:00 · Latest: 2026-01-15T16:58:39+00:00
Comments: Camera-ready version for ECIR 2026
Abstract
Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
中文标题/摘要
标题:基于图像复杂性的自适应检索以提高视觉语言模型的效率
视觉语言模型中的视觉变换器通常对每张图像使用相同的计算量,无论其简单与否。我们提出了ICAR(基于图像复杂性的自适应检索),这是一种自适应计算方法,使视觉变换器能够为简单的图像使用较少的计算量,而为复杂的图像通过其全网络深度进行处理。关键挑战是保持跨模态对齐:不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR 通过双路径训练来解决这一问题,该训练产生来自早期退出路径和全深度路径的兼容嵌入。这在相同的语义空间中保持了图像表示和文本嵌入之间的兼容性,无论图像是否早期退出或完全处理。与现有的两阶段方法不同,ICAR 不需要昂贵的重排序,可以直接进行图像-文本匹配而无需额外开销。为了确定使用多少计算量,我们开发了ConvNeXt-IC,将其视为分类任务。通过应用现代分类器骨干网络而非专门的架构,ConvNeXt-IC 达到了最先进的性能,获得了与人工标注 0.959 的皮尔逊相关系数,同时实现了 4.4 倍更快的复杂性预测。在标准基准上增加了真实世界的网络数据进行评估,ICAR 在保持类别级性能的同时实现了 20% 更快的图像编码,并且保留了 95% 的实例级性能,从而实现了视觉语言系统的可持续扩展。
Summary / 总结
The paper proposes ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach for vision transformers in vision-language models, which uses less compute for simple images while processing complex images fully. This is achieved through dual-path training that ensures compatible embeddings from both early-exit and full-depth paths, maintaining cross-modal alignment. ICAR outperforms existing two-stage approaches by enabling direct image-text matching without additional overhead. The ConvNeXt-IC model, used for image complexity assessment, achieves state-of-the-art performance with a 4.4x faster complexity prediction. On benchmarks with real-world web data, ICAR improves image encoding speed by 20% while maintaining performance levels.
论文提出了ICAR(图像复杂性感知检索)方法,该方法使视觉变换器在视觉语言模型中能够为简单图像使用较少的计算资源,并为复杂图像使用全深度网络处理。通过双路径训练来解决跨模态对齐的挑战。ICAR在基准测试中实现了20%的图像编码加速,同时保持类别级性能和95%的实例级性能,并且ConvNeXt-IC作为现代分类器骨干,提供了4.4倍更快的图像复杂性评估,准确性高。
Combinatorial Optimization Augmented Machine Learning
Authors: Maximilian Schiffer, Heiko Hoppe, Yue Su, Louis Bouvier, Axel Parmentier
First: 2026-01-15T16:55:19+00:00 · Latest: 2026-01-15T16:55:19+00:00
Abstract
Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.
中文标题/摘要
标题:组合优化增强机器学习
组合优化增强机器学习(COAML)最近作为一种强大的范式,将预测模型与组合决策制定相结合而崭露头角。通过将组合优化或acles嵌入到学习管道中,COAML能够构建既数据驱动又符合可行性的策略,连接机器学习、运筹学和随机优化的传统。本文提供了COAML最新状态的全面概述。我们介绍了一个统一的COAML管道框架,描述了其方法论构建块,并形式化了其与经验成本最小化的关系。然后,我们根据不确定性形式和决策结构发展了一个分类法。使用这种分类法,我们回顾了静态和动态问题的算法方法,概述了跨调度、车辆路线、随机规划和强化学习等领域的应用,并从经验成本最小化、模仿学习和强化学习的角度综合了方法论贡献。最后,我们确定了关键的研究前沿。本文综述旨在既作为该领域的教程性介绍,又作为未来研究的路线图,该研究领域介于组合优化和机器学习之间。
Summary / 总结
The research motivation is to integrate combinatorial optimization with machine learning to create data-driven and feasible policies. The main method involves embedding combinatorial optimization oracles into learning pipelines, providing a unifying framework for COAML. Key experimental findings include the development of a taxonomy for problem settings, algorithmic approaches for static and dynamic problems, and a survey of applications across various domains such as scheduling and reinforcement learning.
论文探讨了组合优化增强机器学习(COAML),该方法将预测模型与组合决策相结合。它引入了COAML管道的统一框架,回顾了静态和动态问题的算法方法,并在调度、车辆路径、随机规划和强化学习等多个领域进行了应用综述。主要发现包括COAML与经验成本最小化、模仿学习和强化学习之间的联系,以及该领域研究前沿的识别。
From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA
Authors: Kimia Abedini, Farzad Shami, Gianmaria Silvello
First: 2026-01-15T16:54:11+00:00 · Latest: 2026-01-15T16:54:11+00:00
Comments: Accepted paper by the 48th European Conference on Information Retrieval (ECIR'26)
Abstract
Comprehending genomic information is essential for biomedical research, yet extracting data from complex distributed databases remains challenging. Large language models (LLMs) offer potential for genomic Question Answering (QA) but face limitations due to restricted access to domain-specific databases. GeneGPT is the current state-of-the-art system that enhances LLMs by utilizing specialized API calls, though it is constrained by rigid API dependencies and limited adaptability. We replicate GeneGPT and propose GenomAgent, a multi-agent framework that efficiently coordinates specialized agents for complex genomics queries. Evaluated on nine tasks from the GeneTuring benchmark, GenomAgent outperforms GeneGPT by 12% on average, and its flexible architecture extends beyond genomics to various scientific domains needing expert knowledge extraction.
中文标题/摘要
标题:从单体到多智能体推理:推进GeneGPT在基因组问答中的应用
理解基因组信息对于生物医学研究至关重要,但从复杂分布的数据库中提取数据仍然具有挑战性。大型语言模型(LLMs)为基因组问答(QA)提供了潜在解决方案,但受限于对领域特定数据库的访问限制。GeneGPT是当前最先进的系统,通过使用专门的API调用增强LLMs,但它受到固定API依赖性和有限适应性的限制。我们复制了GeneGPT并提出了GenomAgent,这是一种多智能体框架,能够高效地协调专门的智能体以处理复杂的基因组查询。在GeneTuring基准测试的九项任务上,GenomAgent的平均性能比GeneGPT高出12%,其灵活的架构超越了基因组学,适用于需要专家知识提取的各种科学领域。
Summary / 总结
The research aims to improve genomic question answering by addressing the limitations of existing large language models and specialized API dependencies. The study introduces GenomAgent, a multi-agent framework that coordinates specialized agents for complex genomics queries. GenomAgent outperforms GeneGPT, the current state-of-the-art system, by 12% on average across nine tasks from the GeneTuring benchmark and offers a flexible architecture for various scientific domains requiring expert knowledge extraction.
研究旨在通过克服现有大型语言模型的限制来提升基因组问答能力。该研究引入了GenomAgent,这是一种多智能体系统,通过协调专门的智能体克服了GeneGPT的僵化性。在GeneTuring基准测试中,GenomAgent在平均性能上比GeneGPT提高了12%,并且展示了在需要专家知识提取的更广泛科学领域中的应用潜力。
How Quantization Shapes Bias in Large Language Models
Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
First: 2025-08-25T14:48:26+00:00 · Latest: 2026-01-15T16:30:08+00:00
Abstract
This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, fairness, toxicity, and sentiment. We employ both probability- and generated text-based metrics across 13 benchmarks and evaluate models that differ in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and subgroups, and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
中文标题/摘要
标题:量化如何塑造大型语言模型中的偏差
本研究全面评估了量化对模型偏差的影响,特别关注其对个体人口子群体的影响。我们专注于权重和激活量化策略,并在包括刻板印象、公平性、毒性及情感在内的多种偏差类型上考察其影响。我们使用概率和生成文本的指标,在13个基准上评估了不同架构家族和推理能力的模型。我们的研究结果表明,量化对偏差的影响是复杂的:虽然它可以减少模型的毒性且对情感影响不大,但它往往会增加生成任务中的刻板印象和不公平性,尤其是在激进压缩下。这些趋势在不同的人口类别和子群体以及不同模型类型中通常是一致的,尽管其程度取决于具体环境。总体而言,我们的结果强调了在实践中应用量化时在效率和伦理考虑之间仔细平衡的重要性。
Summary / 总结
This study evaluates how quantization influences model bias, particularly focusing on its effects on demographic subgroups. The research examines weight and activation quantization strategies across various bias types, using both probability- and generated text-based metrics on 13 benchmarks. The findings indicate that while quantization reduces model toxicity and does not significantly affect sentiment, it tends to increase stereotypes and unfairness, especially under aggressive compression. These trends are consistent across different demographic categories and model types, though the magnitude varies depending on the specific setting.
这项研究评估了量化如何影响模型偏见,特别是对不同人口子群体的影响。研究考察了权重和激活量化在各种偏见类型上的效果,使用了13个基准的基于概率和生成文本的度量标准。研究发现,虽然量化可以减少模型的毒性且对情感影响不大,但它会增加刻板印象和不公平性,尤其是在激进压缩下。这些趋势在不同的人口类别和模型类型中是一致的,但具体影响的大小取决于特定的环境设置。
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Authors: Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, Nazia Tasnim, Farig Sadeque
First: 2026-01-15T16:28:14+00:00 · Latest: 2026-01-15T16:28:14+00:00
Comments: 16 pages, 4 figures
Abstract
Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
中文标题/摘要
标题:基于表示感知的去学习:从抑制到知识签名消除
从大型语言模型中选择性地删除知识对于满足GDPR合规性和模型安全性至关重要,但当前的去学习方法将行为抑制与真正的知识删除混淆,允许潜在能力在表面拒绝之下持续存在。在本文中,我们通过引入知识免疫框架(KIF),一种基于表示感知的架构,通过针对内部激活签名而不是表面输出来区分真正的删除与混淆,来解决这一挑战。我们的方法结合了针对特定主体的表示的动态抑制与参数高效的适应,从而在无需完全重新训练模型的情况下实现持久的去学习。KIF 实现了接近完美的删除(FQ 约为 0.99,与 1.00 相比)同时保持了与 oracle 水平相当的实用性(MU = 0.62),有效地打破了所有先前工作都受到限制的稳定性和删除之间的权衡。我们在 3B 到 14B 参数的标准基础模型(Llama 和 Mistral)和推理优先模型(Qwen 和 DeepSeek)上进行了评估。我们的观察表明,标准模型表现出与规模无关的真实删除(<3% 实用性漂移),而推理优先模型揭示了基本的架构差异。我们综合的双指标评估协议结合了表面泄漏与潜在痕迹持续性,操作化了混淆与删除之间的区分,并首次系统地诊断了不同模型家族和规模下的机制级遗忘行为。
Summary / 总结
This paper addresses the challenge of selective knowledge erasure from large language models (LLMs) by introducing the Knowledge Immunization Framework (KIF), which targets internal activation signatures to achieve true knowledge removal without full model retraining. KIF combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, achieving near-oracle erasure while preserving utility. The study evaluates KIF on various models, demonstrating scale-independent true erasure for standard models and revealing architectural divergence in reasoning-prior models. The dual-metric evaluation protocol helps diagnose mechanism-level forgetting behavior across different model families and scales.
本文通过引入知识免疫框架(KIF),针对大型语言模型(LLMs)中的选择性知识擦除问题,通过靶向内部激活签名来区分真正的擦除和混淆。KIF 结合动态抑制特定主题的表示与参数高效的适应,实现了接近完美的擦除效果,同时保持了实用性。研究在多种模型上评估了 KIF,发现标准模型在不同规模下表现出独立的真正擦除效果,而推理优先模型揭示了架构上的根本差异。全面的双指标评估协议有助于诊断不同模型家族和规模下的机制级遗忘行为。
Kolmogorov Arnold Networks and Multi-Layer Perceptrons: A Paradigm Shift in Neural Modelling
Authors: Aradhya Gaonkar, Nihal Jain, Vignesh Chougule, Nikhil Deshpande, Sneha Varur, Channabasappa Muttal
First: 2026-01-15T16:26:49+00:00 · Latest: 2026-01-15T16:26:49+00:00
Comments: 13 pages, 8 figures, 2 tables
Abstract
The research undertakes a comprehensive comparative analysis of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP), highlighting their effectiveness in solving essential computational challenges like nonlinear function approximation, time-series prediction, and multivariate classification. Rooted in Kolmogorov's representation theorem, KANs utilize adaptive spline-based activation functions and grid-based structures, providing a transformative approach compared to traditional neural network frameworks. Utilizing a variety of datasets spanning mathematical function estimation (quadratic and cubic) to practical uses like predicting daily temperatures and categorizing wines, the proposed research thoroughly assesses model performance via accuracy measures like Mean Squared Error (MSE) and computational expense assessed through Floating Point Operations (FLOPs). The results indicate that KANs reliably exceed MLPs in every benchmark, attaining higher predictive accuracy with significantly reduced computational costs. Such an outcome highlights their ability to maintain a balance between computational efficiency and accuracy, rendering them especially beneficial in resource-limited and real-time operational environments. By elucidating the architectural and functional distinctions between KANs and MLPs, the paper provides a systematic framework for selecting the most suitable neural architectures for specific tasks. Furthermore, the proposed study highlights the transformative capabilities of KANs in progressing intelligent systems, influencing their use in situations that require both interpretability and computational efficiency.
中文标题/摘要
标题:科莫戈罗夫-阿诺德网络与多层感知机:神经建模范式的转变
研究对科莫戈罗夫-阿诺德网络(KAN)和多层感知机(MLP)进行了全面的比较分析,突显了它们在解决非线性函数逼近、时间序列预测和多元分类等关键计算挑战方面的有效性。基于科莫戈罗夫表示定理,KANs 使用自适应样条激活函数和基于网格的结构,提供了与传统神经网络框架相比具有变革性的方法。通过涵盖从数学函数估计(二次和三次)到实际应用如预测每日温度和分类葡萄酒等多种数据集,研究通过均方误差(MSE)等准确性指标和通过浮点运算(FLOPs)评估的计算成本,全面评估了模型性能。结果表明,KANs 在所有基准测试中都可靠地超过了 MLPs,具有更高的预测准确性和显著降低的计算成本。这一结果突显了它们在计算效率和准确性之间保持平衡的能力,特别是在资源有限和实时操作环境中尤其有益。通过阐明 KANs 和 MLPs 的架构和功能差异,论文提供了一种系统的方法来选择最适合特定任务的神经网络架构。此外,提出的研究所强调的 KANs 在推进智能系统方面的变革能力,影响了其在需要可解释性和计算效率的情况下使用。
Summary / 总结
The research undertakes a comprehensive comparative analysis of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptrons (MLP), highlighting their effectiveness in solving essential computational challenges like nonlinear function approximation, time-series prediction, and multivariate classification.
该研究对比了Kolmogorov-Arnold网络(KAN)和多层感知机(MLP)在非线性函数近似、时间序列预测和多变量分类中的应用效果。KANs基于Kolmogorov表示定理,采用自适应样条激活函数和网格结构,展示了在各种数据集上的优越性能,在准确性和计算效率方面均优于MLP。研究发现,KANs在资源受限和实时应用环境中能够实现更高的预测精度,同时计算资源消耗显著减少。
Process-Guided Concept Bottleneck Model
Authors: Reza M. Asiyabi, SEOSAW Partnership, Steven Hancock, Casey Ryan
First: 2026-01-15T16:25:55+00:00 · Latest: 2026-01-15T16:25:55+00:00
Comments: 13 pages with 7 figures and 1 table, Supplementary Materials 10 pages with 3 figures
Abstract
Concept Bottleneck Models (CBMs) improve the explainability of black-box Deep Learning (DL) by introducing intermediate semantic concepts. However, standard CBMs often overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well defined. To address this, we propose the Process-Guided Concept Bottleneck Model (PG-CBM), an extension of CBMs which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. Using above ground biomass density estimation from Earth Observation data as a case study, we show that PG-CBM reduces error and bias compared to multiple benchmarks, whilst leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs. Beyond improved accuracy, PG-CBM enhances transparency, enables detection of spurious learning, and provides scientific insights, representing a step toward more trustworthy AI systems in scientific applications.
中文标题/摘要
标题:过程导向的概念瓶颈模型
概念瓶颈模型(CBMs)通过引入中间语义概念来提高黑盒深度学习(DL)的可解释性。然而,标准CBMs往往忽视了特定领域的关系和因果机制,并且依赖完整的概念标签限制了其在监督稀疏但过程定义良好的科学领域中的应用。为了解决这个问题,我们提出了过程导向的概念瓶颈模型(PG-CBM),这是一种扩展的CBM,通过生物物理意义明确的中间概念来约束学习,使其遵循领域定义的因果机制。以地球观测数据估计地上生物量密度为例,我们展示了PG-CBM相比多个基准模型减少了误差和偏差,同时利用多源异构训练数据并产生可解释的中间输出。除了提高准确性,PG-CBM还增强了透明度,能够检测虚假学习,并提供科学见解,代表了在科学应用中迈向更可信赖的人工智能系统的一个步骤。
Summary / 总结
The motivation for the Process-Guided Concept Bottleneck Model (PG-CBM) is to improve the explainability and applicability of Concept Bottleneck Models (CBMs) in scientific domains where complete concept labels are sparse but domain-specific causal mechanisms are well defined. PG-CBM constrains learning to follow these causal mechanisms through biophysically meaningful intermediate concepts. The key experimental finding is that PG-CBM reduces error and bias in above ground biomass density estimation compared to multiple benchmarks, while leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs.
研究旨在通过解决概念瓶颈模型(CBMs)在处理特定领域因果机制和稀疏监督方面的局限性,提高其解释性和适用性。提出了过程导向的概念瓶颈模型(PG-CBM),该模型通过遵循生物物理意义的因果机制来约束学习。在基于地球观测数据估算地上生物量密度的案例研究中,PG-CBM在减少误差和偏差方面优于基准模型,同时利用多源数据并产生可解释的输出。
Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems
Authors: Xi Shi, Mengxin Zheng, Qian Lou
First: 2026-01-15T16:23:53+00:00 · Latest: 2026-01-15T16:23:53+00:00
Comments: Preprint
Abstract
Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at https://github.com/xishi404/LAMaS
中文标题/摘要
标题:学习面向延迟感知的并行多智能体系统编排
多智能体系统(MAS)通过协调多个智能体实现复杂的推理,但由于多步执行和重复模型调用,往往导致较高的推理延迟,严重限制了其在时间敏感场景中的可扩展性和可用性。大多数现有方法主要优化任务性能和推理成本,并显式或隐式假设顺序执行,使得它们在并行执行下控制延迟方面不太理想。在本文中,我们研究了在并行执行下具有显式延迟监督的多智能体系统学习式编排。我们提出了面向延迟的多智能体系统(LAMaS),这是一种面向延迟的多智能体编排框架,能够实现并行执行,并显式优化关键执行路径,使控制器能够在并行执行下构建具有较低延迟的执行拓扑图。我们的实验表明,与多个基准上的最新基线方法相比,我们的方法在多智能体架构搜索中将关键路径长度减少了38-46%,同时保持或甚至提高了任务性能。这些结果突显了在设计高效多智能体系统时显式优化并行执行下延迟的重要性。代码可在https://github.com/xishi404/LAMaS获取
Summary / 总结
This work addresses the high inference latency in multi-agent systems (MAS) by proposing LAMaS, a latency-aware orchestration framework. LAMaS optimizes the critical execution path for parallel execution, reducing the critical path length by 38-46% compared to existing methods while maintaining task performance. This highlights the necessity of explicitly optimizing latency in parallel MAS design.
该研究通过提出LAMaS,一种针对并行执行的延迟感知多智能体系统编排框架,解决了多智能体系统(MAS)中的高推理延迟问题。实验结果显示,与最先进的基线相比,LAMaS将关键路径长度减少了38-46%,同时保持或提升了任务性能。这突显了在设计高效MAS时明确优化并行执行下的延迟的重要性。
DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery
Authors: Constantin Selzer, Fabian B. Flohr
Venue: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, AB, Canada, 2024, pp. 221-227
First: 2026-01-15T16:18:42+00:00 · Latest: 2026-01-15T16:18:42+00:00
Abstract
The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban
中文标题/摘要
标题:DeepUrban:基于无人机影像的交互感知轨迹预测与规划
自主驾驶系统的有效性在很大程度上依赖于强大的预测和规划能力。然而,当前的基准测试受到一个明显问题的阻碍,即缺乏密集交通场景,这对于理解并建模道路使用者之间的复杂交互至关重要。为了解决这一问题,我们与工业合作伙伴DeepScenario合作,开发了DeepUrban——一个新的无人机数据集,旨在增强密集城市环境下的轨迹预测和规划基准。DeepUrban提供了从城市交叉口高空拍摄的高分辨率图像中提取的丰富3D交通对象。数据集进一步丰富了全面的地图和场景信息,以支持高级建模和仿真任务。我们评估了最先进的(SOTA)预测和规划方法,并进行了泛化能力实验。我们的研究结果表明,将DeepUrban添加到nuScenes中可以提高车辆预测和规划的准确性,ADE / FDE指标上的改进幅度最高可达44.1% / 44.3%。网站:https://iv.ee.hm.edu/deepurban
Summary / 总结
The research aims to improve autonomous driving systems by addressing the scarcity of dense traffic scenarios in current benchmarks. The study introduces DeepUrban, a new drone dataset that captures 3D traffic objects from urban intersections, enhancing trajectory prediction and planning. Experiments show that integrating DeepUrban into nuScenes improves prediction accuracy by up to 44.1% and 44.3% on ADE and FDE metrics, respectively.
研究旨在通过提高轨迹预测和规划能力,尤其是针对密集的城市环境,来增强自动驾驶系统。为了解决当前基准中缺乏密集交通场景的问题,作者开发了DeepUrban,这是一个新的无人机数据集。DeepUrban从100米高空拍摄的高分辨率图像中捕捉3D交通对象,并包含详细的地图和场景信息。实验结果显示,将DeepUrban整合到nuScenes基准中,可以将预测精度分别提高44.1%和44.3%的ADE和FDE指标。
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
First: 2026-01-15T16:18:00+00:00 · Latest: 2026-01-15T16:18:00+00:00
Comments: 22 pages, 10 figures
Abstract
State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.
中文标题/摘要
标题:视频生成模型与潜在世界模型的推理时物理对齐
最先进的视频生成模型能够生成令人印象深刻的视觉内容,但往往违反基本的物理原理,限制了它们的应用。虽然有人认为这种缺陷源于预训练时对物理理解不足,但我们发现物理合理性不足也源于不理想的推理策略。因此,我们引入了WMReward,并将提高视频生成的物理合理性视为一种推理时的对齐问题。具体来说,我们利用潜在世界模型(这里为VJEPA-2)的强物理先验作为奖励,搜索和引导多个候选去噪轨迹,从而实现测试时计算量的扩展,以提高生成性能。实验证明,我们的方法在图像条件、多帧条件和文本条件生成设置中显著提高了物理合理性,得到了人类偏好研究的验证。值得注意的是,在ICCV 2025感知测试PhysicsIQ挑战中,我们获得了62.64%的最终得分,获得第一名,并且超越了之前的最先进的技术水平7.42%。我们的工作证明了使用潜在世界模型提高视频生成的物理合理性的可行性,超越了这一特定实例或参数化。
Summary / 总结
The research addresses the issue of video generative models violating basic physics principles by introducing WMReward to align physics plausibility at inference time. The method uses a latent world model's strong physics prior as a reward to guide and optimize multiple denoising trajectories, enhancing generation performance. The approach significantly improves physics plausibility across various generation settings and won the first place in the ICCV 2025 Perception Test PhysicsIQ Challenge with a score of 62.64%, surpassing the previous state of the art by 7.42%.
研究通过引入WMReward,将物理合理性视为推理时的对齐问题,利用潜在世界模型的强大物理先验搜索和引导多个候选去噪轨迹,提升生成性能。该方法在不同生成设置中显著提高了物理合理性,并在ICCV 2025 Perception Test PhysicsIQ挑战赛中获得62.64%的分数,超越了之前的最佳水平7.42%,证明了使用潜在世界模型提高视频生成物理合理性的可行性。
Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
First: 2026-01-15T16:16:34+00:00 · Latest: 2026-01-15T16:16:34+00:00
Abstract
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
中文标题/摘要
标题:释放大型视觉语言模型在路边基础设施智能感知能力
城市路边基础设施的自动化感知对于智能城市管理至关重要,但通用模型往往难以捕捉到必要的细粒度属性和领域规则。虽然大型视觉语言模型(VLMs)在开放世界识别方面表现出色,但在遵循工程标准准确解释复杂设施状态方面常常遇到困难,导致在实际应用中的性能不可靠。为了解决这一问题,我们提出了一种领域适应框架,将VLMs转化为专门的智能基础设施分析代理。我们的方法结合了数据高效微调策略和基于知识的推理机制。具体来说,我们利用Grounding DINO的开放式词汇微调来在最少监督的情况下稳健地定位各种资产,然后利用基于LoRA的Qwen-VL适应进行深入的语义属性推理。为了减轻幻觉并确保专业合规,我们引入了一个双模态检索增强生成(RAG)模块,在推理过程中动态检索权威的行业标准和视觉示例。在全面的新建城市路边场景数据集上进行评估,我们的框架实现了58.9的检测性能mAP和95.5%的属性识别准确率,展示了智能基础设施监控的稳健解决方案。
Summary / 总结
The research aims to improve the automated perception of urban roadside infrastructure for smart city management by addressing the limitations of general-purpose models. The proposed domain-adapted framework fine-tunes large vision-language models using open-vocabulary techniques and LoRA-based adaptation, integrating a dual-modality RAG module for retrieving industry standards and visual exemplars. This approach achieves a detection performance of 58.9 mAP and 95.5% attribute recognition accuracy, showcasing a robust solution for intelligent infrastructure monitoring.
研究旨在通过智能感知城市路边基础设施来提升智慧城市管理。提出了一种领域适应框架,该框架通过数据高效微调和知识导向的推理机制来优化大型视觉语言模型。该框架在检测方面达到了58.9 mAP,在属性识别方面达到了95.5%的准确率,展示了智能基础设施监控的稳健性能。
Spatial As Deep: Spatial CNN for Traffic Scene Understanding
Authors: Xingang Pan, Xiaohang Zhan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang
Venue: AAAI 2018
First: 2017-12-17T09:37:52+00:00 · Latest: 2026-01-15T16:01:43+00:00
Comments: Accepted to AAAI 2018
Abstract
Convolutional neural networks (CNNs) are usually built by stacking convolutional operations layer-by-layer. Although CNN has shown strong capability to extract semantics from raw pixels, its capacity to capture spatial relationships of pixels across rows and columns of an image is not fully explored. These relationships are important to learn semantic objects with strong shape priors but weak appearance coherences, such as traffic lanes, which are often occluded or not even painted on the road surface as shown in Fig. 1 (a). In this paper, we propose Spatial CNN (SCNN), which generalizes traditional deep layer-by-layer convolutions to slice-byslice convolutions within feature maps, thus enabling message passings between pixels across rows and columns in a layer. Such SCNN is particular suitable for long continuous shape structure or large objects, with strong spatial relationship but less appearance clues, such as traffic lanes, poles, and wall. We apply SCNN on a newly released very challenging traffic lane detection dataset and Cityscapse dataset. The results show that SCNN could learn the spatial relationship for structure output and significantly improves the performance. We show that SCNN outperforms the recurrent neural network (RNN) based ReNet and MRF+CNN (MRFNet) in the lane detection dataset by 8.7% and 4.6% respectively. Moreover, our SCNN won the 1st place on the TuSimple Benchmark Lane Detection Challenge, with an accuracy of 96.53%.
中文标题/摘要
标题:空间卷积神经网络:用于交通场景理解的空间CNN
卷积神经网络(CNN)通常通过逐层堆叠卷积操作来构建。尽管CNN在从原始像素中提取语义方面表现出强大的能力,但其捕捉图像行和列中像素的空间关系的能力尚未得到充分探索。这些关系对于学习具有强烈形状先验但外观一致性较弱的语义对象(如交通车道)非常重要,如图1(a)所示。在本文中,我们提出了一种空间CNN(SCNN),它将传统的逐层卷积推广到特征图内的切片间卷积,从而在层内使行和列中的像素之间能够进行消息传递。这种SCNN特别适用于具有强烈空间关系但外观线索较少的长连续形状结构或大型对象,如交通车道、杆和墙。我们在一个新发布的非常具有挑战性的交通车道检测数据集和Cityscapse数据集上应用了SCNN。结果表明,SCNN能够学习结构输出的空间关系,并显著提高了性能。我们展示了SCNN在车道检测数据集上分别比基于递归神经网络(RNN)的ReNet和MRF+CNN(MRFNet)高出8.7%和4.6%。此外,我们的SCNN在TuSimple基准车道检测挑战中获得了第一名,准确率为96.53%。
Summary / 总结
This paper introduces Spatial CNN (SCNN), which extends traditional deep layer-by-layer convolutions to slice-by-slice convolutions within feature maps to capture spatial relationships between pixels. SCNN is applied to traffic lane detection and the Cityscapes dataset, demonstrating improved performance over recurrent neural network-based methods like ReNet and MRFNet by 8.7% and 4.6% respectively, and achieving 96.53% accuracy in the TuSimple Benchmark Lane Detection Challenge.
本文提出了一种Spatial CNN (SCNN),它将传统的逐层卷积扩展为在特征图内进行逐片卷积,允许像素在行和列之间进行消息传递。SCNN特别适用于学习交通车道等长连续形状结构的空间关系。实验结果表明,SCNN在挑战性的交通车道检测数据集和Cityscapes上的表现优于基于RNN的ReNet和MRFNet,分别高出8.7%和4.6%,并且在TuSimple基准车道检测挑战中取得了96.53%的最高准确率。
Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning
Authors: Oscar H. Ramírez-Agudelo, Akshay N. Shewatkar, Edoardo Milana, Roland C. Aydin, Kai Franke
Venue: SPIE Vol. 12675 126750A-12, 2023
First: 2026-01-15T15:59:12+00:00 · Latest: 2026-01-15T15:59:12+00:00
Comments: 17 pages, 10 figures, 6 tables, SPIE Applications of Machine Learning 2023, San Diego, US
Abstract
Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80\% train, 10\% validation, and 10\% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43\,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges
中文标题/摘要
标题:通过深度学习提高烟雾和雾霾场景中测度图像的质量
在雾霾和烟雾环境中拍摄的图像由于能见度降低,监测基础设施和在紧急情况下妨碍应急服务时构成了挑战。本研究探讨了使用深度学习模型自动提高烟雾环境中测度图像的可读性,准确的测度数据解释对于应急响应人员来说是一个有价值的工具。研究使用了FFA-Net和AECR-Net两种深度学习架构,以提高被烟雾和雾霾污染的测度图像的可见度。由于缺乏模拟测度图像的基准数据集,使用Unreal Engine生成了一个新的合成数据集,包含超过14,000张图像。模型分别以80%的训练集、10%的验证集和10%的测试集进行训练。对于合成雾霾数据集,SSIM和PSNR指标分别为0.98和43 dB,与最先进的结果相当。此外,AECR-Net比FFA-Net获得了更稳健的结果。虽然合成烟雾数据集的结果较差,但训练模型仍取得了有趣的结果。总体而言,在烟雾中成像更难提高,因为不均匀性和高密度。其次,FFA-Net和AECR-Net被实现用于去雾霾,而不是去烟雾。本研究表明,使用深度学习架构可以极大地提高烟雾和雾霾场景中模拟测度图像的质量。最后,增强后的输出图像可以成功后处理以实现自动自主读取测度
Summary / 总结
The research aims to improve the readability of gauge images in hazy and smoky environments using deep learning models. Two architectures, FFA-Net and AECR-Net, were trained on a new synthetic dataset of over 14,000 images generated with Unreal Engine. The models achieved high SSIM and PSNR metrics, with AECR-Net showing more robust results. Although the results for the synthetic smoke dataset were poorer, the models still enhanced the quality of gauge images significantly, making them more readable for emergency services and infrastructure monitoring.
研究旨在通过深度学习模型提高烟雾和雾霾环境中模拟表盘图像的可读性。使用Unreal Engine生成了超过14,000张图像的新合成数据集,训练了FFA-Net和AECR-Net两种架构。模型在SSIM和PSNR指标上表现良好,AECR-Net显示出更稳健的结果。尽管合成烟雾数据集的结果较差,但模型仍然显著提高了表盘图像的质量,使其更容易被应急服务和基础设施监控自动读取。
A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning
Authors: Stefano Cerri, Asbjørn Munk, Jakob Ambsdorf, Julia Machnio, Sebastian Nørgaard Llambias, Vardan Nersesjan, Christian Hedeager Krag, Peirong Liu, Pablo Rocamora García, Mostafa Mehdipour Ghazi, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen
First: 2025-06-17T11:48:05+00:00 · Latest: 2026-01-15T15:58:31+00:00
Abstract
We present FOMO300K, a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging (MRI) scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO300K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.
中文标题/摘要
标题:用于自我监督学习的大规模异构3D磁共振脑成像数据集
我们提出了FOMO300K,这是一个包含318,877个脑磁共振成像(MRI)扫描的数据集,来自82,678个MRI会话和59,969个受试者,汇总自920个公开可用的来源。该数据集包括临床级和研究级图像、多种MRI序列以及广泛的解剖和病理变异性,包括具有大脑异常的扫描。对原始图像特征进行了最少的预处理,以降低新用户的入门门槛。提供了用于自我监督预训练和微调的配套代码以及预训练模型。FOMO300K旨在支持大规模医学影像中自我监督学习方法的开发和基准测试。
Summary / 总结
The research introduces FOMO300K, a large-scale and diverse 3D MRI brain imaging dataset containing 318,877 scans from 82,678 sessions and 59,969 subjects, sourced from 920 public sources. The dataset includes both clinical and research-grade images with various MRI sequences and a wide range of anatomical and pathological variations. Minimal preprocessing was applied to maintain the original image characteristics. The study aims to facilitate the development and evaluation of self-supervised learning methods in medical imaging. Key findings include the successful creation of a comprehensive dataset that can be used for self-supervised pretraining and fine-tuning of models, supporting large-scale medical imaging research.
研究介绍了FOMO300K,这是一个包含318,877个3D MRI脑影像扫描的数据集,来自82,678个会话和59,969个受试者,数据来源于920个公开来源。该数据集包括临床和研究级别的图像,具有多种MRI序列和广泛的解剖及病理变化。对原始图像特征进行了最小预处理以保持其原始特性。研究旨在促进大规模医学影像中自监督学习方法的发展和评估。主要发现包括成功创建了一个全面的数据集,可用于自监督预训练和模型微调,支持大规模医学影像研究。
History
20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553