arXiv 论文速递

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Authors: Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang

First: 2026-01-14T18:59:59+00:00 · Latest: 2026-01-14T18:59:59+00:00

Comments: Project page: https://jasper0314-huang.github.io/fast-thinkact/

Abstract

Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.

中文标题/摘要

标题：Fast-ThinkAct：通过可言化的潜在规划实现高效的视觉-语言-动作推理

视觉-语言-动作（VLA）任务需要在复杂的视觉场景中进行推理，并在动态环境中执行适应性动作。虽然近期关于推理VLA的研究表明显式的思维链（CoT）可以提高泛化能力，但它们由于推理痕迹较长而面临较高的推理延迟。我们提出Fast-ThinkAct，一种高效的推理框架，通过可言化的潜在推理实现紧凑且高性能的规划。Fast-ThinkAct通过从教师中提炼，利用偏好导向的目标来对齐操作轨迹，从而转移语言和视觉规划能力，以实现体感控制。这使得推理增强的策略学习能够有效地将紧凑的推理与动作执行连接起来。在多种体感操作和推理基准测试中的广泛实验表明，Fast-ThinkAct在保持有效的长期规划、少量样本适应和故障恢复的同时，相较于最先进的推理VLA，推理延迟最多可减少89.3%。

Summary / 总结

Fast-ThinkAct is an efficient framework for Vision-Language-Action tasks that uses verbalizable latent reasoning to reduce inference latency while maintaining strong performance. It learns to reason efficiently by distilling from a teacher and aligning manipulation trajectories, which helps in transferring both linguistic and visual planning capabilities for embodied control. Experiments show that Fast-ThinkAct reduces inference latency by up to 89.3% compared to state-of-the-art methods while still performing well in long-horizon planning, few-shot adaptation, and failure recovery.

Fast-ThinkAct 是一种高效的视觉-语言-行动任务框架，通过学习紧凑的潜在推理过程来提高推理效率，相比现有方法可将推理延迟降低高达 89.3%，同时保持强大的长期规划能力和少量样本适应能力。该框架使用偏好引导的目标对齐操作轨迹，并将语言和视觉规划能力转移给实体控制，从而实现有效的增强推理策略学习。

Value-Aware Numerical Representations for Transformer Language Models

Authors: Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu

First: 2026-01-14T18:59:14+00:00 · Latest: 2026-01-14T18:59:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model's input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.

中文标题/摘要

标题：价值感知数值表示法用于变换器语言模型

基于变换器的语言模型在数学推理基准测试中通常能够取得优异的成绩，但在基本的数值理解和算术运算方面却显得脆弱。一个主要的限制是，数字被处理为符号标记，其嵌入并没有明确编码数值信息，导致系统性的错误。我们提出了一种价值感知的数值表示法，通过在标准标记化输入中添加一个专门的前缀标记，其嵌入明确地依赖于底层的数值信息，来增强模型。这种机制直接将幅度信息注入到模型的输入空间中，同时保持与现有标记化器和仅解码器变换器架构的兼容性。在算术任务上的评估表明，所提出的方法在各种数值格式、任务和操作数长度上都优于基线方法。这些结果表明，明确编码数值信息是提高语言模型基本数值鲁棒性的一种有效且高效的方法。

Summary / 总结

The research addresses the fragility of transformer-based language models in handling numerical understanding and arithmetic operations, which is due to the lack of explicit numerical value encoding in their token embeddings. The study proposes a value-aware numerical representation that adds a prefix token embedding conditioned on the numerical value to the input. Experiments on arithmetic tasks demonstrate that this approach surpasses baseline models across different numerical formats, tasks, and operand lengths, suggesting that explicitly encoding numerical value enhances the fundamental numerical robustness of language models.

研究针对Transformer语言模型在处理数值理解和算术运算时的脆弱性，尽管它们在数学推理基准测试中表现出色。研究提出了一种数值感知的数值表示方法，该方法在标准标记化输入中添加了一个前缀标记嵌入，该嵌入基于数值值进行条件化。这种方法直接增强了模型处理数字的能力，与基线模型相比，在各种数值格式、任务和操作数长度上提高了性能。

SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3

Authors: Ruiqi Shen, Chang Liu, Henghui Ding

First: 2026-01-14T18:52:14+00:00 · Latest: 2026-01-14T18:52:14+00:00

Comments: Code: https://github.com/FudanCVL/SAM3-DMS

Abs · PDF · Code1 · Code2 · Code3

Abstract

Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.

中文标题/摘要

标题：SAM3-DMS：多目标视频分割中的解耦记忆选择

Segment Anything 3 (SAM3) 已经建立了一个强大的基础，能够稳健地检测、分割和跟踪视频中的指定目标。然而，在其原始实现中，其组级集体记忆选择对于复杂的多对象场景来说并不理想，因为它基于所有并发目标的平均性能进行同步决策，经常忽视个体可靠性。为此，我们提出了SAM3-DMS，这是一种无需训练的解耦策略，利用个体对象的细粒度记忆选择。实验表明，我们的方法实现了稳健的身份保持和跟踪稳定性。值得注意的是，随着目标密度的增加，我们的优势更加明显，为野生环境中的多目标视频分割奠定了坚实的基础。

Summary / 总结

The research aims to improve multi-target video segmentation in complex scenarios by addressing the limitations of the original SAM3 model. SAM3-DMS introduces a decoupled memory selection strategy that individually selects memory for each object, enhancing robust identity preservation and tracking stability. The method outperforms the original model, especially in scenarios with high target density.

研究旨在通过解决原始SAM3模型在复杂场景下的局限性，提高多目标视频分割的鲁棒性。SAM3-DMS提出了一种解耦的记忆选择策略，为每个对象单独选择记忆，从而在高密度目标环境中更好地保持身份一致性和跟踪稳定性。

Causality-enhanced Decision-Making for Autonomous Mobile Robots in Dynamic Environments

Authors: Luca Castri, Gloria Beraldo, Nicola Bellotto

First: 2025-04-16T09:26:04+00:00 · Latest: 2026-01-14T18:52:06+00:00

Comments: Causal Discovery and Inference - Robot Autonomy - Human-Robot Spatial Interaction - Decision-Making

Abs · PDF · Code1 · Code2

Abstract

The growing integration of robots in shared environments - such as warehouses, shopping centres, and hospitals - demands a deep understanding of the underlying dynamics and human behaviours, including how, when, and where individuals engage in various activities and interactions. This knowledge goes beyond simple correlation studies and requires a more comprehensive causal analysis. By leveraging causal inference to model cause-and-effect relationships, we can better anticipate critical environmental factors and enable autonomous robots to plan and execute tasks more effectively. To this end, we propose a novel causality-based decision-making framework that reasons over a learned causal model to assist the robot in deciding when and how to complete a given task. In the examined use case - i.e., a warehouse shared with people - we exploit the causal model to estimate battery usage and human obstructions as factors influencing the robot's task execution. This reasoning framework supports the robot in making informed decisions about task timing and strategy. To achieve this, we developed also PeopleFlow, a new Gazebo-based simulator designed to model context-sensitive human-robot spatial interactions in shared workspaces. PeopleFlow features realistic human and robot trajectories influenced by contextual factors such as time, environment layout, and robot state, and can simulate a large number of agents. While the simulator is general-purpose, in this paper we focus on a warehouse-like environment as a case study, where we conduct an extensive evaluation benchmarking our causal approach against a non-causal baseline. Our findings demonstrate the efficacy of the proposed solutions, highlighting how causal reasoning enables autonomous robots to operate more efficiently and safely in dynamic environments shared with humans.

中文标题/摘要

标题：增强因果关系的自主移动机器人决策制定以应对动态环境

随着机器人在共享环境中的集成不断增加，如仓库、购物中心和医院，需要深入理解底层动力学和人类行为，包括个体何时何地参与各种活动和互动。这种知识超越了简单的相关性研究，需要更全面的因果分析。通过利用因果推理来建模因果关系，我们可以更好地预测关键环境因素，使自主机器人能够更有效地规划和执行任务。为此，我们提出了一种基于因果关系的新型决策制定框架，该框架通过推理学习到的因果模型来帮助机器人决定何时以及如何完成给定任务。在考察的用例中，即与人共享的仓库环境中，我们利用因果模型来估计电池使用和人类障碍等因素对机器人任务执行的影响。这种推理框架支持机器人做出关于任务时间和策略的知情决策。为此，我们还开发了PeopleFlow，这是一种新的基于Gazebo的模拟器，用于建模共享工作空间中上下文敏感的人机空间交互。PeopleFlow具有受时间、环境布局和机器人状态等因素影响的现实人类和机器人轨迹，并可以模拟大量代理。虽然模拟器具有通用性，但在本文中我们以仓库环境为例，进行了广泛的评估，将我们的因果方法与非因果基线进行基准测试。我们的研究结果表明，所提出解决方案的有效性，突显了因果推理如何使自主机器人在与人类共享的动态环境中更高效、更安全地运行。

Summary / 总结

This paper addresses the need for autonomous mobile robots to better understand and predict human behaviors in shared environments through causal inference. The authors propose a causality-based decision-making framework that uses a learned causal model to assist robots in task planning. In a warehouse setting, they evaluate this framework using PeopleFlow, a new Gazebo-based simulator, and show that causal reasoning improves the robot's efficiency and safety by accurately estimating factors like battery usage and human obstructions.

论文旨在通过因果推理使自主移动机器人能够理解并预测共享环境中的动态。提出了一种基于因果关系的决策框架，利用学习到的因果模型帮助机器人进行任务规划和执行。在仓库环境中，该框架有助于估计电池使用和人类障碍，从而实现更明智的任务时间安排和策略决策。评估结果显示，因果方法优于非因果基线，展示了在动态人机环境中更高的效率和安全性。

COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation

Authors: Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

First: 2026-01-14T18:50:17+00:00 · Latest: 2026-01-14T18:50:17+00:00

Abs · PDF · Code1 · Code2

Abstract

3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.

中文标题/摘要

标题：COMPOSE：多视图3D人体姿态估计的超图覆盖优化

基于稀疏多视角的3D姿态估计是众多应用的关键任务，包括动作识别、体育分析和人机交互。基于优化的方法通常遵循两阶段管道，首先在每个视图中检测2D关键点，然后将这些检测结果跨视图关联以三角化3D姿态。现有方法依赖于简单的成对关联来建模这种对应问题，将视图之间全局一致性（即循环一致性）视为软约束。然而，当虚假关联传播错误时，为多个视图统一这些约束变得脆弱。因此，我们提出了一种名为COMPOSE的新框架，将多视图姿态对应匹配形式化为超图划分问题，而不是通过成对关联。尽管理论上的整数线性规划问题复杂度呈指数增长，但我们引入了一种有效的几何剪枝策略，以显著减少搜索空间。COMPOSE在平均精度上比之前的基于优化的方法提高了最多23%，比自我监督的端到端学习方法提高了最多11%，为一个广泛研究的问题提供了有希望的解决方案。

Summary / 总结

The paper proposes COMPOSE, a framework that addresses the multi-view 3D human pose estimation problem by formulating the correspondence matching as a hypergraph partitioning problem, rather than using pairwise associations. This approach improves the robustness to spurious associations and reduces the complexity through geometric pruning. The method achieves up to 23% improvement in average precision over existing optimization-based methods and up to 11% over self-supervised end-to-end learned methods.

研究旨在通过解决现有成对关联方法的局限性，提高多视角下的3D人体姿态估计。COMPOSE将问题表述为超图划分任务，这有助于更好地保持全局一致性并减少错误关联的影响。该方法在平均精度上比基于优化的方法提高了最多23%，比自监督端到端学习方法提高了最多11%。

Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Authors: Jieying Chen, Jeffrey Hu, Joan Lasenby, Ayush Tewari

First: 2026-01-14T18:50:06+00:00 · Latest: 2026-01-14T18:50:06+00:00

Comments: Project page: https://ayushtewari.com/projects/srender/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.

中文标题/摘要

标题：通过稀疏扩散和3D渲染实现静态场景的高效相机控制视频生成

基于扩散模型的现代视频生成模型可以生成非常逼真的片段，但它们在计算上效率低下，通常需要几分钟的GPU时间才能生成几秒钟的视频。这种低效率对需要实时交互的应用程序（如具身AI和VR/AR）的部署构成了关键障碍。本文探讨了一种新的静态场景相机条件视频生成策略：使用基于扩散的生成模型生成稀疏的关键帧集，然后通过3D重建和渲染合成完整的视频。通过将关键帧提升到3D表示并渲染中间视图，我们的方法将生成成本摊销到数百帧上，同时保持几何一致性。我们还引入了一个模型，可以预测给定相机轨迹的最佳关键帧数量，从而使系统能够自适应地分配计算。我们的最终方法SRENDER对于简单的相机轨迹使用非常稀疏的关键帧，而对于复杂的相机运动则使用更密集的关键帧。这使得在生成20秒视频时，与基于扩散的基线相比，视频生成速度提高了40多倍，同时保持了高视觉保真度和时间稳定性，为高效可控的视频合成提供了实际途径。

Summary / 总结

This paper addresses the computational inefficiency of modern diffusion-based video generative models by proposing SRENDER, a method that uses sparse keyframes and 3D rendering to generate videos more efficiently. The approach leverages diffusion models to create a few keyframes, which are then rendered into full videos, reducing the generation time by over 40 times for 20 seconds of video while maintaining visual quality and temporal stability.

研究旨在解决基于扩散模型的现代视频生成模型计算效率低的问题，这些模型生成视频需要大量的GPU时间。方法是使用扩散模型生成一组稀疏的关键帧，然后通过3D重建和渲染合成完整的视频，从而降低生成成本并保持视觉质量。关键实验发现是，提出的SRENDER方法在生成20秒视频时比基于扩散模型的基本方法快40多倍，同时仍然保持高质量和时间稳定性。

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

Authors: Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer

First: 2026-01-14T18:45:08+00:00 · Latest: 2026-01-14T18:45:08+00:00

Comments: ELLIS ML4Molecules Workshop 2025, ELLIS Unconference, Copenhagen 2025

Abs · PDF · Code1 · Code2

Abstract

Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves state-of-the-art zero-shot virtual screening performance in settings where no binding pocket information is provided as input, substantially outperforms existing methods on a challenging target fishing task, and demonstrates competitive ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

中文标题/摘要

标题：对比几何学习解锁统一的结构和配体基于药物设计

结构基于和配体基于的计算药物设计传统上依赖于分离的数据源和建模假设，限制了它们的大规模联合使用。在本工作中，我们引入了统一计算药物设计的对比几何学习（ConGLUDe），这是一种统一结构和配体训练的单一对比几何模型。ConGLUDe 结合了一个几何蛋白质编码器，该编码器产生整个蛋白质的表示和预测结合位点的隐式嵌入，以及一个快速配体编码器，消除了预先定义口袋的需要。通过对比学习将配体与全局蛋白质表示和多个候选结合位点对齐，ConGLUDe 支持配体条件下的口袋预测，以及虚拟筛选和靶标钓鱼，同时在蛋白质-配体复合物和大规模生物活性数据上联合训练。在各种基准测试中，ConGLUDe 在未提供结合口袋信息的零样本虚拟筛选性能方面达到最先进的水平，显著优于现有方法在具有挑战性的靶标钓鱼任务上的表现，并展示了竞争力的配体条件下的口袋选择。这些结果突显了统一结构-配体训练的优势，并将ConGLUDe 作为通用基础模型药物发现的一个步骤。

Summary / 总结

This work introduces ConGLUDe, a single contrastive geometric model that unifies structure- and ligand-based training for drug design. By aligning ligands with both global protein representations and multiple candidate binding sites, ConGLUDe supports ligand-conditioned pocket prediction, virtual screening, and target fishing. It achieves state-of-the-art zero-shot virtual screening performance and outperforms existing methods on a challenging target fishing task, demonstrating the benefits of unified structure-ligand training.

该研究引入了ConGLUDe，这是一种统一结构-配体训练的单个对比几何模型，用于药物设计。通过将配体与全局蛋白质表示和多个候选结合位点对齐，ConGLUDe 支持配体条件下的结合位点预测、虚拟筛选和目标捕获。在各种基准测试中，ConGLUDe 实现了最先进的零样本虚拟筛选性能，并在一项具有挑战性的目标捕获任务上显著优于现有方法，展示了统一结构-配体训练的优势。

Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection

Authors: Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal

First: 2026-01-14T18:43:32+00:00 · Latest: 2026-01-14T18:43:32+00:00

Comments: Code: https://github.com/tianyiniu/RoutingGenData

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.

中文标题/摘要

标题：基于生成数据的路由：无注释的LLM技能估计与专家选择

大型语言模型（LLM）路由器动态选择适用于给定输入的最佳模型。现有方法通常假设可以访问真实标签数据，但在实践中这往往不可用，尤其是在用户请求分布异质且未知的情况下。我们引入了基于生成数据的路由（RGD），这是一种挑战性设置，在这种设置中，路由器仅基于生成器LLM从高级任务描述生成的查询和答案进行训练。我们评估了使用查询和标签的查询-答案路由器以及仅使用查询的查询路由器在四个不同基准和12个模型上的性能，发现随着生成器质量下降，查询-答案路由器的性能下降速度比查询路由器更快。我们的分析揭示了有效生成器的两个关键特征：它们必须准确回答自己的问题，且它们的问题必须在模型池中产生足够的性能差异。然后我们展示了如何通过筛选这些特征来提高生成数据的质量。我们还提出了CASCAL，一种新颖的查询路由器，通过共识投票估计模型的正确性，并通过层次聚类识别模型特定的技能细分。CASCAL在生成器质量较弱的情况下表现出更高的鲁棒性，当使用弱生成器数据训练时，其绝对准确率比最佳查询-答案路由器高出4.6%。

Summary / 总结

The paper addresses the challenge of training LLM routers without access to ground-truth labeled data, a common issue in practice. It introduces RGD, a method where routers are trained on generated queries and answers from high-level task descriptions. The study evaluates both query-answer and query-only routers across various benchmarks and models, showing that query-only routers are more robust to the quality of the generator. The authors propose CASCAL, a query-only router that uses consensus voting and hierarchical clustering to identify model-specific skill niches, demonstrating improved performance even with weak generator data.

本文解决了在缺乏真实标注数据的情况下训练LLM路由器的问题，这是实践中常见的挑战。研究引入了RGD方法，其中路由器是基于生成器LLM生成的查询和答案进行训练的。研究在多个基准和模型上评估了查询-答案路由器和查询-only路由器，发现查询-only路由器在生成器质量下降时更为 robust。研究还识别了有效生成器的两个关键特征，并提出了CASCAL，一种使用共识投票和层次聚类来估计模型正确性和识别模型特定技能领域的新型查询-only路由器，特别是在生成器质量较低时表现出色。

Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection

Authors: Ziyu Yang, Guibin Chen, Yuxin Yang, Aoxiong Zeng, Xiangquan Yang

First: 2026-01-14T18:36:22+00:00 · Latest: 2026-01-14T18:36:22+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape's capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95\% of the performance gap between multi-task and single-task baselines with negligible computational overhead.

中文标题/摘要

标题：通过正交梯度投影在多任务LoRA中解开任务冲突

多任务学习（MTL）结合低秩适应（LoRA）已成为参数高效部署大型语言模型（LLMs）的一个有前途的方向。通过在多个任务中共享一个适配器，可以显著减少存储开销。然而，这种方法会遭受负迁移的问题，即来自不同任务的冲突梯度更新会降低单个任务的性能，与单任务微调相比。由于LoRA中的低秩约束限制了优化景观容纳多样化任务需求的能力，这一问题在LoRA中被进一步加剧。在本文中，我们提出了一种名为Ortho-LoRA的梯度投影方法，专门针对LoRA的二分结构。Ortho-LoRA动态地将冲突任务的梯度投影到彼此的正交补空间中。在GLUE基准上的广泛实验表明，Ortho-LoRA有效地缓解了任务干扰，优于标准联合训练，并且在几乎无计算开销的情况下恢复了多任务和单任务基线之间的95%的性能差距。

Summary / 总结

The paper addresses the issue of task conflicts in multi-task learning with Low-Rank Adaptation (LoRA) by proposing Ortho-LoRA, a gradient projection method. This method projects conflicting task gradients onto the orthogonal complement within the LoRA subspace to mitigate negative transfer. Experiments on the GLUE benchmark show that Ortho-LoRA outperforms standard joint training and nearly recovers the performance gap between multi-task and single-task baselines with minimal computational overhead.

论文提出了一种名为Ortho-LoRA的方法，通过在LoRA子空间中将冲突的任务梯度投影到彼此的正交补空间来解决多任务学习中的任务冲突问题。实验结果表明，Ortho-LoRA在GLUE基准上优于标准联合训练，并几乎恢复了多任务和单任务基线之间的性能差距，同时具有极小的计算开销。

Automating Supply Chain Disruption Monitoring via an Agentic AI Approach

Authors: Sara AlMahri, Liming Xu, Alexandra Brintrup

First: 2026-01-14T18:28:31+00:00 · Latest: 2026-01-14T18:28:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern supply chains are increasingly exposed to disruptions from geopolitical events, demand shocks, trade restrictions, to natural disasters. While many of these disruptions originate deep in the supply network, most companies still lack visibility beyond Tier-1 suppliers, leaving upstream vulnerabilities undetected until the impact cascades downstream. To overcome this blind-spot and move from reactive recovery to proactive resilience, we introduce a minimally supervised agentic AI framework that autonomously monitors, analyses, and responds to disruptions across extended supply networks. The architecture comprises seven specialised agents powered by large language models and deterministic tools that jointly detect disruption signals from unstructured news, map them to multi-tier supplier networks, evaluate exposure based on network structure, and recommend mitigations such as alternative sourcing options. \rev{We evaluate the framework across 30 synthesised scenarios covering three automotive manufacturers and five disruption classes. The system achieves high accuracy across core tasks, with F1 scores between 0.962 and 0.991, and performs full end-to-end analyses in a mean of 3.83 minutes at a cost of \$0.0836 per disruption. Relative to industry benchmarks of multi-day, analyst-driven assessments, this represents a reduction of more than three orders of magnitude in response time. A real-world case study of the 2022 Russia-Ukraine conflict further demonstrates operational applicability. This work establishes a foundational step toward building resilient, proactive, and autonomous supply chains capable of managing disruptions across deep-tier networks.

中文标题/摘要

标题：通过代理人工智能方法自动化供应链中断监测

现代供应链越来越容易受到地缘政治事件、需求冲击、贸易限制以及自然灾害等中断的影响。虽然许多这些中断源自供应链的深处，但大多数公司仍然缺乏对一级供应商以上的上游漏洞的可见性，直到影响波及下游才被发现。为了克服这一盲点，从被动恢复转向主动韧性，我们引入了一个最小监督的代理人工智能框架，该框架能够自主监测、分析和应对扩展供应链网络中的中断。该架构由七个专门的代理组成，这些代理由大型语言模型和确定性工具驱动，共同从非结构化新闻中检测中断信号，将它们映射到多级供应商网络，根据网络结构评估暴露程度，并推荐替代采购等缓解措施。我们在涵盖三家汽车制造商和五类中断的30个合成场景中评估了该框架，系统在核心任务上的准确率很高，F1分数在0.962到0.991之间，平均在3.83分钟内完成端到端分析，成本为每中断0.0836美元。与多天、分析师驱动的评估相比，响应时间减少了三个数量级。2022年俄罗斯-乌克兰冲突的实际案例进一步证明了其操作适用性。这项工作为构建能够管理深层网络中断的弹性、主动和自主供应链奠定了基础。

Summary / 总结

The paper introduces an agentic AI framework to monitor and respond to supply chain disruptions autonomously. It uses seven specialized agents to detect signals from unstructured news, map them to supplier networks, evaluate exposure, and recommend mitigations. Evaluations across 30 synthesized scenarios show high accuracy with F1 scores between 0.962 and 0.991, and a mean analysis time of 3.83 minutes at a cost of $0.0836 per disruption, significantly reducing response time compared to industry benchmarks.

本文提出了一种基于代理AI的框架，用于监控和应对供应链中断。该系统在最小监督下使用专门的代理来从非结构化新闻中检测中断，将它们映射到多级供应商网络，并推荐缓解措施。在30个合成场景中的评估显示，准确率很高，F1分数在0.962到0.991之间，平均分析时间为3.83分钟，每中断的成本为0.0836美元，与行业基准相比，响应时间显著缩短。

OptiMind: Teaching LLMs to Think Like Optimization Experts

Authors: Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, Sirui Li

First: 2025-09-26T22:23:12+00:00 · Latest: 2026-01-14T18:26:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Mathematical programming -- the task of expressing operations and decision-making problems in precise mathematical language -- is fundamental across domains, yet remains a skill-intensive process requiring operations research expertise. Recent advances in large language models for complex reasoning have spurred interest in automating this task, translating natural language into executable optimization models. Current approaches, however, achieve limited accuracy, hindered by scarce and noisy training data without leveraging domain knowledge. In this work, we systematically integrate optimization expertise to improve formulation accuracy for mixed-integer linear programming, a key family of mathematical programs. Our OptiMind framework leverages semi-automated, class-based error analysis to guide both training and inference, explicitly preventing common mistakes within each optimization class. Our resulting fine-tuned LLM significantly improves formulation accuracy by 20.7% across multiple optimization benchmarks, with consistent gains under test-time scaling methods such as self-consistency and multi-turn feedback, enabling further progress toward robust LLM-assisted optimization formulation.

中文标题/摘要

标题：OptiMind：教LLM像优化专家一样思考

数学规划——将操作和决策问题精确地表达成数学语言的任务——在各个领域都是基础性的，但仍然是一个需要运筹学专业知识的技能密集型过程。最近在复杂推理方面的大语言模型的进步激发了自动化这一任务的兴趣，即将自然语言翻译成可执行的优化模型。然而，当前的方法在准确性上受到限制，这主要是由于缺乏和噪声较大的训练数据，未能利用领域知识。在本项工作中，我们系统地整合了优化专业知识，以提高混合整数线性规划问题的建模准确性，这是一个关键的数学规划家族。我们的OptiMind框架利用半自动化的、基于类别的错误分析来指导训练和推理，明确地防止每个优化类中的常见错误。我们最终微调的LLM在多个优化基准测试中将建模准确性提高了20.7%，并且在测试时的缩放方法（如自我一致性、多轮反馈）下保持了一致的改进，这为进一步实现稳健的LLM辅助优化建模奠定了基础。

Summary / 总结

The research aims to enhance the ability of large language models (LLMs) to translate natural language into accurate optimization models by integrating optimization expertise. The OptiMind framework uses semi-automated, class-based error analysis to train and infer, preventing common mistakes in optimization problems. The fine-tuned LLM shows a 20.7% improvement in formulation accuracy across various benchmarks, with consistent gains under scaling methods like self-consistency and multi-turn feedback.

研究旨在通过整合优化专业知识，提升大型语言模型将自然语言转换为准确的优化模型的能力。OptiMind框架采用半自动化、基于类别的错误分析来训练和推理，防止优化问题中的常见错误。微调后的LLM在各种基准测试中的建模准确性提高了20.7%，并在自我一致性等放大方法和多轮反馈下表现出一致的改进。

Deep Hybrid Model for Region of Interest Detection in Omnidirectional Videos

Authors: Sana Alamgeer, Mylene Farias, Marcelo Carvalho

First: 2025-11-24T07:52:06+00:00 · Latest: 2026-01-14T18:25:35+00:00

Comments: I need to withdraw this as it contains some confidential information related to FAPESP funding agency

Abs · PDF · Code1 · Code2

Abstract

The main goal of the project is to design a new model that predicts regions of interest in 360$^{\circ}$ videos. The region of interest (ROI) plays an important role in 360$^{\circ}$ video streaming. For example, ROIs are used to predict view-ports, intelligently cut the videos for live streaming, etc so that less bandwidth is used. Detecting view-ports in advance helps reduce the movement of the head while streaming and watching a video via the head-mounted device. Whereas, intelligent cuts of the videos help improve the efficiency of streaming the video to users and enhance the quality of their viewing experience. This report illustrates the secondary task to identify ROIs, in which, we design, train, and test a hybrid saliency model. In this work, we refer to saliency regions to represent the regions of interest. The method includes the processes as follows: preprocessing the video to obtain frames, developing a hybrid saliency model for predicting the region of interest, and finally post-processing the output predictions of the hybrid saliency model to obtain the output region of interest for each frame. Then, we compare the performance of the proposed method with the subjective annotations of the 360RAT dataset.

中文标题/摘要

标题：全景视频中兴趣区域检测的深度混合模型

该项目的主要目标是设计一种新的模型来预测全景视频中的兴趣区域。兴趣区域（ROI）在全景视频流传输中起着重要作用。例如，ROI 用于预测视窗、智能剪辑视频以进行直播流媒体，从而减少带宽使用。提前检测视窗有助于减少通过头戴设备流式传输和观看视频时的头部移动。而智能剪辑视频有助于提高视频流媒体的效率，提升用户的观看体验质量。本报告阐述了识别ROI的次要任务，在此任务中，我们设计、训练并测试了一种混合显著性模型。该方法包括以下过程：预处理视频以获取帧、开发混合显著性模型以预测兴趣区域，最后对混合显著性模型的输出预测进行后处理，以获得每个帧的兴趣区域输出。然后，我们将所提出方法的性能与360RAT数据集的主观注释进行比较。

Summary / 总结

The project aims to design a new model for detecting regions of interest in 360-degree videos to optimize streaming efficiency and enhance user experience. The method involves preprocessing the video to obtain frames, developing a hybrid saliency model to predict ROIs, and post-processing the predictions. The proposed method was compared with subjective annotations from the 360RAT dataset, showing promising results in identifying ROIs.

研究旨在设计一种模型来检测360度视频中的感兴趣区域（ROIs），以优化流媒体效率和用户体验。方法包括预处理视频以获取帧，开发混合显著性模型来预测ROIs，并后处理预测结果。所提出的方法与360RAT数据集的主观注释进行了比较，显示出在识别ROIs方面的良好效果。

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit

Authors: Junda Lin, Zhaomeng Zhou, Zhi Zheng, Shuochen Liu, Tong Xu, Yong Chen, Enhong Chen

First: 2026-01-09T12:19:49+00:00 · Latest: 2026-01-14T18:19:43+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM agents operating in open environments face escalating risks from indirect prompt injection, particularly within the tool stream where manipulated metadata and runtime feedback hijack execution flow. Existing defenses encounter a critical dilemma as advanced models prioritize injected rules due to strict alignment while static protection mechanisms sever the feedback loop required for adaptive reasoning. To reconcile this conflict, we propose \textbf{VIGIL}, a framework that shifts the paradigm from restrictive isolation to a verify-before-commit protocol. By facilitating speculative hypothesis generation and enforcing safety through intent-grounded verification, \textbf{VIGIL} preserves reasoning flexibility while ensuring robust control. We further introduce \textbf{SIREN}, a benchmark comprising 959 tool stream injection cases designed to simulate pervasive threats characterized by dynamic dependencies. Extensive experiments demonstrate that \textbf{VIGIL} outperforms state-of-the-art dynamic defenses by reducing the attack success rate by over 22\% while more than doubling the utility under attack compared to static baselines, thereby achieving an optimal balance between security and utility.

中文标题/摘要

标题：VIGIL：通过验证后再提交防御LLM代理工具流注入

在开放环境中的LLM代理面临来自间接提示注入的不断升级风险，尤其是在工具流中，篡改的元数据和运行时反馈劫持执行流程。现有防御措施遇到一个关键困境，即高级模型因严格对齐而优先处理注入规则，而静态保护机制则切断了适应性推理所需的反馈循环。为解决这一冲突，我们提出了一种名为\textbf{VIGIL}的框架，该框架从限制性隔离转向验证后再提交协议。通过促进推测性假设生成并通过意图导向验证确保安全性，\textbf{VIGIL}在保持推理灵活性的同时确保了稳健的控制。我们还引入了包含959个工具流注入案例的\textbf{SIREN}基准，旨在模拟由动态依赖性特征定义的普遍威胁。大量实验表明，\textbf{VIGIL}在降低攻击成功率方面优于最先进的动态防御措施，同时在攻击下的实用性比静态基线高出一倍以上，从而实现了安全性和实用性的最佳平衡。

Summary / 总结

VIGIL is a framework designed to protect LLM agents from tool stream injection attacks by implementing a verify-before-commit protocol. It allows for speculative hypothesis generation and intent-grounded verification to maintain reasoning flexibility while ensuring robust control. Experiments show that VIGIL outperforms existing dynamic defenses by reducing attack success rates and increasing utility under attack compared to static baselines.

VIGIL 是一个框架，通过实现 verify-before-commit 协议来保护 LLM 代理免受工具流注入攻击。它允许进行推测性假设生成和基于意图的验证，以保持推理的灵活性并确保强大的控制。实验表明，VIGIL 在减少攻击成功率和在攻击下增加实用性方面优于现有的动态防御措施，与静态基线相比。

Policy Compatible Skill Incremental Learning via Lazy Learning Interface

Authors: Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo

Venue: NeurIPS 2025 Spotlight

First: 2025-09-24T23:34:01+00:00 · Latest: 2026-01-14T18:11:20+00:00

Comments: NeurIPS 2025 Spotlight

Abs · PDF · Code1 · Code2

Abstract

Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy's decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.

中文标题/摘要

标题：通过懒学习接口实现政策兼容技能增量学习

技能增量学习（SIL）是指通过利用与环境交互获得的经验或通过整合额外数据来扩展和细化其技能集的过程。SIL 促进了基于可重用技能的分层政策的有效获取，这些技能适用于下游任务。然而，随着技能库的发展，它可能会破坏现有基于技能的政策的兼容性，限制它们的可重用性和泛化能力。在本文中，我们提出了一种新的框架 SIL-C，该框架确保技能-政策兼容性，使得增量学习技能的改进能够提升下游政策的性能，而无需重新训练政策或结构适应。SIL-C 使用双边懒学习映射技术动态对齐政策引用的子任务空间与代理行为解码的技能空间。这使得每个从复杂任务分解中派生的子任务能够根据轨迹分布相似性选择合适的技能来执行。我们跨多种SIL场景评估了SIL-C，并证明它在技能演变过程中保持了与下游政策的兼容性，同时确保了学习过程的效率。

Summary / 总结

The research addresses the challenge of maintaining compatibility between evolving skills and existing policies in Skill Incremental Learning (SIL). The proposed SIL-C framework uses a lazy learning-based mapping technique to dynamically align subtask spaces with skill spaces, ensuring that improvements in skills enhance downstream policies without retraining. Experiments across various SIL scenarios show that SIL-C maintains compatibility and efficiency in the learning process.

研究解决了技能增量学习（SIL）中不断进化的技能与现有策略之间保持兼容性的挑战。提出了一种名为SIL-C的新框架，利用懒学习映射技术动态对齐子任务空间和技能空间，确保技能改进能够提升下游策略而无需重新训练。实验表明，SIL-C在各种SIL场景中保持了兼容性和效率。

A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering

Authors: Shahana Yasmin Chowdhury, Bithi Banik, Md Tamjidul Hoque, Shreya Banerjee

Venue: HHAI-WS 2025 Workshops at the Fourth International Conference on Hybrid Human-Artificial Intelligence (HHAI), June, 2025, Pisa, Italy

First: 2025-07-09T17:07:45+00:00 · Latest: 2026-01-14T18:07:43+00:00

Comments: 17 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.

中文标题/摘要

标题：一种用于语音情感识别的新型混合深度学习技术结合特征工程

如今，语音情感识别（SER）在人机交互（HCI）和人工智能（AI）的发展中发挥着重要作用。我们提出的DCRF-BiLSTM模型用于识别七种情感：中性、快乐、悲伤、愤怒、恐惧、厌恶和惊讶，这些情感在RAVDESS（R）、TESS（T）、SAVEE（S）、EmoDB（E）和Crema-D（C）五个数据集上进行训练。该模型在各个数据集上均实现了高精度，包括在RAVDESS上达到97.83%，在SAVEE上达到97.02%，在CREMA-D上达到95.10%，在TESS和EMO-DB上达到完美的100%。对于合并的（R+T+S）数据集，其准确率为98.82%，优于之前报道的结果。据我们所知，没有现有研究在同一时间评估单一SER模型在所有五个基准数据集（即R+T+S+C+E）上的表现。在我们的工作中，我们引入了这种全面的组合，并实现了令人瞩目的总体准确率93.76%。这些结果证实了我们DCRF-BiLSTM框架在不同数据集上的稳健性和通用性。

Summary / 总结

The research aims to improve speech emotion recognition (SER) in human-computer interaction and artificial intelligence. The DCRF-BiLSTM model was developed to recognize seven emotions across five datasets (RAVDESS, TESS, SAVEE, EmoDB, and Crema-D), achieving high accuracy on individual datasets and a combined accuracy of 98.82% on (R+T+S) datasets. The model also achieved a comprehensive accuracy of 93.76% across all five datasets, surpassing previous results and demonstrating the robustness of the DCRF-BiLSTM framework.

研究旨在提高人类计算机交互和人工智能中的语音情感识别（SER）。开发了DCRF-BiLSTM模型来识别RAVDESS、TESS、SAVEE、EmoDB和Crema-D五个数据集中的七种情感，各数据集的准确率都很高，特别是在(R+T+S)数据集上的准确率达到98.82%。该模型还实现了所有五个数据集的综合准确率93.76%，超过了以往的结果，证明了DCRF-BiLSTM框架的稳健性和通用性。

STEP3-VL-10B Technical Report

Authors: Ailin Huang, Chengyuan Yao, Chunrui Han, Fanqi Wan, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia Wang, Jian Zhou, Jianjian Sun, Jingcheng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Yuan, Wenwen Qu, Xiangfeng Wang, Yanlin Lai, Yingxiu Zhao, Yinmin Zhang, Yukang Shi, Yuyang Chen, Zejia Weng, Ziyang Meng, Ang Li, Aobo Kong, Bo Dong, Changyi Wan, David Wang, Di Qi, Dingming Li, En Yu, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Yan, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi Fu, Jie Cheng, Jie Zhou, Jisheng Yin, Jingjing Xie, Jingwei Wu, Jun Zhang, Junfeng Liu, Kaijun Tan, Kaiwen Yan, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Shengjie Fan, Shijie Shang, Siyuan Zhang, Tianhao You, Wei Ji, Wuxun Xie, Xiaobo Yang, Xiaojie Hou, Xiaoran Jiao, Xiaoxiao Ren, Xiangwen Kong, Xin Huang, Xin Wu, Xing Chen, Xinran Wang, Xuelin Zhang, Yana Wei, Yang Li, Yanming Xu, Yeqing Shen, Yuang Peng, Yue Peng, Yu Zhou, Yusheng Li, Yuxiang Yang, Yuyang Zhang, Zhe Xie, Zhewei Huang, Zhenyi Lu, Zhimin Fan, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yibo Zhu, Zheng Ge

First: 2026-01-14T17:58:24+00:00 · Latest: 2026-01-14T17:58:24+00:00

Comments: 50 pages

Abs · PDF · Code1 · Code2

Abstract

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

中文标题/摘要

标题：STEP3-VL-10B 技术报告

我们提出了STEP3-VL-10B，这是一种轻量级开源基础模型，旨在重新定义紧凑效率与前沿多模态智能之间的权衡。STEP3-VL-10B 通过两个战略转变实现：首先，一种统一的、完全解冻的预训练策略，使用1.2万亿多模态令牌，结合语言对齐的感知编码器和Qwen3-8B解码器，建立内在的视觉-语言协同作用；其次，一个扩大的后训练流水线，包含超过1000次强化学习迭代。关键的是，我们实现了并行协调推理（PaCoRe）来扩展测试时的计算能力，将资源分配给可扩展的感知推理，探索和综合多种视觉假设。因此，尽管其紧凑的10B规模，STEP3-VL-10B 在性能上与比其大10-20倍的模型（如GLM-4.6V-106B、Qwen3-VL-235B）相当或超越，并且在顶级专有旗舰产品（如Gemini 2.5 Pro和Seed-1.5-VL）中表现出色。它在MMBench上记录了92.2%的得分，在MMMU上记录了80.11%的得分，同时在复杂推理方面分别达到了94.43%的AIME2025得分和75.95%的MathVision得分。我们发布了完整的模型套件，为社区提供了一个强大、高效且可重现的基础线。

Summary / 总结

The research aims to develop a compact yet powerful multimodal foundation model, STEP3-VL-10B, through a unified pre-training strategy and a scaled post-training pipeline. The model integrates a language-aligned Perception Encoder with a Qwen3-8B decoder and employs Parallel Coordinated Reasoning to enhance test-time compute efficiency. Despite its 10B parameter size, STEP3-VL-10B outperforms larger models on various benchmarks, achieving 92.2% on MMBench, 80.11% on MMMU, 94.43% on AIME2025, and 75.95% on MathVision, while maintaining a lightweight footprint.

研究旨在通过统一的预训练策略和扩展后的后训练管道，开发一个紧凑但强大的多模态基础模型STEP3-VL-10B。该模型结合了语言对齐的感知编码器和Qwen3-8B解码器，并采用并行协调推理来提高测试时的计算效率。尽管参数量仅为10亿，STEP3-VL-10B在多种基准测试中表现出色，分别在MMBench、MMMU、AIME2025和MathVision上取得了92.2%、80.11%、94.43%和75.95%的成绩。

Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning

Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park

First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-14T17:57:43+00:00

Comments: Work in Progress

Abs · PDF · Code1 · Code2

Abstract

Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.

中文标题/摘要

标题：协作多智能体推理的测试时强化学习

多智能体系统已成为许多应用中的实用LLM驱动合作者，通过多样性和交叉验证获得鲁棒性。然而，多智能体强化学习（MARL）训练资源密集且不稳定：队友的共同适应引入了非平稳性，奖励通常稀疏且高方差。因此，我们引入了**多智能体测试时强化学习（MATTRL）**框架，在推理时向多智能体协商注入结构化文本经验。MATTRL 形成一个由专家组成的多专家团队，进行多轮讨论，检索和整合测试时的经验，并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池，然后将其重新注入对话。在医学、数学和教育领域的具有挑战性的基准测试中，MATTRL 在多智能体基线上的准确率平均提高了3.67%，在可比的单智能体基线上的准确率提高了8.67%。消融研究探讨了不同的信用分配方案，并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径，无需调整即可实现分布转移鲁棒的多智能体推理。

Summary / 总结

The research introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which enhances multi-agent decision-making by integrating structured textual experience during inference. MATTRL forms a multi-expert team for discussions, retrieves relevant experiences, and reaches a consensus. Experiments show MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines across various domains. Ablation studies explore different credit-assignment methods and their impacts on training outcomes.

研究引入了Multi-Agent Test-Time Reinforcement Learning (MATTRL)，通过在推理过程中整合结构化的文本经验来提升多智能体决策能力。MATTRL 形成一个多专家团队进行讨论，检索相关经验并达成共识。实验结果显示，MATTRL 在不同领域中分别比多智能体基线提高了 3.67%，比单智能体基线提高了 8.67% 的准确性。消融研究探讨了不同信用分配方法及其对训练结果的影响。

SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings

Authors: Yuchen Wu, Jiahe Li, Xiaohan Yu, Lina Yu, Jin Zheng, Xiao Bai

First: 2026-01-14T17:57:08+00:00 · Latest: 2026-01-14T17:57:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.

中文标题/摘要

标题：SCE-SLAM：基于场景坐标嵌入的大规模一致单目SLAM

单目视觉SLAM能够从互联网视频中进行3D重建并在资源受限平台上实现自主导航，但会遭受尺度漂移的问题，即在长时间序列中估计的尺度逐渐发散。现有的帧到帧方法通过局部优化实现实时性能，但由于缺乏独立窗口之间的全局约束，会累积尺度漂移。为了解决这一问题，我们提出了一种SCE-SLAM端到端SLAM系统，通过场景坐标嵌入保持尺度一致性，这些嵌入是学习得到的局部表示，编码在标准尺度参考下的3D几何关系。该框架包含两个关键模块：几何引导聚合，利用3D空间邻近性通过几何调制注意力传播历史观测中的尺度信息；场景坐标束调整，通过从场景坐标嵌入中解码的显式3D坐标约束将当前估计锚定到参考尺度。在KITTI、Waymo和vKITTI上的实验表明，我们的方法在KITTI上将绝对轨迹误差减少了8.36米，同时保持36 FPS，并在大规模场景中实现了尺度一致性。

Summary / 总结

SCE-SLAM is an end-to-end monocular SLAM system that addresses scale drift by using scene coordinate embeddings to maintain global scale consistency. It includes geometry-guided aggregation for propagating scale information and scene coordinate bundle adjustment for anchoring current estimates to a canonical scale. Experiments show that SCE-SLAM reduces absolute trajectory error by 8.36 meters on KITTI compared to previous methods, while maintaining real-time performance and scale consistency across large scenes.

SCE-SLAM 是一种端到端的单目 SLAM 系统，通过使用场景坐标嵌入来维护尺度一致性以解决尺度漂移问题。该系统包括几何引导聚合模块用于传播尺度信息，以及场景坐标束调整模块用于将当前估计锚定到参考尺度。实验表明，SCE-SLAM 在 KITTI 上将绝对轨迹误差减少了 8.36m，同时保持了实时性能并在大规模场景中实现了尺度一致性。

Self-Supervised Animal Identification for Long Videos

Authors: Xuyang Fang, Sion Hannuna, Edwin Simpson, Neill Campbell

First: 2026-01-14T17:53:59+00:00 · Latest: 2026-01-14T17:53:59+00:00

Comments: 11 pages, 1 figure

Abs · PDF · Code1 · Code2 · Code3

Abstract

Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.

中文标题/摘要

标题：长视频中自我监督的动物识别

在长时间视频中识别个体动物对于行为生态学、野生动物监测和畜牧管理至关重要。传统方法需要大量手动标注，而现有的自我监督方法计算量大且不适用于长序列，因为存在内存限制和时间误差传播问题。我们提出了一种高效且自我监督的方法，将动物识别重新定义为全局聚类任务，而不是顺序跟踪问题。我们的方法假设视频中个体数量已知且固定——这是实践中常见的场景——仅需边界框检测和总数。通过采样帧对、使用冻结的预训练骨干网络，并利用匈牙利算法进行批量内伪标签分配，我们的方法在没有身份标签的情况下学习判别特征。我们从视觉-语言模型中适应二元交叉熵损失，使准确率达到97%以上，同时每个批次消耗的GPU内存少于1 GB——比标准对比方法少一个数量级。在具有挑战性的现实世界数据集（3D-POP鸽子和8头牛进食视频）上评估，我们的框架与在超过1000帧上进行监督训练的基线相当或更优，有效地消除了手动标注瓶颈。这项工作使在消费级硬件上实现高精度动物识别成为可能，具有在资源受限的研究环境中广泛应用的潜力。本文所有代码可在https://huggingface.co/datasets/tonyFang04/8-calves 获取。

Summary / 总结

The research aims to develop an efficient self-supervised method for identifying individual animals in long-duration videos, addressing the limitations of manual annotation and computational demands of existing methods. The method reframes animal identification as a global clustering task, using bounding box detections and a self-bootstrapping mechanism with the Hungarian algorithm to learn discriminative features without identity labels. It achieves state-of-the-art accuracy over 97% while consuming less than 1 GB of GPU memory per batch, matching or surpassing supervised baselines on real-world datasets. This work enables practical, high-accuracy animal identification on consumer-grade hardware, suitable for resource-constrained research settings.

研究旨在开发一种高效自监督方法，用于识别长视频中的个体动物，解决手动标注和现有方法计算需求高的问题。该方法将动物识别重新定义为全局聚类任务，使用边界框检测和匈牙利算法进行自举标注机制来学习区分性特征，无需身份标签。该方法在GPU内存消耗不到1 GB/批次的情况下实现了超过97%的准确率，远低于标准对比方法。在真实世界数据集上的评估表明，该框架能够匹配或超越基于超过1,000个标注帧的监督基线，展示了在消费级硬件上的实用高精度动物识别能力。

Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization

Authors: Frank Röder, Jan Benad, Manfred Eppe, Pradeep Kr. Banerjee

Venue: NeurIPS 2025

First: 2025-08-27T22:02:56+00:00 · Latest: 2026-01-14T17:50:26+00:00

Comments: 31 pages, 4 figures, accepted to NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Real-world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics-Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI's latent space enables counterfactual consistency: Perturbing a gravity-encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context-unaware baselines, often surpassing context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.

中文标题/摘要

标题：动态对齐的潜在想象在上下文世界模型中的零样本泛化

现实世界的强化学习需要在无需昂贵重训练的情况下适应未见过的环境条件。上下文马尔可夫决策过程（cMDP）模型这一挑战，但现有方法通常需要显式的上下文变量（例如，摩擦力，重力），当上下文是潜在的或难以测量时，限制了它们的使用。我们引入了动态对齐的潜在想象（DALI），这是一种集成在Dreamer架构中的框架，通过从智能体-环境交互中推断潜在的上下文表示。通过训练一个自监督编码器来预测前向动力学，DALI 生成条件于世界模型和策略的可操作表示，连接感知和控制。我们理论上证明了此编码器对于高效上下文推断和稳健泛化是必不可少的。DALI 的潜在空间允许反事实一致性：扰动重力编码维度以物理上合理的方式改变想象的回放。在具有挑战性的cMDP基准测试中，DALI 在上下文无关基线之上取得了显著的改进，通常在外推任务中超越了上下文感知基线，实现了对未见过的上下文变化的零样本泛化。

Summary / 总结

The paper addresses the challenge of zero-shot generalization in reinforcement learning under unseen environmental conditions. It introduces Dynamics-Aligned Latent Imagination (DALI), which uses a self-supervised encoder to infer latent context representations from agent-environment interactions. This method enhances the Dreamer architecture, enabling it to generate actionable representations that condition the world model and policy, thereby improving perception and control. Experiments on cMDP benchmarks show that DALI outperforms context-unaware baselines and often surpasses context-aware baselines in extrapolation tasks, demonstrating its capability for zero-shot generalization to unseen contextual variations.

研究旨在解决强化学习中适应未见过的环境条件而不重新训练的挑战。引入了Dynamics-Aligned Latent Imagination (DALI)框架，通过训练自监督编码器预测前向动力学，从代理与环境的交互中推断出潜在的上下文表示。该方法生成可操作的表示，条件化世界模型和策略，增强感知和控制。关键实验发现表明，DALI在超越无上下文感知基线的同时，经常在外推任务中超越有上下文感知的基线，实现对未见过的上下文变化的零样本泛化。

Exploring Fine-Tuning for Tabular Foundation Models

Authors: Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Vinay Kumar Sankarapu

First: 2026-01-14T17:40:46+00:00 · Latest: 2026-01-14T17:40:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Tabular Foundation Models (TFMs) have recently shown strong in-context learning capabilities on structured data, achieving zero-shot performance comparable to traditional machine learning methods. We find that zero-shot TFMs already achieve strong performance, while the benefits of fine-tuning are highly model and data-dependent. Meta-learning and PEFT provide moderate gains under specific conditions, whereas full supervised fine-tuning (SFT) often reduces accuracy or calibration quality. This work presents the first comprehensive study of fine-tuning in TFMs across benchmarks including TALENT, OpenML-CC18, and TabZilla. We compare Zero-Shot, Meta-Learning, Supervised (SFT), and parameter-efficient (PEFT) approaches, analyzing how dataset factors such as imbalance, size, and dimensionality affect outcomes. Our findings cover performance, calibration, and fairness, offering practical guidelines on when fine-tuning is most beneficial and its limitations.

中文标题/摘要

标题：探索表格基础模型的微调

表格基础模型（TFMs）最近在结构化数据上展示了强大的上下文学习能力，实现了与传统机器学习方法相当的零样本性能。我们发现，零样本TFMs已经达到了很强的性能，而微调的好处则高度依赖于模型和数据。元学习和PEFT在特定条件下提供了适度的增益，而全面的监督微调（SFT）通常会降低准确度或校准质量。本研究首次在TALENT、OpenML-CC18和TabZilla等基准上对TFMs的微调进行全面研究。我们比较了零样本、元学习、监督（SFT）和参数高效（PEFT）方法，并分析了数据集因素如不平衡、大小和维度对结果的影响。我们的研究涵盖了性能、校准和公平性，提供了关于何时微调最有益及其局限性的实用指南。

Summary / 总结

The study explores fine-tuning methods for Tabular Foundation Models (TFMs) and finds that zero-shot TFMs already perform well, with fine-tuning benefits varying depending on the model and data. Meta-learning and parameter-efficient fine-tuning (PEFT) provide moderate gains under specific conditions, while full supervised fine-tuning (SFT) often degrades performance or calibration. The research covers benchmarks like TALENT, OpenML-CC18, and TabZilla, analyzing how dataset factors such as imbalance, size, and dimensionality impact fine-tuning outcomes, and provides practical guidelines on when fine-tuning is most beneficial and its limitations.

研究探讨了表格式基础模型（TFMs）的微调方法，发现零样本TFMs已经表现出色，微调的益处取决于模型和数据。元学习和参数高效微调（PEFT）在特定条件下提供适度的改进，而全面的监督微调（SFT）往往会降低性能或校准质量。研究涵盖了如TALENT、OpenML-CC18和TabZilla等基准，分析了数据集不平衡、大小和维度等因素对性能、校准和公平性的影响。

AquaFeat+: an Underwater Vision Learning-based Enhancement Method for Object Detection, Classification, and Tracking

Authors: Emanuel da Costa Silva, Tatiana Taís Schein, José David García Ramos, Eduardo Lawson da Silva, Stephanie Loi Brião, Felipe Gomes de Oliveira, Paulo Lilles Jorge Drews-Jr

First: 2026-01-14T17:38:41+00:00 · Latest: 2026-01-14T17:38:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.

中文标题/摘要

标题：AquaFeat+:一种基于水下视觉学习的对象检测、分类和跟踪增强方法

水下视频分析由于低光照、颜色失真和浑浊等因素特别具有挑战性，这些因素会损害视觉数据质量并直接影响机器人应用中感知模块的性能。本文提出了一种名为AquaFeat+的即插即用流水线，旨在专门增强自动化视觉任务的特征，而不是为了提高人类的感知质量。该架构包括颜色校正模块、分层特征增强模块和自适应残差输出模块，这些模块是端到端训练的，并直接由最终应用的损失函数引导。在FishTrack23数据集上训练和评估后，AquaFeat+在对象检测、分类和跟踪指标上取得了显著的改进，验证了其在水下机器人应用中增强感知任务的有效性

Summary / 总结

AquaFeat+ is a feature enhancement method for underwater vision tasks, addressing issues like low lighting and color distortion. It includes color correction, hierarchical feature enhancement, and an adaptive residual output, all trained end-to-end. Evaluated on the FishTrack23 dataset, AquaFeat+ shows notable improvements in object detection, classification, and tracking, demonstrating its effectiveness for underwater robotic applications.

AquaFeat+ 是一种针对水下视觉任务的特征增强方法，解决低光照和颜色失真等问题。它包含颜色校正、分层特征增强和自适应残差输出，所有模块均端到端训练。在 FishTrack23 数据集上评估，AquaFeat+ 在目标检测、分类和跟踪指标上取得了显著改进，证明了其在水下机器人应用中的有效性。

SPGD: Steepest Perturbed Gradient Descent Optimization

Authors: Amir M. Vahedi, Horea T. Ilies

Venue: ASME. J. Mech. Des. (January 14, 2026)

First: 2024-11-07T18:23:30+00:00 · Latest: 2026-01-14T17:20:45+00:00

Comments: 28 pages, 26 figures, submitted to Journal of Mechanical Design

Abs · PDF · Code1 · Code2

Abstract

Optimization algorithms are pivotal in advancing various scientific and industrial fields but often encounter obstacles such as trapping in local minima, saddle points, and plateaus (flat regions), which makes the convergence to reasonable or near-optimal solutions particularly challenging. This paper presents the Steepest Perturbed Gradient Descent (SPGD), a novel algorithm that innovatively combines the principles of the gradient descent method with periodic uniform perturbation sampling to effectively circumvent these impediments and lead to better solutions whenever possible. SPGD is distinctively designed to generate a set of candidate solutions and select the one exhibiting the steepest loss difference relative to the current solution. It enhances the traditional gradient descent approach by integrating a strategic exploration mechanism that significantly increases the likelihood of escaping sub-optimal local minima and navigating complex optimization landscapes effectively. Our approach not only retains the directed efficiency of gradient descent but also leverages the exploratory benefits of stochastic perturbations, thus enabling a more comprehensive search for global optima across diverse problem spaces. We demonstrate the efficacy of SPGD in solving the 3D component packing problem, an NP-hard challenge. Preliminary results show a substantial improvement over four established methods, particularly on response surfaces with complex topographies and in multidimensional non-convex continuous optimization problems. Comparative analyses with established 2D benchmark functions highlight SPGD's superior performance, showcasing its ability to navigate complex optimization landscapes. These results emphasize SPGD's potential as a versatile tool for a wide range of optimization problems.

中文标题/摘要

标题：SPGD：最陡扰动梯度下降优化

优化算法在推动各个科学和工业领域的发展中至关重要，但常常会遇到陷入局部极小值、鞍点和平坦区域（平坦区域）等障碍，这使得收敛到合理或接近最优解变得尤为困难。本文提出了一种新颖的算法——最陡扰动梯度下降（SPGD），该算法创新性地结合了梯度下降法的原则和周期性均匀扰动采样，有效地克服了这些障碍，并在可能的情况下导向更好的解。SPGD的独特设计是生成一组候选解，并选择相对于当前解损失差异最陡峭的那个。它通过整合一种战略性探索机制，显著增加了从次优局部极小值中逃逸和有效导航复杂优化景观的可能性。我们的方法不仅保留了梯度下降法的定向效率，还利用了随机扰动的探索优势，从而能够在多种问题空间中更全面地搜索全局最优解。我们通过解决3D组件包装问题展示了SPGD的有效性，这是一个NP难问题。初步结果显示，SPGD在复杂地形的响应曲面上和多维非凸连续优化问题中显著优于四种现有方法。与现有的二维基准函数进行比较分析，突显了SPGD的优越性能，展示了其在复杂优化景观中导航的能力。这些结果强调了SPGD作为广泛优化问题的多功能工具的潜力。

Summary / 总结

SPGD is a novel optimization algorithm that combines gradient descent with periodic uniform perturbation to improve convergence and escape local minima. It generates candidate solutions and selects the one with the steepest loss difference, enhancing traditional gradient descent by integrating exploration benefits. SPGD demonstrates superior performance in solving the 3D component packing problem and multidimensional optimization challenges, showing substantial improvements over existing methods.

论文提出了一种名为SPGD的新优化算法，该算法结合了梯度下降法和周期性均匀扰动采样，以提高收敛到最优解的能力。SPGD通过集成探索机制增强了传统的梯度下降法，有助于跳出局部极小值并有效导航复杂景观。实验结果显示，SPGD在解决3D组件包装问题和其他非凸优化挑战中优于四种现有方法，特别是在复杂地形中表现出色。

Coupled Data and Measurement Space Dynamics for Enhanced Diffusion Posterior Sampling

Authors: Shayan Mohajer Hamidi, En-Hui Yang, Ben Liang

First: 2025-10-08T18:59:16+00:00 · Latest: 2026-01-14T17:17:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Inverse problems, where the goal is to recover an unknown signal from noisy or incomplete measurements, are central to applications in medical imaging, remote sensing, and computational biology. Diffusion models have recently emerged as powerful priors for solving such problems. However, existing methods either rely on projection-based techniques that enforce measurement consistency through heuristic updates, or they approximate the likelihood $p(\boldsymbol{y} \mid \boldsymbol{x})$, often resulting in artifacts and instability under complex or high-noise conditions. To address these limitations, we propose a novel framework called \emph{coupled data and measurement space diffusion posterior sampling} (C-DPS), which eliminates the need for constraint tuning or likelihood approximation. C-DPS introduces a forward stochastic process in the measurement space $\{\boldsymbol{y}_t\}$, evolving in parallel with the data-space diffusion $\{\boldsymbol{x}_t\}$, which enables the derivation of a closed-form posterior $p(\boldsymbol{x}_{t-1} \mid \boldsymbol{x}_t, \boldsymbol{y}_{t-1})$. This coupling allows for accurate and recursive sampling based on a well-defined posterior distribution. Empirical results demonstrate that C-DPS consistently outperforms existing baselines, both qualitatively and quantitatively, across multiple inverse problem benchmarks.

Summary / 总结

The paper addresses the challenge of recovering unknown signals from noisy or incomplete measurements in inverse problems, which are crucial in fields like medical imaging and remote sensing. It introduces C-DPS, a novel framework that couples data and measurement space diffusion processes to avoid the need for constraint tuning or likelihood approximation. Experiments show that C-DPS outperforms existing methods in both qualitative and quantitative evaluations across various benchmarks.

论文针对从噪声或不完整测量中恢复未知信号的逆问题挑战，这些逆问题在医学成像和计算生物学等领域至关重要。提出了一种名为耦合数据和测量空间扩散后验采样（C-DPS）的新框架，该框架避免了约束调整或似然近似的需求。C-DPS通过将测量空间中的前向随机过程与数据空间扩散耦合，实现了基于明确后验分布的准确和递归采样。实验结果表明，C-DPS在多个逆问题基准测试中均优于现有方法，在定性和定量评估上均表现出色。

Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES

Authors: Oliver Goldstein, Samuel March

First: 2025-01-23T13:06:35+00:00 · Latest: 2026-01-14T17:15:35+00:00

Comments: 3 Figures

Abs · PDF · Code1 · Code2

Abstract

Algebraic data types (ADTs) let a representation specify at the type level what molecular values are valid and what transformations are meaningful. We propose a molecular representation as a family of typed ADTs that separates (i) constitution (Dietz style bonding systems), (ii) 3D coordinates and stereochemistry, and (iii) electronic structure annotations. This separation makes invariants explicit, supports deterministic local edits, and provides hooks for symmetry aware and Bayesian modeling. These data structures allow us to consider how the representation constrains operations which may be performed over them. Types make invalid manipulations unrepresentable and make it easier to define meaningful priors/likelihoods over generative models (programs with sample and score operations). Unlike string based formats, the ADT exposes chemical structure directly; validity conditions (e.g., valence and symmetry constraints) can be enforced by construction and checked deterministically during transformations. We optionally attach electronic structure annotations (shell/subshell/orbital metadata) to atoms when such information is available; we do not attempt to compute orbitals in this work. We sketch Bayesian probabilistic programming via an integration with LazyPPL, a lazy probabilistic programming library; molecules can be made instances of a group under rotation to support geometric learning settings where molecular properties are invariant under rigid motions and relabellings; and the framework's flexibility is demonstrated through an extension to represent chemical reactions. We provide a Haskell library implementing the representation, released under an OSI approved open source license and archived with a DOI.

中文标题/摘要

标题：使用代数数据类型表示分子：超越SMILES和SELFIES

代数数据类型（ADTs）允许在类型级别上指定哪些分子值是有效的以及哪些转换是有意义的。我们提出了一种分子表示法，作为类型化的ADT族，将（i）组成（Dietz风格的键合系统），（ii）3D坐标和立体化学，以及（iii）电子结构注释分离。这种分离使不变量变得明确，支持确定性的局部编辑，并提供对对称性和贝叶斯建模的钩子。这些数据结构使我们能够考虑表示如何限制在其上执行的操作。类型使无效操作不可表示，并使定义生成模型（具有采样和评分操作的程序）的先验/似然性更加容易。与基于字符串的格式不同，ADT直接暴露了化学结构；通过构造可以强制执行有效性条件（例如，价和对称性约束），并在转换期间确定性地检查这些条件。当可用时，我们可选地将电子结构注释（壳层/亚壳层/轨道元数据）附加到原子上；我们在此工作中没有尝试计算轨道。我们通过与LazyPPL（惰性概率编程库）的集成简要介绍了贝叶斯概率编程；分子可以作为旋转群的实例，以支持几何学习设置，其中分子性质在刚体运动和重新标记下不变；并通过扩展表示化学反应展示了框架的灵活性。我们提供了一个Haskell库实现该表示，该库在OSI批准的开源许可证下发布，并通过DOI存档。

Summary / 总结

This paper proposes a new molecular representation using algebraic data types (ADTs) to specify molecular values and transformations at the type level. The method separates molecular constitution, 3D coordinates, stereochemistry, and electronic structure annotations, making invariants explicit and supporting deterministic local edits. Key experimental findings include the ability to enforce validity conditions and define meaningful priors over generative models, which are not possible with string-based formats. The approach also supports Bayesian probabilistic programming and can represent chemical reactions flexibly.

本文提出了一种新的分子表示方法，使用代数数据类型（ADTs）在类型级别上指定分子值和变换。该方法将分子构成、三维坐标、立体化学和电子结构注释分离，使不变量明确，并支持确定性的局部编辑。主要实验发现包括能够强制执行有效性条件并为生成模型定义有意义的先验，这在基于字符串的格式中是不可能的。该方法还支持贝叶斯概率编程，并能灵活地表示化学反应。

Physically Plausible Multi-System Trajectory Generation and Symmetry Discovery

Authors: Jiayin Liu, Yulong Yang, Vineet Bansal, Christine Allen-Blanchette

First: 2025-09-26T23:46:55+00:00 · Latest: 2026-01-14T17:15:25+00:00

Abs · PDF · Code1 · Code2

Abstract

From metronomes to celestial bodies, mechanics underpins how the world evolves in time and space. With consideration of this, a number of recent neural network models leverage inductive biases from classical mechanics to encourage model interpretability and ensure forecasted states are physical. However, in general, these models are designed to capture the dynamics of a single system with fixed physical parameters, from state-space measurements of a known configuration space. In this paper we introduce Symplectic Phase Space GAN (SPS-GAN) which can capture the dynamics of multiple systems, and generalize to unseen physical parameters from. Moreover, SPS-GAN does not require prior knowledge of the system configuration space. In fact, SPS-GAN can discover the configuration space structure of the system from arbitrary measurement types (e.g., state-space measurements, video frames). To achieve physically plausible generation, we introduce a novel architecture which embeds a Hamiltonian neural network recurrent module in a conditional GAN backbone. To discover the structure of the configuration space, we optimize the conditional time-series GAN objective with an additional physically motivated term to encourages a sparse representation of the configuration space. We demonstrate the utility of SPS-GAN for trajectory prediction, video generation and symmetry discovery. Our approach captures multiple systems and achieves performance on par with supervised models designed for single systems.

中文标题/摘要

标题：物理上可验证的多系统轨迹生成与对称性发现

从摆钟到天体，力学是世界在时间和空间中演变的基础。基于此，许多近期的神经网络模型利用经典力学的归纳偏置来促进模型的可解释性，并确保预测状态是物理上的。然而，通常这些模型设计用于捕捉单一系统在固定物理参数下的动力学，从已知配置空间的状态空间测量中。在本文中，我们引入了辛相空间生成对抗网络（SPS-GAN），它可以捕捉多个系统的动力学，并能够从任意测量类型（例如状态空间测量、视频帧）中泛化到未见过的物理参数。实际上，SPS-GAN可以从任意测量类型中发现系统的配置空间结构，而不需要先验的系统配置空间知识。为了实现物理上可验证的生成，我们引入了一种新颖的架构，将哈密顿神经网络递归模块嵌入到条件生成对抗网络的主干中。为了发现配置空间的结构，我们通过增加一个物理上动机的额外项来优化条件时间序列生成对抗网络目标，以鼓励配置空间的稀疏表示。我们展示了SPS-GAN在轨迹预测、视频生成和对称性发现方面的实用性。我们的方法能够捕捉多个系统，并且在性能上与专门为单一系统设计的监督模型相当。

Summary / 总结

This paper introduces Symplectic Phase Space GAN (SPS-GAN), which captures the dynamics of multiple systems and generalizes to unseen physical parameters without requiring prior knowledge of the system configuration space. SPS-GAN uses a Hamiltonian neural network recurrent module in a conditional GAN backbone to ensure physically plausible generation and optimizes the conditional time-series GAN objective with a physically motivated term to discover the configuration space structure. The approach demonstrates utility in trajectory prediction, video generation, and symmetry discovery, achieving performance comparable to supervised models designed for single systems.

本文提出了Symplectic Phase Space GAN (SPS-GAN)，它可以捕捉多个系统的动态并泛化到未见过的物理参数，无需事先了解系统的配置空间。通过在条件GAN骨干中嵌入Hamiltonian神经网络模块，SPS-GAN生成物理上可验证的轨迹。该方法通过在条件时间序列GAN目标中加入一个物理上动机的项来鼓励配置空间的稀疏表示。实验表明，SPS-GAN在轨迹预测、视频生成和对称性发现方面表现出色，与专门为单个系统设计的监督模型的性能相当。

PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records

Authors: Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

First: 2026-01-14T17:12:48+00:00 · Latest: 2026-01-14T17:12:48+00:00

Abs · PDF · Code1 · Code2

Abstract

While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users' more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents' ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.

中文标题/摘要

标题：PersonalAlign：基于长期用户中心记录的分层隐式意图对齐个性化GUI代理

虽然GUI代理在明确和完成指令方面表现出强大的性能，但在实际部署中需要与用户更复杂的隐式意图对齐。在本研究中，我们强调了个性化GUI代理的分层隐式意图对齐（PersonalAlign）这一新任务，要求代理利用长期用户的记录作为持续的上下文，解决模糊指令中遗漏的偏好，并通过用户状态预测潜在的例行程序，以提供主动帮助。为了促进这项研究，我们引入了AndroidIntent，这是一个基准测试，用于评估代理在通过长期用户记录进行推理以解决模糊指令和提供主动建议方面的能力。我们为评估标注了来自不同用户20000条长期记录中的775个用户特定偏好和215个例行程序。此外，我们引入了分层意图记忆代理（HIM-Agent），它维护一个不断更新的个人记忆，并分层组织用户的偏好和例行程序以实现个性化。最后，我们在AndroidIntent上评估了一系列GUI代理，包括GPT-5、Qwen3-VL和UI-TARS，进一步的结果表明，HIM-Agent在执行和主动性能方面分别提高了15.7%和7.3%。

Summary / 总结

The research aims to align GUI agents with users' implicit intents by leveraging long-term user records. The method involves developing PersonalAlign, which uses a Hierarchical Intent Memory Agent (HIM-Agent) to maintain and organize user preferences and routines. Key findings show that HIM-Agent outperforms other agents like GPT-5, Qwen3-VL, and UI-TARS, improving execution and proactive performance by 15.7% and 7.3%, respectively.

研究旨在解决GUI代理与用户复杂隐含意图对齐的问题，这些意图比显式指令更复杂。研究引入了PersonalAlign任务，要求代理使用长期用户记录来理解被忽略的偏好并预测用户习惯。为了评估这一点，作者创建了AndroidIntent基准，测试代理处理模糊指令和提供主动建议的能力。开发了层次意图记忆代理（HIM-Agent）来维护个性化记忆并按层次组织用户的偏好和习惯。评估结果显示，HIM-Agent在执行和主动协助方面分别比其他代理提高了15.7%和7.3%。

LLM for Large-Scale Optimization Model Auto-Formulation: A Lightweight Few-Shot Learning Approach

Authors: Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, Chung-Piaw Teo

First: 2026-01-14T17:09:57+00:00 · Latest: 2026-01-14T17:09:57+00:00

Comments: Updated version of https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5329027

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large-scale optimization is a key backbone of modern business decision-making. However, building these models is often labor-intensive and time-consuming. We address this by proposing LEAN-LLM-OPT, a LightwEight AgeNtic workflow construction framework for LLM-assisted large-scale OPTimization auto-formulation. LEAN-LLM-OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step-by-step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. Leveraging LLMs' text-processing capabilities and common modeling practices, the workflow decomposes the modeling task into a sequence of structured sub-tasks and offloads mechanical data-handling operations to auxiliary tools. This design alleviates the downstream agent's burden related to planning and data handling, allowing it to focus on the most challenging components that cannot be readily standardized. Extensive simulations show that LEAN-LLM-OPT, instantiated with GPT-4.1 and the open source gpt-oss-20B, achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches. In addition, in a Singapore Airlines choice-based revenue management use case, LEAN-LLM-OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large-Scale-OR and Air-NRM, the first comprehensive benchmarks for large-scale optimization auto-formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean-llm-opt.

中文标题/摘要

标题：大规模优化模型自动构建的LLM方法：一种轻量级少样本学习方法

大规模优化是现代商业决策的关键支柱。然而，构建这些模型往往耗时且劳动密集。我们通过提出LEAN-LLM-OPT，一种基于LLM的大规模优化自动构建轻量级抗逆工作流构建框架来解决这一问题。LEAN-LLM-OPT 接收问题描述和相关数据集作为输入，并协调一组LLM代理生成优化模型。具体来说，当接收到查询时，两个上游LLM代理会动态构建一个工作流，逐步说明如何为类似问题构建优化模型。下游LLM代理随后遵循此工作流生成最终输出。利用LLM的文本处理能力和通用建模实践，该工作流将建模任务分解为一系列结构化子任务，并将机械数据处理操作卸载到辅助工具上。这种设计减轻了下游代理在规划和数据处理方面的负担，使其能够专注于那些无法标准化的最具挑战性的组件。广泛的模拟显示，使用GPT-4.1和开源gpt-oss-20B实例化的LEAN-LLM-OPT在大规模优化建模任务中表现出色，并且与最先进的方法具有竞争力。此外，在新加坡航空公司基于选择的收益管理案例中，LEAN-LLM-OPT通过在各种场景中实现领先性能展示了其实用价值。在此过程中，我们引入了大规模-OR和Air-NRM，这是大规模优化自动构建的第一个全面基准。此工作的代码和数据可在https://github.com/CoraLiang01/lean-llm-opt/获得。

Enhancing Federated Class-Incremental Learning via Spatial-Temporal Statistics Aggregation

Authors: Zenghao Guan, Guojun Zhu, Yucan Zhou, Wu Liu, Weiping Wang, Jiebo Luo, Xiaoyan Gu

Venue: WWW 2026

First: 2025-06-02T05:14:57+00:00 · Latest: 2026-01-14T16:57:26+00:00

Comments: WWW 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Federated Class-Incremental Learning (FCIL) enables Class-Incremental Learning (CIL) from distributed data. Existing FCIL methods typically integrate old knowledge preservation into local client training. However, these methods cannot avoid spatial-temporal client drift caused by data heterogeneity and often incur significant computational and communication overhead, limiting practical deployment. To address these challenges simultaneously, we propose a novel approach, Spatial-Temporal Statistics Aggregation (STSA), which provides a unified framework to aggregate feature statistics both spatially (across clients) and temporally (across stages). The aggregated feature statistics are unaffected by data heterogeneity and can be used to update the classifier in closed form at each stage. Additionally, we introduce STSA-E, a communication-efficient variant with theoretical guarantees, achieving similar performance to STSA-E with much lower communication overhead. Extensive experiments on three widely used FCIL datasets, with varying degrees of data heterogeneity, show that our method outperforms state-of-the-art FCIL methods in terms of performance, flexibility, and both communication and computation efficiency. The code is available at https://github.com/Yuqin-G/STSA.

中文标题/摘要

标题：通过空间-时间统计聚合增强联邦类增量学习

联邦类增量学习（FCIL）使分布式数据中的类增量学习（CIL）成为可能。现有FCIL方法通常将旧知识保留融入到本地客户端训练中。然而，这些方法无法避免由数据异质性引起的客户端空间-时间漂移，并且通常会带来显著的计算和通信开销，限制了其实用部署。为同时解决这些挑战，我们提出了一种新颖的方法，空间-时间统计聚合（STSA），它提供了一个统一框架，用于在客户端之间（空间上）和训练阶段之间（时间上）聚合特征统计。聚合的特征统计不受数据异质性影响，并且可以在每个阶段以封闭形式更新分类器。此外，我们引入了STSA-E，这是一种具有理论保证的通信高效变体，其性能与STSA-E相当，但通信开销要低得多。在三个广泛使用的FCIL数据集上进行的大量实验表明，我们的方法在性能、灵活性以及通信和计算效率方面均优于最先进的FCIL方法。代码可在https://github.com/Yuqin-G/STSA获取。

Summary / 总结

The paper addresses the challenges of Federated Class-Incremental Learning (FCIL) by proposing a novel method, Spatial-Temporal Statistics Aggregation (STSA), which aggregates feature statistics both spatially and temporally to mitigate data heterogeneity and reduce computational and communication overhead. Experiments on three datasets show that STSA outperforms existing methods in terms of performance, flexibility, and efficiency.

论文提出了一种名为Spatial-Temporal Statistics Aggregation (STSA)的新方法，以解决Federated Class-Incremental Learning (FCIL)中的挑战。STSA在客户端之间和阶段之间同时聚合特征统计信息，以缓解数据异质性并减少计算和通信开销。实验结果表明，STSA在性能、灵活性和效率方面优于现有方法。

CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems

Authors: Yonglin Tian, Qiyao Zhang, Wei Xu, Yutong Wang, Yihao Wu, Xinyi Li, Xingyuan Dai, Hui Zhang, Zhiyong Cui, Baoqing Guo, Zujun Yu, Yisheng Lv

First: 2026-01-14T16:36:26+00:00 · Latest: 2026-01-14T16:36:26+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub-Tian/CogRail.

中文标题/摘要

标题：CogRail：评估认知入侵感知在智能铁路运输系统中的VLMs基准

准确且早期地感知潜在入侵目标对于确保铁路运输系统的安全至关重要。然而，现有的大多数系统仅专注于固定视觉范围内的对象分类，并通过基于规则的启发式方法来确定入侵状态，往往忽略了潜在入侵风险的目标。预测这些风险需要认知感兴趣对象（OOI）的空间上下文和时间动态，这给传统的视觉模型带来了挑战。为了促进深度入侵感知，我们引入了一个新的基准CogRail，该基准结合了精心策划的开源数据集和认知驱动的问答注释，以支持时空推理和预测。在此基准之上，我们使用多模态提示系统地评估了最先进的视觉语言模型（VLMs），以识别它们在该领域的优势和局限性。此外，我们对VLMs进行了微调以提高性能，并提出了一种联合微调框架，该框架整合了三个核心任务：位置感知、运动预测和威胁分析，从而促进通用基础模型的有效适应，以适应专门针对认知入侵感知的模型。广泛的实验表明，当前的大规模多模态模型在认知入侵感知任务所需的复杂时空推理方面存在困难，突显了现有基础模型在这一关键安全领域的局限性。相比之下，我们提出的联合微调框架通过使模型能够针对特定领域的推理需求进行目标化适应，显著提高了模型性能，突显了结构化多任务学习在提高准确性和可解释性方面的优势。代码将在https://github.com/Hub-Tian/CogRail/上提供。

Summary / 总结

The research aims to improve the safety of railway transportation systems by developing a benchmark, CogRail, to evaluate visual-language models (VLMs) in cognitive intrusion perception. The method involves using multimodal prompts and curated datasets with cognitive annotations to assess VLMs' capabilities in spatio-temporal reasoning. Key findings show that current VLMs struggle with complex spatial-temporal reasoning, but a joint fine-tuning framework significantly improves their performance and interpretability for this task.

研究旨在通过开发CogRail基准来评估视觉语言模型(VLMs)在感知潜在入侵方面的能力，以提高铁路运输系统的安全性。研究使用多模态提示和微调来增强VLMs的认知入侵感知能力，重点关注空间和时间推理。关键发现表明，当前的VLMs在复杂的空间-时间推理方面存在困难，而提出的联合微调框架显著提高了模型性能和可解释性，以适应特定领域的推理需求。

DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing

Authors: Qian Cao, Yahui Liu, Wei Bi, Yi Zhao, Ruihua Song, Xiting Wang, Ruiming Tang, Guorui Zhou, Han Li

First: 2026-01-14T16:30:20+00:00 · Latest: 2026-01-14T16:30:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.

中文标题/摘要

标题：DPWriter：多样化规划分支的强化学习在创造性写作中的应用

基于强化学习（RL）的大语言模型（LLMs）增强通常会导致输出多样性降低，这在开放性任务如创造性写作中削弱了它们的实用性。当前方法缺乏明确的机制来引导多样性的探索，而是优先考虑优化效率和性能而非多样性。本文提出了一种基于半结构化长链思维（CoT）的RL框架，在生成过程中将过程分解为明确规划的中间步骤。我们引入了一种多样化规划分支方法，在规划阶段战略性地引入差异，基于多样性变化，并引入群体意识多样性奖励以鼓励不同的轨迹。在创造性写作基准上的实验结果表明，我们的方法在不牺牲生成质量的情况下显著提高了输出多样性，始终优于现有基线。

Summary / 总结

The research aims to enhance the diversity of outputs from large language models (LLMs) in creative writing tasks, which is often compromised by reinforcement learning (RL) methods that prioritize efficiency over diversity. The study introduces DPWriter, an RL framework that decomposes the generation process into planned intermediate steps and employs a Diverse Planning Branching method to introduce strategic divergence. This method, combined with a group-aware diversity reward, improves output diversity without sacrificing quality, outperforming existing baselines on creative writing benchmarks.

论文针对强化学习（RL）增强的大语言模型（LLMs）在创意写作任务中输出多样性降低的问题，提出了一种新的RL框架，将生成过程分解为计划中的中间步骤，并采用多样规划分支方法在规划阶段引入战略性的发散。该方法还使用群体意识的多样性奖励来鼓励不同的轨迹。实验表明，这种方法在不牺牲生成质量的情况下显著提高了输出多样性，优于现有基线。

GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis

Authors: Manning Gao, Leheng Zhang, Shiqin Han, Haifeng Hu, Yuncheng Jiang, Sijie Mai

First: 2026-01-14T16:26:44+00:00 · Latest: 2026-01-14T16:26:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.

中文标题/摘要

标题：GRCF：多模态情感分析的两阶段组内排序与校准框架

大多数多模态情感分析研究集中在点估计回归上。虽然这种方法简单直接，但它对标签噪声敏感，并且忽略了样本之间的相对顺序，导致预测不稳定且相关性对齐不佳。对偶序数学习框架应运而生，通过比较学习来捕捉相对顺序。然而，它们引入了两个新的权衡：首先，它们对所有比较赋予相同的权重，无法适应性地关注难以排序的样本；其次，它们采用静态排名间隔，无法反映不同情感组之间的语义距离变化。为了解决这些问题，我们提出了一种两阶段组内排序与校准框架（GRCF），该框架借鉴了组相对策略优化（GRPO）的思想。我们的框架通过同时保持相对序数结构、确保绝对评分校准以及适应性地关注困难样本来解决这些权衡。具体而言，第一阶段引入了一种基于优势加权动态间隔排序损失的GRPO启发式方法，以构建精细的序数结构。第二阶段则采用MAE驱动的目标来对齐预测幅度。为了验证其普适性，我们将GRCF扩展到分类任务，包括多模态幽默检测和讽刺检测。GRCF在核心回归基准测试中达到了最先进的性能，同时在分类任务中也表现出强大的普适性。

Summary / 总结

The research aims to improve multimodal sentiment analysis by addressing the limitations of point-wise regression and pairwise ordinal learning. It proposes a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that uses a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss in Stage 1 to build a fine-grained ordinal structure and an MAE-driven objective in Stage 2 to align prediction magnitudes. GRCF outperforms existing methods on regression benchmarks and shows strong generalizability in classification tasks such as multimodal humor and sarcasm detection.

论文通过提出两阶段组内排序和校准框架（GRCF）来解决多模态情感分析中点估计回归的局限性。GRCF 在第一阶段使用 GRPO 启发的优势加权动态边际排序损失来构建细粒度的序数结构，在第二阶段使用 MAE 驱动的目标来对齐预测幅度。该框架提高了稳定性和相关性对齐。实验表明，GRCF 在回归基准上优于现有方法，并在幽默和讽刺检测等分类任务中表现出强大的泛化能力。

Iterative Differential Entropy Minimization (IDEM) method for fine rigid pairwise 3D Point Cloud Registration: A Focus on the Metric

Authors: Emmanuele Barberi, Felice Sfravara, Filippo Cucinotta

Venue: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, Available in IEEE Xplore

First: 2026-01-14T16:16:51+00:00 · Latest: 2026-01-14T16:16:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Point cloud registration is a central theme in computer vision, with alignment algorithms continuously improving for greater robustness. Commonly used methods evaluate Euclidean distances between point clouds and minimize an objective function, such as Root Mean Square Error (RMSE). However, these approaches are most effective when the point clouds are well-prealigned and issues such as differences in density, noise, holes, and limited overlap can compromise the results. Traditional methods, such as Iterative Closest Point (ICP), require choosing one point cloud as fixed, since Euclidean distances lack commutativity. When only one point cloud has issues, adjustments can be made, but in real scenarios, both point clouds may be affected, often necessitating preprocessing. The authors introduce a novel differential entropy-based metric, designed to serve as the objective function within an optimization framework for fine rigid pairwise 3D point cloud registration, denoted as Iterative Differential Entropy Minimization (IDEM). This metric does not depend on the choice of a fixed point cloud and, during transformations, reveals a clear minimum corresponding to the best alignment. Multiple case studies are conducted, and the results are compared with those obtained using RMSE, Chamfer distance, and Hausdorff distance. The proposed metric proves effective even with density differences, noise, holes, and partial overlap, where RMSE does not always yield optimal alignment.

中文标题/摘要

标题：迭代微分熵最小化（IDEM）方法在细粒度刚性配对3D点云注册中的应用：聚焦于度量

点云注册是计算机视觉中的核心主题，对齐算法不断改进以提高鲁棒性。常用方法评估点云之间的欧几里得距离并最小化目标函数，如均方根误差（RMSE）。然而，这些方法在点云预对齐良好时最有效，而密度差异、噪声、空洞和重叠有限等问题会损害结果。传统方法，如迭代最近点（ICP），需要选择一个固定点云，因为欧几里得距离缺乏交换律。当仅一个点云存在问题时，可以进行调整，但在实际场景中，两个点云都可能受到影响，通常需要预处理。作者提出了一种新颖的基于微分熵的度量，旨在作为优化框架中的目标函数，用于细粒度刚性配对3D点云注册，称为迭代微分熵最小化（IDEM）。该度量不依赖于固定点云的选择，在变换过程中揭示出对应最佳对齐的明确最小值。进行了多个案例研究，并将结果与使用RMSE、切比雪夫距离和豪斯多夫距离获得的结果进行了比较。所提出的度量在密度差异、噪声、空洞和部分重叠的情况下证明有效，而RMSE并不总是能获得最佳对齐。

Summary / 总结

The paper introduces IDEM, a novel method for fine rigid pairwise 3D point cloud registration using differential entropy minimization. It addresses limitations of traditional methods like ICP and RMSE by not requiring a fixed point cloud and effectively handling density differences, noise, holes, and partial overlap. Experiments show IDEM outperforms RMSE, Chamfer distance, and Hausdorff distance in scenarios where these metrics fail.

研究引入了Iterative Differential Entropy Minimization (IDEM)方法，用于精细刚性配对3D点云注册，解决了传统方法如ICP和RMSE的局限性。IDEM使用基于差异熵的度量，不需要固定点云，并在变换过程中提供最优对齐的明确最小值。实验表明，IDEM在密度差异、噪声、孔洞和部分重叠等场景中优于RMSE、Chamfer距离和Hausdorff距离。

Epistemic Skills: Reasoning about Knowledge and Oblivion

Authors: Xiaolong Liang, Yì N. Wáng

First: 2025-04-02T13:41:42+00:00 · Latest: 2026-01-14T16:03:35+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an ``epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of ``knowability'' and ``forgettability,'' defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

中文标题/摘要

标题：知识与遗忘的认知技能：知识获取与遗忘的动力学

本文提出了一类知识逻辑，捕捉了获取知识和陷入遗忘的动力学过程，同时融入了群体知识的概念。该方法基于加权模型系统，引入了“认知技能”度量来表示与知识更新相关的认知能力。在此框架中，知识获取被建模为技能提升的过程，而遗忘则被表示为技能下降的后果。该框架还允许探索“可知性”和“可遗忘性”，分别定义为通过技能提升获得知识和通过技能下降陷入遗忘的可能性。此外，它还支持对关于认知的de re和de dicto表达式的区别进行详细分析。模型检查和可满足性问题的计算复杂性被研究，提供了其理论基础和实际应用的见解。

Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings

Authors: Yanru Wu, Jianning Wang, Xiangyu Chen, Enming Zhang, Yang Tan, Hanbing Liu, Yang Li

Venue: NeurIPS 2025

First: 2025-02-17T09:52:19+00:00 · Latest: 2026-01-14T15:58:22+00:00

Comments: 28 pages, 5 figures, accepted by NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Continual learning (CL) has been a critical topic in contemporary deep neural network applications, where higher levels of both forward and backward transfer are desirable for an effective CL performance. Existing CL strategies primarily focus on task models, either by regularizing model updates or by separating task-specific and shared components, while often overlooking the potential of leveraging inter-task relationships to enhance transfer. To address this gap, we propose a transferability-aware task embedding, termed H-embedding, and construct a hypernet framework under its guidance to learn task-conditioned model weights for CL tasks. Specifically, H-embedding is derived from an information theoretic measure of transferability and is designed to be online and easy to compute. Our method is also characterized by notable practicality, requiring only the storage of a low-dimensional task embedding per task and supporting efficient end-to-end training. Extensive evaluations on benchmarks including CIFAR-100, ImageNet-R, and DomainNet show that our framework performs prominently compared to various baseline and SOTA approaches, demonstrating strong potential in capturing and utilizing intrinsic task relationships. Our code is publicly available at https://github.com/viki760/Hembedding_Guided_Hypernet.

中文标题/摘要

标题：利用转移性感知任务嵌入在持续学习中利用任务关系

持续学习（CL）在当前深度神经网络应用中是一个关键话题，其中前向和后向转移的较高水平对于有效的CL性能是必要的。现有的CL策略主要集中在任务模型上，通过正则化模型更新或分离任务特定和共享组件，但往往忽视了利用任务间关系来增强转移的潜力。为了解决这一差距，我们提出了一种转移性感知任务嵌入，称为H-嵌入，并在其指导下构建了一个超网络框架，以学习CL任务的条件模型权重。具体而言，H-嵌入是从转移性的信息论度量中导出的，并设计为在线且易于计算。我们的方法还具有显著的实用性，只需要为每个任务存储一个低维任务嵌入，并支持高效的端到端训练。在包括CIFAR-100、ImageNet-R和DomainNet的基准测试中，我们的框架与各种基线和SOTA方法相比表现出色，展示了在捕获和利用内在任务关系方面的强大潜力。我们的代码可在https://github.com/viki760/Hembedding_Guided_Hypernet上公开获取。

Summary / 总结

The paper addresses the challenge of leveraging inter-task relationships in continual learning (CL) to enhance transferability. It introduces H-embedding, a transferability-aware task embedding, and a hypernet framework to learn task-conditioned model weights. The method is practical, requiring minimal storage and supporting efficient training. Experiments on CIFAR-100, ImageNet-R, and DomainNet show that the proposed framework outperforms existing approaches in capturing and utilizing task relationships for CL tasks.

该论文通过提出H-嵌入，一种转移性感知的任务嵌入方法，解决了持续学习的挑战。该方法构建了一个超网络框架来学习任务条件下的模型权重，从而增强前向和后向转移。实验结果表明，该框架在CIFAR-100、ImageNet-R和DomainNet等基准上的表现优于各种基线和最先进的方法，展示了在捕捉和利用任务关系方面的强大潜力。

Energy-Entropy Regularization: The True Power of Minimal Looped Transformers

Authors: Wai-Lun Lam

First: 2026-01-14T15:56:35+00:00 · Latest: 2026-01-14T15:56:35+00:00

Comments: 19 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension $d = 8$ to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.

中文标题/摘要

标题：能量-熵正则化：最小环形变换器的真正力量

近期研究表明，环形变换器在推理能力方面优于标准深度架构。当前对单头环形架构在基准任务上的训练方法经常失败或表现不佳，这主要是由于损失景观的高度非凸性和不规则性。在这种情况下，优化往往在损失景观的较差局部极小值和鞍点处停滞，阻止模型发现全局最小值点。这些单头环形变换器模型的内部机制仍然不甚了解，从零开始训练它们仍然是一个重大挑战。在本文中，我们提出了一种新的训练框架，利用Tsallis熵和哈密顿动力学来改变损失景观的几何结构。通过将参数更新视为物理流，我们成功地训练了一个模型维度为$d = 8$的单头环形变换器，以解决输入序列长度为1000个标记的归纳头任务。这一成功揭示了其优越推理能力背后的内部机制。

Summary / 总结

This paper addresses the challenges in training single-head looped Transformers by proposing a new training framework that uses Tsallis entropy and Hamiltonian dynamics to optimize the model. The method transforms the loss landscape geometry, enabling the model to escape poor local minima and saddle points. The authors successfully trained a single-head looped Transformer with a model dimension of 8 to solve an induction head task with an input sequence length of 1000 tokens, demonstrating the model's superior reasoning capabilities.

本文提出了一种新的训练框架，利用Tsallis熵和哈密顿动力学来改善优化景观，解决单头循环Transformer训练中的挑战。该方法通过改变损失景观的几何结构，使模型能够跳出较差的局部极小值和鞍点。作者成功地训练了一个8维的单头循环Transformer，解决了包含1000个词元输入序列的归纳头任务，展示了该模型的优越推理能力。

LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models

Authors: Siddharth Betala, Samuel P. Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K. Joshi, Sékou-Oumar Kaba, Félix Therrien, Alex Hernandez-Garcia, Rocío Mercado, N. M. Anoop Krishnan, Alexandre Duval

First: 2025-12-04T08:25:16+00:00 · Latest: 2026-01-14T15:55:32+00:00

Comments: 46 pages, 17 figures, 16 tables

Abs · PDF · Code1 · Code2

Abstract

Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.

中文标题/摘要

标题：LeMat-GenBench：晶体生成模型统一评估框架

生成机器学习（ML）模型在通过无机晶体的逆设计加速材料发现方面具有巨大潜力，能够以前所未有的方式探索化学空间。然而，缺乏标准化的评估框架使得有意义地评估、比较和进一步发展这些ML模型变得具有挑战性。在本文中，我们介绍了LeMat-GenBench，这是一个支持一系列评估指标的统一基准，旨在更好地指导模型开发和下游应用。我们发布了开源评估套件和Hugging Face上的公共排行榜，并对12个最近的生成模型进行了基准测试。结果表明，稳定性增加通常会导致新颖性和多样性降低，没有一个模型在所有维度上都表现出色。总体而言，LeMat-GenBench为公平的模型比较建立了可重复和可扩展的基础，并旨在指导开发更可靠、发现导向的生成模型，用于晶体材料。

Summary / 总结

LeMat-GenBench is a unified evaluation framework for generative models of crystalline materials, addressing the need for standardized evaluation metrics. The framework benchmarks 12 recent generative models and finds that increased stability reduces novelty and diversity, with no model excelling in all dimensions. This work aims to facilitate fair model comparison and guide the development of more reliable generative models for materials discovery.

LeMat-GenBench 是一个统一的基准框架，用于评估晶体材料生成模型，解决了标准化评价指标的缺失问题。该框架包含开源的评价套件和公共排行榜，对12个最新的生成模型进行了基准测试。结果显示，稳定性提高会降低新颖性和多样性，没有模型在所有维度上都表现出色。这项工作旨在促进公平的模型比较，并指导开发更可靠的用于材料发现的生成模型。

egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks

Authors: Matthias Jammot, Björn Braun, Paul Streli, Rafael Wampfler, Christian Holz

Venue: NeurIPS 2025

First: 2025-10-25T03:04:51+00:00 · Latest: 2026-01-14T15:52:28+00:00

Comments: Accepted for publication at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Understanding affect is central to anticipating human behavior, yet current egocentric vision benchmarks largely ignore the person's emotional states that shape their decisions and actions. Existing tasks in egocentric perception focus on physical activities, hand-object interactions, and attention modeling - assuming neutral affect and uniform personality. This limits the ability of vision systems to capture key internal drivers of behavior. In this paper, we present egoEMOTION, the first dataset that couples egocentric visual and physiological signals with dense self-reports of emotion and personality across controlled and real-world scenarios. Our dataset includes over 50 hours of recordings from 43 participants, captured using Meta's Project Aria glasses. Each session provides synchronized eye-tracking video, headmounted photoplethysmography, inertial motion data, and physiological baselines for reference. Participants completed emotion-elicitation tasks and naturalistic activities while self-reporting their affective state using the Circumplex Model and Mikels' Wheel as well as their personality via the Big Five model. We define three benchmark tasks: (1) continuous affect classification (valence, arousal, dominance); (2) discrete emotion classification; and (3) trait-level personality inference. We show that a classical learning-based method, as a simple baseline in real-world affect prediction, produces better estimates from signals captured on egocentric vision systems than processing physiological signals. Our dataset establishes emotion and personality as core dimensions in egocentric perception and opens new directions in affect-driven modeling of behavior, intent, and interaction.

中文标题/摘要

标题：egoEMOTION：主观视角的视觉和生理信号用于识别真实任务中的情绪和个性

理解情绪是预测人类行为的关键，但当前的主观视角基准数据大多忽略了塑造决策和行动的情绪状态。现有主观感知任务主要关注物理活动、手物交互和注意力建模，假设中性情绪和统一的个性。这限制了视觉系统捕捉行为关键内部驱动因素的能力。在本文中，我们提出了egoEMOTION，这是第一个将主观视角的视觉和生理信号与控制和真实场景中的密集自我报告的情绪和个性相结合的数据集。我们的数据集包括来自43名参与者超过50小时的记录，使用Meta的Project Aria眼镜捕获。每个会话提供同步的眼球追踪视频、头戴式光电容积描记图、惯性运动数据和生理基线作为参考。参与者完成了情绪诱发任务和自然活动，并使用Circumplex模型和Mikels的轮盘以及Big Five模型自我报告其情绪状态。我们定义了三个基准任务：（1）连续情绪分类（正负性、唤醒度、支配性）；（2）离散情绪分类；（3）特质水平的个性推断。我们展示了经典的基于学习的方法，在真实情绪预测中作为简单基线，从主观视角系统捕获的信号中产生更好的估计，而不是处理生理信号。我们的数据集将情绪和个性确立为主观感知的核心维度，并为基于情绪的行为、意图和交互建模开辟了新方向。

Summary / 总结

The paper introduces egoEMOTION, a dataset combining egocentric visual and physiological signals with self-reports of emotion and personality across controlled and real-world scenarios. It includes over 50 hours of recordings from 43 participants using Meta's Project Aria glasses. The dataset supports three benchmark tasks: continuous affect classification, discrete emotion classification, and trait-level personality inference. The study finds that a classical learning-based method performs better in predicting affect from egocentric vision signals compared to physiological signals alone.

该研究介绍了egoEMOTION数据集，该数据集结合了主观视觉信号和生理信号，以及情绪和人格的自我报告，涵盖了受控和现实世界的场景。数据集包括43名参与者超过50小时的记录，使用Meta的Project Aria眼镜。研究定义了三个基准任务：连续情绪分类、离散情绪分类和特质水平的人格推断。研究结果显示，基于经典学习的方法在现实世界的情绪预测中，从主观视觉信号中获得的估计比从生理信号中获得的更好。该数据集增强了对行为内部驱动因素在主观感知系统中的理解。

FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures

Authors: Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao, Yufei Huang

First: 2026-01-12T21:57:52+00:00 · Latest: 2026-01-14T15:49:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.

中文标题/摘要

标题：FigEx2：基于视觉条件的复合科学图表面板检测与描述

科学复合图表将多个标记面板合并为单个图像，但在实际管道中，描述往往缺失或仅提供图级摘要，使得面板级理解变得困难。本文提出FigEx2，一种基于视觉条件的框架，直接从复合图表中定位面板并生成面板级描述。为缓解开放描述中多样表达的影响，引入了一种噪声感知门控融合模块，自适应地过滤标记级特征以稳定检测查询空间。此外，采用结合监督学习与强化学习（RL）的分阶段优化策略，利用CLIP基的对齐和BERTScore基的语义奖励来强制执行严格的多模态一致性。为支持高质量监督，我们构建了BioSci-Fig-Cap，一个针对面板级定位的精炼基准，以及跨学科的物理和化学测试套件。实验结果表明，FigEx2在检测上的mAP@0.5:0.95达到0.726，显著优于Qwen3-VL-8B的METEOR 0.51和BERTScore 0.24。值得注意的是，FigEx2在无需微调的情况下表现出显著的零样本迁移能力，适用于新的科学领域。

Summary / 总结

The research aims to improve the understanding of individual panels within scientific compound figures by developing FigEx2, a visual-conditioned framework that localizes panels and generates panel-specific captions. It introduces a noise-aware gated fusion module to stabilize the detection query space and employs a staged optimization strategy combining supervised and reinforcement learning. The framework outperforms Qwen3-VL-8B in METEOR and BERTScore and shows remarkable zero-shot transferability to new scientific domains.

研究旨在通过提出FigEx2框架，增强对科学复合图中各面板的理解，该框架能够定位面板并生成面板级别的描述。它引入了一个噪声感知门控融合模块以稳定检测查询空间，并采用结合监督学习和强化学习的分阶段优化策略来确保多模态一致性。实验结果表明，FigEx2在METEOR和BERTScore上优于Qwen3-VL-8B，并且在新科学领域中表现出显著的零样本迁移能力。

Multimodal Signal Processing For Thermo-Visible-Lidar Fusion In Real-time 3D Semantic Mapping

Authors: Jiajun Sun, Yangyi Ou, Haoyuan Zheng, Chao yang, Yue Ma

First: 2026-01-14T15:46:57+00:00 · Latest: 2026-01-14T15:46:57+00:00

Comments: 5 pages,7 figures. Under review

Abs · PDF · Code1 · Code2

Abstract

In complex environments, autonomous robot navigation and environmental perception pose higher requirements for SLAM technology. This paper presents a novel method for semantically enhancing 3D point cloud maps with thermal information. By first performing pixel-level fusion of visible and infrared images, the system projects real-time LiDAR point clouds onto this fused image stream. It then segments heat source features in the thermal channel to instantly identify high temperature targets and applies this temperature information as a semantic layer on the final 3D map. This approach generates maps that not only have accurate geometry but also possess a critical semantic understanding of the environment, making it highly valuable for specific applications like rapid disaster assessment and industrial preventive maintenance.

中文标题/摘要

标题：多模态信号处理在实时3D语义建图中的热视-激光雷达融合

在复杂环境中，自主机器人导航和环境感知对SLAM技术提出了更高的要求。本文提出了一种利用热信息语义增强3D点云地图的新方法。首先在像素级融合可见光和红外图像，系统将实时LiDAR点云投影到该融合图像流中。然后在热通道中分割热源特征，以即时识别高温目标，并将此温度信息作为语义层应用到最终的3D地图上。这种方法生成的地图不仅具有准确的几何结构，还具有对环境的关键语义理解，对于快速灾害评估和工业预防性维护等特定应用具有很高的价值。

Summary / 总结

The research aims to enhance the accuracy and semantic understanding of 3D point cloud maps in complex environments for autonomous robot navigation. The method involves fusing visible and infrared images at the pixel level, projecting real-time LiDAR data onto this fused image stream, and then segmenting heat source features to identify high-temperature targets. The key experimental finding is that this approach generates 3D maps with both accurate geometry and semantic understanding, which is particularly useful for applications such as disaster assessment and industrial maintenance.

论文旨在通过融合可见光和红外图像来增强3D点云地图，以提高自主机器人导航的性能。系统首先将实时LiDAR点云投影到融合图像流上，然后在热通道中分割热源特征以识别高温目标，并将温度信息作为语义层集成到最终的3D地图中。关键发现表明，该方法生成的地图不仅几何精度高，还具有对环境的语义理解，适用于快速灾害评估和工业预防维护等应用。