arXiv 论文速递

3AM: Segment Anything with Geometric Consistency in Videos

Authors: Yang-Che Sun, Cheng Sun, Chin-Yang Lin, Fu-En Yang, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu

First: 2026-01-13T18:59:54+00:00 · Latest: 2026-01-13T18:59:54+00:00

Comments: Project page: https://jayisaking.github.io/3AM-Page/

Abstract

Video object segmentation methods like SAM2 achieve strong performance through memory-based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi-level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry-consistent recognition grounded in both spatial position and visual similarity. We propose a field-of-view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide-baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Positive IoU on ScanNet++'s Selected Subset, improving over state-of-the-art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM-Page/

中文标题/摘要

标题：3AM：视频中具有几何一致性的对象分割

像SAM2这样的视频对象分割方法通过基于内存的架构实现了强大的性能，但在大视角变化下由于依赖于外观特征而难以应对。传统的3D实例分割方法解决了视角一致性问题，但需要相机姿态、深度图和昂贵的预处理。我们引入了3AM，这是一种在训练时增强的方法，将MUSt3R的3D感知特征整合到SAM2中。我们提出的轻量级特征合并器将多级MUSt3R特征融合，这些特征编码了隐式的几何对应关系。结合SAM2的外观特征，模型实现了基于空间位置和视觉相似性的几何一致识别。我们提出了一种视场感知采样策略，确保帧观察到空间上一致的对象区域，以实现可靠的3D对应学习。关键的是，我们的方法在推理时只需要RGB输入，不需要相机姿态或预处理。在具有宽基线运动的具有挑战性的数据集（ScanNet++、Replica）上，3AM显著优于SAM2及其扩展，分别在ScanNet++的选定子集上实现了90.6%的IoU和71.7%的正IoU，分别比最先进的视频对象分割方法提高了15.9和30.4个百分点。项目页面：https://jayisaking.github.io/3AM-Page/

Summary / 总结

3AM is a training-time enhancement that integrates 3D-aware features from MUSt3R into SAM2 to achieve geometry-consistent video object segmentation. It uses a lightweight Feature Merger to fuse multi-level MUSt3R features with SAM2's appearance features, ensuring spatially consistent object regions for reliable 3D correspondence learning. On ScanNet++ and Replica, 3AM outperforms SAM2 and other state-of-the-art VOS methods, achieving 90.6% IoU and 71.7% Positive IoU, improving over the state-of-the-art by +15.9 and +30.4 points respectively.

3AM 是一种训练时增强，将 MUSt3R 的 3D 意识特征整合到 SAM2 中，以实现几何一致的视频对象分割。它使用轻量级特征合并器将多级 MUSt3R 特征与 SAM2 的外观特征融合，确保空间位置和视觉相似性。该方法引入了视野感知采样策略以学习可靠的 3D 对应关系。在 ScanNet++ 和 Replica 上，3AM 出色地超越了 SAM2 和其他 VOS 方法，分别实现了 90.6% 的 IoU 和 71.7% 的 Positive IoU，比最先进的方法分别提高了 +15.9 和 +30.4 个百分点。

Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System

Authors: Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jenq-Neng Hwang

First: 2026-01-13T18:59:17+00:00 · Latest: 2026-01-13T18:59:17+00:00

Comments: In submission. The first two authors contributed equally

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at https://github.com/hsiangwei0903/EloReview.

中文标题/摘要

标题：基于Elo排名评审系统的大型语言模型代理评审员动态建模

在本研究中，我们使用实际会议论文提交数据，探索基于Elo排名的评审系统中大型语言模型（LLM）代理评审员的动态。多个具有不同人设的LLM代理评审员在区域主席的主持下进行多轮评审互动。我们比较了基准设置与结合Elo评分和评审员记忆的条件。我们的模拟结果展示了几个有趣的研究发现，包括如何引入Elo评分提高区域主席决策准确性，以及评审员利用我们的Elo系统而无需增加评审努力的适应性评审策略。我们的代码可在https://github.com/hsiangwei0903/EloReview获取。

Summary / 总结

This study investigates the dynamics of Large Language Model (LLM) agent reviewers in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM reviewers with different personas engage in multi-round review interactions moderated by an Area Chair. The research compares a baseline setting with conditions that include Elo ratings and reviewer memory. Key findings include improved Area Chair decision accuracy with Elo ratings and reviewers' adaptive strategies that enhance decision-making without increasing effort. The code for the simulation is available on GitHub.

本研究使用实际会议论文提交数据，探讨大型语言模型（LLM）代理审稿人在Elo排名审稿系统中的动态。多个具有不同人设的LLM审稿人参与多轮互动，由领域主席主持。研究将基准设置与包含Elo评分和审稿人记忆的条件进行比较。关键发现包括Elo评分提高了领域主席的决策准确性，以及审稿人利用Elo系统进行适应性审稿策略，而不增加审稿努力。代码可在GitHub上获得。

Motion Attribution for Video Generation

Authors: Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine

First: 2026-01-13T18:59:09+00:00 · Latest: 2026-01-13T18:59:09+00:00

Comments: See the project website at https://research.nvidia.com/labs/sil/projects/MOTIVE/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.

中文标题/摘要

标题：视频生成中的运动归因

尽管视频生成模型取得了快速进展，但数据对运动的影响作用尚不明确。我们提出了Motive（运动归因于视频生成），一种以运动为中心、基于梯度的数据归因框架，适用于现代大型高质量视频数据集和模型。我们使用该框架研究哪些微调片段能改善或降低时间动态性。Motive通过运动加权损失掩码将时间动态性与静态外观分离，从而实现高效且可扩展的运动特定影响计算。在文本到视频模型上，Motive识别出对运动有强烈影响的片段，并指导数据整理以提高时间一致性和物理合理性。使用Motive选择的高影响数据，我们的方法在VBench上提高了运动平滑度和动态程度，与预训练基模型相比，获得了74.1%的人类偏好胜率。据我们所知，这是第一个在视频生成模型中归因运动而非视觉外观的框架，并且利用它来整理微调数据。

Summary / 总结

The research aims to understand the role of data in influencing motion in video generation models. Motive, a motion-centric gradient-based framework, is introduced to study the impact of fine-tuning clips on temporal dynamics. By isolating temporal dynamics from static appearance, Motive identifies clips that significantly affect motion, improving temporal consistency and physical plausibility. The method enhances motion smoothness and dynamic degree, achieving a 74.1% human preference win rate on VBench compared to the pretrained base model, marking the first framework to attribute motion rather than visual appearance in video generative models.

研究旨在理解数据在影响视频生成模型中运动方面的作用。提出了Motive，一种基于梯度的运动中心框架，用于研究细调片段对时间动态的影响。通过从静态外观中隔离时间动态，Motive 识别出显著影响运动的片段，从而提高时间一致性和物理合理性。该方法增强了运动流畅度和动态程度，在VBench上与预训练基模型相比，实现了74.1%的人类偏好胜率，这是首次在视频生成模型中将运动而非视觉外观归因的框架。

SemiETPicker: Fast and Label-Efficient Particle Picking for CryoET Tomography Using Semi-Supervised Learning

Authors: Linhan Wang, Jianwen Dou, Wang Li, Shengkun Wang, Zhiwu Xie, Chang-Tien Lu, Yinlin Chen

Venue: ISBI

First: 2025-10-25T23:09:22+00:00 · Latest: 2026-01-13T18:57:24+00:00

Comments: IEEE International Symposium on Biomedical Imaging (ISBI) 2026

Abs · PDF · Code1 · Code2

Abstract

Cryogenic Electron Tomography (CryoET) combined with sub-volume averaging (SVA) is the only imaging modality capable of resolving protein structures inside cells at molecular resolution. Particle picking, the task of localizing and classifying target proteins in 3D CryoET volumes, remains the main bottleneck. Due to the reliance on time-consuming manual labels, the vast reserve of unlabeled tomograms remains underutilized. In this work, we present a fast, label-efficient semi-supervised framework that exploits this untapped data. Our framework consists of two components: (i) an end-to-end heatmap-supervised detection model inspired by keypoint detection, and (ii) a teacher-student co-training mechanism that enhances performance under sparse labeling conditions. Furthermore, we introduce multi-view pseudo-labeling and a CryoET-specific DropBlock augmentation strategy to further boost performance. Extensive evaluations on the large-scale CZII dataset show that our approach improves F1 by 10% over supervised baselines, underscoring the promise of semi-supervised learning for leveraging unlabeled CryoET data.

中文标题/摘要

标题：SemiETPicker：使用半监督学习进行冷冻ET断层扫描中快速且标签高效的颗粒挑选

冷冻电子断层扫描（CryoET）结合子体积平均（SVA）是唯一能够以分子分辨率解析细胞内蛋白质结构的成像技术。颗粒挑选，即在3D CryoET体素中定位和分类目标蛋白质的任务，仍然是主要瓶颈。由于依赖于耗时的手动标签，大量的未标记断层扫描数据仍未充分利用。在本工作中，我们提出了一种快速且标签高效的半监督框架，利用这些未充分利用的数据。该框架由两个组件组成：(i) 一种受关键点检测启发的端到端热图监督检测模型，以及(ii) 一种在稀疏标签条件下增强性能的教师-学生联合训练机制。此外，我们引入了多视图伪标签和一种针对冷冻ET的DropBlock增强策略，以进一步提高性能。在大规模CZII数据集上的广泛评估表明，我们的方法在F1分数上比监督基线提高了10%，突显了半监督学习在利用未标记的CryoET数据方面的潜力。

Summary / 总结

SemiETPicker is a semi-supervised framework designed to improve particle picking in 3D CryoET volumes, addressing the bottleneck of manual labeling. It includes a heatmap-supervised detection model and a teacher-student co-training mechanism, along with additional techniques such as multi-view pseudo-labeling and CryoET-specific DropBlock augmentation. Evaluations on the CZII dataset demonstrate a 10% improvement in F1 score over supervised methods, highlighting the potential of semi-supervised learning for utilizing unlabeled CryoET data.

研究旨在通过开发半监督学习框架来解决CryoET中的粒子挑选瓶颈，该框架能够有效利用标记和未标记数据。方法包括端到端的热图监督检测模型和教师-学生协同训练机制。实验结果表明，在CZII数据集上，该方法的F1分数比监督基线提高了10%，突显了半监督学习在CryoET粒子挑选中的有效性。

FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation

Authors: Zhifeng Xie, Keyi Zhang, Yiye Yan, Yuling Guo, Fan Yang, Jiting Zhou, Mengtian Li

First: 2025-11-24T14:00:40+00:00 · Latest: 2026-01-13T18:51:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.

中文标题/摘要

标题：FilmSceneDesigner：程序化电影场景生成中的场景设计链式连接

电影场景设计在电影叙事和视觉氛围塑造中起着关键作用。然而，传统的流程依赖于专家驱动的手动建模，这既耗时又费力。为了解决这一问题，我们引入了FilmSceneDesigner，这是一种自动场景生成系统，模拟了专业的电影场景设计工作流程。给定自然语言描述，包括场景类型、历史时期和风格，我们设计了一个基于代理的链式框架，生成与电影场景设计工作流程相匹配的结构化参数，通过提示策略确保参数的准确性和连贯性。另一方面，我们提出了一种程序化生成流水线，该流水线执行一系列专用功能，使用结构化参数进行平面图和结构生成、材料分配、门窗布置以及对象检索和布局，最终从头构建一个完整的电影场景。此外，为了增强电影现实感和资产多样性，我们构建了SetDepot-Pro，这是一个包含6,862个电影专用3D资产和733种材料的精选数据集。实验结果和人类评估表明，我们的系统生成了结构合理且具有强烈电影真实感的场景，支持下游任务如虚拟预览、施工图纸和情绪板创建。

Summary / 总结

FilmSceneDesigner is an automated system that addresses the labor-intensive nature of traditional film set design by using natural language descriptions to generate structured parameters for procedural scene generation. It employs an agent-based chaining framework and a procedural generation pipeline to create floorplans, assign materials, place doors and windows, and layout objects, resulting in structurally sound and cinematically faithful scenes. Human evaluations confirm the system's effectiveness in supporting downstream tasks such as virtual previs and mood board creation.

FilmSceneDesigner 是一个基于自然语言描述自动生成电影场景的自动化系统。它使用基于代理的链式框架生成结构化参数，并通过程序生成流水线创建平面图、结构、材质和物体布局。该系统利用了一个包含 6,862 个 3D 资产和 733 种材质的精心策划数据集 SetDepot-Pro。实验结果表明，生成的场景结构合理且具有高度的电影真实感，适用于虚拟预视、施工图纸和情绪板创作等下游任务。

MemRec: Collaborative Memory-Augmented Agentic Recommender System

Authors: Weixin Chen, Yuhan Zhao, Jingyuan Huang, Zihe Ye, Clark Mingxuan Ju, Tong Zhao, Neil Shah, Li Chen, Yongfeng Zhang

First: 2026-01-13T18:51:16+00:00 · Latest: 2026-01-13T18:51:16+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The evolution of recommender systems has shifted preference storage from rating matrices and dense embeddings to semantic memory in the agentic era. Yet existing agents rely on isolated memory, overlooking crucial collaborative signals. Bridging this gap is hindered by the dual challenges of distilling vast graph contexts without overwhelming reasoning agents with cognitive load, and evolving the collaborative memory efficiently without incurring prohibitive computational costs. To address this, we propose MemRec, a framework that architecturally decouples reasoning from memory management to enable efficient collaborative augmentation. MemRec introduces a dedicated, cost-effective LM_Mem to manage a dynamic collaborative memory graph, serving synthesized, high-signal context to a downstream LLM_Rec. The framework operates via a practical pipeline featuring efficient retrieval and cost-effective asynchronous graph propagation that evolves memory in the background. Extensive experiments on four benchmarks demonstrate that MemRec achieves state-of-the-art performance. Furthermore, architectural analysis confirms its flexibility, establishing a new Pareto frontier that balances reasoning quality, cost, and privacy through support for diverse deployments, including local open-source models. Code:https://github.com/rutgerswiselab/memrec and Homepage: https://memrec.weixinchen.com

中文标题/摘要

标题：MemRec：协作记忆增强自主推荐系统

推荐系统的发展已将偏好存储从评分矩阵和密集嵌入转向自主时代中的语义记忆。然而，现有的代理依赖于孤立的记忆，忽视了关键的协作信号。弥合这一差距受到双重挑战的阻碍：一是如何在不使推理代理的认知负担过重的情况下提炼庞大的图上下文，二是如何高效地进化协作记忆而不产生高昂的计算成本。为了解决这个问题，我们提出了MemRec，这是一种架构上将推理与内存管理解耦的框架，以实现高效的协作增强。MemRec引入了一种专用且成本效益高的LM_Mem来管理动态的协作记忆图，并向下游的LLM_Rec提供合成的高信号上下文。该框架通过高效的检索和成本效益高的异步图传播操作，实现背景中的记忆进化。在四个基准上的广泛实验表明，MemRec达到了最先进的性能。此外，架构分析证实了其灵活性，通过支持多种部署，包括本地开源模型，建立了推理质量、成本和隐私的新帕累托前沿。代码：https://github.com/rutgerswiselab/memrec 和主页：https://memrec.weixinchen.com

Summary / 总结

MemRec is a collaborative memory-augmented agentic recommender system that addresses the challenges of managing large graph contexts and efficiently evolving collaborative memory. It decouples reasoning from memory management, using a dedicated, cost-effective LM_Mem to handle dynamic collaborative memory graphs and serving context to a downstream LLM_Rec. Experiments on four benchmarks show that MemRec outperforms existing methods, and architectural analysis confirms its flexibility and cost-effectiveness, setting a new frontier for balancing reasoning quality, cost, and privacy.

MemRec 是一种协作的记忆增强型推荐系统，旨在解决管理大规模图上下文和高效地进化协作记忆的挑战。该系统将推理与记忆管理分离，使用一个专用且成本效益高的 LM_Mem 来处理动态的协作记忆图，并将上下文传递给下游的 LLM_Rec。在四个基准上的实验表明，MemRec 的性能优于现有方法，且架构分析证实了其灵活性和成本效益，为平衡推理质量、成本和隐私设定了新的前沿。

Reasoning Matters for 3D Visual Grounding

Authors: Hsiang-Wei Huang, Kuang-Ming Chen, Wenhao Chai, Cheng-Yen Yang, Jen-Hao Cheng, Jenq-Neng Hwang

Venue: CVPR

First: 2026-01-13T18:48:41+00:00 · Latest: 2026-01-13T18:48:41+00:00

Comments: 2025 CVPR Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

Abs · PDF · Code1 · Code2

Abstract

The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.

中文标题/摘要

标题：推理对于3D视觉定位的重要性

近年来，具有强大推理能力的大语言模型（LLMs）在数学、编程和科学发现等多个领域推动了研究进展。与此同时，作为3D理解基本任务的3D视觉定位仍然具有挑战性，因为当前的3D视觉定位模型推理能力有限。大多数现有方法结合了文本编码器和视觉特征编码器以生成跨模态融合特征并预测引用对象。这些模型通常需要在大量的3D标注数据上进行监督训练。另一方面，最近的研究也关注于通过扩展合成数据来训练更强的3D视觉定位LLMs，然而性能提升有限且不成比例于数据收集成本。在本文中，我们提出了一种3D视觉定位数据管道，能够自动合成3D视觉定位数据及其相应的推理过程。此外，我们利用生成的数据对LLM进行微调，并引入了Reason3DVG-8B，这是一种强3D视觉定位LLM，仅使用3D-GRAND训练数据的1.6%就超越了之前的基于LLM的方法，证明了我们数据的有效性和推理在3D视觉定位中的重要性。

Summary / 总结

This work addresses the challenge of 3D visual grounding by proposing a data pipeline that automatically synthesizes 3D visual grounding data with reasoning processes. The pipeline is used to fine-tune a Large Language Model (LLM), resulting in Reason3DVG-8B, which outperforms the previous method 3D-GRAND using only 1.6% of their training data, highlighting the importance of reasoning in 3D visual grounding.

该研究提出了一种数据管道，能够自动生成包含推理过程的3D视觉定位数据，并使用这些数据对大型语言模型（LLM）进行微调，引入了Reason3DVG-8B，该模型仅使用3D-GRAND训练数据的1.6%就超越了之前的LLM方法，突显了推理在3D视觉定位中的重要性。

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Authors: Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu

First: 2026-01-13T18:48:00+00:00 · Latest: 2026-01-13T18:48:00+00:00

Comments: 21 pages. Code available at https://github.com/GMLR-Penn/Multiplex-Thinking

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.

中文标题/摘要

标题：多路思考：基于令牌级分支与合并的推理

大型语言模型通常通过思维链（CoT）更有效地解决复杂推理任务，但代价是长而低带宽的令牌序列。相比之下，人类往往通过保持可能下一步的分布来进行软推理。受此启发，我们提出了一种多路思考机制，该机制在每次思考步骤中采样K个候选令牌，并将它们的嵌入聚合为一个连续的多路令牌。这保留了词汇嵌入先验和标准离散生成的采样动态，同时诱导了多路展开的可处理概率分布。因此，多路轨迹可以直接通过在线强化学习（RL）进行优化。重要的是，多路思考是自适应的：当模型自信时，多路令牌几乎离散，类似于标准CoT；当它不确定时，它可以紧凑地表示多个可能的下一步，而不增加序列长度。在具有挑战性的数学推理基准测试中，多路思考在Pass@1到Pass@1024上始终优于强大的离散CoT和RL基线，同时生成更短的序列。代码和检查点可在https://github.com/GMLR-Penn/Multiplex-Thinking获取。

Summary / 总结

This paper introduces Multiplex Thinking, a stochastic soft reasoning mechanism that enhances the reasoning process of large language models. It proposes a method where at each step, K candidate tokens are sampled and their embeddings are aggregated into a single multiplex token, allowing for a distribution over plausible next steps. This approach outperforms strong discrete Chain-of-Thought and reinforcement learning baselines on math reasoning benchmarks, producing shorter sequences while achieving higher accuracy across various pass rates.

研究旨在通过提出Multiplex Thinking，一种随机软推理机制，提高大型语言模型的推理效率和效果。在每次推理步骤中，它会采样K个候选令牌并将它们的嵌入合并为一个单一的多路复用令牌，这使得可以对可能的下一步进行分布表示，同时保持词汇嵌入先验。该方法在数学推理基准测试中优于强大的离散链式思考和强化学习基线，生成更短的序列。该模型是自适应的，在自信时更离散，在不确定时更多样化。

S3-CLIP: Video Super Resolution for Person-ReID

Authors: Tamas Endrei, Gyorgy Cserey

First: 2026-01-13T18:46:37+00:00 · Latest: 2026-01-13T18:46:37+00:00

Comments: Accepted to the 2026 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), VReID-XFD Challenge

Abs · PDF · Code1 · Code2

Abstract

Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.

中文标题/摘要

标题：S3-CLIP：基于视频超分辨率的人员再识别

在大多数人员再识别（ReID）方法中，轨迹片段质量往往被视为次要问题，大多数研究主要集中在对基础模型进行架构修改。这些方法忽视了一个重要的限制，导致在实际困难场景中部署ReID系统时面临挑战。本文介绍了一种名为S3-CLIP的视频超分辨率再识别框架，该框架为2026年WACV VReID-XFD挑战赛开发。所提出的方法将最新的超分辨率网络进展与任务驱动的超分辨率管道相结合，适应基于视频的人员再识别设置。据我们所知，这是首次系统地将视频超分辨率作为提高人员再识别轨迹片段质量的方法进行研究，特别是在具有挑战性的跨视角条件下。实验结果表明，该方法在性能上与基线相当，在空中到地面场景中达到37.52%的mAP，在地面到空中场景中达到29.16%的mAP。在地面到空中设置中，S3-CLIP在排名准确性方面取得了显著提升，分别提高了11.24%、13.48%和17.98%的Rank-1、Rank-5和Rank-10性能。

Summary / 总结

This paper addresses the underexplored area of tracklet quality in person re-identification by introducing S3-CLIP, a video super-resolution framework. The method combines super-resolution networks with CLIP for enhancing tracklet quality, especially in challenging cross-view scenarios. Experimental results show that S3-CLIP performs competitively with the baseline, achieving significant improvements in ranking accuracy, with Rank-1, Rank-5, and Rank-10 gains of 11.24%, 13.48%, and 17.98% respectively in ground-to-aerial settings.

论文提出了S3-CLIP，一种用于人员再识别的视频超分辨率框架，通过增强轨迹片段质量来解决现有方法的局限性。该方法将超分辨率网络与CLIP结合用于基于视频的ReID，展示了与基线相当的性能，在空中到地面场景中达到37.52%的mAP，在地面到空中场景中达到29.16%的mAP，并在排名准确性方面取得了显著提升。

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Authors: Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Hao Chen, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su

First: 2025-12-08T11:12:39+00:00 · Latest: 2026-01-13T18:44:27+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench comprises a database of over 1.3M merchant entries across 6 service categories and 9 major cities, and 900 multi-hop QA tasks from real user queries that require multi-step reasoning. We also developed LocalPlayground, a unified environment integrating multiple tools for LRMs interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.2) achieves only 35.60% correctness, and most models have issues with completeness (average 60.32%) and faithfulness (average 30.72%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at https://localsearchbench.github.io/.

中文标题/摘要

标题：LocalSearchBench：在现实本地生活服务中评估自主搜索系统

大型推理模型LRMs的最新进展使自主搜索系统能够在多个来源进行复杂的多步推理。然而，大多数研究集中在通用信息检索上，很少探索具有独特挑战的垂直领域。在本工作中，我们专注于本地生活服务，并引入了LocalSearchBench，涵盖了多种多样的商业场景。该领域中的实际查询往往模棱两可，需要在商家和产品之间进行多跳推理，这仍然是一个具有挑战性的问题，尚未完全解决。作为第一个全面的本地生活服务自主搜索基准，LocalSearchBench 包含了来自 6 个服务类别和 9 个主要城市的超过 1.3 百万商家条目的数据库，以及 900 个来自真实用户查询的多跳问答任务，这些任务需要多步推理。我们还开发了LocalPlayground，这是一个集成多种工具供LRMs交互的统一环境。实验表明，即使是最先进的LRMs在LocalSearchBench上也难以应对：最佳模型（DeepSeek-V3.2）的正确率为35.60%，大多数模型在完整性（平均60.32%）和忠实性（平均30.72%）方面存在问题。这突显了在本地生活服务中需要专门的基准和领域特定代理训练的需求。代码、基准和排行榜可在 https://localsearchbench.github.io/ 获取。

Summary / 总结

LocalSearchBench is a benchmark for agentic search in local life services, addressing the unique challenges of multi-hop reasoning across merchants and products. It includes a database of over 1.3 million merchant entries and 900 multi-hop QA tasks from real user queries. Experiments show that state-of-the-art large reasoning models perform poorly, with the best model achieving only 35.60% correctness and issues with completeness and faithfulness. This underscores the need for specialized benchmarks and domain-specific training in local life services.

LocalSearchBench 通过引入全面的数据库和多跳问答任务，评估了本地生活服务中的智能搜索系统，结果显示最先进的大型推理模型在正确性、完整性以及忠实性方面表现不佳，表明需要专门的基准测试和领域特定的训练。代码、基准和排行榜可在 https://localsearchbench.github.io/ 获取。

APEX-SWE

Authors: Abhi Kottamasu, Akul Datta, Aakash Barthwal, Chirag Mahapatra, Ajay Arun, Adarsh Hiremath, Brendan Foody, Bertie Vidgen

First: 2026-01-13T18:44:08+00:00 · Latest: 2026-01-13T18:44:08+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).

中文标题/摘要

标题：APEX-SWE

我们介绍了软件工程人工智能生产力指数（APEX-SWE），这是一个基准，用于评估前沿AI模型是否能够执行具有经济价值的软件工程工作。与现有的专注于狭窄、明确任务的评估不同，APEX-SWE 评估了两种新颖的任务类型，这些任务反映了实际的软件工程工作：(1) 集成任务（n=100），需要构建跨异构云原语、业务应用和基础设施即代码服务的端到端系统；(2) 可观察性任务（n=100），需要使用日志、仪表板等遥测信号进行故障排除，并结合非结构化上下文进行调试。我们在APEX-SWE上评估了八种前沿模型。Gemini 3 Pro（思考=高）表现最佳，得分为25%。我们的分析表明，强大的表现主要由先验推理驱动，即区分假设和验证事实的能力，结合在行动前解决不确定性的能力。我们开源了APEX-SWE评估框架和一个开发集（n=50）。

Summary / 总结

APEX-SWE is a benchmark to evaluate AI models' ability to perform economically valuable software engineering tasks, including 100 integration tasks and 100 observability tasks. Eight leading AI models were tested, with Gemini 3 Pro showing the best performance at 25% Pass@1. The success is attributed to epistemic reasoning and the ability to resolve uncertainty. The evaluation tool and a development set are publicly available.

APEX-SWE 是一个基准，用于评估 AI 模型执行经济上有价值的软件工程任务的能力，包括集成和可观测性任务。八种领先的 AI 模型进行了测试，其中 Gemini 3 Pro 表现最佳，达到 Pass@1 评分的 25%。成功的关键因素是模型在行动前解决不确定性时的本体论推理和自主性。该评估框架和开发集已公开发布。

Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning

Authors: Shao-Ting Chiu, Siu Wun Cheung, Ulisses Braga-Neto, Chak Shing Lee, Rui Peng Li

First: 2026-01-12T17:45:31+00:00 · Latest: 2026-01-13T18:39:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Kolmogorov-Arnold Networks (KANs) have shown strong potential for efficiently approximating complex nonlinear functions. However, the original KAN formulation relies on B-spline basis functions, which incur substantial computational overhead due to De Boor's algorithm. To address this limitation, recent work has explored alternative basis functions such as radial basis functions (RBFs) that can improve computational efficiency and flexibility. Yet, standard RBF-KANs often sacrifice accuracy relative to the original KAN design. In this work, we propose Free-RBF-KAN, a RBF-based KAN architecture that incorporates adaptive learning grids and trainable smoothness to close this performance gap. Our method employs freely learnable RBF shapes that dynamically align grid representations with activation patterns, enabling expressive and adaptive function approximation. Additionally, we treat smoothness as a kernel parameter optimized jointly with network weights, without increasing computational complexity. We provide a general universality proof for RBF-KANs, which encompasses our Free-RBF-KAN formulation. Through a broad set of experiments, including multiscale function approximation, physics-informed machine learning, and PDE solution operator learning, Free-RBF-KAN achieves accuracy comparable to the original B-spline-based KAN while delivering faster training and inference. These results highlight Free-RBF-KAN as a compelling balance between computational efficiency and adaptive resolution, particularly for high-dimensional structured modeling tasks.

中文标题/摘要

标题：Free-RBF-KAN：具有自适应径向基函数的柯尔莫哥洛夫-阿诺尔德网络，用于高效函数学习

柯尔莫哥洛夫-阿诺尔德网络（KANs）在高效逼近复杂非线性函数方面显示出强大的潜力。然而，原始的KAN公式依赖于B样条基函数，由于德布尔算法，这会导致大量的计算开销。为了解决这一限制，最近的工作探索了替代基函数，如径向基函数（RBFs），以提高计算效率和灵活性。然而，标准的RBF-KAN通常在准确度上不如原始的KAN设计。在本文中，我们提出了一种基于RBF的KAN架构——Free-RBF-KAN，该架构结合了自适应学习网格和可训练的平滑度，以弥补这一性能差距。我们的方法使用可自由学习的RBF形状，动态地使网格表示与激活模式对齐，从而实现表达性和自适应的函数逼近。此外，我们将平滑度视为与网络权重联合优化的核参数，而不增加计算复杂度。我们为RBF-KAN提供了一般性证明，涵盖了我们的Free-RBF-KAN公式。通过一系列广泛的实验，包括多尺度函数逼近、基于物理的机器学习和PDE解算器学习，Free-RBF-KAN在准确度上与基于B样条的原始KAN相当，同时提供更快的训练和推理速度。这些结果突显了Free-RBF-KAN在计算效率和自适应分辨率之间的平衡，特别是在高维结构化建模任务中具有吸引力。

Summary / 总结

Free-RBF-KAN is a novel RBF-based Kolmogorov-Arnold Network (KAN) that uses adaptive learning grids and trainable smoothness to improve computational efficiency and accuracy. It dynamically adjusts RBF shapes to align with activation patterns, offering expressive and adaptive function approximation. Experimental results show that Free-RBF-KAN achieves comparable accuracy to the original B-spline-based KAN while providing faster training and inference, especially for high-dimensional structured modeling tasks.

Free-RBF-KAN 是一种基于 RBF 的 Kolmogorov-Arnold 网络，通过自适应学习网格和可训练的平滑度来提高计算效率和准确性。它动态调整 RBF 形状以与激活模式对齐，相比传统的 B-样条基 KAN，提供更快的训练和推理速度。实验表明，Free-RBF-KAN 在保持同等准确性的前提下更为高效，特别适用于高维结构化建模任务。

DentalX: Context-Aware Dental Disease Detection with Radiographs

Authors: Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li

Venue: ISBI 2026

First: 2026-01-13T18:32:28+00:00 · Latest: 2026-01-13T18:32:28+00:00

Comments: Accepted at ISBI 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diagnosing dental diseases from radiographs is time-consuming and challenging due to the subtle nature of diagnostic evidence. Existing methods, which rely on object detection models designed for natural images with more distinct target patterns, struggle to detect dental diseases that present with far less visual support. To address this challenge, we propose {\bf DentalX}, a novel context-aware dental disease detection approach that leverages oral structure information to mitigate the visual ambiguity inherent in radiographs. Specifically, we introduce a structural context extraction module that learns an auxiliary task: semantic segmentation of dental anatomy. The module extracts meaningful structural context and integrates it into the primary disease detection task to enhance the detection of subtle dental diseases. Extensive experiments on a dedicated benchmark demonstrate that DentalX significantly outperforms prior methods in both tasks. This mutual benefit arises naturally during model optimization, as the correlation between the two tasks is effectively captured. Our code is available at https://github.com/zhiqin1998/DentYOLOX.

中文标题/摘要

标题：DentalX：基于口腔结构信息的放射影像牙科疾病检测

从放射影像诊断牙科疾病耗时且具有挑战性，因为诊断证据具有微妙性。现有方法依赖于设计用于自然图像的目标检测模型，这些模型具有更明显的目标模式，难以检测牙科疾病，这些疾病缺乏视觉支持。为解决这一挑战，我们提出了一种名为{\bf DentalX}的新型基于口腔结构信息的牙科疾病检测方法，利用口腔结构信息减轻放射影像中的视觉模糊性。具体而言，我们引入了一个结构上下文提取模块，学习一个辅助任务：牙科解剖的语义分割。该模块提取有意义的结构上下文，并将其整合到主要的疾病检测任务中，以增强对细微牙科疾病的检测。在专用基准上的广泛实验表明，DentalX在两个任务上均显著优于先前的方法。这种互惠互利在模型优化过程中自然产生，因为两个任务之间的相关性被有效捕捉。我们的代码可在https://github.com/zhiqin1998/DentYOLOX获取。

Summary / 总结

DentalX is a context-aware dental disease detection approach that leverages oral structure information to improve the detection of subtle dental diseases from radiographs. It introduces a structural context extraction module for semantic segmentation of dental anatomy, which enhances the primary disease detection task. Experiments show that DentalX outperforms previous methods in both tasks, addressing the visual ambiguity in radiographs.

DentalX 提出了一种利用口腔结构信息的上下文感知方法来诊断牙科疾病，通过引入用于牙齿解剖学语义分割的结构上下文提取模块，增强对细微疾病的检测。实验表明，DentalX 在两个任务上均优于先前方法，优化过程中自然捕捉到了两个任务之间的关联性。

DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

Authors: Dongxu Liu, Jiahui Zhu, Yuang Peng, Haomiao Tang, Yuwei Chen, Chunrui Han, Zheng Ge, Daxin Jiang, Mingxue Liao

First: 2025-06-11T12:01:03+00:00 · Latest: 2026-01-13T18:26:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.

中文标题/摘要

标题：DGAE：基于扩散的自动编码器以实现高效的潜在表示学习

自动编码器通过视觉标记化将像素压缩到潜在空间中，从而增强最先进的图像和视频生成模型。尽管最近的进步在高压缩比下缓解了自动编码器的性能下降，但由GAN引起的训练不稳定性仍然是一个开放的挑战。在提高空间压缩的同时，我们还旨在最小化潜在空间的维度，以实现更高效和紧凑的表示。为了解决这些挑战，我们专注于提高解码器的表达能力。具体而言，我们提出了DGAE，它使用扩散模型来引导解码器恢复从潜在表示中未完全解码的信息信号。通过这种设计，DGAE有效地缓解了在高空间压缩率下的性能下降。同时，DGAE实现了2倍更小的潜在空间的最先进性能。当与扩散模型结合使用时，DGAE在ImageNet-1K上的图像生成任务中表现出竞争力，并表明这种紧凑的潜在表示有助于扩散模型的更快收敛。

Summary / 总结

The research aims to improve the performance and efficiency of autoencoders in image and video generative models by addressing training instability and high compression ratios. The proposed DGAE method uses a diffusion model to enhance the decoder's expressiveness, leading to better recovery of informative signals from the latent space. This approach mitigates performance degradation under high compression and achieves state-of-the-art results with a 2x smaller latent space. When integrated with diffusion models, DGAE shows competitive performance on ImageNet-1K image generation and facilitates faster convergence of the diffusion model.

研究旨在通过解决训练不稳定性和高压缩率问题，提高图像和视频生成模型中自动编码器的性能和效率。所提出的DGAE方法利用扩散模型增强解码器的表达能力，从而更好地从潜在空间恢复信息信号。这种方法在高压缩率下减轻了性能下降，并实现了2倍更小的潜在空间的最优结果。当与扩散模型结合时，DGAE在ImageNet-1K图像生成上表现出竞争力，并促进了扩散模型的更快收敛。

Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems

Authors: Ibrahim K. Ozaslan, Panagiotis Patrinos, Mihailo R. Jovanović

First: 2024-08-28T17:43:18+00:00 · Latest: 2026-01-13T18:25:29+00:00

Comments: 32 pages; 4 figures

Abs · PDF · Code1 · Code2

Abstract

We examine stability properties of primal-dual gradient flow dynamics for composite convex optimization problems with multiple, possibly nonsmooth, terms in the objective function under the generalized consensus constraint. The proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM which faces significant challenges from both analysis and implementation viewpoints in large-scale multi-block scenarios. In contrast to customized algorithms with individualized convergence guarantees, we develop a systematic approach for solving a broad class of challenging composite optimization problems. We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics. Our assumptions are much weaker than those required to prove (exponential) stability of primal-dual dynamics as well as (linear) convergence of discrete-time methods such as standard two-block and multi-block ADMM and EXTRA algorithms. Finally, we show necessity of some of our structural assumptions for exponential stability and provide computational experiments to demonstrate the convenience of the proposed approach for parallel and distributed computing applications.

中文标题/摘要

标题：多块凸优化问题的原始对偶梯度流动力学的稳定性

我们研究了在广义共识约束下的复合凸优化问题中，具有多个可能非光滑项的目标函数的原始对偶梯度流动力学的稳定性性质。所提出的动力学基于广义增广拉格朗日函数的近邻增广拉格朗日方法，为大规模多块场景下的ADMM提供了可行的替代方案，ADMM在分析和实现方面都面临重大挑战。与针对特定问题定制的具有个体收敛保证的算法不同，我们开发了一种系统的方法来解决一类广泛的复合优化问题。我们利用各种结构特性，为所提出的动力学建立了全局（指数）收敛保证。我们的假设比证明原始对偶动力学的（指数）稳定性以及离散时间方法（如标准两块和多块ADMM和EXTRA算法）的（线性）收敛所需的假设要弱得多。最后，我们证明了某些结构假设对于指数稳定性是必要的，并提供了计算实验以展示所提出方法在并行和分布式计算应用中的便利性。

Summary / 总结

This paper investigates the stability of primal-dual gradient flow dynamics for solving composite convex optimization problems with multiple, possibly nonsmooth, terms. The method is based on the proximal augmented Lagrangian and offers a systematic approach for a broad class of optimization problems, which is more general than customized algorithms. The proposed dynamics are shown to have global exponential convergence guarantees under weaker assumptions than previous methods. The experiments demonstrate the approach's suitability for parallel and distributed computing applications.

本文研究了解决包含多个可能非光滑项的复合凸优化问题的 primal-dual 梯度流动力学的稳定性。该方法基于增广拉格朗日乘子的近邻形式，并提供了一种适用于广泛优化问题的系统方法，比定制算法更具通用性。所提出的方法在比以前方法更弱的假设下具有全局指数收敛保证。实验表明该方法适用于并行和分布式计算应用。

Aggregating Diverse Cue Experts for AI-Generated Image Detection

Authors: Lei Tan, Shuwei Li, Mohan Kankanhalli, Robby T. Tan

Venue: AAAI 2026

First: 2026-01-13T18:23:42+00:00 · Latest: 2026-01-13T18:23:42+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

The rapid emergence of image synthesis models poses challenges to the generalization of AI-generated image detectors. However, existing methods often rely on model-specific features, leading to overfitting and poor generalization. In this paper, we introduce the Multi-Cue Aggregation Network (MCAN), a novel framework that integrates different yet complementary cues in a unified network. MCAN employs a mixture-of-encoders adapter to dynamically process these cues, enabling more adaptive and robust feature representation. Our cues include the input image itself, which represents the overall content, and high-frequency components that emphasize edge details. Additionally, we introduce a Chromatic Inconsistency (CI) cue, which normalizes intensity values and captures noise information introduced during the image acquisition process in real images, making these noise patterns more distinguishable from those in AI-generated content. Unlike prior methods, MCAN's novelty lies in its unified multi-cue aggregation framework, which integrates spatial, frequency-domain, and chromaticity-based information for enhanced representation learning. These cues are intrinsically more indicative of real images, enhancing cross-model generalization. Extensive experiments on the GenImage, Chameleon, and UniversalFakeDetect benchmark validate the state-of-the-art performance of MCAN. In the GenImage dataset, MCAN outperforms the best state-of-the-art method by up to 7.4% in average ACC across eight different image generators.

中文标题/摘要

标题：利用多种线索专家聚合AI生成图像检测

图像合成模型的迅速发展给AI生成图像检测器的一般化带来了挑战。然而，现有方法往往依赖于特定模型的特征，导致过拟合和一般化能力差。本文提出了一种新的框架——多线索聚合网络（MCAN），该框架将不同的但互补的线索整合到一个统一的网络中。MCAN采用混合编码器适配器动态处理这些线索，使特征表示更具适应性和鲁棒性。我们的线索包括输入图像本身，代表整体内容，以及高频分量，强调边缘细节。此外，我们还引入了色度不一致性（CI）线索，该线索对强度值进行归一化并捕捉图像获取过程中引入的噪声信息，使这些噪声模式与AI生成内容中的噪声模式更加可区分。MCAN的创新之处在于其统一的多线索聚合框架，该框架结合了空间、频域和色度信息，以增强表示学习。这些线索更能够反映真实图像，增强跨模型的一般化能力。在GenImage、Chameleon和UniversalFakeDetect基准测试中，MCAN的性能达到了最先进的水平。在GenImage数据集中，MCAN在八种不同图像生成器的平均ACC上比最先进的方法高出7.4%。

Summary / 总结

This paper addresses the challenge of detecting AI-generated images by proposing the Multi-Cue Aggregation Network (MCAN), which integrates various complementary cues such as the input image, high-frequency components, and chromatic inconsistency. MCAN uses a mixture-of-encoders adapter to dynamically process these cues, leading to more adaptive and robust feature representation. Experiments on benchmark datasets show that MCAN outperforms existing methods, achieving up to 7.4% higher average ACC in the GenImage dataset compared to the best state-of-the-art method.

本文提出了Multi-Cue Aggregation Network (MCAN)，通过整合输入图像、高频分量和Chromatic Inconsistency (CI) 等多种线索来改进特征表示。MCAN 使用混合编码器适配器动态处理这些线索，从而在不同图像生成器之间实现更好的泛化。在 GenImage、Chameleon 和 UniversalFakeDetect 数据集上的实验表明，MCAN 的性能优于现有方法，最高可比最佳现有方法高出 7.4% 的平均 ACC。

The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang

First: 2026-01-09T18:39:01+00:00 · Latest: 2026-01-13T18:21:01+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.

中文标题/摘要

标题：思维的分子结构：长链推理拓扑映射

大型语言模型（LLMs）往往难以从人类或非长链推理（Long CoT）LLMs模仿中学习有效的长链推理。为了理解这一点，我们提出，有效的可学习的长链推理轨迹具有在统一视图中形成的稳定分子状结构，这些结构由三种交互类型组成：深度推理（共价键状）、自我反思（氢键状）和自我探索（范德华力状）。对提炼轨迹的分析表明，这些结构源自长链推理微调，而非关键词模仿。我们引入了有效语义异构体，并表明只有促进快速熵收敛的键支持稳定的长链推理学习，而结构竞争会损害训练。基于这些发现，我们提出了Mole-Syn方法，这是一种分布转移图方法，用于指导有效长链推理结构的合成，从而在基准测试中提升性能和强化学习稳定性。

Summary / 总结

The research aims to understand why large language models struggle with learning effective long chain-of-thought reasoning. It proposes that stable molecular-like structures, formed by three interaction types (Deep-Reasoning, Self-Reflection, and Self-Exploration), are key to effective Long CoT learning. The study finds that these structures emerge from Long CoT fine-tuning rather than keyword imitation. Mole-Syn, a method that guides the synthesis of these effective structures, is introduced to boost performance and reinforcement learning stability across benchmarks.

研究旨在理解大型语言模型为何难以学习有效的长链推理。研究提出，长链推理轨迹中的稳定分子结构，由三种类型的相互作用（深层推理、自我反思和自我探索）形成，是有效学习的关键。分析表明，这些结构来自长链推理微调，而非关键词模仿。研究引入了Mole-Syn方法，指导有效长链推理结构的合成，提升跨基准的性能和强化学习稳定性。

SafePro: Evaluating the Safety of Professional-Level AI Agents

Authors: Kaiwen Zhou, Shreedhar Jangam, Ashwin Nagarajan, Tejas Polu, Suhas Oruganti, Chengzhi Liu, Ching-Chen Kuo, Yuting Zheng, Sravana Narayanaraju, Xin Eric Wang

First: 2026-01-10T19:53:09+00:00 · Latest: 2026-01-13T18:20:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce \textbf{SafePro}, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust safety mechanisms tailored to the next generation of professional AI agents.

中文标题/摘要

标题：SafePro：评估专业级AI代理的安全性

基于大型语言模型的代理正在迅速从简单的对话助手演变为能够在各种领域执行复杂专业任务的自主系统。尽管这些进步有望带来显著的生产率提升，但也引入了关键的安全风险，这些风险目前尚未得到充分探索。现有的安全性评估主要集中在简单的日常辅助任务上，未能捕捉到专业环境中不一致行为的复杂决策过程及其潜在后果。为了解决这一差距，我们引入了**SafePro**，这是一个全面的基准测试，旨在评估执行专业活动的AI代理的安全对齐情况。SafePro 包含了一个跨多种专业领域的复杂任务数据集，这些任务具有安全风险，并通过严格的迭代创建和审查过程开发。我们对最先进的AI模型的评估揭示了显著的安全漏洞，并在专业环境中发现了新的不安全行为。我们进一步表明，这些模型在执行复杂专业任务时表现出安全判断不足和安全对齐薄弱。此外，我们还研究了提高这些场景中代理安全性的安全缓解策略，并观察到积极的改进。我们的研究结果共同强调了为下一代专业AI代理量身定制的稳健安全机制的迫切需求。

Summary / 总结

SafePro is a benchmark designed to evaluate the safety alignment of AI agents performing professional tasks. It addresses the lack of safety evaluations for complex, professional-level tasks by featuring a dataset of high-complexity tasks across various domains. The evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and unsafe behaviors in professional contexts, indicating insufficient safety judgment and weak safety alignment. The study also explores safety mitigation strategies, showing promising improvements. This work underscores the need for robust safety mechanisms for professional AI agents.

SafePro 是一个基准，旨在评估 AI 代理在执行专业任务时的安全对齐情况。它弥补了现有安全评估的不足，重点关注复杂的高风险场景。对最先进的 AI 模型的评估揭示了显著的安全漏洞和新的专业情境中的不安全行为。研究还探讨了安全缓解策略，显示出代理安全性的改进。这项工作强调了为专业 AI 代理设计 robust 安全机制的迫切需求。

Uncovering Political Bias in Large Language Models using Parliamentary Voting Records

Authors: Jieying Chen, Karen de Jong, Andreas Poole, Jan Burakowski, Elena Elderson Nosti, Joep Windt, Chendi Wang

First: 2026-01-13T18:18:25+00:00 · Latest: 2026-01-13T18:18:25+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) become deeply embedded in digital platforms and decision-making systems, concerns about their political biases have grown. While substantial work has examined social biases such as gender and race, systematic studies of political bias remain limited, despite their direct societal impact. This paper introduces a general methodology for constructing political bias benchmarks by aligning model-generated voting predictions with verified parliamentary voting records. We instantiate this methodology in three national case studies: PoliBiasNL (2,701 Dutch parliamentary motions and votes from 15 political parties), PoliBiasNO (10,584 motions and votes from 9 Norwegian parties), and PoliBiasES (2,480 motions and votes from 10 Spanish parties). Across these benchmarks, we assess ideological tendencies and political entity bias in LLM behavior. As part of our evaluation framework, we also propose a method to visualize the ideology of LLMs and political parties in a shared two-dimensional CHES (Chapel Hill Expert Survey) space by linking their voting-based positions to the CHES dimensions, enabling direct and interpretable comparisons between models and real-world political actors. Our experiments reveal fine-grained ideological distinctions: state-of-the-art LLMs consistently display left-leaning or centrist tendencies, alongside clear negative biases toward right-conservative parties. These findings highlight the value of transparent, cross-national evaluation grounded in real parliamentary behavior for understanding and auditing political bias in modern LLMs.

中文标题/摘要

标题：利用议会投票记录揭示大型语言模型中的政治偏见

随着大型语言模型（LLMs）在数字平台和决策系统中的深入应用，对其政治偏见的担忧日益增加。尽管已经进行了大量关于社会偏见（如性别和种族）的研究，但对政治偏见的系统性研究仍然有限，尽管它们对社会有直接影响。本文介绍了一种通过将模型生成的投票预测与验证的议会投票记录对齐来构建政治偏见基准的一般方法。我们在三个国家案例研究中实例化了这种方法：PoliBiasNL（15个政党2,701份荷兰议会动议和投票记录）、PoliBiasNO（9个政党10,584份挪威动议和投票记录）和PoliBiasES（10个政党2,480份西班牙动议和投票记录）。在这些基准中，我们评估了LLM行为中的意识形态倾向和政治实体偏见。作为评估框架的一部分，我们还提出了一种方法，通过将基于投票的位置与CHES（夏洛特维尔专家调查）维度链接起来，在共享的二维CHES空间中可视化LLM和政治党的意识形态，从而实现模型与现实世界政治行为者的直接和可解释的比较。我们的实验揭示了细微的意识形态差异：最先进的LLMs表现出左倾或中间派倾向，并且对右翼保守政党有明显的负面偏见。这些发现突显了基于实际议会行为进行透明、跨国评估对于理解并审计现代LLM中的政治偏见的价值。

Summary / 总结

This paper addresses the growing concern about political biases in large language models (LLMs) by developing a methodology to align model-generated voting predictions with parliamentary voting records. The study evaluates ideological tendencies and political entity bias in LLMs across three national case studies: PoliBiasNL, PoliBiasNO, and PoliBiasES. The experiments show that state-of-the-art LLMs exhibit left-leaning or centrist tendencies and have clear negative biases towards right-conservative parties, emphasizing the need for transparent, cross-national evaluations to understand and audit political bias in LLMs.

该研究通过将模型生成的投票预测与议会投票记录对齐的方法，来解决大型语言模型（LLM）中日益增长的政治偏见问题。研究在三个国家案例研究中评估了LLM的意识形态倾向和政治实体偏见：PoliBiasNL、PoliBiasNO和PoliBiasES。研究发现，最先进的LLM表现出左倾或中间派的倾向，并对右翼保守政党表现出明显的负面偏见，强调了进行透明的跨国评估以理解LLM中的政治偏见的重要性。

On the use of graph models to achieve individual and group fairness

Authors: Arturo Pérez-Peralta, Sandra Benítez-Peña, Rosa E. Lillo

First: 2026-01-13T18:17:43+00:00 · Latest: 2026-01-13T18:17:43+00:00

Comments: 75 pages, 46 figures

Abs · PDF · Code1 · Code2

Abstract

Machine Learning algorithms are ubiquitous in key decision-making contexts such as justice, healthcare and finance, which has spawned a great demand for fairness in these procedures. However, the theoretical properties of such models in relation with fairness are still poorly understood, and the intuition behind the relationship between group and individual fairness is still lacking. In this paper, we provide a theoretical framework based on Sheaf Diffusion to leverage tools based on dynamical systems and homology to model fairness. Concretely, the proposed method projects input data into a bias-free space that encodes fairness constrains, resulting in fair solutions. Furthermore, we present a collection of network topologies handling different fairness metrics, leading to a unified method capable of dealing with both individual and group bias. The resulting models have a layer of interpretability in the form of closed-form expressions for their SHAP values, consolidating their place in the responsible Artificial Intelligence landscape. Finally, these intuitions are tested on a simulation study and standard fairness benchmarks, where the proposed methods achieve satisfactory results. More concretely, the paper showcases the performance of the proposed models in terms of accuracy and fairness, studying available trade-offs on the Pareto frontier, checking the effects of changing the different hyper-parameters, and delving into the interpretation of its outputs.

中文标题/摘要

标题：关于使用图模型实现个体和群体公平性的研究

机器学习算法在司法、医疗和金融等关键决策领域无处不在，这催生了对这些程序中公平性的巨大需求。然而，关于此类模型与公平性理论性质之间的关系仍然知之甚少，群体公平性和个体公平性之间关系的直觉也尚未建立。在本文中，我们基于Sheaf Diffusion提供了一个理论框架，利用动力系统和同调理论的工具来建模公平性。具体而言，所提出的方法将输入数据投影到一个无偏见的空间中，该空间编码了公平性约束，从而产生公平的解决方案。此外，我们还提供了一组处理不同公平性指标的网络拓扑结构，从而形成了一种能够同时处理个体和群体偏见的统一方法。所得到的模型具有解释性的一层，表现为SHAP值的闭式表达式，巩固了其在负责任的人工智能领域的地位。最后，这些直觉在模拟研究和标准公平性基准测试中进行了测试，所提出的方法在准确性和公平性方面取得了令人满意的结果。更具体地说，本文展示了所提出模型在准确性和公平性方面的性能，研究了帕累托前沿上的可用权衡，检查了改变不同超参数的影响，并深入探讨了其输出的解释。

Summary / 总结

This paper aims to enhance fairness in machine learning models used in critical decision-making areas like justice, healthcare, and finance. It introduces a theoretical framework based on Sheaf Diffusion to model fairness using graph models. The method projects input data into a bias-free space, enabling the handling of both individual and group fairness. Experimental results on simulation studies and standard fairness benchmarks demonstrate satisfactory performance in terms of accuracy and fairness, with interpretable SHAP values.

本文旨在提高在司法、医疗和金融等关键决策领域中使用的机器学习模型的公平性。它引入了一个基于Sheaf Diffusion的理论框架来建模公平性，将输入数据投影到一个无偏见的空间中，该空间编码了公平性约束。该方法使用网络拓扑来处理各种公平性指标，提供了一种同时处理个体和群体公平性的统一方法。实验结果表明，这些模型在准确性和公平性方面的表现良好，具有可解释的SHAP值。

Fast and explainable clustering in the Manhattan and Tanimoto distance

Authors: Stefan Güttel, Kaustubh Roy

First: 2026-01-13T18:14:03+00:00 · Latest: 2026-01-13T18:14:03+00:00

Abs · PDF · Code1 · Code2

Abstract

The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor--Butina algorithm, and about 80 times faster than DBSCAN, while computing higher-quality clusters in both cases.

中文标题/摘要

标题：快速且可解释的曼哈顿和塔尼莫距离聚类

CLASSIX算法是一种快速且可解释的数据聚类方法。在原始形式中，该算法通过按数据点的第一主成分排序来截断附近数据点的搜索，近邻性用欧几里得距离定义。在这里，我们将CLASSIX扩展到其他距离度量，包括曼哈顿距离和塔尼莫距离。我们使用数据向量的适当范数作为排序标准，并结合三角不等式来终止搜索。在塔尼莫距离的情况下，使用一个可证明更精确的交集不等式进一步提高新算法的性能。在实际化学指纹基准测试中，CLASSIX塔尼莫比Taylor-Butina算法快约30倍，比DBSCAN快约80倍，同时在两种情况下计算出更高质量的聚类。

Summary / 总结

The research aims to enhance the CLASSIX algorithm for faster and more explainable clustering using Manhattan and Tanimoto distances. The method involves sorting data points by their norm and applying the triangle inequality for search termination. Key findings show that CLASSIX Tanimoto is approximately 30 times faster than the Taylor--Butina algorithm and 80 times faster than DBSCAN, while producing higher-quality clusters.

研究旨在通过使用曼哈顿和塔尼莫距离度量来增强CLASSIX算法，使其更快且更具解释性。方法包括按数据向量的适当范数进行排序，并使用三角不等式来终止搜索。关键发现表明，CLASSIX Tanimoto比Taylor--Butina算法快约30倍，比DBSCAN快约80倍，同时生成更高质量的聚类。

Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards

Authors: Tengjun Jin, Yoojin Choi, Yuxuan Zhu, Daniel Kang

First: 2026-01-13T18:09:06+00:00 · Latest: 2026-01-13T18:09:06+00:00

Comments: 18 pages, 14 figures, 9 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of database-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial. In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman's $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman's $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at https://github.com/uiuc-kang-lab/text_to_sql_benchmarks.

中文标题/摘要

标题：普遍的标注错误破坏了文本到SQL基准和排行榜

研究人员提出了许多文本到SQL技术以简化数据分析并加速数据库驱动应用的开发。为了比较这些技术并选择最适合部署的最佳技术，社区依赖于公开的基准和排行榜。由于这些基准在问题构建和答案评估过程中高度依赖于人工标注，标注的有效性至关重要。在本文中，我们进行了一项实证研究，(i) 对两个广泛使用的文本到SQL基准BIRD和Spider 2.0-Snow的标注错误率进行了基准测试，(ii) 修正了BIRD开发集的一部分以测量标注错误对文本到SQL代理性能和排行榜排名的影响。通过专家分析，我们展示了BIRD Mini-Dev和Spider 2.0-Snow的错误率分别为52.8%和62.8%。我们重新评估了BIRD排行榜上的所有16个开源代理在原始和修正后的BIRD开发集子集上的性能。我们展示了性能变化范围从-7%到31%（相对而言），排名变化范围从-9到+9位。我们进一步评估了这些影响是否适用于完整的BIRD开发集。我们发现，未修正子集上的代理排名与完整开发集上的排名相关性很强（Spearman's $r_s$=0.85，$p$=3.26e-5），而与修正子集上的排名相关性较弱（Spearman's $r_s$=0.32，$p$=0.23）。这些发现表明，标注错误可以显著扭曲报告的性能和排名，可能误导研究方向或部署选择。我们的代码和数据可在https://github.com/uiuc-kang-lab/text_to_sql_benchmarks/获取。

Summary / 总结

This paper investigates the impact of annotation errors on text-to-SQL benchmarks by analyzing the error rates in two widely used benchmarks, BIRD and Spider 2.0-Snow. The study corrects a subset of the BIRD development set and re-evaluates 16 open-source agents, showing significant changes in performance and rankings. The findings indicate that annotation errors can distort reported results, potentially misleading research and deployment decisions.

本文通过分析两个广泛使用的基准BIRD和Spider 2.0-Snow中的注释错误率，研究注释错误对文本到SQL基准的影响。研究纠正了BIRD开发集的一部分，并重新评估了16个开源代理，结果显示性能变化范围从-7%到31%，排名变化从-9到+9位。研究发现，注释错误可以扭曲报告的性能和排名，可能误导研究和部署决策。

Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling

Authors: Yang Cai, Weiqiang Zheng

First: 2026-01-13T18:08:06+00:00 · Latest: 2026-01-13T18:08:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces $k\ge 1$ candidate responses and a user selects their preferred one. We introduce $(k,f(k))$-robust alignment, which requires the $k$-output model to have win rate $f(k)$ against any other single-output model, and asymptotic universal alignment (U-alignment), which requires $f(k)\to 1$ as $k\to\infty$. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose $k$-sample product policies achieve U-alignment at rate $f(k)=\frac{k}{k+1}$, and no method can achieve a faster rate in general. We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for $k=1$, sampling from the resulting (often deterministic) policy cannot guarantee win rates above $\tfrac{1}{2}$ except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the $(k+1)$-player alignment game achieves the optimal $(k,\frac{k}{k+1})$-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.

Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs

Authors: Manideep Reddy Chinthareddy

First: 2026-01-13T18:03:41+00:00 · Latest: 2026-01-13T18:03:41+00:00

Comments: 46 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation for software engineering often relies on vector similarity search, which captures topical similarity but can fail on multi-hop architectural reasoning such as controller to service to repository chains, interface-driven wiring, and inheritance. This paper benchmarks three retrieval pipelines on Java codebases (Shopizer, with additional runs on ThingsBoard and OpenMRS Core): (A) vector-only No-Graph RAG, (B) an LLM-generated knowledge graph RAG (LLM-KB), and (C) a deterministic AST-derived knowledge graph RAG (DKB) built with Tree-sitter and bidirectional traversal. Using 15 architecture and code-tracing queries per repository, we measure indexing time, query latency, corpus coverage, cost, and answer correctness. DKB builds its graph in seconds, while LLM-KB requires much longer graph generation. LLM-KB also shows indexing incompleteness: on Shopizer, 377 files are skipped or missed, reducing embedded chunk coverage and graph size compared to DKB. End-to-end cost is modest for DKB relative to the vector-only baseline but much higher for LLM-KB, especially as repository scale increases. Query latency is similar for No-Graph and DKB, while LLM-KB is slower and more variable. On the Shopizer question suite, DKB achieves the highest correctness, LLM-KB is close behind, and the vector-only baseline performs worst on upstream architectural queries and has the highest hallucination risk. Overall, deterministic AST-derived graphs provide more reliable coverage and multi-hop grounding than LLM-extracted graphs at substantially lower indexing cost.

中文标题/摘要

标题：可靠图-RAG在代码库中的应用：AST派生图与LLM提取的知识图谱

软件工程中的检索增强生成通常依赖于向量相似性搜索，这可以捕捉主题相似性，但在多跳架构推理（如控制器到服务到存储库链、接口驱动的连接和继承）方面可能会失败。本文在Java代码库（Shopizer，额外运行于ThingsBoard和OpenMRS Core）上对三种检索管道进行了基准测试：(A) 仅向量的无图RAG，(B) 由LLM生成的知识图谱RAG（LLM-KB），以及(C) 使用Tree-sitter和双向遍历构建的确定性AST派生知识图谱RAG（DKB）。使用每个仓库15个架构和代码追踪查询，我们测量了索引时间、查询延迟、语料库覆盖率、成本和答案准确性。DKB在几秒钟内构建其图，而LLM-KB需要更长的图生成时间。LLM-KB还显示了索引不完整性：在Shopizer上，有377个文件被跳过或遗漏，导致嵌入片段覆盖率和图大小低于DKB。端到端成本对于DKB相对较低，但相对于仅向量基线来说更高，尤其是随着仓库规模的增加。查询延迟对于无图和DKB相似，而LLM-KB更慢且更不稳定。在Shopizer问题集中，DKB的正确性最高，LLM-KB紧随其后，仅向量基线在上游架构查询中表现最差且具有最高的幻觉风险。总体而言，确定性AST派生图在索引成本显著降低的情况下提供了比LLM提取图更可靠的覆盖和多跳定位。

Summary / 总结

This paper evaluates three retrieval pipelines for software engineering using Java codebases: vector-only No-Graph RAG, LLM-generated knowledge graph RAG (LLM-KB), and deterministic AST-derived knowledge graph RAG (DKB). DKB builds the graph quickly, while LLM-KB takes much longer and shows indexing incompleteness, reducing coverage. DKB is more cost-effective and achieves higher correctness, especially on upstream architectural queries, compared to LLM-KB and the vector-only baseline.

本文评估了三种针对Java代码库的检索管道：仅向量的No-Graph RAG、LLM生成的知识图谱RAG（LLM-KB）和基于AST的确定性知识图谱RAG（DKB）。DKB快速构建其图，而LLM-KB耗时更长且存在索引不完整的问题，减少了覆盖率。DKB在成本效益和正确性方面表现更佳，特别是在上游架构查询方面，优于LLM-KB和向量基线。

STELP: Secure Transpilation and Execution of LLM-Generated Programs

Authors: Swapnil Shinde, Sahil Wadhwa, Andy Luo, Akshay Gupta, Mohammad Shahed Sorower

First: 2026-01-09T01:49:41+00:00 · Latest: 2026-01-13T17:55:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Rapid evolution of Large Language Models (LLMs) has achieved major advances in reasoning, planning, and function-calling capabilities. Multi-agentic collaborative frameworks using such LLMs place them at the center of solving software development-related tasks such as code generation. However, direct use of LLM generated code in production software development systems is problematic. The code could be unstable or erroneous and contain vulnerabilities such as data poisoning, malicious attacks, and hallucinations that could lead to widespread system malfunctions. This prohibits the adoption of LLM generated code in production AI systems where human code reviews and traditional secure testing tools are impractical or untrustworthy. In this paper, we discuss safety and reliability problems with the execution of LLM generated code and propose a Secure Transpiler and Executor of LLM-Generated Program (STELP), capable of executing LLM-generated code in a controlled and safe manner. STELP secures autonomous production AI systems involving code generation, filling the critical void left by the impracticality or limitations of traditional secure testing methodologies and human oversight. This includes applications such as headless code generation-execution and LLMs that produce executable code snippets as an action plan to be executed in real time. We contribute a human-validated dataset of insecure code snippets and benchmark our approach on publicly available datasets for correctness, safety, and latency. Our results demonstrate that our approach outperforms an existing method by a significant margin, particularly in its ability to safely execute risky code snippets. Warning: This paper contains malicious code snippets that should be run with caution.

中文标题/摘要

标题：STELP：安全转换和执行LLM生成的程序

大型语言模型（LLMs）的快速进化在推理、规划和函数调用能力方面取得了重大进展。使用此类LLMs的多智能体协作框架将它们置于解决软件开发相关任务（如代码生成）的核心位置。然而，直接在生产软件开发系统中使用LLM生成的代码存在问题。这些代码可能是不稳定的或错误的，并且可能包含数据中毒、恶意攻击和幻觉等漏洞，可能导致系统广泛故障。这阻碍了在需要人工代码审查和传统安全测试工具不可行或不可信的生产AI系统中采用LLM生成的代码。在本文中，我们讨论了执行LLM生成代码的安全性和可靠性问题，并提出了一种安全转换和执行LLM生成程序（STELP）的方法，能够以受控和安全的方式执行LLM生成的代码，填补了传统安全测试方法和人工监督的不切实际或局限性所留下的关键空白。这包括无头代码生成执行和LLM生成可执行代码片段作为实时执行的操作计划的应用。我们贡献了一个经过人工验证的不安全代码片段数据集，并在公开可用的数据集上对我们的方法进行了正确性、安全性和延迟的基准测试。我们的结果表明，与现有方法相比，我们的方法在安全执行风险代码片段方面表现出显著优势。警告：本文包含恶意代码片段，应谨慎运行。

Summary / 总结

This paper addresses the challenges of using Large Language Model (LLM) generated code in production systems, where the code may be unstable, erroneous, or contain vulnerabilities. It proposes STELP, a secure transpiler and executor for LLM-generated programs, designed to safely execute such code. The approach is validated using a human-validated dataset of insecure code snippets and public benchmarks, showing superior performance in terms of correctness, safety, and latency compared to existing methods.

本文探讨了使用大型语言模型（LLMs）生成的代码在生产系统中面临的风险，如不稳定、错误和安全漏洞。它提出了STELP，一种安全的LLM生成程序编译器和执行器，旨在安全地执行这些代码。实验表明，STELP在基准数据集上的表现优于现有方法，特别是在处理风险代码片段方面，在正确性、安全性和延迟方面表现更佳。

Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge

Authors: Tobias Rueckert, David Rauber, Raphaela Maerkl, Leonard Klausmann, Suemeyye R. Yildiran, Max Gutbrod, Danilo Weber Nunes, Alvaro Fernandez Moreno, Imanol Luengo, Danail Stoyanov, Nicolas Toussaint, Enki Cho, Hyeon Bae Kim, Oh Sung Choo, Ka Young Kim, Seong Tae Kim, Gonçalo Arantes, Kehan Song, Jianjun Zhu, Junchen Xiong, Tingyi Lin, Shunsuke Kikuchi, Hiroki Matsuzaki, Atsushi Kouno, João Renato Ribeiro Manesco, João Paulo Papa, Tae-Min Choi, Tae Kyeong Jeong, Juyoun Park, Oluwatosin Alabi, Meng Wei, Tom Vercauteren, Runzhi Wu, Mengya Xu, An Wang, Long Bai, Hongliang Ren, Amine Yamlahi, Jakob Hennighausen, Lena Maier-Hein, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Shu Yang, Yihui Wang, Hao Chen, Santiago Rodríguez, Nicolás Aparicio, Leonardo Manrique, Juan Camilo Lyons, Olivia Hosie, Nicolás Ayobi, Pablo Arbeláez, Yiping Li, Yasmina Al Khalil, Sahar Nasirihaghighi, Stefanie Speidel, Daniel Rueckert, Hubertus Feussner, Dirk Wilhelm, Christoph Palm

First: 2025-07-22T13:10:42+00:00 · Latest: 2026-01-13T17:49:42+00:00

Comments: A challenge report pre-print accepted by the journal Medical Image Analysis (MedIA), containing 37 pages, 15 figures, and 14 tables

Abs · PDF · Code1 · Code2

Abstract

Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical context - such as the current procedural phase - has emerged as a promising strategy to improve robustness and interpretability. To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures. We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in RAMIS and offers a high-quality resource to support future research in surgical scene understanding.

中文标题/摘要

标题：PhaKIR 2024 挑战赛中手术阶段识别、器械关键点估计和器械实例分割的比较验证：结果

在计算机辅助和机器人辅助微创手术（RAMIS）中，可靠地识别和定位内镜视频记录中的手术器械对于多种应用至关重要，包括手术培训、技能评估和自主辅助。然而，在实际条件下的稳健性能仍然是一个重大挑战。将手术上下文（如当前的程序阶段）纳入考虑，已成为提高稳健性和可解释性的有前途的策略。为应对这些挑战，我们在MICCAI 2024的内镜视觉（EndoVis）挑战中组织了手术程序阶段、关键点和器械识别（PhaKIR）子挑战。我们引入了一个新颖的多中心数据集，包含来自三个不同医疗机构的十三段完整的腹腔镜胆囊切除术视频，并统一标注了三个相关任务：手术阶段识别、器械关键点估计和器械实例分割。与现有数据集不同，我们的数据集允许在同一数据中联合研究器械定位和程序上下文，并支持整个程序中的时间信息集成。我们根据生物医学图像分析挑战的BIAS指南报告了结果和发现。PhaKIR子挑战通过提供一种独特的基准，推动了开发具有时间意识和上下文驱动方法在RAMIS领域的进展，并提供了一个高质量的资源，以支持未来对手术场景理解的研究。

Summary / 总结

The study addresses the challenge of reliably recognizing and localizing surgical instruments in endoscopic videos, crucial for applications in computer- and robot-assisted minimally invasive surgery. It introduces a novel multi-center dataset for the PhaKIR 2024 challenge, including thirteen laparoscopic cholecystectomy videos with unified annotations for surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. The results advance the field by providing a benchmark for developing temporally aware, context-driven methods in RAMIS.

研究旨在可靠地识别和定位内镜视频中的手术器械，这对于微创手术中的培训和自主辅助应用至关重要。研究引入了一个新的多中心数据集，用于三项任务：手术阶段识别、器械关键点估计和器械实例分割。该数据集支持对器械定位和程序上下文的联合研究，有助于开发时间感知的方法。主要发现包括在实际条件下提高了识别方法的鲁棒性和可解释性。

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi

First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-13T17:48:43+00:00

Comments: Work in Progress

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.

中文标题/摘要

标题：奖励稀有性：面向创造性问题解决的LLM独特性感知RL

强化学习（RL）已成为后训练大型语言模型（LLMs）的核心范式，特别是在复杂推理任务中，但它经常遭受探索崩溃的问题：策略过早地集中于一小套主导的推理模式，提高了pass@1，但限制了rollout级别的多样性和pass@k的收益。我们认为这种失败源于对局部token行为的正则化，而不是对解决方案集的多样性。为了解决这个问题，我们提出了独特性感知强化学习，这是一种rollout级别的目标，明确奖励那些表现出罕见高层策略的正确解决方案。该方法使用基于LLM的裁判将相同问题的rollout根据其高层解决方案策略聚类，忽略表面差异，并根据集群大小反向重置策略优势。因此，正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中，我们的方法在大规模采样预算下始终如一地提高了pass@$k$，增加了pass@$k$曲线下的面积（AUC@$K$），同时不牺牲pass@1，并在大规模下维持探索并发现更多多样化的解决方案策略。

Summary / 总结

The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant strategies. It introduces Uniqueness-Aware Reinforcement Learning, which rewards solutions that exhibit rare high-level strategies, using an LLM-based judge to cluster similar strategies and reweight policy advantages. This method improves pass@$k$ across various reasoning benchmarks and increases the AUC@$K$ without compromising pass@1, promoting exploration and uncovering diverse solution strategies.

论文解决了强化学习中大型语言模型探索坍缩的问题，即策略倾向于聚焦于少数主导推理模式。提出了一种新颖的强化学习方法，即独特性感知强化学习，该方法通过LLM基评估器对策略进行聚类，并根据集群大小重新加权策略优势，奖励表现出罕见高级策略的正确解决方案。该方法在各种推理基准测试中提高了pass@$k$，增加了AUC@$K$，同时保持了pass@1，并促进了大规模下的多样化解决方案策略。

MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification

Authors: Yingying Feng, Jie Li, Jie Hu, Yukang Zhang, Lei Tan, Jiayi Ji

Venue: NeurIPS 2025

First: 2025-10-27T13:08:46+00:00 · Latest: 2026-01-13T17:44:49+00:00

Comments: Accepted by NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Real-world object re-identification (ReID) systems often face modality inconsistencies, where query and gallery images come from different sensors (e.g., RGB, NIR, TIR). However, most existing methods assume modality-matched conditions, which limits their robustness and scalability in practical applications. To address this challenge, we propose MDReID, a flexible any-to-any image-level ReID framework designed to operate under both modality-matched and modality-mismatched scenarios. MDReID builds on the insight that modality information can be decomposed into two components: modality-shared features that are predictable and transferable, and modality-specific features that capture unique, modality-dependent characteristics. To effectively leverage this, MDReID introduces two key components: the Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML). Specifically, MDL explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enforces orthogonality and complementarity between the two components to enhance discriminative power across modalities. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDReID. Notably, MDReID achieves significant mAP improvements of 9.8\%, 3.0\%, and 11.5\% in general modality-matched scenarios, and average gains of 3.4\%, 11.8\%, and 10.9\% in modality-mismatched scenarios, respectively. The code is available at: \textcolor{magenta}{https://github.com/stone96123/MDReID}.

中文标题/摘要

标题：MDReID: 不同模态学习的任意到任意多模态物体重识别

现实世界中的物体重识别（ReID）系统经常面临模态不一致的问题，其中查询和画廊图像来自不同的传感器（例如，RGB、NIR、TIR）。然而，大多数现有方法假设模态匹配的条件，这限制了它们在实际应用中的鲁棒性和可扩展性。为了解决这一挑战，我们提出了MDReID，这是一种灵活的任意到任意图像级ReID框架，旨在同时处理模态匹配和模态不匹配的场景。MDReID基于这样一个洞察：模态信息可以分解为两个部分：模态共享特征，这些特征是可预测和可转移的，以及模态特定特征，这些特征捕捉了独特的、模态依赖的特性。为了有效利用这一点，MDReID引入了两个关键组件：模态解耦学习（MDL）和模态感知度量学习（MML）。具体来说，MDL明确地将模态特征分解为模态共享和模态特定表示，使在模态对齐和不匹配的场景中都能有效检索。MML是一种定制的度量学习策略，进一步确保了两个组件之间的正交性和互补性，以增强跨模态的判别力。在三个具有挑战性的多模态ReID基准数据集（RGBNT201、RGBNT100、MSVR310）上进行的大量实验一致地证明了MDReID的优势。值得注意的是，MDReID在一般模态匹配场景中实现了9.8%、3.0%和11.5%的显著mAP改进，在模态不匹配场景中分别实现了3.4%、11.8%和10.9%的平均收益。代码可在：https://github.com/stone96123/MDReID 获取。

Summary / 总结

MDReID is a flexible ReID framework designed to handle both modality-matched and modality-mismatched scenarios. It decomposes modality features into shared and specific components and uses Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML) to enhance retrieval. Experiments on three benchmarks show that MDReID outperforms existing methods, with significant improvements in both modality-matched and mismatched scenarios, achieving mAP gains of up to 11.5% and 11.8% respectively.

MDReID 是一种灵活的 ReID 框架，能够处理模态匹配和模态不匹配的情况。它将模态特征分解为共享和特定组件，并使用模态解耦学习（MDL）和模态感知度量学习（MML）来增强检索效果。在三个基准上的实验表明，MDReID 在模态匹配和不匹配场景中均优于现有方法，分别实现了高达 11.5% 和 11.8% 的 mAP 提升。

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Authors: Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, Shuicheng Yan

First: 2026-01-13T17:42:27+00:00 · Latest: 2026-01-13T17:42:27+00:00

Comments: 40 pages, 8 pages

Abs · PDF · Code1 · Code2 · Project1

Abstract

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.

中文标题/摘要

标题：M3CoTBench：医学影像理解中MLLMs的链式思维基准

链式思维（CoT）推理已被证明能有效提升大型语言模型，通过鼓励逐步的中间推理。最近的进展将这一范式扩展到了多模态大型语言模型（MLLMs）。在医学领域，诊断决策依赖于细微的视觉线索和顺序推理，CoT 与临床思维过程自然契合。然而，当前用于医学影像理解的基准测试通常只关注最终答案，而忽略了推理路径。一个不透明的过程缺乏可靠的判断基础，难以帮助医生进行诊断。为解决这一问题，我们引入了一个新的M3CoTBench基准测试，专门用于评估医学影像理解中CoT推理的正确性、效率、影响和一致性。M3CoTBench包括1）涵盖24种检查类型的多样化、多层次难度数据集，2）13种不同难度的任务，3）一套针对临床推理的CoT特定评估指标（正确性、效率、影响和一致性），以及4）多种MLLMs的性能分析。M3CoTBench系统地评估了CoT推理在各种医学成像任务中的表现，揭示了MLLMs在生成可靠且临床可解释的推理方面的当前局限性，并旨在促进透明、可信且诊断准确的AI系统的开发，以服务于医疗保健。项目页面：https://juntaojianggavin.github.io/projects/M3CoTBench/

Summary / 总结

M3CoTBench is a new benchmark designed to evaluate the CoT reasoning of MLLMs in medical image understanding. It addresses the gap in current benchmarks by focusing on the reasoning process rather than just the final answer. The benchmark includes a diverse dataset with 24 examination types and 13 varying-difficulty tasks, along with specific metrics for evaluating correctness, efficiency, impact, and consistency. Key findings show that current MLLMs struggle to generate reliable and clinically interpretable reasoning, highlighting the need for more transparent and trustworthy AI systems in healthcare.

M3CoTBench 是一个新基准，旨在评估 MLLMs 在医学影像理解中的 CoT 推理能力。它包含一个多样化的数据集，涵盖 24 种检查类型和 13 个不同难度的任务，以及针对临床推理的正确性、效率、影响和一致性等特定评估指标。该基准揭示了 MLLMs 在生成可靠且临床可解释的推理方面的当前局限性，旨在促进透明和可信赖的 AI 系统在医疗保健中的发展。

Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis

Authors: Sara Giordano, Kornikar Sen, Miguel A. Martin-Delgado

First: 2025-07-22T14:39:20+00:00 · Latest: 2026-01-13T17:34:14+00:00

Comments: 35 pages, 7 figures, color figures

Abs · PDF · Code1 · Code2

Abstract

A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the Noisy Intermediate-Scale Quantum (NISQ) era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension.The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. This is a circuit-aware reward, in contrast to the current trend of works on this topic, which are primarily fidelity-based. By leveraging sparse matrix representations and state-space discretization, the method enables practical navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set still yields low depth circuits, highlighting the algorithm robustness and adaptability. The results confirm that this RL-driven approach, with our completely circuit-aware method, efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.

中文标题/摘要

标题：混合奖励驱动的强化学习在高效量子电路合成中的应用

提出了一种强化学习（RL）框架，用于从固定初始状态高效合成生成指定目标量子态的量子电路，解决了噪声中等规模量子（NISQ）时代和未来容错量子计算中的核心挑战。该方法利用基于动作序列的表格Q学习，在离散化的量子态空间中有效管理空间维度的指数增长。该框架引入了一种混合奖励机制，结合了静态、领域导向的奖励，引导代理向目标状态发展，以及可定制的动态惩罚，以防止诸如门拥堵和重复状态访问等低效电路结构。这是一种电路感知的奖励，与当前该领域工作的主要基于保真度的方法不同。通过利用稀疏矩阵表示和状态空间离散化，该方法能够在最小化计算开销的同时，实现高维环境的有效导航。在最多七量子比特的图态准备任务上进行基准测试，我们证明该算法能够一致地发现具有优化门计数的最小深度电路。此外，将该框架扩展到通用门集仍然能够生成低深度电路，突显了该算法的鲁棒性和适应性。结果表明，这种由RL驱动的方法，结合我们完全电路感知的方法，能够高效探索复杂的量子态空间，并合成接近最优的量子电路，为量子电路优化提供了一种资源高效的基石。

Summary / 总结

The research introduces a reinforcement learning framework for synthesizing quantum circuits that generate specific target quantum states from a fixed initial state, addressing challenges in both NISQ and future quantum computing. It employs tabular Q-learning with a hybrid reward mechanism that includes a static, domain-informed reward and dynamic penalties to guide the agent towards the target state while avoiding inefficient circuits. The method uses sparse matrix representations and state-space discretization to navigate high-dimensional spaces efficiently. Experiments on graph-state preparation tasks show that the algorithm consistently finds minimal-depth circuits with optimized gate counts, even when extended to a universal gate set, demonstrating the algorithm's robustness and adaptability.

研究提出了一种基于强化学习的框架，用于从固定初始状态生成特定目标量子态的量子电路合成，以应对NISQ时代和未来量子计算中的挑战。该方法采用基于动作序列的表格Q学习，并结合静态、领域导向的奖励和动态惩罚来引导代理向目标状态移动，同时避免无效电路。该方法利用稀疏矩阵表示和状态空间离散化来高效导航高维空间。实验结果表明，该算法在图态准备任务中能够一致地找到具有优化门数的最小深度电路，即使扩展到通用门集，也展示了算法的鲁棒性和适应性。

Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions

Authors: Zhaoxian Wu, Quan Xiao, Tayfun Gokmen, Omobayode Fagbohungbe, Tianyi Chen

First: 2025-02-10T09:56:15+00:00 · Latest: 2026-01-13T17:33:05+00:00

Abs · PDF · Code1 · Code2

Abstract

As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses. While the conductance changes by a constant in response to each pulse, in reality, the change is scaled by asymmetric and non-linear response functions, leading to a non-ideal training dynamic. This paper provides a theoretical foundation for gradient-based training on AIMC hardware with non-ideal response functions. We demonstrate that asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty on the objective. To overcome the issue, we propose Residual Learning algorithm, which provably converges exactly to a critical point by solving a bilevel optimization problem. We demonstrate that the proposed method can be extended to address other hardware imperfections, such as limited response granularity. As we know, it is the first paper to investigate the impact of a class of generic non-ideal response functions. The conclusion is supported by simulations validating our theoretical insights.

中文标题/摘要

标题：通用非理想电阻元件的模拟内存训练：响应函数的影响

随着训练和部署大型视觉或语言模型的经济和环境成本急剧增加，模拟内存计算(AIMC)作为一种节能解决方案脱颖而出。然而，从训练视角来看，尤其是其训练动态，尚未得到充分探索。在AIMC硬件中，可训练权重由电阻元件的导电性表示，并通过连续的电脉冲进行更新。虽然每次脉冲后导电性会恒定变化，但实际上变化会受到不对称和非线性响应函数的缩放，导致非理想的训练动态。本文为在具有非理想响应函数的AIMC硬件上基于梯度的训练提供了理论基础。我们证明了不对称的响应函数会对模拟SGD产生负面影响，施加一个隐含的目标惩罚。为了解决这一问题，我们提出了残差学习算法，该算法通过求解双层优化问题可证明地收敛到一个临界点。我们证明了所提出的方法可以扩展以解决其他硬件缺陷，如响应粒度有限的问题。据我们所知，这是第一篇研究一类通用非理想响应函数影响的论文。结论由验证我们理论洞察的仿真支持。

Summary / 总结

This paper investigates the impact of non-ideal response functions on analog in-memory training, providing a theoretical foundation for gradient-based training in AIMC hardware. The study shows that asymmetric response functions negatively affect Analog SGD by imposing an implicit penalty. To address this, the authors propose a Residual Learning algorithm that converges to a critical point by solving a bilevel optimization problem. Simulations validate the theoretical findings, demonstrating the method's effectiveness in handling other hardware imperfections as well.

该论文探讨了非理想响应函数对模拟内存训练的影响，这是一种训练大型模型的有前景的节能方法。研究表明，非对称响应函数会通过施加隐含惩罚来负面影响Analog SGD。为了解决这一问题，作者提出了一个残差学习算法，通过求解双层优化问题可以精确收敛到一个临界点。仿真验证了理论发现的有效性，还表明该方法可以处理其他硬件缺陷。

Grid-Aware Charging and Operational Optimization for Mixed-Fleet Public Transit

Authors: Rishav Sen, Amutheezan Sivagnanam, Aron Laszka, Ayan Mukhopadhyay, Abhishek Dubey

Venue: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), 2024

First: 2026-01-13T17:30:25+00:00 · Latest: 2026-01-13T17:30:25+00:00

Comments: 7 pages, 7 figures, 4 algorithms. Published in the Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)

Abs · PDF · Code1 · Code2

Abstract

The rapid growth of urban populations and the increasing need for sustainable transportation solutions have prompted a shift towards electric buses in public transit systems. However, the effective management of mixed fleets consisting of both electric and diesel buses poses significant operational challenges. One major challenge is coping with dynamic electricity pricing, where charging costs vary throughout the day. Transit agencies must optimize charging assignments in response to such dynamism while accounting for secondary considerations such as seating constraints. This paper presents a comprehensive mixed-integer linear programming (MILP) model to address these challenges by jointly optimizing charging schedules and trip assignments for mixed (electric and diesel bus) fleets while considering factors such as dynamic electricity pricing, vehicle capacity, and route constraints. We address the potential computational intractability of the MILP formulation, which can arise even with relatively small fleets, by employing a hierarchical approach tailored to the fleet composition. By using real-world data from the city of Chattanooga, Tennessee, USA, we show that our approach can result in significant savings in the operating costs of the mixed transit fleets.

中文标题/摘要

标题：混合车队公共交通的电网感知充电与运营优化

随着城市人口的快速增长和对可持续交通解决方案需求的增加，公共交通系统正逐渐转向电动巴士。然而，管理由电动巴士和柴油巴士组成的混合车队带来了显著的运营挑战。一个主要挑战是应对动态电价，即充电成本随时间变化。交通机构必须根据这种动态性优化充电分配，同时考虑如座位限制等次要因素。本文提出了一种综合混合整数线性规划（MILP）模型，通过同时优化混合（电动和柴油巴士）车队的充电时间表和行程分配，考虑动态电价、车辆容量和路线限制等因素来应对这些挑战。我们通过一种针对车队组成定制的分层方法来解决MILP公式可能带来的潜在计算不可行性问题，即使车队规模相对较小也是如此。通过使用美国田纳西州查塔努加市的实际数据，我们展示了我们的方法可以显著降低混合公交车队的运营成本。

Summary / 总结

This paper addresses the operational challenges of managing mixed fleets of electric and diesel buses in public transit systems, particularly the need to optimize charging schedules in response to dynamic electricity pricing. The authors propose a mixed-integer linear programming (MILP) model that jointly optimizes charging schedules and trip assignments, considering factors like vehicle capacity and route constraints. Using real-world data from Chattanooga, Tennessee, the study demonstrates significant savings in operating costs for mixed transit fleets.

本文探讨了在公共交通系统中管理混合车队（包括电动和柴油巴士）所面临的运营挑战，特别是如何根据动态电价优化充电时间表。研究提出了一种混合整数线性规划（MILP）模型，该模型同时优化充电时间和行程分配，考虑了车辆容量和路线约束等因素。通过使用美国田纳西州查塔努加市的实际数据，研究显示这种方法可以显著降低混合巴士车队的运营成本。

MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Authors: Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, Liujuan Cao

First: 2026-01-11T11:44:07+00:00 · Latest: 2026-01-13T17:29:39+00:00

Comments: Project Website: https://sosppxo.github.io/mvggt.github.io/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. Code and models are publicly available at https://mvggt.github.io.

中文标题/摘要

标题：MVGGT：多模态视觉几何接地变换器用于多视图3D指示表达分割

大多数现有的3D指示表达分割（3DRES）方法依赖于密集的高质量点云，而现实世界的代理如机器人和手机仅能使用少量稀疏的RGB视图并具有严格的延迟限制。我们提出了多视图3D指示表达分割（MV-3DRES），其中模型必须直接从稀疏多视图图像中恢复场景结构并分割指示的对象。传统的两阶段管道首先重建点云，然后进行分割，通常会导致低质量的几何结构，产生粗略或退化的目标区域，并且运行速度较慢。我们提出了多模态视觉几何接地变换器（MVGGT），这是一种高效的端到端框架，通过双分支设计将语言信息整合到稀疏视图几何推理中。在这种设置下进行训练暴露出一个关键的优化障碍，称为前景梯度稀释（FGD），其中稀疏的3D信号导致弱监督。为了解决这个问题，我们引入了视图无目标抑制优化（PVSO），它提供了更强且更平衡的梯度，使学习更加稳定和高效。为了支持一致的评估，我们构建了MVRefer，这是一个基准，定义了MV-3DRES的标准设置和指标。实验表明，MVGGT建立了第一个强大的基线，并实现了高精度和快速推理，优于现有方法。代码和模型可在https://mvggt.github.io/公开获取。

Summary / 总结

The research addresses the limitations of existing 3DRES methods that require dense point clouds and slow processing, focusing on sparse multi-view RGB images. It introduces MVGGT, an end-to-end framework that integrates language information into geometric reasoning. Key findings show MVGGT outperforms existing methods with high accuracy and fast inference, overcoming the challenge of weak supervision from sparse 3D signals through Per-view No-target Suppression Optimization (PVSO).

研究针对现有依赖密集点云的3D参照表达分割方法在实际应用中难以满足稀疏RGB视图和严格延迟约束的限制。提出的MVGGT框架将语言信息整合到稀疏视图几何推理中，通过视图无目标抑制优化（PVSO）克服了弱监督的优化障碍。实验表明，MVGGT在准确性和推理速度上均优于现有方法，建立了MV-3DRES的第一个强基线。

To Retrieve or To Think? An Agentic Approach for Context Evolution

Authors: Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li

First: 2026-01-13T17:25:57+00:00 · Latest: 2026-01-13T17:25:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning tasks.However, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority voting.It aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token consumption.Our work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.

中文标题/摘要

标题：取回还是思考？一种代理导向的上下文演化方法

当前的上下文增强方法，如检索增强生成，对于解决知识密集型推理任务至关重要。然而，它们通常遵循一种僵化的、粗暴的策略，在每一步都执行检索。这种不分青红皂白的方法不仅导致不必要的计算成本，还会通过饱和上下文中的无关噪音而降低性能。为了解决这些局限性，我们引入了代理导向的上下文演化（ACE），这是一种受人类元认知启发的框架，能够动态决定是寻求新证据还是利用现有知识进行推理。ACE 通过一个中央协调代理，通过多数投票策略性地做出决策。它旨在交替激活检索代理进行外部检索和推理代理进行内部分析和优化。通过消除冗余的检索步骤，ACE 维持了一个简洁且演化的上下文。在具有挑战性的多跳问答基准测试中的广泛实验表明，ACE 在准确率上显著优于竞争性基线，同时实现了高效的令牌消耗。我们的工作为复杂、知识密集型任务的上下文演化生成提供了宝贵的见解。

Summary / 总结

The paper addresses the limitations of current context augmentation methods, which often perform unnecessary retrieval at every step, leading to increased computational costs and degraded performance. It introduces Agentic Context Evolution (ACE), a framework that uses a central agent to decide between retrieval and reasoning based on majority voting. ACE alternates between a retriever and a reasoner to maintain a concise context. Experiments show that ACE outperforms existing methods in accuracy while consuming fewer tokens, providing insights for improving context-evolved generation for complex tasks.

论文针对当前上下文增强方法，如检索增强生成，存在的计算成本过高和因无关噪声导致性能下降的问题，提出了Agentic Context Evolution (ACE)框架。ACE通过中央协调器代理决定是检索新证据还是利用现有知识进行推理，交替使用检索和推理代理，减少冗余检索步骤，保持简洁的上下文。实验表明，ACE在准确性和token消耗效率上优于现有方法，为复杂知识密集型任务的上下文演化生成提供了有价值的见解。

GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification

Authors: Qiao Li, Jie Li, Yukang Zhang, Lei Tan, Jing Chen, Jiayi Ji

Venue: Neurips 2025

First: 2025-10-25T12:16:10+00:00 · Latest: 2026-01-13T17:19:03+00:00

Comments: Accepted by Neurips 2025

Abs · PDF · Code1 · Code2

Abstract

Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unmanned aerial vehicles (UAVs) and ground-based surveillance cameras. The task poses significant challenges due to extreme viewpoint discrepancies, occlusions, and domain gaps between aerial and ground imagery. While prior works have made progress by learning cross-view representations, they remain limited in handling severe pose variations and spatial misalignment. To address these issues, we propose a Geometric and Semantic Alignment Network (GSAlign) tailored for AG-ReID. GSAlign introduces two key components to jointly tackle geometric distortion and semantic misalignment in aerial-ground matching: a Learnable Thin Plate Spline (LTPS) Module and a Dynamic Alignment Module (DAM). The LTPS module adaptively warps pedestrian features based on a set of learned keypoints, effectively compensating for geometric variations caused by extreme viewpoint changes. In parallel, the DAM estimates visibility-aware representation masks that highlight visible body regions at the semantic level, thereby alleviating the negative impact of occlusions and partial observations in cross-view correspondence. A comprehensive evaluation on CARGO with four matching protocols demonstrates the effectiveness of GSAlign, achieving significant improvements of +18.8\% in mAP and +16.8\% in Rank-1 accuracy over previous state-of-the-art methods on the aerial-ground setting.

中文标题/摘要

标题：GSAlign：航空-地面人员再识别的几何和语义对齐网络

航空-地面人员再识别（AG-ReID）是一项新兴且具有挑战性的任务，旨在匹配从无人机（UAV）和地面监视摄像头截获的大幅不同视角的行人图像。由于极端视角差异、遮挡和航空与地面图像之间的领域差距，该任务面临重大挑战。尽管先前的工作通过学习跨视角表示取得了进展，但它们在处理严重的姿态变化和空间对齐问题方面仍然有限。为了解决这些问题，我们提出了一种针对AG-ReID的几何和语义对齐网络（GSAlign）。GSAlign引入了两个关键组件，以同时解决航空-地面匹配中的几何失真和语义对齐问题：可学习的薄板样条（LTPS）模块和动态对齐模块（DAM）。LTPS模块根据一组学习到的关键点自适应地扭曲行人特征，有效补偿了极端视角变化引起的几何变化。同时，DAM估计了基于语义的可见性感知表示掩码，突出显示了跨视角对应中的可见身体区域，从而减轻了遮挡和部分观察的负面影响。在CARGO上使用四种匹配协议进行全面评估表明，GSAlign的有效性，相对于之前的最先进方法，在航空-地面设置中实现了mAP提高18.8%和Rank-1精度提高16.8%。

Summary / 总结

The paper introduces GSAlign, a Geometric and Semantic Alignment Network designed to address the challenges of aerial-ground person re-identification (AG-ReID). GSAlign uses a Learnable Thin Plate Spline (LTPS) Module to adaptively warp pedestrian features and a Dynamic Alignment Module (DAM) to estimate visibility-aware representation masks. The network significantly improves cross-view matching by handling geometric distortions and semantic misalignments. Experimental results on the CARGO dataset show that GSAlign outperforms previous methods, achieving a 18.8% increase in mAP and a 16.8% increase in Rank-1 accuracy.

研究旨在通过开发几何和语义对齐网络（GSAlign）来解决航拍-地面人员再识别（AG-ReID）中的极端视角差异、遮挡和领域差距问题。GSAlign 包含一个可学习的薄板样条（LTPS）模块用于几何失真校正，以及一个动态对齐模块（DAM）用于语义对齐。研究在 CARGO 数据集上使用四种匹配协议进行了评估，结果显示 GSAlign 在平均精度（mAP）和 Rank-1 准确率上分别比之前的方法提高了 18.8% 和 16.8%。

TerraFormer: Automated Infrastructure-as-Code with LLMs Fine-Tuned via Policy-Guided Verifier Feedback

Authors: Prithwish Jana, Sam Davidson, Bhavana Bhasker, Andrey Kan, Anoop Deoras, Laurent Callot

First: 2026-01-13T17:08:30+00:00 · Latest: 2026-01-13T17:08:30+00:00

Comments: The paper has been published at the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE 2026), Rio de Janeiro, Brazil, April 12-18, 2026

Abs · PDF · Code1 · Code2

Abstract

Automating Infrastructure-as-Code (IaC) is challenging, and large language models (LLMs) often produce incorrect configurations from natural language (NL). We present TerraFormer, a neuro-symbolic framework for IaC generation and mutation that combines supervised fine-tuning with verifier-guided reinforcement learning, using formal verification tools to provide feedback on syntax, deployability, and policy compliance. We curate two large, high-quality NL-to-IaC datasets, TF-Gen (152k instances) and TF-Mutn (52k instances), via multi-stage verification and iterative LLM self-correction. Evaluations against 17 state-of-the-art LLMs, including ~50x larger models like Sonnet 3.7, DeepSeek-R1, and GPT-4.1, show that TerraFormer improves correctness over its base LLM by 15.94% on IaC-Eval, 11.65% on TF-Gen (Test), and 19.60% on TF-Mutn (Test). It outperforms larger models on both TF-Gen (Test) and TF-Mutn (Test), ranks third on IaC-Eval, and achieves top best-practices and security compliance.

中文标题/摘要

标题：TerraFormer：通过策略导向验证反馈微调的自动化基础设施即代码

自动化基础设施即代码（IaC）具有挑战性，大型语言模型（LLMs）经常从自然语言（NL）生成错误的配置。我们提出了TerraFormer，这是一种结合监督微调和验证导向强化学习的神经符号框架，使用形式验证工具提供关于语法、部署性和策略合规性的反馈。我们通过多阶段验证和迭代LLM自我修正，精心制作了两个大型高质量的NL到IaC数据集，TF-Gen（152k实例）和TF-Mutn（52k实例）。针对17个最先进的LLM进行评估，包括约50倍更大的模型如Sonnet 3.7、DeepSeek-R1和GPT-4.1，结果显示TerraFormer在IaC-Eval上的正确性提高了15.94%，在TF-Gen（测试）上提高了11.65%，在TF-Mutn（测试）上提高了19.60%。它在TF-Gen（测试）和TF-Mutn（测试）上优于更大模型，在IaC-Eval上排名第三，并实现了最佳实践和安全合规。

Summary / 总结

TerraFormer is a neuro-symbolic framework for automating Infrastructure-as-Code (IaC) generation and mutation, using supervised fine-tuning and verifier-guided reinforcement learning. It improves the correctness of large language models by 15.94% on IaC-Eval, 11.65% on TF-Gen (Test), and 19.60% on TF-Mutn (Test) compared to its base models. It outperforms larger models on both TF-Gen (Test) and TF-Mutn (Test) and ranks third on IaC-Eval, while achieving top best-practices and security compliance.

TerraFormer 是一个结合监督微调和验证器引导强化学习的神经-符号框架，用于自动化基础设施即代码（IaC）生成和变异。与基础语言模型相比，它在 IaC-Eval 上提高了 15.94% 的正确性，在 TF-Gen (Test) 上提高了 11.65%，在 TF-Mutn (Test) 上提高了 19.60%。它在 TF-Gen (Test) 和 TF-Mutn (Test) 上优于更大规模的模型，并在 IaC-Eval 上排名第三，同时实现了最佳实践和安全合规性。

Learning from Demonstrations via Capability-Aware Goal Sampling

Authors: Yuanlin Duan, Yuning Wang, Wenjie Qiu, He Zhu

Venue: NeurIPS 2025

First: 2026-01-13T17:03:31+00:00 · Latest: 2026-01-13T17:03:31+00:00

Comments: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

Abs · PDF · Code1 · Code2

Abstract

Despite its promise, imitation learning often fails in long-horizon environments where perfect replication of demonstrations is unrealistic and small errors can accumulate catastrophically. We introduce Cago (Capability-Aware Goal Sampling), a novel learning-from-demonstrations method that mitigates the brittle dependence on expert trajectories for direct imitation. Unlike prior methods that rely on demonstrations only for policy initialization or reward shaping, Cago dynamically tracks the agent's competence along expert trajectories and uses this signal to select intermediate steps--goals that are just beyond the agent's current reach--to guide learning. This results in an adaptive curriculum that enables steady progress toward solving the full task. Empirical results demonstrate that Cago significantly improves sample efficiency and final performance across a range of sparse-reward, goal-conditioned tasks, consistently outperforming existing learning from-demonstrations baselines.

中文标题/摘要

标题：基于能力感知目标采样的演示学习

尽管具有潜力，但模仿学习在长时序环境中往往失败，因为完美复制演示是不现实的，而小错误可能会灾难性地累积。我们提出了Cago（能力感知目标采样），这是一种新颖的从演示学习方法，可以缓解直接模仿对专家轨迹的脆弱依赖。与先前仅依赖演示进行策略初始化或奖励塑造的方法不同，Cago 动态跟踪代理在专家轨迹上的能力，并使用此信号选择中间步骤——即略超出代理当前能力范围的目标——来引导学习。这导致了一种自适应课程，使代理能够稳步向完成任务的目标迈进。实验结果表明，Cago 显著提高了稀疏奖励、目标条件任务的样本效率和最终性能，并且在所有基准上表现更优。

Summary / 总结

The paper addresses the challenge of imitation learning in long-horizon environments where direct replication of expert demonstrations is impractical. It introduces Cago, a capability-aware goal sampling method that dynamically selects intermediate goals for the agent to achieve, based on its current competence. This approach leads to an adaptive curriculum that facilitates steady progress towards the full task. Experiments show that Cago improves sample efficiency and final performance across various sparse-reward, goal-conditioned tasks, outperforming existing baselines.

研究针对长时域环境中的模仿学习难题，其中完美复制专家演示是不现实的。提出了一种名为Cago的方法，该方法基于代理当前的能力动态采样中间目标，形成一个自适应的学习课程。实验表明，Cago在各种稀疏奖励任务中提高了样本效率和最终性能，优于现有基线。

Salience-SGG: Enhancing Unbiased Scene Graph Generation with Iterative Salience Estimation

Authors: Runfeng Qu, Ole Hall, Pia K Bideau, Julie Ouerfelli-Ethier, Martin Rolfs, Klaus Obermayer, Olaf Hellwich

First: 2026-01-13T16:57:09+00:00 · Latest: 2026-01-13T16:57:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Scene Graph Generation (SGG) suffers from a long-tailed distribution, where a few predicate classes dominate while many others are underrepresented, leading to biased models that underperform on rare relations. Unbiased-SGG methods address this issue by implementing debiasing strategies, but often at the cost of spatial understanding, resulting in an over-reliance on semantic priors. We introduce Salience-SGG, a novel framework featuring an Iterative Salience Decoder (ISD) that emphasizes triplets with salient spatial structures. To support this, we propose semantic-agnostic salience labels guiding ISD. Evaluations on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG achieves state-of-the-art performance and improves existing Unbiased-SGG methods in their spatial understanding as demonstrated by the Pairwise Localization Average Precision

中文标题/摘要

标题：Salience-SGG：通过迭代空间结构估计增强的无偏场景图生成

场景图生成（SGG）受到长尾分布的影响，其中少数谓词类别占主导地位，而许多其他类别则被严重低估，导致模型在稀有关系上表现不佳。无偏-SGG方法通过实施去偏策略来解决这一问题，但往往以空间理解为代价，导致过度依赖语义先验。我们提出了Salience-SGG，这是一种新颖的框架，包含一个迭代空间结构解码器（ISD），强调具有显著空间结构的三元组。为此，我们提出了语义无关的空间结构标签来指导ISD。在Visual Genome、Open Images V6和GQA-200上的评估表明，Salience-SGG达到了最先进的性能，并通过成对定位平均精度改进了现有的无偏-SGG方法的空间理解

Summary / 总结

The research aims to address the bias in Scene Graph Generation (SGG) models caused by the long-tailed distribution of predicate classes. It introduces Salience-SGG, which uses an Iterative Salience Decoder to focus on triplets with salient spatial structures, improving spatial understanding. Experiments on Visual Genome, Open Images V6, and GQA-200 show that Salience-SGG outperforms existing methods and enhances spatial understanding compared to Unbiased-SGG methods.

研究旨在解决场景图生成（SGG）模型因谓词类别的长尾分布而导致的偏差问题。提出了一种新的框架Salience-SGG，通过迭代的显著性解码器关注具有显著空间结构的三元组，提升空间理解能力。在Visual Genome、Open Images V6和GQA-200上的实验表明，Salience-SGG在性能上超过了现有方法，并且在空间理解方面优于Unbiased-SGG方法。

Spike-timing-dependent Hebbian learning as noisy gradient descent

Authors: Niklas Dexheimer, Sascha Gaudlitz, Johannes Schmidt-Hieber

First: 2025-05-15T13:23:16+00:00 · Latest: 2026-01-13T16:54:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.

中文标题/摘要

标题：基于时间戳的依赖亨比学习作为有噪声的梯度下降

亨比学习是生物神经网络中学习的关键原则。我们将基于时间戳的依赖亨比可塑性规则与概率单纯形上的非凸损失函数下的有噪声梯度下降联系起来。尽管不断注入噪声且优化问题本身是非凸的，仍能严格证明所考虑的亨比学习动态能够识别出活动最高的前突触神经元，并且收敛速度在迭代次数中呈指数级快。这是非标准且令人惊讶的，因为通常固定噪声水平的有噪声梯度下降仅收敛到一个稳态区域，在该区域噪声会导致动态在最小化器附近波动。

Summary / 总结

The study investigates how spike-timing-dependent Hebbian learning can be understood as a form of noisy gradient descent on the probability simplex. Despite the presence of constant noise and the non-convex nature of the optimization problem, the research proves that this Hebbian learning dynamic can identify the presynaptic neuron with the highest activity and achieve exponential fast convergence. This finding is noteworthy as standard noisy gradient descent typically only converges to a stationary regime where fluctuations around a minimizer occur.

研究探讨了突触时间依赖的希伯学习与有噪声的梯度下降之间的关系。结果显示，尽管存在持续的噪声和非凸性，希伯学习仍能识别最活跃的前突触神经元，并且收敛速度非常快。这与标准的有噪声的梯度下降通常只能在最小值附近波动而没有如此快速的收敛速度不同。

Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts

Authors: Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis

First: 2026-01-13T16:53:40+00:00 · Latest: 2026-01-13T16:53:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) remains a central optimisation framework in machine learning. Although RL agents can converge to optimal solutions, the definition of ``optimality'' depends on the environment's statistical properties. The Bellman equation, central to most RL algorithms, is formulated in terms of expected values of future rewards. However, when ergodicity is broken, long-term outcomes depend on the specific trajectory rather than on the ensemble average. In such settings, the ensemble average diverges from the time-average growth experienced by individual agents, with expected-value formulations yielding systematically suboptimal policies. Prior studies demonstrated that traditional RL architectures fail to recover the true optimum in non-ergodic environments. We extend this analysis to deep RL implementations and show that these, too, produce suboptimal policies under non-ergodic dynamics. Introducing explicit time dependence into the learning process can correct this limitation. By allowing the network's function approximation to incorporate temporal information, the agent can estimate value functions consistent with the process's intrinsic growth rate. This improvement does not require altering the environmental feedback, such as reward transformations or modified objective functions, but arises naturally from the agent's exposure to temporal trajectories. Our results contribute to the growing body of research on reinforcement learning methods for non-ergodic systems.

中文标题/摘要

标题：非遍历环境下深度强化学习的模型无关解决方案

强化学习（RL）仍然是机器学习中的核心优化框架。尽管RL代理可以收敛到最优解，但“最优性”的定义取决于环境的统计特性。贝尔曼方程是大多数RL算法的核心，它以未来奖励的期望值来表述。然而，当遍历性被打破时，长期结果取决于具体的轨迹而非整体平均值。在这种情况下，整体平均值与个体代理经历的时间平均增长率相偏离，以期望值为基础的表述会产生系统性的次优策略。先前的研究表明，传统的RL架构在非遍历环境中无法恢复真正的最优解。我们将这种分析扩展到深度RL实现，并证明在非遍历动力学下，这些架构也会产生次优策略。将显式的时间依赖性引入学习过程可以纠正这一局限。通过允许网络的功能近似包含时间信息，代理可以估计与过程固有增长率一致的价值函数。这种改进不需要改变环境反馈，如奖励转换或修改目标函数，而是自然地源于代理对时间轨迹的暴露。我们的结果为非遍历系统下的强化学习方法研究做出了贡献。

Summary / 总结

The paper addresses the challenge of reinforcement learning in non-ergodic environments where traditional RL algorithms fail to converge to optimal policies. It demonstrates that deep RL methods also suffer from this issue and introduces a method to incorporate explicit time dependence into the learning process, allowing agents to estimate value functions consistent with the intrinsic growth rate of the process. This approach does not require modifying environmental feedback but improves policy performance naturally through exposure to temporal trajectories.

论文探讨了在非遍历环境中强化学习的挑战，传统RL方法在这种环境中无法收敛到最优策略。它提出了一种方法，将显式的时间依赖性引入学习过程，使代理能够估计与环境固有增长速率一致的价值函数。关键发现表明，这种方法在不修改环境反馈的情况下提高了策略性能，从而克服了传统RL架构在非遍历环境中的局限性。