arXiv 论文速递

2026-01-20 03:28
Snapshot: 20260120_0328
UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation
Authors: Ruiheng Zhang, Jingfeng Yao, Huangxuan Zhao, Hao Yan, Xiao He, Lei Chen, Zhou Wei, Yong Luo, Zengmao Wang, Lefei Zhang, Dacheng Tao, Bo Du
First: 2026-01-16T18:59:58+00:00 · Latest: 2026-01-16T18:59:58+00:00
Comments: Codes and models are available at https://github.com/ZrH42/UniX
Abstract
Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at https://github.com/ZrH42/UniX.
中文标题/摘要
标题:UniX:统一自回归和扩散模型以理解与生成胸部X光片
尽管取得了进展,但医疗基础模型仍然难以统一视觉理解和生成,因为这两个任务具有固有的冲突目标:语义抽象与像素级重建。现有方法通常基于参数共享的自回归架构,经常导致在其中一个或两个任务上的性能妥协。为了解决这个问题,我们提出了UniX,这是一种用于胸部X光片理解和生成的新一代统一医疗基础模型。UniX 将两个任务分别拆分为一个自回归分支用于理解,一个扩散分支用于高保真生成。关键地,引入了一种跨模态自注意力机制,以动态地用理解特征引导生成过程。结合严格的去噪数据处理管道和多阶段训练策略,该架构能够使任务之间协同合作,同时利用扩散模型的优势以实现更优的生成。在两个代表性基准上,UniX 在理解性能(Micro-F1)上提高了46.1%,在生成质量(FD-RadDino)上提高了24.2%,仅使用LLM-CXR参数的四分之一。通过达到与任务特定模型相当的性能,我们的工作确立了一种可扩展的医疗图像理解和生成协同范式。代码和模型可在 https://github.com/ZrH42/UniX 获取。
Summary / 总结
UniX is designed to unify visual understanding and generation for chest X-rays by decoupling these tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. It introduces a cross-modal self-attention mechanism to dynamically guide generation with understanding features. On benchmarks, UniX improves understanding performance by 46.1% (Micro-F1) and generation quality by 24.2% (FD-RadDino), using only a quarter of the parameters of LLM-CXR. This work establishes a scalable paradigm for synergistic medical image understanding and generation.
UniX 通过将视觉理解任务和高保真生成任务分别用自回归分支和扩散分支来解耦,引入了跨模态自注意力机制以动态地用理解特征来引导生成过程。在基准测试中,UniX 的理解性能提高了 46.1%(Micro-F1),生成质量提高了 24.2%(FD-RadDino),并且仅使用了 LLM-CXR 参数的四分之一。这项工作为协同的医学图像理解和生成建立了一个可扩展的范式。
Do explanations generalize across large reasoning models?
Authors: Koyena Pal, David Bau, Chandan Singh
First: 2026-01-16T18:55:29+00:00 · Latest: 2026-01-16T18:55:29+00:00
Abstract
Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.
中文标题/摘要
标题:大型推理模型的解释是否具有普适性?
大型推理模型(LRMs)在解决问题的过程中会产生一个文本推理链(CoT),这可以作为一种强大的工具来理解问题,通过提供一种可读的自然语言解释。然而,尚不清楚这些解释是否具有普适性,即它们是否捕捉到了问题的普遍模式,而不是仅限于LRM的特殊模式。这对于理解或发现新概念至关重要,例如在科学的人工智能中。我们通过评估一种特定的普适性概念来研究这个问题:一种LRM生成的解释是否会在提供给其他LRM时产生相同的行为。我们发现CoT解释通常表现出这种形式的普适性(即它们增加了LRM之间的一致性),并且这种增加的普适性与人类的偏好排名和强化学习后的训练相关。我们进一步分析了解释产生一致答案的条件,并提出了一种简单的句子级集成策略,以提高一致性。综上所述,这些结果建议在使用LRM解释以获得新见解时要谨慎,并概述了表征LRM解释普适性的框架。
Summary / 总结
This study investigates whether textual chain of thought (CoT) explanations generated by large reasoning models (LRMs) generalize across different models. The research finds that CoT explanations often lead to consistent behavior when provided to other LRMs, which is correlated with human preference rankings and post-training with reinforcement learning. The study also proposes a sentence-level ensembling strategy to improve consistency and suggests caution when using LRM explanations to derive new insights.
研究探讨了大型推理模型(LRM)生成的文本链式思考(CoT)解释是否能在不同模型间泛化。研究发现,当将CoT解释提供给其他LRM时,这些解释通常会导致一致的行为,这种一致性与人类的偏好排名和强化学习后的训练有关。研究还提出了一种简单的句子级集成策略来提高一致性,并建议在使用LRM解释来获得新见解时要谨慎。
ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Authors: Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel
Venue: www
First: 2026-01-16T18:51:24+00:00 · Latest: 2026-01-16T18:51:24+00:00
Comments: Project Page: http://facebookresearch.github.io/ShapeR Video: https://www.youtube.com/watch?v=EbY30KAA55I
Abstract
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
中文标题/摘要
标题:ShapeR:基于随意捕捉的稳健条件3D形状生成
近期在3D形状生成方面的进展取得了令人印象深刻的成果,但大多数现有方法依赖于干净、未遮挡和良好分割的输入。在现实世界场景中,这些条件很少被满足。我们提出了ShapeR,一种新颖的方法,用于从随意捕捉的序列中生成条件3D对象形状。给定一个图像序列,我们利用现成的视觉-惯性SLAM、3D检测算法和视觉-语言模型,为每个对象提取一组稀疏的SLAM点、多视角图像和机器生成的描述。一种经过训练以有效利用这些模态的矫正流变换器随后生成高保真度的度量3D形状。为了确保对随意捕捉数据挑战的鲁棒性,我们采用了包括实时组合增强、跨越对象和场景数据集的课程训练方案以及处理背景杂乱的策略。此外,我们引入了一个新的评估基准,包括7个真实世界场景中的178个野外对象,带有几何注释。实验表明,在这种具有挑战性的设置中,ShapeR 显著优于现有方法,与最先进的方法相比,平均切比雪夫距离提高了2.7倍。
Summary / 总结
ShapeR is a novel approach for generating 3D object shapes from casually captured sequences. It uses SLAM, 3D detection, and vision-language models to extract sparse SLAM points, multi-view images, and machine-generated captions. A rectified flow transformer then generates high-fidelity 3D shapes. ShapeR outperforms existing methods, reducing the Chamfer distance by 2.7 times in challenging real-world scenarios.
ShapeR 是一种从随意拍摄的序列中生成 3D 物体形状的新方法。它使用视觉惯性 SLAM、3D 检测和视觉语言模型来提取稀疏的 SLAM 点、多视角图像和机器生成的描述。然后,一个校正的流变压器生成高保真的 3D 形状。ShapeR 在具有挑战性的现实世界场景中优于现有方法,将 Chamfer 距离减少了 2.7 倍。
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Authors: Emily Steiner, Jianhao Zheng, Henry Howard-Jenkins, Chris Xie, Iro Armeni
First: 2026-01-16T18:45:19+00:00 · Latest: 2026-01-16T18:45:19+00:00
Abstract
Indoor environments evolve as objects move, appear, or disappear. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high-frequency temporal measurements that are uncommon in the longer-horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. It explores strategies to share information across observations, demonstrating that this shared context not only enables consistent instance tracking but also improves standard 3DSIS quality. To evaluate this task, we define a new metric, t-mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
中文标题/摘要
标题:ReScene4D:演化的室内三维场景时空一致语义实例分割
室内环境随着物体的移动、出现或消失而演变。捕捉这些动态需要在间歇性捕获的3D扫描中保持时空一致的实例身份,即使在未观察到变化时也是如此。我们引入并形式化了时空稀疏的4D室内语义实例分割(SIS)任务,该任务联合分割、识别和时空关联物体实例。这一设置对现有的3DSIS方法构成了挑战,因为它们缺乏时间推理,需要离散匹配步骤;对4D LiDAR方法也构成了挑战,因为它们依赖于高频率的时间测量,而在室内环境长时间演变中这些测量并不常见。我们提出了一种名为ReScene4D的新方法,该方法无需密集观测即可适应4DSIS架构。它探索了在观测之间共享信息的策略,证明这种共享上下文不仅能够实现一致的实例跟踪,还能提高标准3DSIS的质量。为了评估这一任务,我们定义了一个新的度量标准t-mAP,该标准扩展了mAP以奖励时间身份一致性。ReScene4D在3RScan数据集上达到了最先进的性能,建立了理解演化的室内场景的新基准。
Summary / 总结
The research aims to capture the evolving dynamics of indoor environments by maintaining temporally consistent instance identities across 3D scans. The method, ReScene4D, adapts 3D semantic instance segmentation architectures for 4D settings, enabling consistent instance tracking without dense observations. Key findings include improved temporal identity consistency and state-of-the-art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.
研究旨在通过在稀疏时间戳的3D扫描之间保持实例身份的一致性来捕捉室内环境的动态变化。方法ReScene4D将3D语义实例分割(3DSIS)架构适应4D设置,无需密集观测即可实现一致的实例跟踪。关键发现包括ReScene4D在3RScan数据集上达到最先进的性能,并引入了t-mAP新指标来评估时间身份一致性。
MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management
Authors: Miriam K. Wolff, Peter Calhoun, Eleonora Maria Aiello, Yao Qin, Sam F. Royston
First: 2026-01-16T18:38:33+00:00 · Latest: 2026-01-16T18:38:33+00:00
Comments: 22 pages, 5 figures, 7 supplementary figures, submitted to JDST
Abstract
Progress in Type 1 Diabetes (T1D) algorithm development is limited by the fragmentation and lack of standardization across existing T1D management datasets. Current datasets differ substantially in structure and are time-consuming to access and process, which impedes data integration and reduces the comparability and generalizability of algorithmic developments. This work aims to establish a unified and accessible data resource for T1D algorithm development. Multiple publicly available T1D datasets were consolidated into a unified resource, termed the MetaboNet dataset. Inclusion required the availability of both continuous glucose monitoring (CGM) data and corresponding insulin pump dosing records. Additionally, auxiliary information such as reported carbohydrate intake and physical activity was retained when present. The MetaboNet dataset comprises 3135 subjects and 1228 patient-years of overlapping CGM and insulin data, making it substantially larger than existing standalone benchmark datasets. The resource is distributed as a fully public subset available for immediate download at https://metabo-net.org/ , and with a Data Use Agreement (DUA)-restricted subset accessible through their respective application processes. For the datasets in the latter subset, processing pipelines are provided to automatically convert the data into the standardized MetaboNet format. A consolidated public dataset for T1D research is presented, and the access pathways for both its unrestricted and DUA-governed components are described. The resulting dataset covers a broad range of glycemic profiles and demographics and thus can yield more generalizable algorithmic performance than individual datasets.
中文标题/摘要
标题:MetaboNet:最大的公共综合型1型糖尿病管理数据集
型1型糖尿病(T1D)算法开发的进步受限于现有T1D管理数据集的碎片化和缺乏标准化。当前的数据集在结构上差异很大,获取和处理这些数据耗时,这阻碍了数据整合,降低了算法开发的可比性和普遍性。这项工作旨在建立一个统一且易于访问的数据资源,用于T1D算法开发。多个公开可用的T1D数据集被整合成一个统一的资源,称为MetaboNet数据集。纳入标准要求同时提供持续葡萄糖监测(CGM)数据和相应的胰岛素泵剂量记录。此外,当存在时,保留辅助信息,如报告的碳水化合物摄入量和体力活动。MetaboNet数据集包括3135名受试者和1228名患者的重叠CGM和胰岛素数据,使其比现有的独立基准数据集大得多。该资源以完全公开的子集形式提供,可通过https://metabo-net.org/立即下载,而受数据使用协议(DUA)限制的子集则通过各自的申请流程访问。对于后者子集中的数据集,提供了处理管道,可自动将数据转换为标准化的MetaboNet格式。介绍了用于T1D研究的综合公共数据集,并描述了其不受限制和DUA管理部分的访问途径。该数据集涵盖了广泛的血糖谱和人口统计学特征,因此可以产生比单个数据集更普遍的算法性能。
Summary / 总结
This study addresses the fragmentation and lack of standardization in existing Type 1 Diabetes (T1D) management datasets by consolidating multiple publicly available datasets into a unified resource called MetaboNet. The method involves collecting data from 3135 subjects with continuous glucose monitoring (CGM) and insulin pump dosing records, along with auxiliary information like carbohydrate intake and physical activity. The resulting MetaboNet dataset, which includes 1228 patient-years of data, is significantly larger than existing benchmark datasets. Key findings include the dataset's broad coverage of glycemic profiles and demographics, enabling more generalizable algorithmic performance compared to individual datasets.
研究旨在解决现有1型糖尿病(T1D)管理数据集的碎片化和缺乏标准化问题,这阻碍了算法的发展。MetaboNet数据集将多个公开可用的T1D数据集整合在一起,包括连续葡萄糖监测(CGM)数据和胰岛素泵剂量记录,使其成为最大的统一资源,包含3135名受试者和1228名患者的年数据。该数据集提供了一种标准化格式,并可通过不受限制和数据使用协议(DUA)治理的途径获取,从而在广泛的血糖谱和人口统计学特征范围内实现更通用的算法性能。
Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them
Authors: Jiahe Jin, Abhijay Paladugu, Chenyan Xiong
First: 2025-10-08T00:20:35+00:00 · Latest: 2026-01-16T18:30:29+00:00
Abstract
Agentic search requires large language models (LLMs) to perform multi-step search to solve complex information-seeking tasks, imposing unique challenges on their reasoning capabilities. However, what constitutes effective reasoning for agentic search and how it can be learned remains unclear. In this work, we first investigate the reasoning behaviors that enable success in agentic search. By comparing successful and failed trajectories via an LLM-based analysis pipeline, we identify four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Building on this, we propose Behavior Priming, a training approach that equips agentic search models with these reasoning behaviors before reinforcement learning (RL). Specifically, it first performs supervised fine-tuning (SFT) on collected trajectories exhibiting the identified behaviors to cultivate these behaviors, and then applies standard RL to further improve task performance. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct show that Behavior Priming yields relative improvements over direct RL by 37.2\% on three web benchmarks and 6.2\% on seven multi-hop QA benchmarks, and outperforms the SFT-then-RL baseline using outcome-correct trajectories for fine-tuning. Crucially, we show that these reasoning behaviors matter more than outcome correctness in the priming stage prior to RL. Further analysis reveals that Behavior Priming enhances exploration (pass@8) and test-time scaling (search step number), providing a robust foundation for RL. Our code are avalible at https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search.
中文标题/摘要
标题:代理搜索中的有益推理行为及其有效后训练获取方法
代理搜索要求大型语言模型(LLMs)执行多步搜索以解决复杂的信息检索任务,对它们的推理能力提出了独特的挑战。然而,有效的代理搜索推理构成要素及其如何学习仍然不清楚。在本工作中,我们首先研究使代理搜索成功的推理行为。通过基于LLM的分析管道比较成功的和失败的轨迹,我们确定了四种有益的行为:信息验证、权威评估、适应性搜索和错误恢复。在此基础上,我们提出了一种行为引导的训练方法,该方法在强化学习(RL)之前为代理搜索模型配备了这些推理行为。具体而言,它首先对表现出所识别行为的轨迹进行监督微调(SFT),以培养这些行为,然后应用标准RL进一步提高任务性能。在Qwen3-1.7B和Llama3.2-3B-Instruct上的实验表明,行为引导相较于直接RL在三个网页基准上提高了37.2%,在七个多跳问答基准上提高了6.2%,并且在使用结果正确的轨迹进行微调时优于SFT-然后-RL基线。至关重要的是,我们表明,在RL之前的引导阶段,这些推理行为比结果正确性更为重要。进一步的分析表明,行为引导增强了探索(pass@8)和测试时的扩展(搜索步骤数),为RL提供了稳健的基础。我们的代码可在https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search/ 获取。
Summary / 总结
This study investigates the reasoning behaviors that enable success in agentic search, identifying four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. A training approach called Behavior Priming is proposed, which combines supervised fine-tuning on these behaviors with reinforcement learning to improve task performance. Experiments show that Behavior Priming outperforms direct reinforcement learning and a supervised fine-tuning followed by reinforcement learning baseline, with relative improvements of 37.2% on web benchmarks and 6.2% on multi-hop QA benchmarks. The study highlights the importance of these reasoning behaviors over outcome correctness in the priming stage before reinforcement learning.
研究探讨了在代理搜索中成功所需的推理行为,确定了四种有益的行为:信息验证、权威评估、适应性搜索和错误恢复。提出了一种名为行为引导的训练方法,该方法结合了对这些行为的监督微调和强化学习,以提高任务性能。实验表明,行为引导在网页基准测试和七个多跳问答基准测试中分别比直接强化学习和监督微调后强化学习基线高出37.2%和6.2%。研究强调了这些推理行为在强化学习之前的重要性,超过了结果正确性。
On the Probability of First Success in Differential Evolution: Hazard Identities and Tail Bounds
Authors: Dimitar Nedanovski, Svetoslav Nenov, Dimitar Pilev
First: 2026-01-16T18:24:24+00:00 · Latest: 2026-01-16T18:24:24+00:00
Comments: All codes are publically available at https://github.com/snenovgmailcom/lshade_hazard_project
Abstract
We study first-hitting times in Differential Evolution (DE) through a conditional hazard frame work. Instead of analyzing convergence via Markov-chain transition kernels or drift arguments, we ex press the survival probability of a measurable target set $A$ as a product of conditional first-hit probabilities (hazards) $p_t=\Prob(E_t\mid\mathcal F_{t-1})$. This yields distribution-free identities for survival and explicit tail bounds whenever deterministic lower bounds on the hazard hold on the survival event. For the L-SHADE algorithm with current-to-$p$best/1 mutation, we construct a checkable algorithmic witness event $\mathcal L_t$ under which the conditional hazard admits an explicit lower bound depending only on sampling rules, population size, and crossover statistics. This separates theoretical constants from empirical event frequencies and explains why worst-case constant-hazard bounds are typically conservative. We complement the theory with a Kaplan--Meier survival analysis on the CEC2017 benchmark suite . Across functions and budgets, we identify three distinct empirical regimes: (i) strongly clustered success, where hitting times concentrate in short bursts; (ii) approximately geometric tails, where a constant-hazard model is accurate; and (iii) intractable cases with no observed hits within the evaluation horizon. The results show that while constant-hazard bounds provide valid tail envelopes, the practical behavior of L-SHADE is governed by burst-like transitions rather than homogeneous per-generati on success probabilities.
Summary / 总结
The study investigates first-hitting times in Differential Evolution (DE) using a conditional hazard framework. Instead of traditional Markov-chain analysis, the survival probability of a target set is expressed as a product of conditional first-hit probabilities. For the L-SHADE algorithm, a checkable witness event is constructed to provide explicit lower bounds on the hazard, which helps explain the conservatism of worst-case constant-hazard bounds. Empirical analysis on the CEC2017 benchmark suite reveals three distinct regimes: strongly clustered success, approximately geometric tails, and intractable cases, indicating that L-SHADE's practical behavior is characterized by burst-like transitions rather than homogeneous per-generation success probabilities.
本文使用条件风险框架研究了差分进化(DE)中的首次击中时间。不同于传统的马尔可夫链分析,目标集的生存概率被表达为条件首次击中概率的乘积。对于L-SHADE算法,构造了一个可验证的见证事件,以提供明确的下界,这有助于解释最坏情况下的常数风险边界为何通常保守。在CEC2017基准套件上的实证分析揭示了三种不同的模式:强聚集的成功、几何尾部近似以及不可解的情况,表明L-SHADE的实际行为是由突发过渡而非每代恒定的成功概率所支配的。
The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents
Authors: Eilam Shapira, Roi Reichart, Moshe Tennenholtz
First: 2026-01-16T18:18:03+00:00 · Latest: 2026-01-16T18:18:03+00:00
Abstract
The integration of AI agents into economic markets fundamentally alters the landscape of strategic interaction. We investigate the economic implications of expanding the set of available technologies in three canonical game-theoretic settings: bargaining (resource division), negotiation (asymmetric information trade), and persuasion (strategic information transmission). We find that simply increasing the choice of AI delegates can drastically shift equilibrium payoffs and regulatory outcomes, often creating incentives for regulators to proactively develop and release technologies. Conversely, we identify a strategic phenomenon termed the "Poisoned Apple" effect: an agent may release a new technology, which neither they nor their opponent ultimately uses, solely to manipulate the regulator's choice of market design in their favor. This strategic release improves the releaser's welfare at the expense of their opponent and the regulator's fairness objectives. Our findings demonstrate that static regulatory frameworks are vulnerable to manipulation via technology expansion, necessitating dynamic market designs that adapt to the evolving landscape of AI capabilities.
中文标题/摘要
标题:毒苹果效应:通过AI代理技术扩展对媒介市场进行战略操控
将AI代理融入经济市场从根本上改变了战略互动的格局。我们研究了在三种经典的博弈论框架下扩展可用技术集对经济影响:讨价还价(资源分配)、谈判(不对称信息交易)和说服(战略信息传递)。我们发现,仅仅增加AI代理的选择就能大幅改变均衡收益和监管结果,经常促使监管者主动开发和发布技术。相反,我们发现了一种战略现象,称为“毒苹果”效应:一个代理可能会发布一种新技术,这种技术他们和对手最终都不会使用,只是为了操纵监管者对市场设计的选择以利于自己。这种战略发布会提高发布者的福利,同时损害对手和监管者的公平目标。我们的研究结果表明,静态的监管框架容易受到技术扩展的操控,需要动态的市场设计来适应AI能力不断变化的格局。
Summary / 总结
The study examines how the expansion of AI technologies in economic markets affects strategic interactions in bargaining, negotiation, and persuasion. By increasing the choice of AI delegates, the equilibrium payoffs and regulatory outcomes are significantly altered, often leading regulators to develop and release technologies proactively. The research identifies a strategic phenomenon called the 'Poisoned Apple' effect, where an agent releases a new technology that neither they nor their opponent uses, to manipulate the regulator's choice of market design in their favor, thereby improving their welfare at the expense of their opponent and the regulator's fairness objectives. This highlights the vulnerability of static regulatory frameworks to technology expansion and suggests the need for dynamic market designs.
研究探讨了AI技术在经济市场中的扩展如何影响讨价还价、谈判和说服中的战略互动。通过增加AI代理的选择,均衡收益和监管结果会显著改变,通常促使监管者主动开发和发布技术。研究发现了一种被称为‘毒苹果’效应的战略现象,即一个代理人发布一种新技术,但该技术既不被他们自己也不被对手使用,而是为了操纵监管者对市场设计的选择,从而在损害对手和监管者公平目标的同时,提高自己的福利。这表明静态监管框架容易受到技术扩展的操纵,并强调了需要动态市场设计以适应AI能力的演变。
BoxMind: Closed-loop AI strategy optimization for elite boxing validated in the 2024 Olympics
Authors: Kaiwen Wang, Kaili Zheng, Rongrong Deng, Qingmin Fan, Milin Zhang, Zongrui Li, Xuesi Zhou, Bo Han, Liren Chen, Chenyi Guo, Ji Wu
First: 2026-01-16T18:14:46+00:00 · Latest: 2026-01-16T18:14:46+00:00
Abstract
Competitive sports require sophisticated tactical analysis, yet combat disciplines like boxing remain underdeveloped in AI-driven analytics due to the complexity of action dynamics and the lack of structured tactical representations. To address this, we present BoxMind, a closed-loop AI expert system validated in elite boxing competition. By defining atomic punch events with precise temporal boundaries and spatial and technical attributes, we parse match footage into 18 hierarchical technical-tactical indicators. We then propose a graph-based predictive model that fuses these explicit technical-tactical profiles with learnable, time-variant latent embeddings to capture the dynamics of boxer matchups. Modeling match outcome as a differentiable function of technical-tactical indicators, we turn winning probability gradients into executable tactical adjustments. Experiments show that the outcome prediction model achieves state-of-the-art performance, with 69.8% accuracy on BoxerGraph test set and 87.5% on Olympic matches. Using this predictive model as a foundation, the system generates strategic recommendations that demonstrate proficiency comparable to human experts. BoxMind is validated through a closed-loop deployment during the 2024 Paris Olympics, directly contributing to the Chinese National Team's historic achievement of three gold and two silver medals. BoxMind establishes a replicable paradigm for transforming unstructured video data into strategic intelligence, bridging the gap between computer vision and decision support in competitive sports.
中文标题/摘要
标题:BoxMind:精英拳击中的闭环AI策略优化在2024年奥运会上验证
竞技体育需要复杂的战术分析,然而,由于动作动态的复杂性和缺乏结构化的战术表示,搏击类运动在AI驱动的分析方面仍处于落后状态。为解决这一问题,我们提出了BoxMind,这是一种在精英拳击比赛中验证过的闭环AI专家系统。通过定义精确的时间边界和空间及技术属性的原子性击打事件,我们将比赛视频解析为18个层次的技术-战术指标。然后,我们提出了一种基于图的预测模型,将这些明确的技术-战术特征与可学习的时间变化潜在嵌入融合,以捕捉拳手对决的动力学。将比赛结果建模为技术-战术指标的可微函数,我们将获胜概率梯度转化为可执行的战术调整。实验表明,结果预测模型达到了最先进的性能,在BoxerGraph测试集上准确率为69.8%,在奥运比赛中为87.5%。基于此预测模型,系统生成的战略建议展示了与人类专家相当的水平。BoxMind在2024年巴黎奥运会上通过闭环部署得到了验证,直接促成了中国国家队历史上三金两银的辉煌成就。BoxMind建立了一个可复制的范例,将未结构化的视频数据转化为战略智能,填补了计算机视觉与决策支持之间的差距,应用于竞技体育。
Summary / 总结
BoxMind is an AI system designed to optimize boxing strategies by analyzing match footage and generating tactical adjustments. It defines punch events with precise attributes and uses a graph-based model to predict match outcomes, achieving 69.8% accuracy on test sets and 87.5% on Olympic matches. The system was deployed during the 2024 Paris Olympics, contributing to the Chinese National Team's success with three gold and two silver medals.
BoxMind 是一个通过分析比赛视频和生成战术调整来优化拳击策略的人工智能系统。它定义了精确属性的击拳事件,并使用基于图的模型来预测比赛结果,测试集的准确率为 69.8%,奥运比赛的准确率为 87.5%。该系统在 2024 年巴黎奥运会上得到了验证,为中国国家拳击队带来了三枚金牌和两枚银牌的成功。
CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation
Authors: Vanshali Sharma, Andrea Mia Bejar, Gorkem Durak, Ulas Bagci
Venue: ISBI 2026
First: 2026-01-16T18:09:19+00:00 · Latest: 2026-01-16T18:09:19+00:00
Comments: Accepted at ISBI 2026
Abstract
In the generative AI era, where even critical medical tasks are increasingly automated, radiology report generation (RRG) continues to rely on suboptimal metrics for quality assessment. Developing domain-specific metrics has therefore been an active area of research, yet it remains challenging due to the lack of a unified, well-defined framework to assess their robustness and applicability in clinical contexts. To address this, we present CTest-Metric, a first unified metric assessment framework with three modules determining the clinical feasibility of metrics for CT RRG. The modules test: (i) Writing Style Generalizability (WSG) via LLM-based rephrasing; (ii) Synthetic Error Injection (SEI) at graded severities; and (iii) Metrics-vs-Expert correlation (MvE) using clinician ratings on 175 "disagreement" cases. Eight widely used metrics (BLEU, ROUGE, METEOR, BERTScore-F1, F1-RadGraph, RaTEScore, GREEN Score, CRG) are studied across seven LLMs built on a CT-CLIP encoder. Using our novel framework, we found that lexical NLG metrics are highly sensitive to stylistic variations; GREEN Score aligns best with expert judgments (Spearman~0.70), while CRG shows negative correlation; and BERTScore-F1 is least sensitive to factual error injection. We will release the framework, code, and allowable portion of the anonymized evaluation data (rephrased/error-injected CT reports), to facilitate reproducible benchmarking and future metric development.
中文标题/摘要
标题:CTest-Metric:一种统一框架以评估CT报告生成中临床有效性的度量标准
在生成式AI时代,即使关键医疗任务也越来越多地自动化,放射学报告生成(RRG)仍然依赖于次优的评估质量指标。因此,开发特定领域的度量标准一直是研究的活跃领域,但由于缺乏一个统一且定义良好的框架来评估其在临床环境中的稳健性和适用性,这仍然是一个挑战。为了解决这个问题,我们提出了CTest-Metric,这是一种统一的度量标准评估框架,包含三个模块以确定度量标准在CT RRG中的临床可行性。这些模块测试:(i) 通过基于LLM的重写测试写作风格的一般性(WSG);(ii) 在不同严重程度上注入合成错误(SEI);(iii) 使用临床医生对175个“分歧”案例的评级测试度量标准与专家的一致性(MvE)。八个广泛使用的度量标准(BLEU、ROUGE、METEOR、BERTScore-F1、F1-RadGraph、RaTEScore、GREEN Score、CRG)在七个基于CT-CLIP编码器构建的LLM上进行了研究。使用我们的新型框架,我们发现词汇型NLG度量标准对风格变化非常敏感;GREEN Score与专家判断最一致(斯皮尔曼相关系数约为0.70),而CRG显示出负相关;BERTScore-F1对事实错误注入的敏感性最低。我们将发布该框架、代码以及匿名评估数据的部分(重写/错误注入的CT报告),以促进可重复基准测试和未来度量标准的发展。
Summary / 总结
CTest-Metric is a unified framework for assessing the clinical validity of metrics used in CT report generation. It evaluates metrics through three modules: Writing Style Generalizability, Synthetic Error Injection, and Metrics-vs-Expert correlation. The study found that lexical NLG metrics are highly sensitive to stylistic variations, GREEN Score best aligns with expert judgments, CRG shows a negative correlation, and BERTScore-F1 is least sensitive to factual errors.
CTest-Metric 是一个统一的框架,用于评估用于 CT 报告生成的指标的临床有效性。它包括三个模块:写作风格通用性 (WSG)、合成错误注入 (SEI) 和指标与专家判断 (MvE)。研究测试了八个指标在七个 LLM 上的结果,发现词汇型 NLG 指标对风格变化非常敏感,GREEN Score 最好地与专家判断一致,CRG 显示负相关,而 BERTScore-F1 对事实错误注入最不敏感。
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study
Authors: Xiaoran Fan, Zhichao Sun, Yangfan Gao, Jingfei Xiong, Hang Yan, Yifei Cao, Jiajun Sun, Shuo Li, Zhihao Zhang, Zhiheng Xi, Yuhao Zhou, Senjie Jin, Changhao Jiang, Junjie Ye, Ming Zhang, Rui Zheng, Zhenhua Han, Yunke Zhang, Demei Yan, Shaokang Dong, Tao Ji, Tao Gui
First: 2025-06-14T15:26:31+00:00 · Latest: 2026-01-16T17:59:34+00:00
Abstract
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12$\times$ faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.
中文标题/摘要
标题:什么造就了适合LLM中心语音生成的优质语音分词器?一项系统研究
语音语言模型(SLMs)为统一语音和文本的理解与生成提供了有希望的途径。然而,在实现有效的跨模态对齐和高质量的语音生成方面仍存在挑战。在本工作中,我们系统地研究了在LLM中心的SLMs中语音分词器设计的作用,这些模型通过语音头和说话人建模进行增强。我们在公平的SLM框架下比较了耦合、半解耦和完全解耦的语音分词器,并发现解耦分词显著提高了对齐和合成质量。为了解决语音和文本之间信息密度的不匹配,我们引入了多令牌预测(MTP)到SLMs中,使每个隐藏状态能够解码多个语音令牌。这导致了高达12倍的解码速度提升,并且词错误率大幅下降(从6.07降至3.01)。此外,我们提出了一种基于说话人的生成范式,并引入了RoleTriviaQA,这是一个包含多种说话人身份的大规模角色扮演知识问答基准。实验表明,我们的方法提高了知识理解和说话人一致性。
Summary / 总结
This study investigates the impact of different speech tokenizer designs on LLM-centric SLMs, finding that decoupled tokenization improves alignment and synthesis quality. The research introduces multi-token prediction (MTP) to address the information density mismatch between speech and text, achieving up to 12 times faster decoding and a significant reduction in word error rate. Additionally, a speaker-aware generation paradigm and RoleTriviaQA benchmark are proposed to enhance knowledge understanding and speaker consistency in SLMs.
该研究系统地探讨了不同语音分词设计对LLM为中心的语音生成模型的影响。通过比较耦合、半解耦和完全解耦的分词器,研究发现解耦分词可以显著提高对齐和合成质量。引入多令牌预测(MTP)到SLM中,实现了更快的解码和显著降低词错误率。此外,提出了一个基于角色的生成范式和一个新的大规模角色扮演知识问答基准RoleTriviaQA,以增强语音生成模型的知识理解和说话人一致性。
UCB-type Algorithm for Budget-Constrained Expert Learning
Authors: Ilgam Latypov, Alexandra Suvorikova, Alexey Kroshnin, Alexander Gasnikov, Yuriy Dorn
First: 2025-10-26T12:36:17+00:00 · Latest: 2026-01-16T17:59:33+00:00
Abstract
In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^α)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-α}\,T^α\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.
Generative Scenario Rollouts for End-to-End Autonomous Driving
Authors: Rajeev Yasarla, Deepti Hegde, Shizhong Han, Hsin-Pai Cheng, Yunxiao Shi, Meysam Sadeghigooghari, Shweta Mahajan, Apratim Bhattacharyya, Litian Liu, Risheek Garrepalli, Thomas Svantesson, Fatih Porikli, Hong Cai
First: 2026-01-16T17:59:28+00:00 · Latest: 2026-01-16T17:59:28+00:00
Abstract
Vision-Language-Action (VLA) models are emerging as highly effective planning models for end-to-end autonomous driving systems. However, current works mostly rely on imitation learning from sparse trajectory annotations and under-utilize their potential as generative models. We propose Generative Scenario Rollouts (GeRo), a plug-and-play framework for VLA models that jointly performs planning and generation of language-grounded future traffic scenes through an autoregressive rollout strategy. First, a VLA model is trained to encode ego vehicle and agent dynamics into latent tokens under supervision from planning, motion, and language tasks, facilitating text-aligned generation. Next, GeRo performs language-conditioned autoregressive generation. Given multi-view images, a scenario description, and ego-action questions, it generates future latent tokens and textual responses to guide long-horizon rollouts. A rollout-consistency loss stabilizes predictions using ground truth or pseudo-labels, mitigating drift and preserving text-action alignment. This design enables GeRo to perform temporally consistent, language-grounded rollouts that support long-horizon reasoning and multi-agent planning. On Bench2Drive, GeRo improves driving score and success rate by +15.7 and +26.2, respectively. By integrating reinforcement learning with generative rollouts, GeRo achieves state-of-the-art closed-loop and open-loop performance, demonstrating strong zero-shot robustness. These results highlight the promise of generative, language-conditioned reasoning as a foundation for safer and more interpretable end-to-end autonomous driving.
中文标题/摘要
标题:生成场景展开在端到端自动驾驶中的应用
视觉-语言-动作(VLA)模型正在成为端到端自动驾驶系统中高度有效的规划模型。然而,当前的工作主要依赖于稀疏轨迹注释的模仿学习,并且未能充分利用其作为生成模型的潜力。我们提出了生成场景展开(GeRo),这是一种插件式框架,通过自回归展开策略联合执行基于语言的未来交通场景的规划和生成。首先,训练一个VLA模型将自我车辆和代理的动力学编码为在规划、运动和语言任务监督下的潜在标记,促进文本对齐的生成。接下来,GeRo执行基于语言的自回归生成。给定多视角图像、场景描述和自我动作问题,它生成未来潜在标记和文本响应以引导长期展开。展开一致性损失使用真实值或伪标签稳定预测,减轻漂移并保持文本-动作对齐。此设计使GeRo能够执行时间一致、基于语言的展开,支持长期推理和多智能体规划。在Bench2Drive上,GeRo的驾驶得分和成功率分别提高了15.7%和26.2%。通过将强化学习与生成展开相结合,GeRo实现了最先进的闭环和开环性能,展示了强大的零样本鲁棒性。这些结果突显了生成性、基于语言的推理作为端到端自动驾驶安全性和可解释性基础的潜力。
Summary / 总结
The research aims to enhance end-to-end autonomous driving systems by leveraging Vision-Language-Action (VLA) models as generative models. The proposed Generative Scenario Rollouts (GeRo) framework trains VLA models to encode dynamics into latent tokens and performs autoregressive generation of future traffic scenes. On Bench2Drive, GeRo improves driving scores and success rates by 15.7% and 26.2%, respectively, and achieves state-of-the-art performance with strong zero-shot robustness through integrated reinforcement learning and generative rollouts.
研究旨在通过利用Vision-Language-Action (VLA) 模型作为生成模型来提升端到端自动驾驶系统。提出的Generative Scenario Rollouts (GeRo) 框架训练VLA模型将车辆动态编码为潜在令牌,并进行未来交通场景的自回归生成。GeRo 在 Bench2Drive 上将驾驶得分和成功率分别提高了15.7%和26.2%,并在闭环和开环场景中均实现了最先进的性能,展示了强大的零样本鲁棒性。
Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs
Authors: Alessandro Padella, Massimiliano de Leoni, Marlon Dumas
First: 2026-01-16T17:54:55+00:00 · Latest: 2026-01-16T17:54:55+00:00
Comments: 19 pages, 4 figure, TMIS journal submission
Abstract
Predictive Process Monitoring is a branch of process mining that aims to predict the outcome of an ongoing process. Recently, it leveraged machine-and-deep learning architectures. In this paper, we extend our prior LLM-based Predictive Process Monitoring framework, which was initially focused on total time prediction via prompting. The extension consists of comprehensively evaluating its generality, semantic leverage, and reasoning mechanisms, also across multiple Key Performance Indicators. Empirical evaluations conducted on three distinct event logs and across the Key Performance Indicators of Total Time and Activity Occurrence prediction indicate that, in data-scarce settings with only 100 traces, the LLM surpasses the benchmark methods. Furthermore, the experiments also show that the LLM exploits both its embodied prior knowledge and the internal correlations among training traces. Finally, we examine the reasoning strategies employed by the model, demonstrating that the LLM does not merely replicate existing predictive methods but performs higher-order reasoning to generate the predictions.
中文标题/摘要
标题:探索基于LLM的预测过程监控在小规模事件日志中的特征
预测过程监控是过程挖掘的一个分支,旨在预测正在进行的过程的结果。最近,它利用了机器学习和深度学习架构。在本文中,我们扩展了我们之前基于LLM的预测过程监控框架,该框架最初专注于通过提示进行总时间预测。扩展包括全面评估其通用性、语义利用和推理机制,以及跨多个关键绩效指标。在三个不同的事件日志和总时间和活动发生预测的关键绩效指标上进行的实证评估表明,在只有100条轨迹的数据稀缺设置中,LLM超过了基准方法。此外,实验还表明,LLM利用了其内置的先验知识以及训练轨迹之间的内部关联。最后,我们研究了模型采用的推理策略,证明LLM不仅复制现有的预测方法,还进行更高层次的推理以生成预测。
Summary / 总结
This paper extends a prior LLM-based Predictive Process Monitoring framework, evaluating its generality and reasoning mechanisms across multiple Key Performance Indicators. Empirical evaluations on three event logs show that the LLM outperforms benchmark methods in data-scarce settings with only 100 traces, leveraging both prior knowledge and internal trace correlations to generate predictions through higher-order reasoning.
本文扩展了先前基于LLM的预测过程监控框架,最初专注于总时间预测。它在多个关键绩效指标上评估了该框架的通用性、语义利用和推理机制。实验证明,在只有100条轨迹的数据稀缺环境中,LLM在基准方法中表现出色,它利用了先验知识和训练轨迹之间的内部关联进行推理。LLM进行高层次的推理来生成预测,而不仅仅是复制现有的预测方法。
MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
Authors: Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui
First: 2026-01-16T17:45:34+00:00 · Latest: 2026-01-16T17:45:34+00:00
Abstract
As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
中文标题/摘要
标题:MHA2MLA-VLM:使DeepSeek的经济型多头潜在注意力适用于视觉-语言模型
随着视觉-语言模型(VLMs)处理越来越复杂和多模态的任务,关键-值(KV)缓存的快速增长在推理过程中造成了显著的内存和计算瓶颈。虽然多头潜在注意力(MLA)提供了一种有效的压缩KV缓存和加速推理的方法,但如何在不进行昂贵的预训练的情况下将现有的VLMs适应到MLA架构中仍然鲜有探索。在本文中,我们提出了MHA2MLA-VLM,这是一种参数高效且多模态感知的框架,用于将现成的VLMs转换为MLA。我们的方法包含两个核心技术:(1)一种适应模态的部分-RoPE策略,该策略通过选择性地屏蔽非必要维度支持传统的和多模态设置,(2)一种模态解耦的低秩近似方法,该方法独立地压缩了视觉和文本的KV空间。此外,我们引入了参数高效的微调以最小化适应成本,并证明了最小化输出激活误差而非参数距离可以显著减少性能损失。在三个代表性VLMs上的广泛实验表明,MHA2MLA-VLM在最少的监督数据下恢复了原始模型性能,显著减少了KV缓存的占用空间,并与KV量化无缝集成。
Summary / 总结
The research aims to address the memory and computational challenges posed by the Key-Value (KV) cache in vision-language models (VLMs) by introducing MHA2MLA-VLM, a parameter-efficient framework for converting existing VLMs to Multi-Head Latent Attention (MLA). The method employs a modality-adaptive partial-RoPE strategy and a modality-decoupled low-rank approximation to compress the KV cache and minimize adaptation cost. Experiments on three VLMs show that MHA2MLA-VLM can restore original model performance with minimal supervised data and significantly reduce KV cache size.
研究针对视觉语言模型(VLM)中关键值缓存带来的内存和计算瓶颈,提出了MHA2MLA-VLM,这是一种参数高效的框架,用于将现有VLM转换为多头潜在注意力(MLA)。该方法包括一种模态自适应部分RoPE策略和一种模态解耦低秩近似方法,以支持传统和多模态设置,并采用参数高效的微调来最小化适应成本。实验表明,MHA2MLA-VLM在少量监督数据下恢复了原始性能,显著减少了关键值缓存的大小,并与关键值量化无缝集成。
Probabilistic Mission Design for Neuro-Symbolic Unmanned Aircraft Systems
Authors: Simon Kohaut, Benedict Flade, Daniel Ochs, Devendra Singh Dhami, Julian Eggert, Kristian Kersting
First: 2024-12-25T11:04:00+00:00 · Latest: 2026-01-16T17:27:13+00:00
Comments: arXiv admin note: text overlap with arXiv:2406.03454
Abstract
Advanced Air Mobility (AAM) is a growing field that demands accurate and trustworthy models of legal concepts and restrictions for navigating Unmanned Aircraft Systems (UAS). In addition, any implementation of AAM needs to face the challenges posed by inherently dynamic and uncertain human-inhabited spaces robustly. Nevertheless, the employment of UAS beyond visual line of sight (BVLOS) is an endearing task that promises to significantly enhance today's logistics and emergency response capabilities. Hence, we propose Probabilistic Mission Design (ProMis), a novel neuro-symbolic approach to navigating UAS within legal frameworks. ProMis is an interpretable and adaptable system architecture that links uncertain geospatial data and noisy perception with declarative, Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and its legality. To inform planning with legal restrictions and uncertainty in mind, ProMis yields Probabilistic Mission Landscapes (PML). These scalar fields quantify the belief that the HPLP is satisfied across the agent's state space. Extending prior work on ProMis' reasoning capabilities and computational characteristics, we show its integration with potent machine learning models such as Large Language Models (LLM) and Transformer-based vision models. Hence, our experiments underpin the application of ProMis with multi-modal input data and how our method applies to many AAM scenarios.
中文标题/摘要
标题:神经符号无人驾驶航空系统中的概率任务设计
先进空中移动(AAM)是一个快速增长的领域,需要准确可靠的法律概念和限制模型来导航无人驾驶航空系统(UAS)。此外,任何AAM的实现都需要面对动态和不确定的人类居住空间带来的挑战。然而,超越视距(BVLOS)的UAS应用是一个令人向往的任务,有望显著提升当今的物流和应急响应能力。因此,我们提出了概率任务设计(ProMis),这是一种新颖的神经符号方法,用于在法律框架内导航UAS。ProMis 是一个可解释且适应性强的系统架构,将不确定的地理空间数据和嘈杂的感知与声明性混合概率逻辑程序(HPLP)连接起来,以推理代理的状态空间及其合法性。为了在规划中考虑法律限制和不确定性,ProMis 生成了概率任务景观(PML)。这些标量场量化了HPLP在代理状态空间中得到满足的信念。通过扩展ProMis推理能力和计算特性的先前工作,我们展示了其与强大的机器学习模型(如大型语言模型LLM和基于变换器的视觉模型)的集成。因此,我们的实验证明了ProMis在多模态输入数据下的应用及其方法如何应用于许多AAM场景。
Summary / 总结
The paper proposes Probabilistic Mission Design (ProMis), a neuro-symbolic approach for navigating Unmanned Aircraft Systems (UAS) within legal frameworks. ProMis integrates uncertain geospatial data and noisy perception with Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and legality, producing Probabilistic Mission Landscapes (PML) that quantify the belief of HPLP satisfaction. Experiments demonstrate ProMis's integration with machine learning models and its applicability to various Advanced Air Mobility (AAM) scenarios, enhancing logistics and emergency response capabilities beyond visual line of sight (BVLOS).
论文提出了Probabilistic Mission Design (ProMis),这是一种神经符号方法,用于在法律框架内导航无人驾驶航空系统(UAS),以应对动态和不确定环境的挑战。ProMis 使用混合概率逻辑程序(HPLP)来推理代理的状态空间和合法性,生成概率任务景观(PML),量化HPLP满足性的信念。实验展示了ProMis与机器学习模型的集成能力,证明了其处理多模态输入数据和适用于各种先进空中移动(AAM)场景的能力。
PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs
Authors: Oishee Bintey Hoque, Nibir Chandra Mandal, Kyle Luong, Amanda Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga
First: 2026-01-16T17:16:26+00:00 · Latest: 2026-01-16T17:16:26+00:00
Abstract
Large-scale livestock operations pose significant risks to human health and the environment, while also being vulnerable to threats such as infectious diseases and extreme weather events. As the number of such operations continues to grow, accurate and scalable mapping has become increasingly important. In this work, we present an infrastructure-first, explainable pipeline for identifying and characterizing Concentrated Animal Feeding Operations (CAFOs) from aerial and satellite imagery. Our method (1) detects candidate infrastructure (e.g., barns, feedlots, manure lagoons, silos) with a domain-tuned YOLOv8 detector, then derives SAM2 masks from these boxes and filters component-specific criteria, (2) extracts structured descriptors (e.g., counts, areas, orientations, and spatial relations) and fuses them with deep visual features using a lightweight spatial cross-attention classifier, and (3) outputs both CAFO type predictions and mask-level attributions that link decisions to visible infrastructure. Through comprehensive evaluation, we show that our approach achieves state-of-the-art performance, with Swin-B+PRISM-CAFO surpassing the best performing baseline by up to 15\%. Beyond strong predictive performance across diverse U.S. regions, we run systematic gradient--activation analyses that quantify the impact of domain priors and show ho
中文标题/摘要
标题:PRISM-CAFO:基于先验条件的遥感基础设施分割与映射管道
大规模的畜禽养殖场对人类健康和环境构成了重大风险,同时这些养殖场也容易受到传染病和极端天气事件的威胁。随着此类养殖场数量的不断增加,准确和可扩展的映射变得越来越重要。在本工作中,我们提出了一种以基础设施为主导、可解释的管道,用于从航空和卫星图像中识别和表征集中饲养场(CAFO)。我们的方法(1)使用领域调优的YOLOv8检测器检测候选基础设施(例如,畜舍、饲料场、粪池、筒仓),然后从这些框中推导SAM2掩码并根据组件特定标准进行筛选,(2)提取结构化描述符(例如,数量、面积、方向和空间关系),并使用轻量级空间交叉注意力分类器与深度视觉特征进行融合,以及(3)输出CAFO类型预测和掩码级别的归因,将决策与可见基础设施联系起来。通过全面评估,我们展示了我们的方法达到了最先进的性能,Swin-B+PRISM-CAFO超越了最佳基线高达15%。除了在多样化的美国地区表现出强大的预测性能外,我们还进行了系统梯度-激活分析,量化了领域先验的影响,并展示了
Summary / 总结
The research aims to accurately and scalably map Concentrated Animal Feeding Operations (CAFOs) to address environmental and health risks. The method uses a domain-tuned YOLOv8 detector to identify candidate infrastructure, derives SAM2 masks, and applies component-specific criteria. It then extracts structured descriptors and fuses them with deep visual features using a lightweight spatial cross-attention classifier. The approach achieves state-of-the-art performance, surpassing the best baseline by up to 15%, and provides mask-level attributions linking decisions to visible infrastructure. Comprehensive evaluations across diverse U.S. regions demonstrate its effectiveness and predictive power.
研究旨在准确且规模化地绘制集中式动物饲养场(CAFOs),以应对环境和健康风险。方法使用领域调优的YOLOv8检测器识别候选基础设施,生成SAM2掩码,并应用组件特定标准。然后提取结构化描述符,并使用轻量级空间交叉注意力分类器融合深度视觉特征。该方法达到最先进的性能,比最佳基线高出15%以上,并提供掩码级别的归因,将决策与可见基础设施联系起来。全面评估表明其在不同美国地区的预测能力和有效性。
Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps
Authors: Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez-Pellitero, Youngkyoon Jang
First: 2026-01-16T17:02:46+00:00 · Latest: 2026-01-16T17:02:46+00:00
Abstract
We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.
中文标题/摘要
标题:Map2Thought:基于度量认知图的显式三维空间推理
我们提出Map2Thought框架,使其能够为3D VLMs提供显式且可解释的空间推理。该框架基于两个关键组件:度量认知图(Metric-CogMap)和认知思维链(Cog-CoT)。度量认知图通过将离散网格用于关系推理与连续的度量尺度表示用于精确的几何理解,提供了一种统一的空间表示。基于度量认知图,认知思维链通过确定性操作进行显式的几何推理,包括向量操作、边界框距离以及遮挡感知的外观顺序提示,生成基于3D结构的可解释推理轨迹。实验结果表明,Map2Thought能够实现可解释的3D理解,仅使用一半的监督数据即可达到59.9%的准确率,接近使用完整数据集训练的60.9%基线。在10%、25%和50%训练子集上,它分别比最先进的方法高出5.3%、4.8%和4.0%,在VSI-Bench上表现更优。
Summary / 总结
Map2Thought is a framework that enhances 3D vision and language models with explicit spatial reasoning through Metric Cognitive Maps and Cognitive Chain-of-Thought. It integrates discrete and continuous representations for precise geometric understanding and interpretable reasoning. Experiments show that Map2Thought achieves 59.9% accuracy with half the supervision, outperforming state-of-the-art methods by 5.3%, 4.8%, and 4.0% under different training subset sizes.
Map2Thought 是一个框架,通过结合 Metric Cognitive Maps 和 Cognitive Chain-of-Thought 来增强 3D 视觉和语言模型的空间推理能力。它将离散和连续表示相结合,实现精确的几何理解和可解释的推理。实验表明,Map2Thought 在仅使用一半监督数据的情况下达到了 59.9% 的准确率,并且在不同训练子集大小下分别优于最先进的方法 5.3%、4.8% 和 4.0%。
GenDA: Generative Data Assimilation on Complex Urban Areas via Classifier-Free Diffusion Guidance
Authors: Francisco Giral, Álvaro Manzano, Ignacio Gómez, Ricardo Vinuesa, Soledad Le Clainche
First: 2026-01-16T17:02:00+00:00 · Latest: 2026-01-16T17:02:00+00:00
Abstract
Urban wind flow reconstruction is essential for assessing air quality, heat dispersion, and pedestrian comfort, yet remains challenging when only sparse sensor data are available. We propose GenDA, a generative data assimilation framework that reconstructs high-resolution wind fields on unstructured meshes from limited observations. The model employs a multiscale graph-based diffusion architecture trained on computational fluid dynamics (CFD) simulations and interprets classifier-free guidance as a learned posterior reconstruction mechanism: the unconditional branch learns a geometry-aware flow prior, while the sensor-conditioned branch injects observational constraints during sampling. This formulation enables obstacle-aware reconstruction and generalization across unseen geometries, wind directions, and mesh resolutions without retraining. We consider both sparse fixed sensors and trajectory-based observations using the same reconstruction procedure. When evaluated against supervised graph neural network (GNN) baselines and classical reduced-order data assimilation methods, GenDA reduces the relative root-mean-square error (RRMSE) by 25-57% and increases the structural similarity index (SSIM) by 23-33% across the tested meshes. Experiments are conducted on Reynolds-averaged Navier-Stokes (RANS) simulations of a real urban neighbourhood in Bristol, United Kingdom, at a characteristic Reynolds number of $\mathrm{Re}\approx2\times10^{7}$, featuring complex building geometry and irregular terrain. The proposed framework provides a scalable path toward generative, geometry-aware data assimilation for environmental monitoring in complex domains.
Summary / 总结
GenDA is a generative data assimilation framework that reconstructs high-resolution wind fields from limited sensor data using a multiscale graph-based diffusion architecture. It learns a geometry-aware flow prior and incorporates observational constraints during sampling to enable obstacle-aware reconstruction and generalization across different geometries and mesh resolutions. Experiments show that GenDA outperforms supervised graph neural network baselines and classical data assimilation methods, reducing RRMSE by 25-57% and increasing SSIM by 23-33%. The framework was tested on RANS simulations of a real urban neighborhood in Bristol, UK, with complex building geometry and irregular terrain.
GenDA 是一种生成式数据同化框架,通过多尺度图基扩散架构从有限的传感器数据中重建高分辨率风场。该框架学习几何感知的流场先验,并在采样过程中注入观测约束,以实现对不同几何形状和网格分辨率的障碍物感知重建和泛化。实验表明,GenDA 在 RRMSE 上优于监督图神经网络基线和经典数据同化方法,降低了 25-57% 的 RRMSE,并在 SSIM 上提高了 23-33%。该框架在英国布里斯托尔市一个具有复杂建筑几何形状和不规则地形的真实城市街区的 RANS 模拟中进行了测试。
Inter-patient ECG Arrhythmia Classification with LGNs and LUTNs
Authors: Wout Mommen, Lars Keuninckx, Paul Detterer, Achiel Colpaert, Piet Wambacq
First: 2026-01-16T16:55:36+00:00 · Latest: 2026-01-16T16:55:36+00:00
Abstract
Deep Differentiable Logic Gate Networks (LGNs) and Lookup Table Networks (LUTNs) are demonstrated to be suitable for the automatic classification of electrocardiograms (ECGs) using the inter-patient paradigm. The methods are benchmarked using the MIT-BIH arrhythmia data set, achieving up to 94.28% accuracy and a $jκ$ index of 0.683 on a four-class classification problem. Our models use between 2.89k and 6.17k FLOPs, including preprocessing and readout, which is three to six orders of magnitude less compared to SOTA methods. A novel preprocessing method is utilized that attains superior performance compared to existing methods for both the mixed-patient and inter-patient paradigms. In addition, a novel method for training the Lookup Tables (LUTs) in LUTNs is devised that uses the Boolean equation of a multiplexer (MUX). Additionally, rate coding was utilized for the first time in these LGNs and LUTNs, enhancing the performance of LGNs. Furthermore, it is the first time that LGNs and LUTNs have been benchmarked on the MIT-BIH arrhythmia dataset using the inter-patient paradigm. Using an Artix 7 FPGA, between 2000 and 2990 LUTs were needed, and between 5 to 7 mW (i.e. 50 pJ to 70 pJ per inference) was estimated for running these models. The performance in terms of both accuracy and $jκ$-index is significantly higher compared to previous LGN results. These positive results suggest that one can utilize LGNs and LUTNs for the detection of arrhythmias at extremely low power and high speeds in heart implants or wearable devices, even for patients not included in the training set.
中文标题/摘要
标题:基于LGNs和LUTNs的跨患者心电图心律失常分类
深度可微逻辑门网络(LGNs)和查找表网络(LUTNs)被证明适用于使用跨患者范式自动分类心电图(ECGs)。该方法使用MIT-BIH心律失常数据集进行基准测试,在四类分类问题上达到94.28%的准确率和0.683的$jκ$指数。我们的模型包括预处理和读取在内的计算量在2890到6170 FLOPs之间,比当前最佳方法低三个到六个数量级。我们使用了一种新型预处理方法,该方法在混合患者和跨患者范式中均表现出优于现有方法的性能。此外,我们还提出了一种用于训练LUTNs中的查找表(LUTs)的新方法,该方法使用多路复用器(MUX)的布尔方程。此外,这是首次在LGNs和LUTNs中使用速率编码,提高了LGNs的性能。此外,这是首次使用跨患者范式在MIT-BIH心律失常数据集上对LGNs和LUTNs进行基准测试。使用Artix 7 FPGA,需要2000到2990个LUT,并估计每推理需要5到7毫瓦(即每推理50皮焦到70皮焦)。在准确率和$jκ$-指数方面,性能显著高于之前的LGN结果。这些积极的结果表明,可以利用LGNs和LUTNs在心脏植入物或可穿戴设备中以极低的功耗和高速度检测心律失常,即使对于未包含在训练集中的患者也是如此。
Summary / 总结
The study demonstrates the effectiveness of Deep Differentiable Logic Gate Networks (LGNs) and Lookup Table Networks (LUTNs) for inter-patient ECG arrhythmia classification using the MIT-BIH arrhythmia dataset. The models achieve up to 94.28% accuracy and a $jκ$ index of 0.683, with a significantly lower computational cost of 2.89k to 6.17k FLOPs. A novel preprocessing method and a new training method for LUTs using a multiplexer Boolean equation are introduced, enhancing performance. Rate coding was also utilized for the first time in LGNs and LUTNs, improving their accuracy and $jκ$-index compared to previous results, making them suitable for low-power and high-speed arrhythmia detection in heart implants or wearable devices.
研究展示了Deep Differentiable Logic Gate Networks (LGNs)和Lookup Table Networks (LUTNs)在MIT-BIH数据集上用于跨患者心电图(ECG)心律失常分类的有效性,最高准确率达到94.28%,$jκ$指数为0.683。模型的计算量在2.89k到6.17k FLOPs之间,远低于当前最先进的方法。新颖的预处理和训练方法提升了性能,模型在准确率和$jκ$-指数方面优于之前的LGN结果,表明这些模型在心脏植入设备或可穿戴设备中进行低功耗心律失常检测的潜力。
Entropy Production in Machine Learning Under Fokker-Planck Probability Flow
Authors: Lennon Shikhman
First: 2026-01-02T04:01:57+00:00 · Latest: 2026-01-16T16:53:45+00:00
Comments: 10 pages, 4 figures, 1 table
Abstract
Machine learning models deployed in nonstationary environments inevitably experience performance degradation due to data drift. While numerous drift detection heuristics exist, most lack a dynamical interpretation and provide limited guidance on how retraining decisions should be balanced against operational cost. In this work, we propose an entropy-based retraining framework grounded in nonequilibrium statistical physics. Interpreting drift as probability flow governed by a Fokker-Planck equation, we quantify model-data mismatch using relative entropy and show that its time derivative admits an entropy-balance decomposition featuring a nonnegative entropy production term driven by probability currents. Guided by this theory, we implement an entropy-triggered retraining policy using an exponentially weighted moving-average (EWMA) control statistic applied to a streaming kernel density estimator of the Kullback-Leibler divergence. We evaluate this approach across multiple nonstationary data streams. In synthetic, financial, and web-traffic domains, entropy-based retraining achieves predictive performance comparable to frequent retraining while reducing retraining frequency by one to two orders of magnitude. However, in a challenging biomedical ECG setting, the entropy-based trigger underperforms the maximum-frequency baseline, highlighting limitations of feature-space entropy monitoring under complex label-conditional drift.
中文标题/摘要
标题:机器学习在非稳态环境中马尔可夫过程概率流中的熵生产
部署在非稳态环境中的机器学习模型不可避免地会因数据漂移而性能下降。虽然存在许多漂移检测启发式方法,但大多数缺乏动力学解释,且对重新训练决策如何平衡运营成本提供有限的指导。在本文中,我们提出了一种基于非平衡统计物理的熵为基础的重新训练框架。将漂移解释为由Fokker-Planck方程控制的概率流,我们使用相对熵量化模型与数据的不匹配,并表明其时间导数可以分解为一个由概率电流驱动的非负熵生产项。受该理论的指导,我们使用指数加权移动平均(EWMA)控制统计量应用于Kullback-Leibler散度的流式核密度估计器,实现了一个熵触发的重新训练策略。我们在多个非稳态数据流中评估了该方法。在合成、金融和网络流量领域,基于熵的重新训练在预测性能上与频繁重新训练相当,同时将重新训练频率降低了1到2个数量级。然而,在具有挑战性的生物医学ECG设置中,基于熵的触发器表现不如最大频率基线,这突显了在复杂标签条件漂移下特征空间熵监控的局限性。
Summary / 总结
This work addresses the challenge of performance degradation in machine learning models due to data drift by proposing an entropy-based retraining framework. The method interprets drift as probability flow governed by a Fokker-Planck equation and uses relative entropy to quantify model-data mismatch. The key experimental finding is that entropy-based retraining achieves comparable predictive performance to frequent retraining but with significantly reduced retraining frequency in synthetic, financial, and web-traffic domains, though it underperforms in a biomedical ECG setting due to complex label-conditional drift.
本文提出了一种基于熵的重训练框架,用于非平稳环境下的机器学习模型。通过将数据漂移解释为由Fokker-Planck方程控制的概率流,作者使用相对熵量化模型与数据的不匹配,并展示了其时间导数可以分解为一个非负的熵生产项。他们使用指数加权移动平均控制统计量实现熵触发的重训练策略,并在多个领域进行了评估。该方法在各种领域中实现了与频繁重训练相当的预测性能,但因复杂标签条件性漂移而在生物医学ECG场景中表现不佳。
The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents
Authors: Ziyu Wang, Chenyuan Liu, Yushun Xiang, Runhao Zhang, Qingbo Hao, Hongliang Lu, Houyu Chen, Zhizhong Feng, Kaiyue Zheng, Dehao Ye, Xianchao Zeng, Xinyu Zhou, Boran Wen, Jiaxin Li, Mingyu Zhang, Kecheng Zheng, Qian Zhu, Ran Cheng, Yong-Lu Li
First: 2026-01-16T16:42:05+00:00 · Latest: 2026-01-16T16:42:05+00:00
Abstract
Recently, with the rapid development of robot learning and imitation learning, numerous datasets and methods have emerged. However, these datasets and their task designs often lack systematic consideration and principles. This raises important questions: Do the current datasets and task designs truly advance the capabilities of robotic agents? Do evaluations on a few common tasks accurately reflect the differentiated performance of various methods proposed by different teams and evaluated on different tasks? To address these issues, we introduce the Great March 100 (\textbf{GM-100}) as the first step towards a robot learning Olympics. GM-100 consists of 100 carefully designed tasks that cover a wide range of interactions and long-tail behaviors, aiming to provide a diverse and challenging set of tasks to comprehensively evaluate the capabilities of robotic agents and promote diversity and complexity in robot dataset task designs. These tasks are developed through systematic analysis and expansion of existing task designs, combined with insights from human-object interaction primitives and object affordances. We collect a large amount of trajectory data on different robotic platforms and evaluate several baseline models. Experimental results demonstrate that the GM-100 tasks are 1) feasible to execute and 2) sufficiently challenging to effectively differentiate the performance of current VLA models. Our data and code are available at https://rhos.ai/research/gm-100.
中文标题/摘要
标题:伟大的3月100:100项细致任务评估具身AI代理
近年来,随着机器人学习和模仿学习的快速发展,出现了大量数据集和方法。然而,这些数据集及其任务设计往往缺乏系统的考虑和原则。这提出了重要问题:当前的数据集和任务设计是否真正推动了机器人代理的能力?在少数几个常见任务上的评估是否能准确反映不同团队提出的不同方法在不同任务上的差异化表现?为了解决这些问题,我们引入了伟大的3月100(GM-100),作为迈向机器人学习奥运会的第一步。GM-100 包含100个精心设计的任务,涵盖了广泛的交互和长尾行为,旨在提供一个多样且具有挑战性的任务集,全面评估机器人代理的能力,并促进机器人数据集任务设计的多样性和复杂性。这些任务通过系统分析和扩展现有任务设计,并结合人类物体交互基本原理和物体功能的知识开发而成。我们在不同的机器人平台上收集了大量的轨迹数据,并评估了几种基线模型。实验结果表明,GM-100 任务是1)可执行的,2)足够具有挑战性,能够有效区分当前VLA模型的性能。我们的数据和代码可在https://rhos.ai/research/gm-100/获取。
Summary / 总结
The research aims to address the lack of systematic consideration in current robotic datasets and task designs. It introduces the Great March 100 (GM-100), a set of 100 detailed tasks designed to comprehensively evaluate robotic agents' capabilities. The tasks cover a wide range of interactions and long-tail behaviors, and are evaluated through trajectory data collection and baseline model testing. The results show that GM-100 tasks are feasible and sufficiently challenging to differentiate the performance of current VLA models.
研究旨在解决当前用于评估机器人代理的数据库缺乏系统性考虑的问题。引入了Great March 100 (GM-100),包含100个详细任务,涵盖了各种交互和罕见行为,以全面评估机器人代理的能力。这些任务通过系统分析和人类物体交互的见解来开发。实验表明,GM-100既可行又足够具有挑战性,能够区分当前模型的表现。
Do Sparse Autoencoders Identify Reasoning Features in Language Models?
Authors: George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
First: 2026-01-09T09:54:36+00:00 · Latest: 2026-01-16T16:27:07+00:00
Abstract
We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). We first show through a simple theoretical analysis that $\ell_1$-regularized SAEs are intrinsically biased toward low-dimensional patterns, providing a mechanistic explanation for why shallow linguistic cues may be preferentially captured over distributed reasoning behaviors. Motivated by this bias, we introduce a falsification-oriented evaluation framework that combines causal token injection and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that features identified by contrastive methods are highly sensitive to token-level interventions, with 45% to 90% activating when a small number of associated tokens are injected into non-reasoning text. For the remaining features, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields no improvements in benchmark performance. Overall, our results suggest that SAE features identified by current contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves. Code is available at https://github.com/GeorgeMLP/reasoning-probing.
中文标题/摘要
标题:稀疏自编码器能否识别语言模型中的推理特征?
我们探讨了稀疏自编码器(SAEs)是否能够识别大型语言模型(LLMs)中的真正推理特征。通过简单的理论分析,我们展示了$\ell_1$正则化SAEs本质上倾向于低维模式,这为为什么浅层语言线索可能优先于分布式推理行为被捕捉提供了机制解释。受这种偏差的启发,我们引入了一种基于因果令牌注入和LLM引导反证的评估框架,以测试特征激活是否反映推理过程或表面语言相关性。在涵盖多个模型家族、层和推理数据集的20种配置中,我们发现通过对比方法识别的特征对令牌级干预非常敏感,当少量相关令牌被注入非推理文本时,45%到90%的特征会被激活。对于剩余的特征,LLM引导反证始终产生激活该特征的非推理输入和不激活该特征的推理输入,没有分析的特征满足我们对真正推理行为的标准。引导这些特征在基准性能上没有改进。总体而言,我们的结果表明,当前对比方法识别的SAE特征主要捕捉推理的语言相关性而非实际的推理计算。代码可在https://github.com/GeorgeMLP/reasoning-probing获取。
Summary / 总结
The study investigates whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). Through theoretical analysis and a falsification-oriented evaluation framework combining causal token injection and LLM-guided falsification, the research finds that features identified by contrastive methods are highly sensitive to token-level interventions, with 45% to 90% activating when a few associated tokens are injected into non-reasoning text. LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, indicating that SAE features primarily capture linguistic correlates rather than genuine reasoning computations. No feature satisfied the criteria for genuine reasoning behavior, and steering these features did not improve benchmark performance.
研究探讨了稀疏自编码器(SAEs)是否能够识别大型语言模型(LLMs)中的真正推理特征。通过理论分析和结合因果令牌注入及LLM引导的反证框架的评估方法,研究发现,通过对比方法识别的特征对令牌级干预高度敏感,当注入少量相关令牌到非推理文本中时,45%到90%的特征会被激活。LLM引导的反证方法始终能够生成激活特征的非推理输入和不激活特征的推理输入,表明SAE特征主要捕捉的是推理的语义关联而非真正的推理计算。没有一个特征满足真正推理行为的标准,且引导这些特征也没有提升基准性能。
Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation
Authors: Tao Tang, Shijie Xu, Jionglong Su, Zhixiang Lu
Venue: ICASSP 2026
First: 2025-07-04T13:52:16+00:00 · Latest: 2026-01-16T16:16:45+00:00
Comments: Accepted by IEEE ICASSP 2026
Abstract
The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.
中文标题/摘要
标题:Causal-SAM-LLM:作为因果推理器的大型语言模型在稳健医学分割中的应用
深度学习模型在医学图像分割中的临床应用受到其无法泛化到未见领域的限制。这一问题通常源于模型学习了解剖内容与领域特定成像风格之间的虚假相关性。为克服这一根本挑战,我们提出了Causal-SAM-LLM,一种新颖的框架,将大型语言模型(LLMs)提升为因果推理器的角色。该框架基于冻结的分割一切皆可能模型(SAM)编码器,结合了两种协同创新。首先,语言对抗解耦(LAD)利用视觉-语言模型生成丰富的文本描述,以混淆图像风格。通过训练分割模型的特征与这些风格描述对比性地不同,它学习到一个不含非因果信息的表示。其次,测试时因果干预(TCI)提供了一种交互机制,其中LLM解释临床医生的自然语言命令,实时调节分割解码器的特征,实现有针对性的错误修正。我们在四个公开数据集(BTCV、CHAOS、AMOS、BraTS)组成的综合基准上进行了广泛的实证评估,评估了跨扫描仪、跨模态和跨解剖结构设置下的泛化能力。Causal-SAM-LLM 在离群值稳健性方面建立了新的基准,平均Dice分数提高了6.2个百分点,Hausdorff距离减少了15.8毫米,同时使用了不到9%的完整模型可训练参数。我们的工作为构建稳健、高效且可交互控制的医疗AI系统开辟了新途径。
Summary / 总结
Causal-SAM-LLM is a framework that uses Large Language Models (LLMs) as causal reasoners to improve the robustness of medical image segmentation models. It incorporates Linguistic Adversarial Disentanglement (LAD) and Test-Time Causal Intervention (TCI) to enhance the model's ability to generalize across different imaging styles and modalities. The framework achieves a new state-of-the-art in out-of-distribution robustness, with improvements in Dice scores and reductions in Hausdorff Distance compared to the strongest baseline, while using minimal model parameters.
Causal-SAM-LLM 是一种框架,利用大型语言模型(LLMs)作为因果推理者来增强医学图像分割模型的鲁棒性。它使用 Linguistic Adversarial Disentanglement (LAD) 生成混淆图像风格的文本描述,并训练分割模型以抵御这些风格的影响。此外,Test-Time Causal Intervention (TCI) 允许根据临床医生的命令实时调整分割特征。该框架在分布外鲁棒性方面取得了显著改进,与最强基线相比,实现了最高 6.2 分点更高的 Dice 分数和 15.8 毫米更低的 Hausdorff 距离,同时仅使用了模型可训练参数的不到 9%。
SME-YOLO: A Real-Time Detector for Tiny Defect Detection on PCB Surfaces
Authors: Meng Han
First: 2026-01-16T16:14:56+00:00 · Latest: 2026-01-16T16:14:56+00:00
Abstract
Surface defects on Printed Circuit Boards (PCBs) directly compromise product reliability and safety. However, achieving high-precision detection is challenging because PCB defects are typically characterized by tiny sizes, high texture similarity, and uneven scale distributions. To address these challenges, this paper proposes a novel framework based on YOLOv11n, named SME-YOLO (Small-target Multi-scale Enhanced YOLO). First, we employ the Normalized Wasserstein Distance Loss (NWDLoss). This metric effectively mitigates the sensitivity of Intersection over Union (IoU) to positional deviations in tiny objects. Second, the original upsampling module is replaced by the Efficient Upsampling Convolution Block (EUCB). By utilizing multi-scale convolutions, the EUCB gradually recovers spatial resolution and enhances the preservation of edge and texture details for tiny defects. Finally, this paper proposes the Multi-Scale Focused Attention (MSFA) module. Tailored to the specific spatial distribution of PCB defects, this module adaptively strengthens perception within key scale intervals, achieving efficient fusion of local fine-grained features and global context information. Experimental results on the PKU-PCB dataset demonstrate that SME-YOLO achieves state-of-the-art performance. Specifically, compared to the baseline YOLOv11n, SME-YOLO improves mAP by 2.2% and Precision by 4%, validating the effectiveness of the proposed method.
中文标题/摘要
标题:SME-YOLO:PCB表面微小缺陷检测的实时检测器
PCB表面的缺陷直接损害产品的可靠性和安全性。然而,实现高精度检测极具挑战性,因为PCB缺陷通常表现为微小尺寸、高纹理相似性和不均匀的尺度分布。为应对这些挑战,本文提出了一种基于YOLOv11n的新框架,命名为SME-YOLO(Small-target Multi-scale Enhanced YOLO)。首先,我们采用归一化 Wasserstein 距离损失(NWDLoss)。该度量有效地减轻了交并比(IoU)对微小物体位置偏差的敏感性。其次,原始上采样模块被高效上采样卷积块(EUCB)所取代。通过利用多尺度卷积,EUCB逐步恢复空间分辨率并增强微小缺陷边缘和纹理细节的保留。最后,本文提出了多尺度聚焦注意力(MSFA)模块。该模块针对PCB缺陷的特定空间分布进行定制,能够适应性地加强关键尺度区间内的感知,实现局部精细特征和全局上下文信息的有效融合。在PKU-PCB数据集上的实验结果表明,SME-YOLO达到了最先进的性能。具体而言,与基线YOLOv11n相比,SME-YOLO的mAP提高了2.2%,精度提高了4%,验证了所提方法的有效性。
Summary / 总结
The paper addresses the challenge of detecting tiny defects on PCB surfaces, which is crucial for product reliability. It proposes SME-YOLO, which uses NWDLoss to reduce sensitivity to positional deviations, EUCB for multi-scale convolution to enhance edge and texture details, and MSFA for adaptive feature fusion. Experimental results show that SME-YOLO outperforms the baseline YOLOv11n, with a 2.2% improvement in mAP and 4% in Precision.
本文提出SME-YOLO以应对PCB表面微小缺陷检测的挑战,该方法使用NWDLoss减少对位置偏差的敏感性,使用EUCB进行多尺度卷积以保留边缘和纹理细节,并使用MSFA进行自适应特征融合。实验结果表明,SME-YOLO在PKU-PCB数据集上优于基线YOLOv11n,mAP提高了2.2%,精确度提高了4%。
Factored Value Functions for Graph-Based Multi-Agent Reinforcement Learning
Authors: Ahmed Rashwan, Keith Briggs, Chris Budd, Lisa Kreusser
First: 2026-01-16T16:11:50+00:00 · Latest: 2026-01-16T16:11:50+00:00
Abstract
Credit assignment is a core challenge in multi-agent reinforcement learning (MARL), especially in large-scale systems with structured, local interactions. Graph-based Markov decision processes (GMDPs) capture such settings via an influence graph, but standard critics are poorly aligned with this structure: global value functions provide weak per-agent learning signals, while existing local constructions can be difficult to estimate and ill-behaved in infinite-horizon settings. We introduce the Diffusion Value Function (DVF), a factored value function for GMDPs that assigns to each agent a value component by diffusing rewards over the influence graph with temporal discounting and spatial attenuation. We show that DVF is well-defined, admits a Bellman fixed point, and decomposes the global discounted value via an averaging property. DVF can be used as a drop-in critic in standard RL algorithms and estimated scalably with graph neural networks. Building on DVF, we propose Diffusion A2C (DA2C) and a sparse message-passing actor, Learned DropEdge GNN (LD-GNN), for learning decentralised algorithms under communication costs. Across the firefighting benchmark and three distributed computation tasks (vector graph colouring and two transmit power optimisation problems), DA2C consistently outperforms local and global critic baselines, improving average reward by up to 11%.
中文标题/摘要
标题:基于图的多智能体强化学习中的因子价值函数
信用分配是多智能体强化学习(MARL)中的核心挑战,尤其是在具有结构化、局部交互的大规模系统中。图基化马尔可夫决策过程(GMDPs)通过影响图捕捉此类设置,但标准评论者与这种结构不匹配:全局价值函数为每个智能体提供的学习信号较弱,而现有的局部构造在无限期设置中难以估计且行为不佳。我们引入了扩散价值函数(DVF),这是一种针对GMDPs的因子价值函数,通过在影响图上随时间折扣和空间衰减扩散奖励,为每个智能体分配一个价值分量。我们证明DVF是明确定义的,具有贝尔曼不动点,并通过平均性质分解全局折扣价值。DVF可以作为标准RL算法中的即插即用评论者使用,并可通过图神经网络高效估计。基于DVF,我们提出了扩散A2C(DA2C)和稀疏消息传递演员,学习通信成本下的去中心化算法,Learned DropEdge GNN(LD-GNN)。在灭火基准测试以及三个分布式计算任务(向量图着色和两个传输功率优化问题)中,DA2C始终优于局部和全局评论者基线,平均奖励提高高达11%。
Summary / 总结
This paper addresses the challenge of credit assignment in multi-agent reinforcement learning (MARL) by introducing the Diffusion Value Function (DVF), which assigns value components to agents based on the influence graph. DVF uses temporal discounting and spatial attenuation to diffuse rewards, providing a well-defined and scalable way to estimate values. The proposed Diffusion A2C (DA2C) algorithm, which uses DVF, outperforms both local and global critic baselines in various MARL benchmarks, achieving up to 11% higher average reward.
论文通过引入扩散价值函数(DVF)解决了多智能体强化学习(MARL)中的信用分配问题,DVF 根据影响图将价值组件分配给每个智能体,改进了现有方法,提供了更好的学习信号和可扩展性。作者提出了基于扩散的 A2C(DA2C)和稀疏消息传递演员,LD-GNN,用于在通信成本下的去中心化学习。实验结果显示,DA2C 在各种任务中优于局部和全局批评基线,平均奖励提高了最多 11%。
Wetland mapping from sparse annotations with satellite image time series and temporal-aware segment anything model
Authors: Shuai Yuan, Tianwu Lin, Shuang Chen, Yu Xia, Peng Qin, Xiangyu Liu, Xiaoqing Xu, Nan Xu, Hongsheng Zhang, Jie Wang, Peng Gong
First: 2026-01-16T16:10:32+00:00 · Latest: 2026-01-16T16:10:32+00:00
Abstract
Accurate wetland mapping is essential for ecosystem monitoring, yet dense pixel-level annotation is prohibitively expensive and practical applications usually rely on sparse point labels, under which existing deep learning models perform poorly, while strong seasonal and inter-annual wetland dynamics further render single-date imagery inadequate and lead to significant mapping errors; although foundation models such as SAM show promising generalization from point prompts, they are inherently designed for static images and fail to model temporal information, resulting in fragmented masks in heterogeneous wetlands. To overcome these limitations, we propose WetSAM, a SAM-based framework that integrates satellite image time series for wetland mapping from sparse point supervision through a dual-branch design, where a temporally prompted branch extends SAM with hierarchical adapters and dynamic temporal aggregation to disentangle wetland characteristics from phenological variability, and a spatial branch employs a temporally constrained region-growing strategy to generate reliable dense pseudo-labels, while a bidirectional consistency regularization jointly optimizes both branches. Extensive experiments across eight global regions of approximately 5,000 km2 each demonstrate that WetSAM substantially outperforms state-of-the-art methods, achieving an average F1-score of 85.58%, and delivering accurate and structurally consistent wetland segmentation with minimal labeling effort, highlighting its strong generalization capability and potential for scalable, low-cost, high-resolution wetland mapping.
中文标题/摘要
标题:基于稀疏注释的卫星图像时间序列和时序感知的分割一切模型的湿地制图
准确的湿地制图对于生态系统监测至关重要,但密集的像素级注释成本高昂,实际应用通常依赖于稀疏的点标签,在这种情况下,现有的深度学习模型表现不佳,而强烈的季节性和年际湿地动态进一步使单日期影像不足,导致显著的制图误差;尽管基础模型如SAM从点提示中展示了良好的泛化能力,但它们本质上是为静态图像设计的,无法建模时间信息,导致异质湿地中分割不连续。为克服这些限制,我们提出了一种基于SAM的框架WetSAM,该框架通过双分支设计结合卫星图像时间序列,从稀疏点监督中进行湿地制图,其中时间提示分支通过层次适配器和动态时间聚合扩展SAM,以从物候变异中分离湿地特征,空间分支采用时间约束的区域生长策略生成可靠的密集伪标签,而双向一致性正则化联合优化两个分支。在八个大约5000平方公里的全球区域中进行的广泛实验表明,WetSAM显著优于最先进的方法,平均F1分数为85.58%,实现了准确且结构一致的湿地分割,同时最小化了标注工作量,突显了其强大的泛化能力和大规模、低成本、高分辨率湿地制图的潜力。
Summary / 总结
The research aims to improve wetland mapping accuracy using satellite image time series and a temporal-aware segment anything model (WetSAM) to address the limitations of sparse annotations and seasonal dynamics. WetSAM integrates a temporally prompted branch and a spatial branch, with a bidirectional consistency regularization to optimize both branches. Experiments show that WetSAM outperforms existing methods, achieving an average F1-score of 85.58% and providing accurate and structurally consistent wetland segmentation with minimal labeling effort.
研究旨在利用卫星图像和稀疏标注提高湿地测绘的准确性。提出了一种名为WetSAM的框架,结合时间序列卫星图像和时间感知的分割任何模型。该模型采用双分支设计,处理季节性变化并生成可靠的密集伪标签,平均F1得分为85.58%,在多个地区显著优于现有方法,且标注工作量小。
Spectral invariance and maximality properties of the frequency spectrum of quantum neural networks
Authors: Patrick Holzer, Ivica Turkalj
First: 2024-02-22T13:04:50+00:00 · Latest: 2026-01-16T16:09:12+00:00
Abstract
We analyze the frequency spectrum of Quantum Neural Networks (QNNs) using Minkowski sums, which yields a compact algebraic description and permits explicit computation. Using this description, we prove several maximality results for broad classes of QNN architectures. Under some mild technical conditions we establish a bijection between classes of models with the same area $A:=R\cdot L$ that preserves the frequency spectrum, where $R$ denotes the number of qubits and $L$ the number of layers, which we consequently call spectral invariance under area-preserving transformations. With this we explain the symmetry in $R$ and $L$ in the results often observed in the literature and show that the maximal frequency spectrum depends only on the area $A=RL$ and not on the individual values of $R$ and $L$. Moreover, we collect and extend existing results and specify the maximum possible frequency spectrum of a QNN with an arbitrary number of layers as a function of the spectrum of its generators. In the case of arbitrary dimensional generators, where our two introduced notions of maximality differ, we extend existing Golomb ruler based results and introduce a second novel approach based on a variation of the turnpike problem, which we call the relaxed turnpike problem. We clarify comprehensively how the generators of a QNN must be chosen in order to obtain a maximal frequency spectrum for a given area $A$, thereby contributing to a deeper theoretical understanding. However, our numerical experiments show that trainability depends not only on $A = RL$, but also on the choice of $(R,L)$, so that knowledge of the maximum frequency spectrum alone is not sufficient to ensure good trainability.
中文标题/摘要
标题:量子神经网络频谱的谱不变性和最大化性质
我们使用闵可夫斯基和分析量子神经网络(QNN)的频谱,这提供了一个紧凑的代数描述并允许显式计算。利用这种描述,我们证明了广泛类别的QNN架构的若干最大化结果。在一些温和的技术条件下,我们建立了在保持频谱不变的情况下,模型类之间的双射,我们将其称为面积保持变换下的谱不变性。我们解释了文献中经常观察到的$R$和$L$的对称性,并表明最大频谱仅取决于面积$A=RL$,而不取决于$R$和$L$的个别值。此外,我们收集并扩展了现有结果,并指出了具有任意层数的QNN的最大可能频谱,作为其生成器频谱的函数。在生成器具有任意维数的情况下,我们的两种引入的最大性概念不同,我们扩展了现有的戈尔德伯格尺结果,并引入了一种基于变体的拐点问题的新方法,我们称之为松弛拐点问题。我们全面澄清了QNN的生成器必须如何选择,以获得给定面积$A$的最大频谱,从而为更深入的理论理解做出贡献。然而,我们的数值实验表明,可训练性不仅取决于$A=RL$,还取决于$(R,L)$的选择,因此仅知道最大频谱不足以确保良好的可训练性。
SUG-Occ: An Explicit Semantics and Uncertainty Guided Sparse Learning Framework for Real-Time 3D Occupancy Prediction
Authors: Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Nanren Bao, Bo Qian, Hao Si, Manabu Tsukada
First: 2026-01-16T16:07:38+00:00 · Latest: 2026-01-16T16:07:38+00:00
Abstract
As autonomous driving moves toward full scene understanding, 3D semantic occupancy prediction has emerged as a crucial perception task, offering voxel-level semantics beyond traditional detection and segmentation paradigms. However, such a refined representation for scene understanding incurs prohibitive computation and memory overhead, posing a major barrier to practical real-time deployment. To address this, we propose SUG-Occ, an explicit Semantics and Uncertainty Guided Sparse Learning Enabled 3D Occupancy Prediction Framework, which exploits the inherent sparsity of 3D scenes to reduce redundant computation while maintaining geometric and semantic completeness. Specifically, we first utilize semantic and uncertainty priors to suppress projections from free space during view transformation while employing an explicit unsigned distance encoding to enhance geometric consistency, producing a structurally consistent sparse 3D representation. Secondly, we design an cascade sparse completion module via hyper cross sparse convolution and generative upsampling to enable efficiently coarse-to-fine reasoning. Finally, we devise an object contextual representation (OCR) based mask decoder that aggregates global semantic context from sparse features and refines voxel-wise predictions via lightweight query-context interactions, avoiding expensive attention operations over volumetric features. Extensive experiments on SemanticKITTI benchmark demonstrate that the proposed approach outperforms the baselines, achieving a 7.34/% improvement in accuracy and a 57.8\% gain in efficiency.
中文标题/摘要
标题:SUG-Occ:一种显式语义和不确定性引导的稀疏学习框架,用于实时3D占用预测
随着自动驾驶向全场景理解迈进,3D语义占用预测已成为一项关键的感知任务,提供了超越传统检测和分割范式的体素级语义。然而,这种精细的表示形式对场景理解带来的计算和内存开销是巨大的,成为实际实时部署的主要障碍。为了解决这个问题,我们提出了SUG-Occ,一种显式语义和不确定性引导的稀疏学习使能的3D占用预测框架,利用3D场景的固有稀疏性来减少冗余计算,同时保持几何和语义完整性。具体来说,我们首先利用语义和不确定性先验在视图变换过程中抑制自由空间的投影,同时采用显式的无符号距离编码增强几何一致性,生成一个结构上一致的稀疏3D表示。其次,我们通过超交叉稀疏卷积和生成上采样设计了一个级联稀疏完成模块,以实现高效的粗到细推理。最后,我们设计了一个基于对象上下文表示(OCR)的掩码解码器,从稀疏特征中聚合全局语义上下文,并通过轻量级查询-上下文交互细化体素级预测,避免在体素特征上进行昂贵的注意力操作。在SemanticKITTI基准上的广泛实验表明,所提出的方法优于基线方法,在准确率上提高了7.34%,效率提高了57.8%。
Summary / 总结
The research aims to address the computational and memory challenges of real-time 3D semantic occupancy prediction in autonomous driving. SUG-Occ, a sparse learning framework, uses semantic and uncertainty priors to reduce redundant computations while maintaining geometric and semantic accuracy. The method includes a cascade sparse completion module and an object contextual representation-based mask decoder, which together enhance geometric consistency and refine predictions efficiently. Experiments show SUG-Occ outperforms baselines with a 7.34% improvement in accuracy and a 57.8% gain in efficiency.
研究旨在解决自动驾驶中实时3D语义占用预测的计算和内存挑战。SUG-Occ框架利用语义和不确定性先验减少不必要的计算,同时保持几何和语义准确性。它通过使用无符号距离编码和级联稀疏完成模块以及轻量级掩码解码器进行高效推理。实验表明,SUG-Occ在准确性和效率上分别比基线提高了7.34%和57.8%。
Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning
Authors: Haomiao Tang, Jinpeng Wang, Minyi Zhao, Guanghao Meng, Ruisheng Luo, Long Chen, Shu-Tao Xia
Venue: AAAI 2026 oral presentation
First: 2026-01-16T16:05:49+00:00 · Latest: 2026-01-16T16:05:49+00:00
Comments: Accepted for publication and oral presentation at AAAI 2026
Abstract
Composed Image Retrieval (CIR) enables image search by combining a reference image with modification text. Intrinsic noise in CIR triplets incurs intrinsic uncertainty and threatens the model's robustness. Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatment of queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. HUG utilizes a fine-grained probabilistic learning framework, where queries and targets are represented by Gaussian embeddings that capture detailed concepts and uncertainties. We customize heterogeneous uncertainty estimations for multi-modal queries and uni-modal targets. Given a query, we capture uncertainties not only regarding uni-modal content quality but also multi-modal coordination, followed by a provable dynamic weighting mechanism to derive comprehensive query uncertainty. We further design uncertainty-guided objectives, including query-target holistic contrast and fine-grained contrasts with comprehensive negative sampling strategies, which effectively enhance discriminative learning. Experiments on benchmarks demonstrate HUG's effectiveness beyond state-of-the-art baselines, with faithful analysis justifying the technical contributions.
中文标题/摘要
标题:异质不确定性引导的细粒度概率学习组合图像检索
组合图像检索(CIR)通过结合参考图像和修改文本实现图像搜索。CIR三元组中的固有噪声导致固有不确定性,威胁模型的鲁棒性。概率学习方法在解决此类问题方面显示出潜力;然而,它们由于实例级的整体建模和查询和目标的同质处理而无法满足CIR的需求。本文引入了一种异质不确定性引导(HUG)范式来克服这些限制。HUG利用细粒度的概率学习框架,其中查询和目标由高斯嵌入表示,捕捉详细的概念和不确定性。我们为多模态查询和单模态目标定制了异质不确定性估计。给定一个查询,我们不仅捕捉单模态内容质量的不确定性,还捕捉多模态协调的不确定性,然后通过可证明的动态加权机制推导出综合查询不确定性。我们进一步设计了不确定性引导的目标,包括查询-目标整体对比和细粒度对比,以及全面的负样本策略,这有效地增强了区分学习。基准上的实验表明HUG超越了最先进的基线方法,忠实的分析证明了技术贡献。
Summary / 总结
This paper addresses the challenges in Composed Image Retrieval (CIR) by introducing a Heterogeneous Uncertainty-Guided (HUG) paradigm. The method uses a fine-grained probabilistic learning framework with Gaussian embeddings to capture detailed concepts and uncertainties for queries and targets. It customizes uncertainty estimations for multi-modal queries and uni-modal targets, and employs a dynamic weighting mechanism to derive comprehensive query uncertainty. The approach also includes uncertainty-guided objectives for enhancing discriminative learning. Experiments show HUG outperforms existing methods on benchmarks, validating its technical contributions.
本文通过引入异质不确定性引导(HUG)范式解决了组合图像检索(CIR)中的挑战。HUG采用细粒度的概率学习框架,使用高斯嵌入捕捉查询和目标的详细概念和不确定性。该方法针对多模态查询和单模态目标定制了异质不确定性估计,并使用动态加权机制来推导全面的查询不确定性。此外,它还设计了不确定性引导的目标来增强区分性学习。实验表明HUG在基准测试中优于现有方法,验证了其技术贡献。
Theorem Prover as a Judge for Synthetic Data Generation
Authors: Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen
Venue: ACL 2025
First: 2025-02-18T18:57:09+00:00 · Latest: 2026-01-16T15:59:43+00:00
Abstract
The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
中文标题/摘要
标题:定理证明器作为合成数据生成的裁判
由于其在增强大型语言模型(LLM)数学能力方面的潜力,数学推理中合成数据的需求增加。然而,确保中间推理步骤的有效性仍然是一个重大挑战,影响数据质量。虽然通过定理证明器的形式验证可以有效验证LLM推理,但数学证明的自动形式化仍然容易出错。为此,我们引入了迭代自动形式化的方法,该方法通过迭代细化定理证明器的形式化来减少错误,从而将Lean证明器的执行率从60%提高到87%。在此基础上,我们引入了定理证明器作为裁判(TP-as-a-Judge)的方法,该方法利用定理证明器的形式化来严格评估LLM的中间推理,有效地将自动形式化与合成数据生成相结合。最后,我们提出了基于定理证明器反馈的强化学习框架(RLTPF),该框架用定理证明器反馈替换人类注释,以替代基于人类反馈的强化学习(RLHF)。在多个LLM上应用TP-as-a-Judge和RLTPF,仅使用3,508个样本即可提高基准测试,Mistral-7B在MultiArith上的准确率提高了5.56%,Llama-2-7B在SVAMP上的准确率提高了6.00%,Llama-3.1-8B在AQUA上的准确率提高了3.55%。
Efficient LLM Collaboration via Planning
Authors: Byeongchan Lee, Jonghoon Lee, Dongyoung Kim, Jaehyung Kim, Kyungjoon Park, Dongjun Lee, Jinwoo Shin
First: 2025-06-13T08:35:50+00:00 · Latest: 2026-01-16T15:28:18+00:00
Abstract
Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large proprietary models (e.g., models with over 100B parameters) achieve remarkable results across diverse tasks, they are often accessible through costly APIs, making frequent use too costly for many applications. In contrast, small open-source models (e.g., models with fewer than 3B parameters) are freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.
中文标题/摘要
标题:通过规划实现高效的LLM协作
近年来,大型语言模型(LLMs)在从简单到复杂的各种任务中都表现出强大的性能。然而,虽然大型专有模型(例如,参数超过100亿的模型)在多种任务中取得了显著成果,但它们通常通过昂贵的API访问,使得频繁使用成本过高。相比之下,小型开源模型(例如,参数少于30亿的模型)可以免费获取并易于本地部署,但在复杂任务上的表现仍然有限。这种权衡引出了一个自然的问题:小模型和大模型如何高效协作以结合各自的优点?为了解决这一权衡问题,我们提出了COPE,一种测试时协作框架。首先,一个规划模型生成一个轻量级的中间计划,该计划指导下游执行模型。小模型和大模型交替担任规划者和执行者,通过多阶段级联交换计划以协作解决任务。通过涵盖数学推理、代码生成、开放性任务和代理任务基准的全面实验,我们证明COPE在性能上可与大型专有模型媲美,同时大幅降低了推理API成本。这些结果突显了规划作为成本效益推理的有效先验。
Summary / 总结
The paper addresses the trade-off between the performance and accessibility of large and small language models. It proposes COPE, a collaboration framework where a planner model generates plans that guide an executor model, enabling small and large models to take turns and collaboratively solve tasks. Experiments show that COPE matches the performance of large proprietary models while significantly reducing inference costs.
论文探讨了大型专有模型性能与使用成本之间的权衡,以及小型开源模型性能有限的问题。提出了一种测试时协作框架COPE,其中规划模型生成计划以指导下游执行模型。小型和大型模型交替担任规划者和执行者,通过多阶段交换计划来解决任务。实验表明,COPE在性能上与大型专有模型相当,同时大幅降低了推理成本。
A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X-Enabled Autonomous Driving
Authors: Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada
First: 2025-06-20T13:58:10+00:00 · Latest: 2026-01-16T15:26:55+00:00
Abstract
3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, its effectiveness is inherently constrained in single-vehicle setups by occlusions, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy of predictions. Despite its potential, research on collaborative 3D semantic occupancy prediction is hindered by the lack of dedicated datasets. To bridge this gap, we design a high-resolution semantic voxel sensor in CARLA to produce dense and comprehensive annotations. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. Experimental results demonstrate the superior performance of our baseline, with increasing gains observed as range expands. Our code is available at https://github.com/tlab-wide/Co3SOP}{https://github.com/tlab-wide/Co3SOP.
中文标题/摘要
标题:一种用于V2X使能自动驾驶中协作3D语义占用预测的合成基准
3D语义占用预测是自动驾驶中新兴的感知范式,提供几何细节和语义类别的体素级表示。然而,其有效性在单车辆设置中受到遮挡、传感器范围限制和狭窄视角的内在约束。为了解决这些限制,协作感知使互补信息得以交换,从而提高预测的完整性和准确性。尽管具有潜力,但由于缺乏专用数据集,关于协作3D语义占用预测的研究受到阻碍。为弥补这一差距,我们设计了一个高分辨率语义体素传感器在CARLA中生成密集且全面的注释。我们进一步开发了一个基线模型,通过空间对齐和注意力聚合进行跨代理特征融合。此外,我们建立了具有不同预测范围的基准,旨在系统评估空间范围对协作预测的影响。实验结果表明,我们的基线性能优越,随着范围的扩大,性能提升更为显著。我们的代码可在https://github.com/tlab-wide/Co3SOP获取。
Summary / 总结
The research aims to improve 3D semantic occupancy prediction in autonomous driving by leveraging collaborative perception to overcome limitations such as occlusions and sensor range. A high-resolution semantic voxel sensor in CARLA was developed to generate detailed annotations, and a baseline model using inter-agent feature fusion was created. Experiments showed that the baseline model performed better with increasing prediction range, highlighting the benefits of collaborative perception. The code is available at https://github.com/tlab-wide/Co3SOP.
研究旨在通过协作感知来改善自动驾驶中的3D语义占用预测,以克服诸如遮挡和传感器范围等限制。在CARLA中开发了一个高分辨率语义体素传感器以生成详细的注释,并创建了一个使用跨代理特征融合的基线模型。实验表明,随着预测范围的增加,基线模型的性能更好,突显了协作感知的优势。代码可在https://github.com/tlab-wide/Co3SOP获得。
Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
Authors: Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, Zhenbo Luo
First: 2026-01-16T15:14:04+00:00 · Latest: 2026-01-16T15:14:04+00:00
Comments: Accepted by ICASSP2026
Abstract
Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.
中文标题/摘要
标题:Think-Clip-Sample: 慢速-快速帧选择的视频理解
多模态大型语言模型(MLLMs)的近期进展显著提升了视频理解能力。然而,它们在长视频上的表现仍受限于计算约束和不理想的帧选择。我们提出了Think-Clip-Sample(TCS),这是一种无需训练的框架,通过两个关键组件增强长视频理解:(i) 多查询推理,生成多个查询以捕捉问题和视频的互补方面;(ii) 剪辑级慢速-快速采样,自适应地平衡密集的局部细节和稀疏的全局上下文。在MLVU、LongVideoBench和VideoMME上的广泛实验表明,TCS在不同MLLMs上的一致性改进,最高可提升6.9%的准确率,并且能够在减少50%推理时间成本的情况下达到相当的准确率,突显了TCS在长视频理解上的高效性和有效性。
Summary / 总结
The research aims to address the limitations of multi-modal large language models (MLLMs) in understanding long-form videos due to computational constraints and suboptimal frame selection. The Think-Clip-Sample (TCS) framework is introduced, which includes Multi-Query Reasoning and Clip-level Slow-Fast Sampling. Multi-Query Reasoning generates multiple queries to capture different aspects of the question and video, while Clip-level Slow-Fast Sampling adaptively balances local details and global context. Experiments show that TCS improves performance across various MLLMs, increasing accuracy by up to 6.9% and reducing inference time by 50%.
研究旨在通过解决计算限制和不理想的帧选择问题,提高多模态大型语言模型(MLLMs)在理解长视频方面的性能。Think-Clip-Sample (TCS) 框架引入了多查询推理来生成多个查询以捕捉问题和视频的互补方面,并通过剪辑级别的慢速-快速采样来适应性地平衡密集的局部细节和稀疏的全局上下文。实验表明,TCS 能够在不同的 MLLMs 上增强性能,提高准确率高达 6.9%,并减少 50% 的推理时间。
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
Authors: Xingjun Ma, Yixu Wang, Hengyuan Xu, Yutao Wu, Yifan Ding, Yunhan Zhao, Zilong Wang, Jiabin Hua, Ming Wen, Jianan Liu, Ranjie Duan, Yifeng Gao, Yingshui Tan, Yunhao Chen, Hui Xue, Xin Wang, Wei Cheng, Jingjing Chen, Zuxuan Wu, Bo Li, Yu-Gang Jiang
First: 2026-01-15T15:52:52+00:00 · Latest: 2026-01-16T15:04:58+00:00
Comments: 41 pages, 22 figures
Abstract
The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whether these advances translate into comparable improvements in safety remains unclear, partly due to fragmented evaluations that focus on isolated modalities or threat models. In this report, we present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation using a unified protocol that combines benchmark, adversarial, multilingual, and compliance evaluations. By aggregating results into safety leaderboards and model profiles, we reveal a highly uneven safety landscape: while GPT-5.2 demonstrates consistently strong and balanced performance, other models exhibit clear trade-offs across benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. Despite strong results under standard benchmarks, all models remain highly vulnerable under adversarial testing, with worst-case safety rates dropping below 6%. Text-to-image models show slightly stronger alignment in regulated visual risk categories, yet remain fragile when faced with adversarial or semantically ambiguous prompts. Overall, these findings highlight that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation design--underscoring the need for standardized, holistic safety assessments to better reflect real-world risk and guide responsible deployment.
中文标题/摘要
标题:GPT-5.2、Gemini 3 Pro、Qwen3-VL、Grok 4.1 Fast、Nano Banana Pro 和 Seedream 4.5 的安全性报告
大型语言模型(LLMs)和多模态大型语言模型(MLLMs)的快速进化在语言和视觉推理、感知和生成方面带来了显著进步,但这些进步是否转化为可比的安全性改进仍不清楚,部分原因是由于碎片化的评估,这些评估主要集中在孤立的模态或威胁模型上。本报告对六种前沿模型——GPT-5.2、Gemini 3 Pro、Qwen3-VL、Grok 4.1 Fast、Nano Banana Pro 和 Seedream 4.5——进行了综合安全性评估,使用统一的评估协议,结合基准测试、对抗性测试、多语言通用性和合规性评估,对每个模型在语言、视觉语言和图像生成方面的表现进行了评估。通过汇总结果形成安全排行榜和模型概况,我们揭示了一个高度不均衡的安全景观:虽然GPT-5.2表现出一致的强平衡性能,但其他模型在基准安全性、对抗性鲁棒性、多语言通用性和监管合规性方面存在明显的权衡。尽管在标准基准测试下表现出色,所有模型在对抗性测试中仍然高度脆弱,最坏情况下的安全性率低于6%。文本到图像模型在受监管的视觉风险类别中表现出略强的对齐,但在面对对抗性或语义模糊的提示时仍然脆弱。总体而言,这些发现突显了前沿模型中的安全性是多维度的——由模态、语言和评估设计所塑造,强调了需要标准化、全面的安全评估以更好地反映现实世界的风险并指导负责任的部署。
Summary / 总结
This report evaluates the safety of six advanced models—GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5—across language, vision-language, and image generation using a unified protocol. The study reveals a mixed safety landscape, with GPT-5.2 showing consistent strong performance, while other models exhibit trade-offs in benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. All models are vulnerable under adversarial testing, with safety rates dropping below 6%. Text-to-image models show better alignment in regulated visual risk categories but are still fragile with adversarial prompts.
本报告使用基准、对抗、多语言和合规性测试的一体化协议,评估了六种先进模型——GPT-5.2、Gemini 3 Pro、Qwen3-VL、Grok 4.1 Fast、Nano Banana Pro和Seedream 4.5的安全性。研究发现,虽然GPT-5.2表现稳定,但其他模型在安全性、对抗鲁棒性、多语言泛化和合规性方面存在权衡。所有模型在对抗测试中表现脆弱,最坏情况下安全率低于6%。文本到图像模型在受监管的视觉风险类别中表现出更好的一致性,但在对抗或语义模糊的提示下仍然脆弱。
AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
Authors: Weiyi Wang, Xinchi Chen, Jingjing Gong, Xuanjing Huang, Xipeng Qiu
First: 2026-01-16T15:02:41+00:00 · Latest: 2026-01-16T15:02:41+00:00
Abstract
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.
中文标题/摘要
标题:AstroReason-Bench:评估跨异构空间规划问题的统一代理规划能力
近期在代理大型语言模型(LLMs)方面的进展使它们成为能够在多种任务中进行推理和行动的通用规划者。然而,现有的代理基准主要集中在符号或弱语境环境上,而对其在物理约束的现实世界领域的表现则研究不足。我们引入了AstroReason-Bench,这是一个全面的基准,用于评估代理在空间规划问题(SPP)中的规划能力,SPP是一类具有异构目标、严格物理约束和长期决策的高风险问题。AstroReason-Bench 结合了多种调度模式,包括地面站通信和灵活的地球观测,并提供了一个统一的代理导向交互协议。在多种最先进的开源和闭源代理LLM系统上进行评估,我们发现当前的代理在性能上远逊于专门的求解器,突显了在现实约束下通用规划的关键局限性。AstroReason-Bench 提供了一个具有挑战性和诊断性的测试平台,以供未来代理研究使用。
Summary / 总结
The research motivation is to evaluate the performance of agentic Large Language Models (LLMs) in Space Planning Problems (SPP), which involve heterogeneous objectives, strict physical constraints, and long-horizon decision-making. The main method involves creating AstroReason-Bench, a comprehensive benchmark that integrates multiple scheduling regimes and provides a unified agent-oriented interaction protocol. Key experimental findings show that current agentic LLMs underperform specialized solvers, indicating significant limitations in generalist planning under realistic constraints.
研究动机是评估大型语言模型(LLMs)在包含异质目标、严格物理约束和长期决策的Space Planning Problems (SPP)中的表现。主要方法是创建AstroReason-Bench,一个综合基准,整合了多种调度模式,并提供了一个统一的代理导向交互协议。关键实验发现表明,当前的代理型LLMs在现实约束下的表现不如专门的求解器,显示出在通用规划方面的显著局限性。
Offline Reinforcement-Learning-Based Power Control for Application-Agnostic Energy Efficiency
Authors: Akhilesh Raj, Swann Perarnau, Aniruddha Gokhale, Solomon Bekele Abera
First: 2026-01-16T15:00:17+00:00 · Latest: 2026-01-16T15:00:17+00:00
Comments: 11 pages, 5 figures, 3 tables and unpublished
Abstract
Energy efficiency has become an integral aspect of modern computing infrastructure design, impacting the performance, cost, scalability, and durability of production systems. The incorporation of power actuation and sensing capabilities in CPU designs is indicative of this, enabling the deployment of system software that can actively monitor and adjust energy consumption and performance at runtime. While reinforcement learning (RL) would seem ideal for the design of such energy efficiency control systems, online training presents challenges ranging from the lack of proper models for setting up an adequate simulated environment, to perturbation (noise) and reliability issues, if training is deployed on a live system. In this paper we discuss the use of offline reinforcement learning as an alternative approach for the design of an autonomous CPU power controller, with the goal of improving the energy efficiency of parallel applications at runtime without unduly impacting their performance. Offline RL sidesteps the issues incurred by online RL training by leveraging a dataset of state transitions collected from arbitrary policies prior to training. Our methodology applies offline RL to a gray-box approach to energy efficiency, combining online application-agnostic performance data (e.g., heartbeats) and hardware performance counters to ensure that the scientific objectives are met with limited performance degradation. Evaluating our method on a variety of compute-bound and memory-bound benchmarks and controlling power on a live system through Intel's Running Average Power Limit, we demonstrate that such an offline-trained agent can substantially reduce energy consumption at a tolerable performance degradation cost.
中文标题/摘要
标题:基于离线强化学习的无应用依赖能量效率功率控制
能量效率已成为现代计算基础设施设计中的一个核心方面,影响着生产系统的性能、成本、可扩展性和耐用性。CPU设计中集成功率执行和传感能力表明了这一点,使得能够部署系统软件在运行时主动监控和调整能源消耗和性能。虽然强化学习(RL)似乎非常适合设计这样的能量效率控制系统,但在线训练存在诸多挑战,包括缺乏适当的模型来设置合适的模拟环境,以及如果在实时系统上部署训练时出现的扰动(噪声)和可靠性问题。 在本文中,我们讨论了使用离线强化学习作为设计自主CPU功率控制器的替代方法,旨在在不影响其性能的情况下,在运行时提高并行应用程序的能量效率。离线RL通过利用在训练前从任意策略收集的状态转换数据集,绕过了在线RL训练带来的问题。 我们的方法将离线RL应用于灰盒能量效率方法,结合在线应用无关的性能数据(例如心跳)和硬件性能计数器,以确保在有限的性能退化下满足科学目标。通过Intel的运行平均功率限制控制功率,并在多种计算和内存绑定基准上评估我们的方法,我们证明了这种离线训练的代理可以在可接受的性能退化成本下显著降低能源消耗。
Summary / 总结
The paper addresses the challenge of improving energy efficiency in modern computing systems through the application of offline reinforcement learning (RL). It proposes an offline RL approach to design an autonomous CPU power controller that can adjust energy consumption and performance without relying on online training, which is prone to environmental and reliability issues. The method combines application-agnostic performance data and hardware counters to train the controller, and it demonstrates a significant reduction in energy consumption with minimal performance impact across various benchmarks.
本文探讨了使用离线强化学习提高现代计算系统能源效率的方法。动机在于在不牺牲应用程序性能的情况下提升能源效率,而在线RL由于环境和可靠性问题难以实现。方法是预先收集各种策略的状态转换,并使用它们来训练离线RL代理。关键发现表明,离线训练的代理可以显著减少能源消耗,同时仅轻微降低性能。
FEATHer: Fourier-Efficient Adaptive Temporal Hierarchy Forecaster for Time-Series Forecasting
Authors: Jaehoon Lee, Seungwoo Lee, Younghwi Kim, Dohee Kim, Sunghyun Sim
First: 2026-01-16T14:57:41+00:00 · Latest: 2026-01-16T14:57:41+00:00
Comments: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Abstract
Time-series forecasting is fundamental in industrial domains like manufacturing and smart factories. As systems evolve toward automation, models must operate on edge devices (e.g., PLCs, microcontrollers) with strict constraints on latency and memory, limiting parameters to a few thousand. Conventional deep architectures are often impractical here. We propose the Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer) for accurate long-term forecasting under severe limits. FEATHer introduces: (i) ultra-lightweight multiscale decomposition into frequency pathways; (ii) a shared Dense Temporal Kernel using projection-depthwise convolution-projection without recurrence or attention; (iii) frequency-aware branch gating that adaptively fuses representations based on spectral characteristics; and (iv) a Sparse Period Kernel reconstructing outputs via period-wise downsampling to capture seasonality. FEATHer maintains a compact architecture (as few as 400 parameters) while outperforming baselines. Across eight benchmarks, it achieves the best ranking, recording 60 first-place results with an average rank of 2.05. These results demonstrate that reliable long-range forecasting is achievable on constrained edge hardware, offering a practical direction for industrial real-time inference.
中文标题/摘要
标题:FEATHer:傅里叶高效自适应时空层次预测器用于时间序列预测
时间序列预测在工业领域如制造业和智能工厂中至关重要。随着系统向自动化发展,模型必须在具有严格延迟和内存限制的边缘设备(如PLC、微控制器)上运行,限制参数数量在几千以内。传统的深度架构在这里往往不切实际。我们提出了傅里叶高效自适应时空层次预测器(FEATHer),以在严重限制条件下实现准确的长期预测。FEATHer 引入了:(i) 超轻量级多尺度分解为频率路径;(ii) 共享密集时空内核,使用投影深度卷积投影,无需循环或注意力机制;(iii) 频率感知分支门控,根据频谱特性自适应融合表示;以及 (iv) 稀疏周期核,通过周期性下采样重构输出以捕捉季节性。FEATHer 维持紧凑的架构(少至400个参数),同时优于基线模型。在八个基准测试中,它取得了最佳排名,记录了60个第一,并且平均排名为2.05。这些结果表明,可靠的长距离预测在受限的边缘硬件上是可行的,为工业实时推理提供了实用的方向。
Summary / 总结
FEATHer is designed for accurate long-term time-series forecasting on edge devices with strict constraints on latency and memory. It introduces an ultra-lightweight multiscale decomposition, a shared Dense Temporal Kernel, frequency-aware branch gating, and a Sparse Period Kernel. Experiments across eight benchmarks show that FEATHer outperforms baselines, achieving the best ranking with an average rank of 2.05 and 60 first-place results.
FEATHer 旨在为具有严格延迟和内存限制的边缘设备进行准确的长期时间序列预测。它引入了多尺度频域分解、共享密集时间内核、基于频谱特性的分支门控以及周期内核稀疏重构来保持一个紧凑的架构,参数数量可低至400个。在八个基准测试中,FEATHer 出色地超越了基线模型,取得了最佳排名,平均排名为2.05,并且获得了60个第一。
Dynamic Prototype Rehearsal for Continual ECG Arrhythmia Detection
Authors: Sana Rahmani, Reetam Chatterjee, Ali Etemad, Javad Hashemi
Venue: ICASSP 2025
First: 2025-01-13T18:37:10+00:00 · Latest: 2026-01-16T14:55:40+00:00
Comments: Accepted to 2025 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
Abstract
Continual Learning (CL) methods aim to learn from a sequence of tasks while avoiding the challenge of forgetting previous knowledge. We present DREAM-CL, a novel CL method for ECG arrhythmia detection that introduces dynamic prototype rehearsal memory. DREAM-CL selects representative prototypes by clustering data based on learning behavior during each training session. Within each cluster, we apply a smooth sorting operation that ranks samples by training difficulty, compressing extreme values and removing outliers. The more challenging samples are then chosen as prototypes for the rehearsal memory, ensuring effective knowledge retention across sessions. We evaluate our method on time-incremental, class-incremental, and lead-incremental scenarios using two widely used ECG arrhythmia datasets, Chapman and PTB-XL. The results demonstrate that DREAM-CL outperforms the state-of-the-art in CL for ECG arrhythmia detection. Detailed ablation and sensitivity studies are performed to validate the different design choices of our method.
中文标题/摘要
标题:持续心电图心律失常检测的动态原型排练
持续学习(CL)方法旨在从一系列任务中学习,同时避免遗忘先前知识的挑战。我们提出了DREAM-CL,这是一种用于心电图心律失常检测的新型CL方法,引入了动态原型排练记忆。DREAM-CL在每次训练会话中根据学习行为对数据进行聚类,选择代表性的原型。在每个聚类内,我们应用平滑排序操作,按训练难度对样本进行排序,压缩极端值并去除异常值。然后选择更具挑战性的样本作为排练记忆中的原型,确保在会话之间有效保留知识。我们在两个广泛使用的ECG心律失常数据集Chapman和PTB-XL上对时间增量、类别增量和导联增量场景进行了评估。结果表明,DREAM-CL在ECG心律失常检测的持续学习方面优于最新技术。进行了详细的消融和敏感性研究以验证我们方法的不同设计选择。
Summary / 总结
DREAM-CL is a novel Continual Learning method for ECG arrhythmia detection that uses dynamic prototype rehearsal memory. It selects representative prototypes by clustering data based on learning behavior and ranks samples by training difficulty to ensure effective knowledge retention. Experiments on two ECG arrhythmia datasets show that DREAM-CL outperforms existing methods in time-incremental, class-incremental, and lead-incremental scenarios.
DREAM-CL 是一种用于 ECG 心律失常检测的新型持续学习方法,使用动态原型复现记忆。它通过在每次训练会话中基于学习行为对数据进行聚类来选择代表原型,并应用平滑排序操作按训练难度对样本进行排序,选择最难的样本作为原型。实验结果表明,DREAM-CL 在时间增量、类别增量和导联增量场景下,使用 Chapman 和 PTB-XL 数据集的表现优于现有最先进的方法。
Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning
Authors: Jian Lu, Yi Luo
First: 2025-11-24T08:22:50+00:00 · Latest: 2026-01-16T14:34:45+00:00
Abstract
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, our approach consistently delivers significant end-to-end training efficiency improvements on NPU platforms, indicating its potential for widespread application.
中文标题/摘要
标题:周期异步性:一种加速大语言模型强化学习的在线策略方法
自GRPO算法的引入以来,强化学习(RL)引起了越来越多的关注,人们不断努力进行再现和应用。然而,训练效率仍然是一个关键挑战。在主流的RL框架中,推理和训练通常部署在同一设备上。虽然这种方法通过资源整合降低了成本,但其同步执行的计算耦合限制了推理和训练的同时进行。在本研究中,我们重新采用了分离推理和训练部署的策略,并通过改进数据加载器,将传统的同步架构转变为周期异步框架,这使得每个组件可以根据需求独立、弹性地扩展,同时算法的准确性与同步方法完全等价,两者都属于在线策略。值得注意的是,在训练阶段,我们应用了统一的三模型架构,并提出了共享提示注意掩码以减少重复计算。在实践中,我们的方法在NPU平台上持续提供了显著的端到端训练效率提升,表明其具有广泛的应用潜力。
History
20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553