Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
Authors: Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai, Jian Yang, Zhenyu Zhang
First: 2026-01-06T18:59:57+00:00 · Latest: 2026-01-06T18:59:57+00:00
Comments: Project page: https://luhexiao.github.io/Muses.github.io/
Abstract
We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.
中文标题/摘要
标题:缪斯:无需训练的前馈范式下幻想3D生物生成设计
我们提出了缪斯,这是首个无需训练的用于生成幻想3D生物的前馈方法。以往的方法依赖于部分感知优化、手动组装或2D图像生成,由于精细部分级操作的挑战和域外生成的限制,往往会产生不现实或不连贯的3D资产。相比之下,缪斯利用了3D骨架,这是一种生物形态的基本表示,以明确和理性的方式组合多样元素。这种骨骼基础将3D内容创作形式化为一种结构感知的设计、组合和生成流水线。缪斯首先通过图约束推理构建一个具有连贯布局和比例的创造性3D骨架。然后,该骨架指导在结构化潜在空间内的体素组装过程,整合来自不同对象的区域。最后,在骨骼条件下应用图像引导的外观建模,以生成与组装形状风格一致且和谐的纹理。大量实验表明,缪斯在视觉保真度和与文本描述的一致性方面达到了最先进的性能,并且在灵活的3D对象编辑方面具有潜力。项目页面:https://luhexiao.github.io/Muses.github.io/
Summary / 总结
Muses is a training-free method for generating fantastic 3D creatures in a feed-forward manner. It uses a 3D skeleton to compose and generate diverse elements, addressing the limitations of previous methods that often produce unrealistic 3D assets. Muses constructs a coherent 3D skeleton through graph-constrained reasoning, guides voxel-based assembly in a structured latent space, and applies image-guided appearance modeling to generate a harmonious texture. Experiments show Muses outperforms existing methods in visual fidelity and alignment with textual descriptions, and demonstrates potential for flexible 3D object editing.
Muses 是一种无需训练的 3D 幻想生物生成方法,采用前馈方式。它利用 3D 骨架来组合和生成多样元素,避免了以往方法在部件级操作和跨域生成方面的问题。Muses 通过图约束推理构建一个连贯的 3D 骨架,指导基于体素的组装过程,并应用图像引导的外观建模来生成风格一致的纹理。实验表明,Muses 在视觉保真度和与文本描述的一致性方面优于以往方法,并展示了灵活的 3D 对象编辑潜力。
Aligning Text, Images, and 3D Structure Token-by-Token
Authors: Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari
First: 2025-06-09T17:59:37+00:00 · Latest: 2026-01-06T18:58:50+00:00
Comments: Project webpage: https://glab-caltech.github.io/kyvo/
Abstract
Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/
中文标题/摘要
标题:逐个对齐文本、图像和3D结构
在帮助设计师构建和编辑3D环境以及机器人在三维空间中导航和互动方面,理解3D世界的机器是必不可少的。受语言和图像建模进展的启发,我们研究了自回归模型在新模态——结构化3D场景中的潜力。为此,我们提出了一种统一的LLM框架,将语言、图像和3D场景对齐,并详细阐述了实现最佳训练和性能的关键设计选择,包括数据表示、模态特定目标等。我们展示了如何对复杂3D对象进行分词,以纳入我们的结构化3D场景模态。我们在四个核心3D任务——渲染、识别、指令跟随和问答——以及四个3D数据集(合成和真实世界)上评估了性能。我们展示了我们的模型在从单张图像重建包含复杂对象的完整3D场景以及真实世界3D对象识别任务上的有效性。项目网页:https://glab-caltech.github.io/kyvo/
Summary / 总结
The research aims to develop machines capable of understanding 3D environments, which is crucial for designers and robots. The authors propose a unified LLM framework that aligns text, images, and 3D structures, and evaluate its performance on four core 3D tasks using both synthetic and real-world datasets. The model effectively reconstructs complete 3D scenes from a single image and performs well in real-world 3D object recognition tasks.
研究旨在开发能够理解3D环境的机器,这对于设计师和机器人至关重要。作者提出了一种统一的LLM框架,将文本、图像和3D结构对齐,并使用合成和真实世界的数据集评估其在四个核心3D任务上的性能。该模型能够从单张图像重建完整的3D场景,并在3D物体识别任务中表现出色。
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Authors: Hao Yu, Haotong Lin, Jiawei Wang, Jiaxin Li, Yida Wang, Xueyang Zhang, Yue Wang, Xiaowei Zhou, Ruizhen Hu, Sida Peng
First: 2026-01-06T18:57:06+00:00 · Latest: 2026-01-06T18:57:06+00:00
Comments: 19 pages, 13 figures
Abstract
Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces InfiniDepth, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method's capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Extensive experiments demonstrate that InfiniDepth achieves state-of-the-art performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts.
中文标题/摘要
标题:InfiniDepth:基于神经隐式场的任意分辨率和细粒度深度估计
现有的深度估计方法本质上只能在离散的图像网格上预测深度。这种表示方式限制了其对任意输出分辨率的扩展性,并阻碍了几何细节的恢复。本文引入了InfiniDepth,它将深度表示为神经隐式场。通过一个简单而有效的局部隐式解码器,我们可以在连续的2D坐标处查询深度,从而实现任意分辨率和细粒度的深度估计。为了更好地评估我们方法的能力,我们从五个不同的游戏中收集了一个高质量的4K合成基准,涵盖了多种具有丰富几何和外观细节的场景。广泛的实验表明,InfiniDepth在合成和真实世界基准上的相对和度量深度估计任务中均达到了最先进的性能,特别是在细部区域表现尤为出色。此外,它还能够受益于大视角变化下的新颖视图合成任务,生成高质量的结果,且较少出现孔洞和伪影。
Summary / 总结
The research motivation is to overcome the limitations of existing depth estimation methods that are restricted to discrete image grids, which hinder scalability and geometric detail recovery. InfiniDepth represents depth as neural implicit fields, allowing depth estimation at continuous 2D coordinates and achieving arbitrary-resolution and fine-grained depth estimation. Experiments on a high-quality 4K synthetic benchmark and real-world benchmarks show that InfiniDepth outperforms existing methods, especially in fine-detail regions, and improves novel view synthesis under large viewpoint shifts with fewer holes and artifacts.
InfiniDepth通过将深度表示为神经隐式场来解决现有深度估计方法的局限性,使其能够实现任意分辨率和精细的深度估计。该方法使用局部隐式解码器在连续的2D坐标上查询深度。实验表明,InfiniDepth在高质量的4K合成基准和真实世界数据集上优于先前的方法,特别是在精细细节区域表现更佳,并且在大视角变化下的新颖视图合成中表现出色。
A Versatile Multimodal Agent for Multimedia Content Generation
Authors: Daoan Zhang, Wenlin Yao, Xiaoyang Wang, Yebowen Hu, Jiebo Luo, Dong Yu
First: 2026-01-06T18:49:47+00:00 · Latest: 2026-01-06T18:49:47+00:00
Abstract
With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs -- a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.
中文标题/摘要
标题:一种多功能多模态代理用于多媒体内容生成
随着AIGC(AI生成内容)技术的进步,越来越多的生成模型正在革新视频编辑、音乐生成乃至电影制作等领域。然而,由于当前AIGC模型的局限性,大多数模型只能在特定应用场景中作为单一组件发挥作用,无法在实际应用中端到端地完成任务。在实际应用中,编辑专家通常需要处理各种各样的图像和视频输入,生成多模态输出——视频通常包括音频、文本和其他元素。当前模型难以实现这种多模态之间的有效整合。然而,基于代理系统的兴起使得使用AI工具应对复杂的生成任务成为可能。为了应对复杂的场景,本文提出了一种多媒体代理,旨在自动化复杂内容的创作。我们的代理系统包括数据生成流水线、内容创作工具库以及用于评估偏好对齐的一系列指标。值得注意的是,我们引入了技能获取理论来建模训练数据的收集和代理训练。我们设计了两阶段相关策略进行计划优化,包括自我相关和模型偏好相关。此外,我们通过三个阶段的方法利用生成的计划来训练多媒体代理,包括基础/成功计划微调和偏好优化。比较结果表明,我们的方法是有效的,多媒体代理能够生成比新型模型更好的多媒体内容。
Summary / 总结
The research aims to address the limitations of current AIGC models in handling multimodal content generation tasks. The authors propose a MultiMedia-Agent that integrates a data generation pipeline, a content creation tool library, and evaluation metrics. The agent uses a skill acquisition theory for training data curation and employs a two-stage correlation strategy for plan optimization. Experimental results show that the MultiMedia-Agent outperforms novel models in generating better multimedia content.
本文提出了一种MultiMedia-Agent,以解决当前AIGC模型在处理多模态内容生成方面的局限性。该代理系统包括数据生成管道、工具库和评估指标。作者引入了技能获取理论来指导训练数据的收集和代理训练,并采用两阶段相关策略进行计划优化。实验结果表明,MultiMedia-Agent在生成多媒体内容方面优于新型模型。
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
Authors: Di Wu, Yixin Wan, Kai-Wei Chang
First: 2025-05-26T17:59:33+00:00 · Latest: 2026-01-06T18:46:16+00:00
Abstract
Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We propose Visualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation, then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a simple yet effective perspective for advancing in text-image retrieval. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.
中文标题/摘要
标题:VisRet:可视化提高知识密集型文本到图像检索
文本到图像检索(T2I检索)仍然具有挑战性,因为跨模态嵌入通常表现为概念的集合,未能充分表示如姿态和视角等结构化的视觉关系。我们提出了一种名为Visualize-then-Retrieve(VisRet)的检索范式,以缓解跨模态相似性对齐的这一局限性。VisRet 首先通过T2I生成将文本查询投影到图像模态,然后在图像模态内进行检索,以绕过跨模态检索器在识别细微的视觉空间特征方面的弱点。在四个基准测试(Visual-RAG、INQUIRE-Rerank、Microsoft COCO以及我们的新基准Visual-RAG-ME,包含多实体比较)上,VisRet 显著优于跨模态相似性匹配和将T2I检索重新表述为文本到文本相似性匹配的基线,使用CLIP作为检索器时,平均nDCG@30提高了0.125,使用E5-V时提高了0.121。对于下游问答任务,VisRet 在Visual-RAG和Visual-RAG-ME上的top-1检索准确率分别提高了3.8%和15.7%,top-10检索准确率分别提高了3.9%和11.1%。消融研究显示,VisRet 与不同的T2I指令LLM、T2I生成模型和下游LLM兼容。VisRet 提供了一种简单而有效的视角,以推进文本图像检索。我们的代码和新基准已公开发布在https://github.com/xiaowu0162/Visualize-then-Retrieve。
Summary / 总结
The paper addresses the challenge of text-to-image retrieval by proposing VisRet, which first converts textual queries into image representations through text-to-image generation and then performs image retrieval. This method outperforms cross-modal similarity matching and text-to-text approaches across four benchmarks, improving nDCG@30 by an average of 0.125 with CLIP and 0.121 with E5-V. It also enhances question answering accuracy on Visual-RAG and Visual-RAG-ME by 3.8% to 15.7% in top-1 retrieval and 3.9% to 11.1% in top-10 retrieval. Ablation studies confirm its compatibility with various models. VisRet offers a straightforward yet effective approach to advancing text-image retrieval.
论文提出VisRet方法,首先通过文本到图像生成将文本查询转换为图像表示,然后在图像领域进行检索。该方法在四个基准上优于跨模态相似性匹配和文本到文本的方法,分别使用CLIP和E5-V时,nDCG@30平均提高了0.125和0.121。此外,它在Visual-RAG和Visual-RAG-ME上的问答准确性也分别提高了3.8%到15.7%(在top-1检索中)和3.9%到11.1%(在top-10检索中)。消融研究证实了其与各种模型的兼容性。VisRet提供了一种简单而有效的文本图像检索方法。
Shallow-circuit Supervised Learning on a Quantum Processor
Authors: Luca Candelori, Swarnadeep Majumder, Antonio Mezzacapo, Javier Robledo Moreno, Kharen Musaelian, Santhanam Nagarajan, Sunil Pinnamaneni, Kunal Sharma, Dario Villani
First: 2026-01-06T18:26:53+00:00 · Latest: 2026-01-06T18:26:53+00:00
Abstract
Quantum computing has long promised transformative advances in data analysis, yet practical quantum machine learning has remained elusive due to fundamental obstacles such as a steep quantum cost for the loading of classical data and poor trainability of many quantum machine learning algorithms designed for near-term quantum hardware. In this work, we show that one can overcome these obstacles by using a linear Hamiltonian-based machine learning method which provides a compact quantum representation of classical data via ground state problems for k-local Hamiltonians. We use the recent sample-based Krylov quantum diagonalization method to compute low-energy states of the data Hamiltonians, whose parameters are trained to express classical datasets through local gradients. We demonstrate the efficacy and scalability of the methods by performing experiments on benchmark datasets using up to 50 qubits of an IBM Heron quantum processor.
中文标题/摘要
标题:浅电路监督学习在量子处理器上的应用
量子计算长期以来一直有望在数据分析方面带来变革性的进步,但由于加载经典数据的量子成本高昂以及许多为近期内量子硬件设计的量子机器学习算法训练效果不佳等根本障碍,实用的量子机器学习仍然难以实现。在本文中,我们展示了可以通过使用基于线性哈密顿量的机器学习方法来克服这些障碍,该方法通过k局部哈密顿量的基态问题提供了一种经典数据的紧凑量子表示。我们使用最近的基于样本的Krylov量子对角化方法来计算数据哈密顿量的低能态,通过局部梯度训练其参数以表达经典数据集。我们通过在IBM Heron量子处理器上使用多达50个量子比特进行实验,展示了该方法的有效性和可扩展性。
Summary / 总结
This work addresses the challenges in practical quantum machine learning by proposing a linear Hamiltonian-based method that uses ground state problems for k-local Hamiltonians to represent classical data compactly. The method employs the sample-based Krylov quantum diagonalization to compute low-energy states and trains the parameters to express classical datasets. Experiments on benchmark datasets using up to 50 qubits of an IBM Heron quantum processor demonstrate the method's efficacy and scalability.
该研究通过提出一种基于线性哈密顿量的方法,利用k-local哈密顿量的地面态问题高效表示经典数据,来解决实用量子机器学习中的挑战。该方法使用样本基Krylov量子对角化方法计算低能态,并通过局部梯度训练参数以表示经典数据集。使用IBM Heron量子处理器上的基准数据集进行的实验表明了该方法的有效性和可扩展性。
LTX-2: Efficient Joint Audio-Visual Foundation Model
Authors: Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman
First: 2026-01-06T18:24:41+00:00 · Latest: 2026-01-06T18:24:41+00:00
Abstract
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
中文标题/摘要
标题:LTX-2:高效联合音频-视觉基础模型
近期的文本到视频扩散模型可以生成引人入胜的视频序列,但它们仍然无声——缺少音频提供的语义、情感和氛围提示。我们引入了LTX-2,这是一个开源的基础模型,能够以统一的方式生成高质量、时间同步的音频-视觉内容。LTX-2 由一个不对称的双流变压器组成,包含一个140亿参数的视频流和一个50亿参数的音频流,通过双向音频-视频交叉注意力层和时间位置嵌入以及跨模态AdaLN进行耦合,以实现共享时间步长条件。这种架构使联合音频-视觉模型的高效训练和推理成为可能,同时为视频生成分配了更多的容量,而音频生成较少。我们使用多语言文本编码器以获得更广泛的提示理解,并引入了一种模态感知的无条件引导机制(模态-CFG),以提高音频-视觉对齐和可控性。除了生成语音,LTX-2 还生成了丰富、连贯的音频轨道,跟随每个场景的角色、环境、风格和情感——包括自然的背景音和拟音元素。在我们的评估中,该模型在开源系统中实现了最先进的音频-视觉质量和提示一致性,同时以远低于专有模型的计算成本和推理时间提供结果。所有模型权重和代码均已公开发布。
Summary / 总结
LTX-2 is designed to generate high-quality, temporally synchronized audiovisual content by integrating an asymmetric dual-stream transformer with bidirectional audio-video cross-attention layers. The model, consisting of a 14B-parameter video stream and a 5B-parameter audio stream, efficiently trains and infers a unified audiovisual model with better capacity allocation for video generation. Experimental results show that LTX-2 outperforms other open-source systems in terms of audiovisual quality and prompt adherence, while achieving comparable results to proprietary models at a lower computational cost and inference time.
LTX-2 是一个开源的音频-视觉基础模型,旨在生成高质量的同步音频和视频内容。它使用了一个不对称的双流变压器,为视频生成分配了更多的参数,并结合了双向交叉注意力和时间位置嵌入。该模型在音频视觉质量和指令遵循性方面优于专有模型,同时计算成本和推理时间更低。它可以生成与场景中的角色、环境、风格和情绪相匹配的连贯音频轨道,包括自然背景音和拟音效果。
AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise
Authors: Tara Bogavelli, Roshnee Sharma, Hari Subramani
First: 2025-09-13T01:18:23+00:00 · Latest: 2026-01-06T18:18:48+00:00
Abstract
While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi-agent systems. This study aims to address these gaps by providing a comprehensive enterprise-specific benchmark evaluating 18 distinct agentic configurations across state-of-the-art large language models. We examine four critical agentic system dimensions: orchestration strategy, agent prompt implementation (ReAct versus function calling), memory architecture, and thinking tool integration. Our benchmark reveals significant model-specific architectural preferences that challenge the prevalent one-size-fits-all paradigm in agentic AI systems. It also reveals significant weaknesses in overall agentic performance on enterprise tasks with the highest scoring models achieving a maximum of only 35.3\% success on the more complex task and 70.8\% on the simpler task. We hope these findings inform the design of future agentic systems by enabling more empirically backed decisions regarding architectural components and model selection.
中文标题/摘要
标题:AgentArch:评估企业中代理架构的全面基准
虽然代理架构的各个组件已被单独研究,但对不同设计维度在复杂多代理系统中的相互作用仍缺乏有限的实证理解。本研究旨在通过提供一个针对最先进的大型语言模型评估18种不同代理配置的企业特定基准来填补这些空白。我们考察了四个关键的代理系统维度:编排策略、代理提示实现(ReAct与函数调用)、记忆架构以及思维工具集成。我们的基准揭示了显著的模型特定架构偏好,挑战了代理AI系统中普遍适用的一刀切范式。它还揭示了代理在企业任务中的整体表现存在显著弱点,最高得分为35.3%的成功率完成更复杂的任务,而简单任务的成功率为70.8%。我们希望这些发现能够通过使关于架构组件和模型选择的决策更具实证支持来指导未来代理系统的开发。
Summary / 总结
This study aims to evaluate how different design dimensions interact in complex multi-agent systems by providing a comprehensive benchmark for 18 agentic configurations across state-of-the-art large language models. The benchmark examines four critical dimensions: orchestration strategy, agent prompt implementation, memory architecture, and thinking tool integration. Key findings include significant model-specific architectural preferences and substantial weaknesses in overall agentic performance on enterprise tasks, with the highest scoring models achieving only 35.3% success on complex tasks and 70.8% on simpler tasks. These results challenge the one-size-fits-all paradigm in agentic AI systems and provide insights for future design decisions.
该研究旨在通过提供一个全面的企业特定基准来评估不同设计维度的代理架构在复杂多代理系统中的交互情况。该基准评估了18种不同的代理配置,涵盖了四大关键维度:协调策略、代理提示实现、记忆架构和思维工具集成。主要发现包括显著的模型特定架构偏好以及在企业任务中的整体代理性能的显著不足,最高得分为35.3%的成功率在复杂任务上和70.8%在简单任务上。这些结果挑战了一刀切的代理AI系统范式,并为未来的设计决策提供了见解。
Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models
Authors: Kartik Bose, Abhinandan Kumar, Raghuraman Soundararajan, Priya Mudgil, Samonee Ralmilay, Niharika Dutta, Manphool Singhal, Arun Kumar, Saugata Sen, Anurima Patra, Priya Ghosh, Abanti Das, Amit Gupta, Ashish Verma, Dipin Sudhakaran, Ekta Dhamija, Himangi Unde, Ishan Kumar, Krithika Rangarajan, Prerna Garg, Rachel Sequeira, Sudhin Shylendran, Taruna Yadav, Tej Pal, Pankaj Gupta
First: 2026-01-06T18:18:44+00:00 · Latest: 2026-01-06T18:18:44+00:00
Abstract
Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between <1B and >=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.
Summary / 总结
The study aimed to create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, to evaluate the validity and accuracy of open-weight and proprietary language models for RADS assignment. The dataset includes 1,600 synthetic reports across 10 RADS frameworks. Evaluating 41 quantized small language models and GPT-5.2, the study found that under guided prompting, GPT-5.2 achieved 99.8% validity and 81.1% accuracy, while pooled SLMs achieved 96.8% validity and 61.1% accuracy. Performance improved with model size and guided prompting, but gaps remained for higher-complexity RADS schemes.
研究旨在创建一个由放射科医生验证的多RADS合成基准RXL-RADSet,以评估各种语言模型在RADS分配中的有效性和准确性。该数据集包含1,600份跨10个RADS框架的合成报告。评估了41个小语言模型和GPT-5.2,研究发现,在引导提示下,GPT-5.2实现了99.8%的有效性和81.1%的准确性,而聚合的小语言模型实现了96.8%的有效性和61.1%的准确性。性能随着模型大小的增加和引导提示相比零样本提示有所提高。
The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization
Authors: Ruixing Zhang, Zihan Liu, Leilei Sun, Tongyu Zhu, Weifeng Lv
First: 2026-01-06T18:13:24+00:00 · Latest: 2026-01-06T18:13:24+00:00
Abstract
Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.
中文标题/摘要
标题:声纳时刻:音频语言模型在音频地理定位中的基准测试
地理定位旨在推断给定信号的地理来源。在计算机视觉中,地理定位已成为对组合推理能力的严苛基准测试,并与公共安全相关。相比之下,由于缺乏高质量的音频-位置配对,音频地理定位的进步受到限制。为了解决这一差距,我们引入了AGL1K,这是第一个面向音频语言模型(ALMs)的音频地理定位基准,覆盖了72个国家和地区。为了从众包平台中提取可靠可地理定位的样本,我们提出了音频可地理定位度量,该度量量化了每个录音的信息量,生成了1,444个精选音频片段。对16个ALMs的评估显示,ALMs已经具备了音频地理定位的能力。我们发现,闭源模型显著优于开源模型,语言线索往往成为预测的主要支撑。我们进一步分析了ALMs的推理痕迹、区域偏见、错误原因以及可地理定位度量的可解释性。总体而言,AGL1K为音频地理定位建立了基准,并可能促进具有更好地理空间推理能力的ALMs的发展。
Summary / 总结
The paper aims to benchmark audio-language models (ALMs) in audio geo-localization, a task of inferring the geographic origin of audio signals. To address the lack of high-quality audio-location pairs, the authors introduce AGL1K, a benchmark dataset spanning 72 countries and territories, and propose the Audio Localizability metric to curate 1,444 audio clips. Evaluations on 16 ALMs reveal that closed-source models outperform open-source models, and linguistic clues are often crucial for predictions. The study also analyzes reasoning traces, regional biases, and error causes, highlighting the need for better geospatial reasoning capabilities in ALMs.
论文旨在评估音频语言模型(ALMs)在音频地理定位任务中的表现,即推断音频信号的地理来源。为了解决高质量音频-位置配对数据不足的问题,作者引入了AGL1K基准数据集,覆盖72个国家和地区,包含1,444个精选音频片段。对16种ALMs的评估显示,闭源模型优于开源模型,语言线索往往是预测的关键。研究还分析了ALMs的推理过程、区域偏见和错误原因,有助于提高ALMs的地理空间推理能力。
From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
Authors: Marc Finzi, Shikai Qiu, Yiding Jiang, Pavel Izmailov, J. Zico Kolter, Andrew Gordon Wilson
First: 2026-01-06T18:04:03+00:00 · Latest: 2026-01-06T18:04:03+00:00
Abstract
Can we learn more from data than existed in the generating process itself? Can new and useful information be constructed from merely applying deterministic transformations to existing data? Can the learnable content in data be evaluated without considering a downstream task? On these questions, Shannon information and Kolmogorov complexity come up nearly empty-handed, in part because they assume observers with unlimited computational capacity and fail to target the useful information content. In this work, we identify and exemplify three seeming paradoxes in information theory: (1) information cannot be increased by deterministic transformations; (2) information is independent of the order of data; (3) likelihood modeling is merely distribution matching. To shed light on the tension between these results and modern practice, and to quantify the value of data, we introduce epiplexity, a formalization of information capturing what computationally bounded observers can learn from data. Epiplexity captures the structural content in data while excluding time-bounded entropy, the random unpredictable content exemplified by pseudorandom number generators and chaotic dynamical systems. With these concepts, we demonstrate how information can be created with computation, how it depends on the ordering of the data, and how likelihood modeling can produce more complex programs than present in the data generating process itself. We also present practical procedures to estimate epiplexity which we show capture differences across data sources, track with downstream performance, and highlight dataset interventions that improve out-of-distribution generalization. In contrast to principles of model selection, epiplexity provides a theoretical foundation for data selection, guiding how to select, generate, or transform data for learning systems.
中文标题/摘要
标题:从熵到Epiplexity:重思计算能力有限的智能的信息
我们能否从数据中学习到比生成过程本身更多的信息?仅通过应用确定性变换能否从现有数据中构建新的有用信息?数据中的可学习内容能否在不考虑下游任务的情况下进行评估?针对这些问题,香农信息和柯尔莫哥洛夫复杂性几乎无能为力,部分原因是它们假设观察者具有无限的计算能力,并未能针对有用的信息内容。在本文中,我们识别并举例说明了信息理论中的三个看似矛盾之处:(1)信息不能通过确定性变换增加;(2)信息与数据的顺序无关;(3)似然建模仅仅是分布匹配。为了阐明这些结果与现代实践之间的张力,并量化数据的价值,我们引入了Epiplexity,这是一种计算能力有限的观察者可以从数据中学习到的信息的正式化定义。Epiplexity捕获了数据中的结构内容,同时排除了时间限制下的熵,即伪随机数生成器和混沌动力系统中体现的随机不可预测内容。借助这些概念,我们展示了如何通过计算创造信息,信息如何依赖于数据的顺序,以及如何通过似然建模生成比数据生成过程本身更复杂的程序。我们还提出了估计Epiplexity的实用方法,这些方法表明能够捕捉数据来源之间的差异,与下游性能相关,并突出那些提高离分布泛化的数据集干预措施。与模型选择原则不同,Epiplexity为数据选择提供了一个理论基础,指导如何选择、生成或转换用于学习系统的数据。
Summary / 总结
This work addresses the limitations of traditional information measures like Shannon information and Kolmogorov complexity by introducing epiplexity, a measure of information that accounts for the computational constraints of observers. The study identifies three paradoxes in information theory and proposes epiplexity to quantify the learnable content in data. Key findings include the ability to create new information through computation, the dependency of information on data ordering, and the potential for likelihood modeling to generate more complex programs than the data generating process. The authors also provide practical methods to estimate epiplexity, which track with downstream performance and highlight interventions that improve out-of-distribution generalization.
本文通过引入epiplexity,一种用于计算能力有限的观察者从数据中学习的内容的正式化信息,解决了传统信息度量如香农信息和柯尔莫哥洛夫复杂性所面临的局限性。研究指出了信息理论中的三个悖论,并展示了信息可以通过计算创建、依赖于数据的顺序,并且可以生成比数据生成过程更复杂的程序。关键发现包括估计epiplexity的实用方法,这些方法能够捕捉不同数据源之间的差异、与下游性能相关,并且能够突出改进数据外推泛化的数据干预措施。
Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion
Authors: Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, Volodymyr Karpiv
Venue: ICLR 2026
First: 2026-01-06T17:52:02+00:00 · Latest: 2026-01-06T17:52:02+00:00
Comments: Preprint. Under review at ICLR 2026
Abstract
Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves better or comparable forgetting to strong baselines while maintaining image quality and benign prompt fidelity; ablations show that (i) per-step critics and (ii) noisy-conditioned rewards are key to stability and effectiveness. We release code and evaluation scripts to facilitate reproducibility and future research on RL-based diffusion unlearning.
中文标题/摘要
标题:文本到图像扩散中的评论者引导强化遗忘
文本到图像扩散模型中的机器遗忘旨在移除目标概念同时保持整体实用性。先前的扩散遗忘方法通常依赖于监督权重编辑或全局惩罚;强化学习(RL)方法虽然灵活,但通常优化稀疏的轨迹末尾奖励,导致高方差更新和弱的信用分配。我们提出了一种通用的RL框架用于扩散遗忘,将去噪视为一个顺序决策过程,并引入了具有噪声时间步奖励的时间步感知评论者。具体而言,我们基于CLIP训练奖励预测器于噪声潜变量上,并使用其每步信号来计算策略梯度更新逆向扩散核的优势估计。我们的算法易于实现,支持离策重用,并可插入标准文本到图像主干。在多个概念上,该方法在遗忘效果上优于或与强大的基线相当,同时保持图像质量和良性提示保真度;消融实验表明,(i) 每步评论者和(ii) 噪声条件奖励对于稳定性和有效性至关重要。我们发布了代码和评估脚本以促进可重复性和基于RL的扩散遗忘的未来研究。
Summary / 总结
The paper addresses the challenge of unlearning targeted concepts in text-to-image diffusion models while preserving image quality and prompt fidelity. It introduces a reinforcement learning (RL) framework that treats denoising as a sequential decision process, using a timestep-aware critic with noisy-step rewards. The method employs a CLIP-based reward predictor to compute advantage estimates for policy-gradient updates, leading to better or comparable forgetting performance compared to strong baselines while maintaining image quality and prompt fidelity. Ablation studies confirm the importance of per-step critics and noisy-conditioned rewards for stability and effectiveness.
论文旨在解决从文本到图像的扩散模型中去除特定概念的同时保持图像质量的问题。提出了一种基于强化学习(RL)的框架,将去噪视为一个顺序决策过程,使用具有噪声步奖励的时间步感知评论家来引导去学习过程。该方法在去除目标概念的同时保持图像质量和提示保真度方面优于强基线。
Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers
Authors: Yue Kang, Zhuoyi Huang, Benji Schussheim, Diana Licon, Dina Atia, Shixing Cao, Jacob Danovitch, Kunho Kim, Billy Norcilien, Jonah Karpman, Mahmound Sayed, Mike Taylor, Tao Sun, Pavel Metrikov, Vipul Agarwal, Chris Quirk, Ye-Yi Wang, Nick Craswell, Irene Shaffer, Tianwei Chen, Sulaiman Vesal, Soundar Srinivasan
First: 2026-01-06T17:48:40+00:00 · Latest: 2026-01-06T17:48:40+00:00
Abstract
In enterprise search, building high-quality datasets at scale remains a central challenge due to the difficulty of acquiring labeled data. To resolve this challenge, we propose an efficient approach to fine-tune small language models (SLMs) for accurate relevance labeling, enabling high-throughput, domain-specific labeling comparable or even better in quality to that of state-of-the-art large language models (LLMs). To overcome the lack of high-quality and accessible datasets in the enterprise domain, our method leverages on synthetic data generation. Specifically, we employ an LLM to synthesize realistic enterprise queries from a seed document, apply BM25 to retrieve hard negatives, and use a teacher LLM to assign relevance scores. The resulting dataset is then distilled into an SLM, producing a compact relevance labeler. We evaluate our approach on a high-quality benchmark consisting of 923 enterprise query-document pairs annotated by trained human annotators, and show that the distilled SLM achieves agreement with human judgments on par with or better than the teacher LLM. Furthermore, our fine-tuned labeler substantially improves throughput, achieving 17 times increase while also being 19 times more cost-effective. This approach enables scalable and cost-effective relevance labeling for enterprise-scale retrieval applications, supporting rapid offline evaluation and iteration in real-world settings.
中文标题/摘要
标题:微调小型语言模型作为高效的企业搜索相关性标注器
在企业搜索中,由于难以获取高质量的标注数据,大规模构建高质量数据集仍然是一个核心挑战。为解决这一挑战,我们提出了一种有效的方法,通过微调小型语言模型(SLMs)来进行准确的相关性标注,从而实现高通量、领域特定的标注,其质量和最先进的大型语言模型(LLMs)相当甚至更好。为了克服企业领域中高质量和可访问数据集的缺乏,我们的方法利用合成数据生成。具体来说,我们使用LLM从种子文档生成现实的企业查询,使用BM25检索困难的负样本,并使用教师LLM分配相关性评分。生成的数据集随后被提炼成SLM,产生一个紧凑的相关性标注器。我们在一个由923个企业查询-文档对组成、由训练有素的人标注的高质量基准上评估了我们的方法,并展示了微调后的SLM在与人类判断的一致性方面与教师LLM相当甚至更好。此外,我们的标注器显著提高了吞吐量,实现了17倍的提升,同时成本效益提高了19倍。这种方法使大规模检索应用中的相关性标注变得可扩展且成本效益高,支持在实际场景中的快速离线评估和迭代。
Summary / 总结
The research aims to address the challenge of building high-quality datasets for enterprise search by proposing a method to fine-tune small language models (SLMs) for accurate relevance labeling. The method uses an LLM to generate synthetic enterprise queries, retrieves hard negatives using BM25, and assigns relevance scores with a teacher LLM. The resulting dataset is distilled into an SLM, which is then used as a compact relevance labeler. The approach achieves agreement with human judgments comparable to or better than the teacher LLM, with a 17 times increase in throughput and 19 times cost-effectiveness, enabling scalable and cost-effective relevance labeling for enterprise search.
研究旨在通过提出一种方法来解决企业搜索中构建高质量数据集的挑战,即通过微调小型语言模型(SLM)进行准确的相关性标注。该方法使用LLM生成合成数据,使用BM25检索困难的负样本,并使用教师LLM分配相关性评分。微调后的SLM在人类判断的协议上与教师LLM相当或更好,同时提高了17倍的吞吐量和19倍的成本效益。这种方法支持企业规模检索应用中的可扩展和成本效益高的相关性标注,支持实际环境中的快速离线评估和迭代。
UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward
Authors: Yile Liu, Yixian Liu, Zongwei Li, Yufei Huang, Xinhua Feng, Zhichao Hu, Jinglu Hu, Jianfeng Yan, Fengzong Lian, Yuhong Liu
First: 2026-01-06T17:41:32+00:00 · Latest: 2026-01-06T17:41:32+00:00
Comments: 19 pages, 6 figures, 7 tables
Abstract
While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.
中文标题/摘要
标题:UltraLogic:通过大规模数据合成和双极浮点奖励提升LLM推理能力
尽管大型语言模型(LLMs)在自然语言处理方面展现了显著潜力,但在多步逻辑、规划和验证等复杂通用推理方面仍存在关键瓶颈。尽管可验证奖励强化学习(RLVR)在特定领域取得了成功,但该领域缺乏大规模、高质量且难度校准的数据以支持通用推理。为解决这一问题,我们提出了UltraLogic框架,该框架通过基于代码的求解方法将问题的逻辑核心与其自然语言表达分离开来,以自动化生产高质量数据。该框架包括数百种独特的任务类型和跨十个难度级别的自动化校准管道。此外,为缓解二元奖励稀疏性和非负奖励陷阱,我们引入了双极浮点奖励(BFR)机制,利用分级惩罚有效区分完美响应与逻辑错误的响应。我们的实验表明,任务多样性是推理提升的主要驱动力,而BFR与难度匹配策略相结合,显著提高了训练效率,引导模型向全局逻辑最优解发展。
Summary / 总结
UltraLogic aims to enhance LLM reasoning by addressing the bottleneck of complex multi-step reasoning through large-scale data synthesis and a Bipolar Float Reward mechanism. The framework uses a Code-based Solving methodology to automate high-quality data production for various task types and difficulty levels. Experiments show that task diversity is crucial for reasoning improvement, and the Bipolar Float Reward, along with a difficulty matching strategy, enhances training efficiency and guides models towards logical optima.
UltraLogic旨在通过解决通用推理缺乏大规模、高质量和难度校准的数据问题来增强LLM的推理能力。它采用基于代码的求解方法来自动化生产数百种独特的任务类型,并通过自动校准流水线在十个难度级别上进行校准。此外,它引入了双极浮点奖励(BFR)机制来缓解二元奖励稀疏性和非负奖励陷阱,使用分级惩罚来区分完美的响应和逻辑错误的响应。实验表明,任务多样性是推理增强的关键,而BFR与难度匹配策略结合使用可以提高训练效率,引导模型向全局逻辑最优解发展。
DIP: Dynamic In-Context Planner For Diffusion Language Models
Authors: Yang Li, Han Meng, Chenan Wang, Haipeng Chen
First: 2026-01-06T17:24:16+00:00 · Latest: 2026-01-06T17:24:16+00:00
Comments: 4 pages
Abstract
Diffusion language models (DLMs) have shown strong potential for general natural language tasks with in-context examples. However, due to the bidirectional attention mechanism, DLMs incur substantial computational cost as context length increases. This work addresses this issue with a key discovery: unlike the sequential generation in autoregressive language models (ARLMs), the diffusion generation paradigm in DLMs allows \textit{efficient dynamic adjustment of the context} during generation. Building on this insight, we propose \textbf{D}ynamic \textbf{I}n-Context \textbf{P}lanner (DIP), a context-optimization method that dynamically selects and inserts in-context examples during generation, rather than providing all examples in the prompt upfront. Results show DIP maintains generation quality while achieving up to 12.9$\times$ inference speedup over standard inference and 1.17$\times$ over KV cache-enhanced inference.
中文标题/摘要
标题:DIP:动态上下文规划器用于扩散语言模型
扩散语言模型(DLMs)在具有上下文示例的情况下展示了强大的通用自然语言任务潜力。然而,由于双向注意力机制,DLMs在上下文长度增加时会带来巨大的计算成本。这项工作通过一个关键发现解决了这一问题:与自回归语言模型(ARLMs)的顺序生成不同,DLMs的扩散生成范式允许在生成过程中进行\textit{高效的动态上下文调整}。基于这一洞察,我们提出了\textbf{D}动态\textbf{I}上下文\textbf{P}规划器(DIP),这是一种上下文优化方法,在生成过程中动态选择和插入上下文示例,而不是在提示中一次性提供所有示例。结果显示,DIP在保持生成质量的同时,相对于标准推理实现了高达12.9$\times$的推理加速,相对于KV缓存增强的推理实现了1.17$\times$的加速。
Summary / 总结
This work addresses the computational cost issue in diffusion language models (DLMs) as context length increases by proposing DIP, a dynamic in-context planner. DIP dynamically selects and inserts in-context examples during generation, rather than providing all examples upfront. The method achieves up to 12.9 times inference speedup over standard inference and 1.17 times over KV cache-enhanced inference while maintaining generation quality.
该研究通过提出动态上下文规划器DIP来解决扩散语言模型(DLMs)的计算成本问题。与自回归语言模型不同,DLMs在生成过程中可以高效地调整上下文。DIP在生成过程中动态选择并插入上下文示例,从而实现最高12.9倍的推理速度提升,同时保持生成质量。与KV缓存增强的推理相比,DIP实现了1.17倍的加速。
Empowering Reliable Visual-Centric Instruction Following in MLLMs
Authors: Weilei He, Feng Ju, Zhiyuan Fan, Rui Min, Minhao Cheng, Yi R. Fung
First: 2026-01-06T17:23:33+00:00 · Latest: 2026-01-06T17:23:33+00:00
Comments: Submitted to ARR Jan
Abstract
Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs' instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs' instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.
中文标题/摘要
标题:增强可靠视觉中心指令遵循能力的MLLMs
评估多模态大型语言模型(MLLMs)的指令遵循(IF)能力对于严格评估模型输出与用户指定意图的一致性至关重要。然而,现有的MLLMs指令遵循能力评估基准主要集中在文本模态的口头指令上。这些限制阻碍了对指令遵循能力的全面分析,因为它们忽略了嵌入在语义丰富的视觉模态中的隐式约束。为了解决这一差距,我们引入了VC-IFEval,这是一个新的基准,附带一个系统构建的数据集,用于评估MLLMs在多模态设置下的指令遵循能力。我们的基准系统地将视觉依赖性约束纳入指令设计中,使我们能够更严格和细致地评估MLLMs如何与视觉输入和文本指令对齐。此外,通过在我们的数据集上微调MLLMs,我们实现了视觉指令遵循准确性和一致性的显著提升。通过在代表性MLLMs上的广泛评估,我们提供了关于当前模型优势和局限性的新见解。
Summary / 总结
The research aims to evaluate the instruction-following capabilities of Multimodal Large Language Models (MLLMs) by introducing VC-IFEval, a new benchmark that includes a systematically constructed dataset. This benchmark evaluates MLLMs' ability to follow both visual and textual instructions, addressing the limitations of existing benchmarks that focus solely on textual instructions. By fine-tuning MLLMs on this dataset, the study achieves significant improvements in visual instruction-following accuracy and adherence, providing new insights into the strengths and limitations of current models.
研究旨在通过引入VC-IFEval这一新基准来评估Multimodal Large Language Models (MLLMs)的指令遵循能力,该基准包含一个系统构建的数据集,能够评估MLLMs在遵循视觉和文本指令方面的表现,解决了现有基准仅关注文本指令的局限性。通过在该数据集上对MLLMs进行微调,研究取得了显著的视觉指令遵循准确性和一致性提升,提供了对当前模型优缺点的新见解。
D^3ETOR: Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations
Authors: Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu, Bo Liu, Chen Feng, Ioannis Patras
First: 2025-12-23T11:16:16+00:00 · Latest: 2026-01-06T17:16:18+00:00
Abstract
Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.
中文标题/摘要
标题:D^3ETOR:辩论增强的伪标签生成和频率感知渐进去偏见方法在带有涂鸦注释的弱监督伪装目标检测中的应用
弱监督伪装目标检测(WSCOD)旨在定位和分割其周围场景中视觉上被隐藏的目标,仅依赖稀疏监督,如涂鸦注释。尽管取得了进展,但现有WSCOD方法仍远落后于完全监督方法,主要由于两个限制:(1)由通用分割模型(如SAM)生成并通过规则过滤的伪掩码往往不可靠,因为这些模型缺乏COD所需的特定语义理解;(2)忽视涂鸦注释中的固有注释偏差,这妨碍了模型捕捉伪装目标的全局结构。为克服这些挑战,我们提出D^3ETOR,一种两阶段WSCOD框架,包括辩论增强的伪标签生成和频率感知渐进去偏见。在第一阶段,我们引入自适应熵驱动的点采样方法和多智能体辩论机制,以增强SAM在目标检测中的能力,提高伪掩码的可解释性和精度。在第二阶段,我们设计FADeNet,通过逐步融合多级频率感知特征来平衡全局语义理解和局部细节建模,同时动态调整监督强度以缓解涂鸦偏差。通过同时利用伪掩码和涂鸦语义的监督信号,D^3ETOR显著缩小了弱监督和完全监督目标检测之间的差距,实现了多个基准上的最佳性能。
Summary / 总结
The paper addresses the challenges in Weakly-Supervised Camouflaged Object Detection (WSCOD) by proposing D^3ETOR, a two-stage framework. The first stage enhances pseudo labeling using an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to improve the reliability of pseudo masks. The second stage introduces FADeNet, which fuses multi-level frequency-aware features to balance global and local details while dynamically adjusting supervision strength to reduce annotation bias. Experiments show that D^3ETOR outperforms existing methods and achieves state-of-the-art results on multiple benchmarks.
论文提出${D}^{3}$ETOR框架,解决弱监督伪装目标检测(WSCOD)中的挑战。该框架分为两个阶段:第一阶段通过自适应熵驱动的点采样方法和多智能体辩论机制增强伪标签的可靠性;第二阶段引入FADeNet,融合多级频率感知特征以平衡全局和局部细节,同时动态调整监督强度以减少注释偏差。实验结果表明,${D}^{3}$ETOR在多个基准上超越了现有方法,达到了最先进的性能。
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
Authors: Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao
First: 2026-01-06T17:15:50+00:00 · Latest: 2026-01-06T17:15:50+00:00
Abstract
While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.
中文标题/摘要
标题:UniCorn:通过自我生成的监督实现自我提升的统一多模态模型
虽然统一多模态模型(UMMs)在跨模态理解方面取得了显著成功,但在利用这种内部知识进行高质量生成方面仍存在显著差距。我们将这种差异正式化为传导性失语症,这是一种现象,即模型能够准确解释多模态输入,但在将其理解转化为忠实且可控的合成方面存在困难。为了解决这一问题,我们提出了UniCorn,这是一种简单而优雅的自我提升框架,无需外部数据或教师监督。通过将单一UMM划分为三个协作角色:提案者、解决者和裁判,UniCorn通过自我博弈生成高质量的交互,并利用认知模式重构将潜在理解提炼为明确的生成信号。为了验证多模态一致性的恢复,我们引入了基于文本到图像再到文本重建循环的UniCycle循环一致性基准。广泛的实验表明,UniCorn在六个通用图像生成基准上实现了全面和显著的改进。值得注意的是,它在TIIF(73.8)、DPG(86.8)、CompBench(88.5)和UniCycle上达到了SOTA性能,并进一步在WISE上实现了+5.0的显著提升,在OneIG上实现了+6.5的显著提升。这些结果表明,我们的方法显著增强了T2I生成能力,同时保持了稳健的理解能力,展示了统一多模态智能完全自我监督细化的可扩展性。
Summary / 总结
The research aims to improve Unified Multimodal Models (UMMs) by addressing their inability to translate accurate cross-modal understanding into high-quality generation. UniCorn, a self-improvement framework, partitions a UMM into Proposer, Solver, and Judge roles to generate high-quality interactions through self-play. Experiments show that UniCorn significantly enhances multimodal coherence and achieves state-of-the-art performance across six benchmarks, including TIIF, DPG, and CompBench, while also improving T2I generation by +5.0 on WISE and +6.5 on OneIG.
UniCorn通过将单一UMM划分为提案者、解决者和裁判者三个角色,并通过自我博弈生成高质量的交互,解决了统一多模态模型在跨模态理解与高质量生成之间的差距。该方法无需外部数据或监督,提高了多模态的一致性,并在六个通用图像生成基准上达到了最先进的性能,包括TIIF、DPG和CompBench,同时增强了从文本到图像的生成能力并保持了稳健的理解能力。
AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Authors: Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert
First: 2026-01-06T17:13:23+00:00 · Latest: 2026-01-06T17:13:23+00:00
Abstract
Multimodal medical large language models have shown impressive progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model explicitly designed for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://github.com/aneesurhashmi/anatomix
中文标题/摘要
标题:AnatomiX,一种解剖学意识的多模态大型语言模型,用于胸部X光解释
多模态医疗大型语言模型在胸部X光解释方面取得了显著进展,但仍面临空间推理和解剖理解方面的挑战。尽管现有的接地技术提高了整体性能,但它们往往无法建立真正的解剖对应关系,导致在医疗领域出现错误的解剖理解。为了解决这一差距,我们引入了AnatomiX,这是一种明确为解剖学接地的胸部X光解释设计的多任务多模态大型语言模型。受放射学工作流程的启发,AnatomiX采用两阶段方法:首先,它识别解剖结构并提取其特征,然后利用大型语言模型执行多种下游任务,如短语接地、报告生成、视觉问答和图像理解。在多个基准上的广泛实验表明,AnatomiX在解剖推理方面表现出色,与现有方法相比,在解剖接地、短语接地、基于解剖的诊断和基于解剖的描述任务上的性能提高了超过25%。代码和预训练模型可在https://github.com/aneesurhashmi/anatomix获取
Summary / 总结
AnatomiX is a multitask multimodal large language model specifically designed for anatomically grounded chest X-ray interpretation. It uses a two-stage approach to identify anatomical structures and extract their features, followed by a large language model for various downstream tasks. Experiments show that AnatomiX outperforms existing methods by more than 25% on tasks such as anatomy grounding, phrase grounding, grounded diagnosis, and grounded captioning, addressing the limitations of spatial reasoning and anatomical understanding in existing models.
AnatomiX 是一种专门用于胸部 X 光解剖导向解释的多任务多模态大型语言模型。它采用两阶段方法来识别解剖结构并提取其特征,然后通过大型语言模型执行各种下游任务。实验表明,AnatomiX 在解剖定位、短语定位、基于解剖的诊断和基于解剖的描述等任务上比现有方法高出 25% 以上,解决了现有模型在空间推理和解剖理解方面的局限性。
Leveraging the true depth of LLMs
Authors: Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret
First: 2025-02-05T00:26:27+00:00 · Latest: 2026-01-06T17:11:03+00:00
Abstract
The remarkable capabilities of Large Language Models (LLMs) are overshadowed by their immense computational cost. While recent work has shown that many LLM layers can be reordered or even removed with minimal impact on accuracy, these insights have not been translated into significant inference speedups. To bridge this gap, we introduce a novel method that restructures the computational graph by grouping and evaluating consecutive layer pairs in parallel. This approach, requiring no retraining, yields a 1.19x throughput gain on Llama 2 7B while reducing the average benchmark accuracy by only 1.5\%. We demonstrate the practical value of this method for large-scale LLM deployment and show that some of the lost accuracy can be recovered with lightweight fine-tuning of the parallelized layers.
中文标题/摘要
标题:利用大型语言模型的真正深度
大型语言模型(LLMs)的卓越能力被其巨大的计算成本所掩盖。尽管最近的研究表明,许多LLM层可以重新排序甚至移除而对准确性影响甚微,但这些见解尚未转化为显著的推理速度提升。为弥合这一差距,我们提出了一种新颖的方法,通过分组并行评估连续的层对来重新结构计算图。这种方法无需重新训练,在Llama 2 7B上实现了1.19倍的吞吐量提升,同时平均基准准确性仅下降1.5%。我们展示了此方法在大规模LLM部署中的实用价值,并表明可以通过轻量级的并行层微调来恢复部分丢失的准确性。
Summary / 总结
The research aims to enhance the efficiency of Large Language Models (LLMs) without significantly compromising their accuracy. The method involves restructuring the computational graph by evaluating consecutive layer pairs in parallel, which results in a 1.19x throughput gain on Llama 2 7B with only a 1.5% reduction in average benchmark accuracy. Some of the lost accuracy can be recovered through lightweight fine-tuning of the parallelized layers.
研究旨在提高大型语言模型(LLMs)的效率,同时不显著牺牲其准确性。方法是通过并行评估连续的层对来重新结构计算图,不需要重新训练。这种方法在Llama 2 7B上实现了1.19倍的吞吐量提升,平均基准准确率仅降低了1.5%。此外,通过轻量级微调并行化层,可以恢复部分丢失的准确性。
Decentralized Autoregressive Generation
Authors: Stepan Maschan, Haoxuan Qu, Jun Liu
First: 2026-01-06T17:07:27+00:00 · Latest: 2026-01-06T17:07:27+00:00
Comments: Work in progress
Abstract
We present a theoretical analysis of decentralization of autoregressive generation. We define the Decentralized Discrete Flow Matching objective, by expressing probability generating velocity as a linear combination of expert flows. We also conduct experiments demonstrating the equivalence between decentralized and centralized training settings for multimodal language models across diverse set of benchmarks. Specifically, we compare two distinct paradigms: LLaVA and InternVL 2.5-1B, which uses a fixed CLIP vision encoder and performs full-parameter fine-tuning (ViT+MLP+LLM) during the instruction tuning stage.
中文标题/摘要
标题:去中心化自回归生成
我们对自回归生成的去中心化进行了理论分析。我们定义了去中心化离散流匹配目标,通过将概率生成速度表示为专家流的线性组合。我们还进行了实验,展示了去中心化和中心化训练设置在多种基准测试中的等效性,特别是针对多模态语言模型。具体来说,我们比较了两种不同的范式:LLaVA 和 InternVL 2.5-1B,后者使用固定 CLIP 视觉编码器,并在指令调优阶段进行全参数微调(ViT+MLP+LLM)。
Summary / 总结
The paper analyzes the decentralization of autoregressive generation, introducing the Decentralized Discrete Flow Matching objective. It compares decentralized and centralized training for multimodal language models using LLaVA and InternVL 2.5-1B, showing that both paradigms are equivalent across various benchmarks, with InternVL 2.5-1B involving a fixed CLIP vision encoder and full-parameter fine-tuning during instruction tuning.
论文分析了自回归生成的去中心化问题,提出了去中心化离散流匹配目标。研究使用LLaVA和InternVL 2.5-1B进行对比,后者在指令调优阶段使用固定CLIP视觉编码器并进行全参数微调,结果显示两种训练方式在多种基准测试中是等效的。
DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation
Authors: Jiajun jiao, Haowei Zhu, Puyuan Yang, Jianghui Wang, Ji Liu, Ziqiong Liu, Dong Li, Yuejian Fang, Junhai Yong, Bin Wang, Emad Barsoum
Venue: AAAI 2026
First: 2026-01-06T16:55:55+00:00 · Latest: 2026-01-06T16:55:55+00:00
Comments: Accepted to AAAI 2026
Abstract
Diffusion models have achieved remarkable success in image and video generation. However, their inherently multiple step inference process imposes substantial computational overhead, hindering real-world deployment. Accelerating diffusion models is therefore essential, yet determining how to combine multiple model acceleration techniques remains a significant challenge. To address this issue, we introduce a framework driven by large language models (LLMs) for automated acceleration code generation and evaluation. First, we present DiffBench, a comprehensive benchmark that implements a three stage automated evaluation pipeline across diverse diffusion architectures, optimization combinations and deployment scenarios. Second, we propose DiffAgent, an agent that generates optimal acceleration strategies and codes for arbitrary diffusion models. DiffAgent employs a closed-loop workflow in which a planning component and a debugging component iteratively refine the output of a code generation component, while a genetic algorithm extracts performance feedback from the execution environment to guide subsequent code refinements. We provide a detailed explanation of the DiffBench construction and the design principles underlying DiffAgent. Extensive experiments show that DiffBench offers a thorough evaluation of generated codes and that DiffAgent significantly outperforms existing LLMs in producing effective diffusion acceleration strategies.
中文标题/摘要
标题:DiffBench 遇上 DiffAgent:端到端由大规模语言模型驱动的扩散模型加速代码生成
扩散模型在图像和视频生成方面取得了显著的成功。然而,它们固有的多步推理过程带来了巨大的计算开销,阻碍了实际部署。因此,加速扩散模型变得至关重要,但如何结合多种模型加速技术仍然是一个重大挑战。为了解决这个问题,我们提出了一种由大规模语言模型(LLMs)驱动的自动化加速代码生成和评估框架。首先,我们介绍了DiffBench,这是一个全面的基准,实现了跨多种扩散架构、优化组合和部署场景的三阶段自动化评估管道。其次,我们提出了DiffAgent,这是一种生成任意扩散模型最优加速策略和代码的代理。DiffAgent 采用了一个闭环工作流,在该工作流中,规划组件和调试组件迭代优化代码生成组件的输出,同时遗传算法从执行环境中提取性能反馈,以指导后续代码的优化。我们详细解释了DiffBench的构建和DiffAgent的设计原则。广泛的实验表明,DiffBench 提供了对生成代码的全面评估,而DiffAgent 显著优于现有的LLMs,能够生成有效的扩散加速策略。
Summary / 总结
The paper introduces DiffBench and DiffAgent to address the computational overhead of diffusion models. DiffBench is a comprehensive benchmark that evaluates diffusion models across various architectures and scenarios, while DiffAgent generates optimal acceleration strategies through a closed-loop workflow involving planning, debugging, and genetic algorithm feedback. Experiments demonstrate that DiffAgent outperforms existing LLMs in generating effective acceleration strategies for diffusion models.
论文通过引入DiffBench和DiffAgent来解决扩散模型的计算开销问题。DiffBench是一个全面的基准,用于在各种架构和部署场景下评估扩散模型,而DiffAgent使用包含规划、调试和遗传算法反馈的闭环工作流生成最优加速策略和代码。实验表明,DiffBench能够对生成的代码进行全面评估,而DiffAgent在生成有效的加速策略方面优于现有LLM。
Predicting Time Pressure of Powered Two-Wheeler Riders for Proactive Safety Interventions
Authors: Sumit S. Shevtekar, Chandresh K. Maurya, Gourab Sil, Subasish Das
First: 2026-01-06T16:52:09+00:00 · Latest: 2026-01-06T16:52:09+00:00
Comments: 13 pages, 8 figures
Abstract
Time pressure critically influences risky maneuvers and crash proneness among powered two-wheeler riders, yet its prediction remains underexplored in intelligent transportation systems. We present a large-scale dataset of 129,000+ labeled multivariate time-series sequences from 153 rides by 51 participants under No, Low, and High Time Pressure conditions. Each sequence captures 63 features spanning vehicle kinematics, control inputs, behavioral violations, and environmental context. Our empirical analysis shows High Time Pressure induces 48% higher speeds, 36.4% greater speed variability, 58% more risky turns at intersections, 36% more sudden braking, and 50% higher rear brake forces versus No Time Pressure. To benchmark this dataset, we propose MotoTimePressure, a deep learning model combining convolutional preprocessing, dual-stage temporal attention, and Squeeze-and-Excitation feature recalibration, achieving 91.53% accuracy and 98.93% ROC AUC, outperforming eight baselines. Since time pressure cannot be directly measured in real time, we demonstrate its utility in collision prediction and threshold determination. Using MTPS-predicted time pressure as features, improves Informer-based collision risk accuracy from 91.25% to 93.51%, approaching oracle performance (93.72%). Thresholded time pressure states capture rider cognitive stress and enable proactive ITS interventions, including adaptive alerts, haptic feedback, V2I signaling, and speed guidance, supporting safer two-wheeler mobility under the Safe System Approach.
中文标题/摘要
标题:基于动力两轮车骑乘者时间压力预测的主动安全干预
时间压力对动力两轮车骑乘者进行危险操作和事故倾向性影响巨大,但在智能交通系统中的预测研究仍相对不足。我们提供了一个包含129,000多个标记的多变量时间序列数据集,来自51名参与者在无时间压力、低时间压力和高时间压力条件下进行的153次骑行。每个序列包含63个特征,涵盖车辆动力学、控制输入、行为违规和环境背景。我们的实证分析表明,高时间压力导致48%的更高速度、36.4%的速度变化率增加、58%的更多危险交叉口转弯、36%的更多紧急制动以及50%更高的后制动力量,相较于无时间压力情况。为了评估该数据集,我们提出了一种结合卷积预处理、双阶段时间注意力和Squeeze-and-Excitation特征重校准的深度学习模型MotoTimePressure,其准确率为91.53%,ROC AUC为98.93%,优于八个基线模型。由于时间压力无法实时直接测量,我们展示了其在碰撞预测和阈值确定中的应用价值。使用MTPS预测的时间压力作为特征,基于Informer的碰撞风险准确性从91.25%提高到93.51%,接近于理想性能(93.72%)。阈值化的时间压力状态捕捉骑乘者的认知压力,支持主动ITS干预,包括自适应警报、触觉反馈、V2I信号和速度指导,以支持更安全的两轮车移动,符合安全系统方法。
Summary / 总结
This study addresses the critical impact of time pressure on powered two-wheeler riders' behavior and safety. It presents a large dataset of 129,000+ labeled sequences from 153 rides, capturing 63 features including vehicle kinematics and environmental context. The empirical analysis reveals that high time pressure leads to higher speeds, greater speed variability, more risky turns, and more sudden braking. A deep learning model, MotoTimePressure, is proposed, achieving 91.53% accuracy and 98.93% ROC AUC, outperforming eight baselines. The model's utility is demonstrated in collision prediction and proactive safety interventions, improving collision risk accuracy and enabling adaptive alerts and haptic feedback for safer mobility.
该研究关注动力两轮车骑行者时间压力的预测问题,这会影响他们的碰撞风险。使用了包含129,000+条标签序列的数据集,来自51名参与者153次骑行,捕捉了63个特征。关键发现表明,高时间压力会使速度增加48%,速度变化增加36.4%,在交叉口进行危险转弯增加58%,突然刹车增加36%,后刹车力增加50%,与无时间压力相比。提出的MotoTimePressure模型是一种深度学习方法,实现了91.53%的准确率和98.93%的ROC AUC,优于八个基线。该模型用于改进碰撞风险预测,并实现主动安全干预,如适应性警报和V2I信号等。
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Authors: Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang
Venue: NeurIPS 2025
First: 2025-05-18T11:08:32+00:00 · Latest: 2026-01-06T16:40:19+00:00
Comments: Accepted to NeurIPS 2025
Abstract
The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7\% over GRPO and 6\% over DAPO across six benchmark tasks for a 1.5B model.
中文标题/摘要
标题:DisCO:用区分性约束优化强化大型推理模型
最近,DeepSeek-R1的成功和开放性引起了对组相对策略优化(GRPO)作为大型推理模型(LRMs)的强化学习方法的广泛关注。本文分析了在二元奖励设置下GRPO的目标,并揭示了问题级别难度偏差的固有限制。我们还发现GRPO与监督学习中的传统区分性方法之间存在联系。基于这些见解,我们提出了一个新的区分性约束优化(DisCO)框架,以强化LRMs,该框架基于区分性学习的原则。DisCO与GRPO及其最近变体的主要区别在于:(1) 用由评分函数定义的区分性目标取代了组相对目标;(2) 放弃基于剪裁的替代目标,转而使用非剪裁的RL替代目标作为评分函数;(3) 采用简单的有效约束优化方法来强制执行KL散度约束。结果,DisCO在GRPO及其变体上提供了显著的优势:(i) 通过采用区分性目标完全消除了难度偏差;(ii) 通过使用非剪裁评分函数和约束优化方法解决了GRPO及其变体中的熵不稳定性,从而获得长期稳定的训练动态;(iii) 允许整合先进的区分性学习技术来解决数据不平衡问题,在训练过程中,大量问题的生成答案中负回答多于正回答。我们在增强SFT微调模型的数学推理能力方面的实验表明,DisCO显著优于GRPO及其改进变体DAPO,在1.5B模型的六个基准任务中,平均分别比GRPO和DAPO高出7%和6%。
Summary / 总结
This work addresses the limitations of Group Relative Policy Optimization (GRPO) in reinforcing large reasoning models (LRMs) by introducing a new framework called Discriminative Constrained Optimization (DisCO). DisCO replaces the group relative objective with a discriminative objective and uses non-clipping RL surrogate objectives, which helps eliminate difficulty bias and stabilize training dynamics. Experiments show that DisCO outperforms GRPO and its variants, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for a 1.5B model.
该研究通过引入新的Discriminative Constrained Optimization (DisCO)框架来解决Group Relative Policy Optimization (GRPO)在大型推理模型中的局限性。DisCO用判别性目标替换群组相对目标,并使用非剪辑的RL替代目标,从而减少难度偏差并改善训练动态。实验表明,DisCO在六个基准任务上优于GRPO及其改进版本DAPO,1.5B模型的平均提升分别为7%和6%。
Dynamic Hyperparameter Importance for Efficient Multi-Objective Optimization
Authors: Daphne Theodorakopoulos, Marcel Wever, Marius Lindauer
Venue: IJCAI 2026
First: 2026-01-06T16:37:44+00:00 · Latest: 2026-01-06T16:37:44+00:00
Comments: Submitted to IJCAI 2026
Abstract
Choosing a suitable ML model is a complex task that can depend on several objectives, e.g., accuracy, model size, fairness, inference time, or energy consumption. In practice, this requires trading off multiple, often competing, objectives through multi-objective optimization (MOO). However, existing MOO methods typically treat all hyperparameters as equally important, overlooking that hyperparameter importance (HPI) can vary significantly depending on the trade-off between objectives. We propose a novel dynamic optimization approach that prioritizes the most influential hyperparameters based on varying objective trade-offs during the search process, which accelerates empirical convergence and leads to better solutions. Building on prior work on HPI for MOO post-analysis, we now integrate HPI, calculated with HyperSHAP, into the optimization. For this, we leverage the objective weightings naturally produced by the MOO algorithm ParEGO and adapt the configuration space by fixing the unimportant hyperparameters, allowing the search to focus on the important ones. Eventually, we validate our method with diverse tasks from PyMOO and YAHPO-Gym. Empirical results demonstrate improvements in convergence speed and Pareto front quality compared to baselines.
中文标题/摘要
标题:动态超参数重要性在高效多目标优化中的应用
选择合适的机器学习模型是一个复杂的任务,可能依赖于多个目标,例如准确性、模型大小、公平性、推理时间和能耗。实践中,这需要通过多目标优化(MOO)在多个常常相互竞争的目标之间进行权衡。然而,现有的MOO方法通常将所有超参数视为同等重要,忽视了超参数重要性(HPI)在不同目标权衡下的显著差异。我们提出了一种新颖的动态优化方法,在搜索过程中根据目标权衡的变化优先考虑最具影响力的超参数,从而加速实证收敛并获得更好的解决方案。基于先前关于MOO后分析中HPI的工作,我们现在将使用HyperSHAP计算的HPI整合到优化中。为此,我们利用ParEGO等MOO算法自然产生的目标权重,并通过固定不重要的超参数来适应配置空间,使搜索能够专注于重要的超参数。最终,我们使用来自PyMOO和YAHPO-Gym的多种任务验证了我们的方法。实证结果表明,与基线相比,我们的方法在收敛速度和帕累托前沿质量方面有所改进。
Summary / 总结
The paper addresses the challenge of selecting appropriate machine learning models by optimizing multiple objectives such as accuracy and inference time. It introduces a dynamic optimization approach that adjusts the importance of hyperparameters based on the changing trade-offs between objectives, leading to faster convergence and better solutions. The method uses HyperSHAP to calculate hyperparameter importance and integrates it with the ParEGO algorithm to focus the search on the most influential hyperparameters, as demonstrated through experiments on various tasks from PyMOO and YAHPO-Gym.
论文针对选择合适的机器学习模型以优化多个目标(如准确性和推理时间)的挑战。它提出了一种动态的超参数重要性(HPI)方法,在优化过程中根据目标之间的变化调整重要性,从而加快收敛速度并获得更好的解决方案。该方法通过固定不重要的超参数,将HPI与优化过程集成,使搜索能够专注于最重要的超参数。实验结果表明,与传统方法相比,该方法在收敛速度和帕累托前沿质量上有所改进。
On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime
Authors: Shuai Jiang, Alexey Voronin, Eric Cyr, Ben Southworth
First: 2026-01-06T16:35:07+00:00 · Latest: 2026-01-06T16:35:07+00:00
Comments: 21 pages, 13 figures,
Abstract
Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.
中文标题/摘要
标题:预条件梯度下降向丰富学习阶段收敛的行为研究
光谱偏差,神经网络倾向于先学习低频信号的倾向,既是福也是祸。虽然它通过抑制高频噪声增强了泛化能力,但在需要捕捉细小结构的科学任务中却是一个限制。已知的延迟泛化现象“悟解”是神经网络快速训练的另一个障碍。悟解被假定是学习从NTK过渡到特征丰富阶段时产生的。本文探讨了预条件梯度下降(PGD),如高斯-牛顿法,对光谱偏差和悟解现象的影响。我们通过理论和实验证据展示了PGD如何缓解与光谱偏差相关的问题。此外,基于丰富的学习阶段悟解假说,我们研究了如何使用PGD来减少与悟解相关的延迟。我们的猜想是,PGD在没有光谱偏差阻碍的情况下,能够在NTK阶段均匀探索参数空间。我们的实验结果证实了这一预测,提供了强有力的证据,表明悟解代表了由NTK特征描述的懒惰阶段和丰富阶段之间的过渡行为。这些发现加深了我们对优化动力学、光谱偏差和神经网络学习阶段之间相互作用的理解。
Summary / 总结
This paper investigates the impact of preconditioned gradient descent (PGD) on spectral bias and grokking phenomena in neural networks. The authors demonstrate through theoretical and empirical results that PGD can mitigate issues associated with spectral bias, enabling uniform exploration of the parameter space in the NTK regime. The study confirms that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime, providing strong evidence for the rich learning regime grokking hypothesis.
该研究探讨了预条件梯度下降(PGD),如高斯-牛顿,对神经网络中频谱偏置和“悟解”现象的影响。研究表明,PGD可以缓解频谱偏置,使参数空间在近似核(NTK)阶段实现均匀探索。实验结果证实,“悟解”现象是近似核阶段和丰富学习阶段之间的过渡行为,为优化动力学与神经网络学习阶段之间的相互作用提供了见解。
Rapid Augmentations for Time Series (RATS): A High-Performance Library for Time Series Augmentation
Authors: Wadie Skaf, Felix Kern, Aryamaan Basu Roy, Tejas Pradhan, Roman Kalkreuth, Holger Hoos
First: 2026-01-06T16:33:51+00:00 · Latest: 2026-01-06T16:33:51+00:00
Abstract
Time series augmentation is critical for training robust deep learning models, particularly in domains where labelled data is scarce and expensive to obtain. However, existing augmentation libraries for time series, mainly written in Python, suffer from performance bottlenecks, where running time grows exponentially as dataset sizes increase -- an aspect limiting their applicability in large-scale, production-grade systems. We introduce RATS (Rapid Augmentations for Time Series), a high-performance library for time series augmentation written in Rust with Python bindings (RATSpy). RATS implements multiple augmentation methods spanning basic transformations, frequency-domain operations and time warping techniques, all accessible through a unified pipeline interface with built-in parallelisation. Comprehensive benchmarking of RATSpy versus a commonly used library (tasug) on 143 datasets demonstrates that RATSpy achieves an average speedup of 74.5\% over tsaug (up to 94.8\% on large datasets), with up to 47.9\% less peak memory usage.
中文标题/摘要
标题:Rapid Augmentations for Time Series (RATS): 一种高性能的时间序列扩增库
时间序列扩增对于训练稳健的深度学习模型至关重要,特别是在标注数据稀缺且获取成本高昂的领域。然而,现有的时间序列扩增库主要用Python编写,存在性能瓶颈,运行时间随着数据集规模的增加呈指数增长——这是限制其在大规模、生产级系统中应用的一个方面。我们引入了RATS(Rapid Augmentations for Time Series),这是一种用Rust编写的高性能时间序列扩增库,带有Python绑定(RATSpy)。RATS实现了多种扩增方法,包括基本变换、频域操作和时间扭曲技术,所有方法都通过一个统一的管道接口访问,并内置了并行化功能。对143个数据集进行的全面基准测试表明,与常用库(tasug)相比,RATSpy在平均速度上提高了74.5%,在大型数据集上最多提高了94.8%,并且峰值内存使用量最多减少了47.9%。
Summary / 总结
RATS (Rapid Augmentations for Time Series) is a high-performance library for time series augmentation written in Rust, designed to address the performance limitations of existing Python-based libraries. It supports various augmentation methods including basic transformations, frequency-domain operations, and time warping, and provides a unified pipeline interface with built-in parallelization. Benchmarks show that RATS outperforms tasug by achieving an average speedup of 74.5% and up to 47.9% less peak memory usage on 143 datasets.
RATS(快速时间序列增强)是一个用Rust编写的高性能时间序列增强库,旨在解决现有基于Python的库的性能限制。它支持基本变换、频域操作和时间扭曲等多种增强方法,并提供了一个具有内置并行化的统一管道接口。基准测试显示,RATS在大型数据集上的性能优于tasug,平均加速74.5%,峰值内存使用量最多减少47.9%。
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Authors: Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Zhang Bo, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, KinHei Lee, Zhenxuan Zhang, Xiaobing Li, Maosong Sun
First: 2025-11-24T06:40:38+00:00 · Latest: 2026-01-06T16:25:52+00:00
Abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision-Language Models to interpret full musical notation remains insufficiently examined. We introduce the Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative Question-Answering pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. To facilitate further research, we publicly release MSU-Bench and all associated resources.
中文标题/摘要
标题:音乐谱理解基准:评估大型语言模型对完整音乐谱的理解能力
理解完整的音乐谱需要综合推理音高、节奏、和声以及大尺度结构,然而大型语言模型和视觉-语言模型对完整音乐记谱符号的解释能力仍缺乏充分的考察。我们引入了音乐谱理解基准(MSU-Bench),这是首个大规模、人工策展的跨文本(ABC 符号)和视觉(PDF)模态的乐谱级音乐理解基准。MSU-Bench 包含来自巴赫、贝多芬、肖邦、德彪西等作曲家的1,800个生成性问答对,按难度分为四个级别,从起始信息到织体和结构。超过十五个最先进的模型在零样本和微调设置下的评估揭示了模态间的显著差距、不稳定级别的表现以及多级正确性的挑战。微调在各模态中显著提高了结果,同时保留了通用知识,使MSU-Bench 成为未来多模态推理研究的坚实基础。为了促进进一步研究,我们公开发布了MSU-Bench及其所有相关资源。
Summary / 总结
The research aims to evaluate Large Language Models and Vision-Language Models in understanding complete musical scores, which involve complex reasoning over pitch, rhythm, harmony, and structure. The Musical Score Understanding Benchmark (MSU-Bench) was introduced, comprising 1,800 generative Question-Answering pairs from famous composers. Evaluations of over fifteen state-of-the-art models showed significant modality gaps and unstable performance across different levels of difficulty. Fine-tuning models improved performance across modalities while retaining general knowledge, highlighting the benchmark's utility for future research in multimodal reasoning.
研究旨在评估大型语言模型和视觉-语言模型在理解完整乐谱方面的能力,这些乐谱涉及对音高、节奏、和声和结构的复杂推理。引入了乐谱理解基准(MSU-Bench),包含来自著名作曲家的1,800个生成性问答对。对超过十五种最先进的模型的评估显示了显著的模态差距和不同难度级别的不稳定性能。微调模型在各个模态中提高了性能,同时保留了通用知识,突显了基准对未来多模态推理研究的实用性。
PersonaLedger: Generating Realistic Financial Transactions with Persona Conditioned LLMs and Rule Grounded Feedback
Authors: Dehao Yuan, Tyler Farnan, Stefan Tesliuc, Doron L Bergman, Yulun Wu, Xiaoyu Liu, Minghui Liu, James Montgomery, Nam H Nguyen, C. Bayan Bruss, Furong Huang
First: 2026-01-06T16:18:59+00:00 · Latest: 2026-01-06T16:18:59+00:00
Abstract
Strict privacy regulations limit access to real transaction data, slowing open research in financial AI. Synthetic data can bridge this gap, but existing generators do not jointly achieve behavioral diversity and logical groundedness. Rule-driven simulators rely on hand-crafted workflows and shallow stochasticity, which miss the richness of human behavior. Learning-based generators such as GANs capture correlations yet often violate hard financial constraints and still require training on private data. We introduce PersonaLedger, a generation engine that uses a large language model conditioned on rich user personas to produce diverse transaction streams, coupled with an expert configurable programmatic engine that maintains correctness. The LLM and engine interact in a closed loop: after each event, the engine updates the user state, enforces financial rules, and returns a context aware "nextprompt" that guides the LLM toward feasible next actions. With this engine, we create a public dataset of 30 million transactions from 23,000 users and a benchmark suite with two tasks, illiquidity classification and identity theft segmentation. PersonaLedger offers a realistic, privacy preserving resource that supports rigorous evaluation of forecasting and anomaly detection models. PersonaLedger offers the community a rich, realistic, and privacy preserving resource -- complete with code, rules, and generation logs -- to accelerate innovation in financial AI and enable rigorous, reproducible evaluation.
中文标题/摘要
标题:PersonaLedger:使用角色条件化的大语言模型和规则导向反馈生成现实的金融交易
严格的隐私法规限制了对真实交易数据的访问,减缓了金融AI的开放研究。合成数据可以弥补这一缺口,但现有的生成器无法同时实现行为多样性和逻辑一致性。基于规则的模拟器依赖于手工设计的工作流程和浅层的随机性,这未能捕捉到人类行为的丰富性。基于学习的生成器如生成对抗网络能够捕捉到相关性,但往往违反了严格的金融约束,并仍然需要在私人数据上进行训练。我们引入了PersonaLedger,这是一种使用大型语言模型根据丰富用户角色进行条件化以生成多样交易流的生成引擎,并结合了一个专家可配置的程序化引擎以保持正确性。LLM和引擎在一个闭环中交互:在每次事件后,引擎更新用户状态,执行金融规则,并返回一个上下文感知的“下一个提示”,以引导LLM向可行的下一步行动。借助此引擎,我们创建了一个包含3000万笔交易和23000名用户的公共数据集,并提供了一个基准套件,包含两项任务:流动性分类和身份盗窃分割。PersonaLedger提供了一个现实、隐私保护的资源,支持对预测和异常检测模型进行严格的评估。PersonaLedger为社区提供了一个丰富、现实且隐私保护的资源——包括代码、规则和生成日志——以加速金融AI的创新并实现严格的、可重复的评估。
Summary / 总结
PersonaLedger is designed to generate realistic financial transactions by combining a large language model conditioned on user personas with a rule-grounded feedback mechanism. This approach ensures both behavioral diversity and logical correctness, overcoming limitations of existing generators. Key experimental findings include the creation of a public dataset of 30 million transactions from 23,000 users, which supports the evaluation of forecasting and anomaly detection models in financial AI. The system offers a privacy-preserving resource for the community to accelerate research in this field.
PersonaLedger 通过使用大型语言模型结合详细用户人设,并结合规则导向的反馈机制,生成现实的金融交易。系统通过语言模型和可编程引擎之间的闭环交互,更新用户状态并确保财务规则的正确性。实验结果包括生成了来自23,000个用户的3000万笔交易,并创建了一个包含流动性分类和身份盗窃分割等任务的基准套件,支持金融AI模型的严格评估。
CSAI: Conditional Self-Attention Imputation for Healthcare Time-series
Authors: Linglong Qian, Joseph Arul Raj, Hugh Logan Ellis, Ao Zhang, Yuezhou Zhang, Tao Wang, Richard JB Dobson, Zina Ibrahim
First: 2023-12-27T20:42:40+00:00 · Latest: 2026-01-06T16:16:41+00:00
Abstract
We introduce the Conditional Self-Attention Imputation (CSAI) model, a novel recurrent neural network architecture designed to address the challenges of complex missing data patterns in multivariate time series derived from hospital electronic health records (EHRs). CSAI extends state-of-the-art neural network-based imputation by introducing key modifications specific to EHR data: a) attention-based hidden state initialisation to capture both long- and short-range temporal dependencies prevalent in EHRs, b) domain-informed temporal decay to mimic clinical data recording patterns, and c) a non-uniform masking strategy that models non-random missingness by calibrating weights according to both temporal and cross-sectional data characteristics. Comprehensive evaluation across four EHR benchmark datasets demonstrates CSAI's effectiveness compared to state-of-the-art architectures in data restoration and downstream tasks. CSAI is integrated into PyPOTS, an open-source Python toolbox designed for machine learning tasks on partially observed time series. This work significantly advances the state of neural network imputation applied to EHRs by more closely aligning algorithmic imputation with clinical realities.
中文标题/摘要
标题:CSAI:基于条件自注意力插补的医疗健康时间序列
我们介绍了条件自注意力插补(CSAI)模型,这是一种新颖的循环神经网络架构,旨在解决医院电子健康记录(EHRs)中多变量时间序列复杂缺失数据模式的挑战。CSAI通过引入特定于EHR数据的关键修改扩展了最先进的基于神经网络的插补方法:a) 基于注意力的隐藏状态初始化,以捕获EHRs中普遍存在的长短期时序依赖关系,b) 领域导向的时间衰减,以模拟临床数据记录模式,c) 非均匀遮罩策略,通过根据时间和横截面数据特征校准权重来建模非随机缺失性。在四个EHR基准数据集上的全面评估表明,与最先进的架构相比,CSAI在数据恢复和下游任务中更为有效。CSAI被集成到PyPOTS中,这是一个用于部分观测时间序列机器学习任务的开源Python工具箱。这项工作通过更紧密地将算法插补与临床现实相结合,显著推进了应用于EHRs的神经网络插补状态。
Summary / 总结
The research introduces CSAI, a novel recurrent neural network designed to handle complex missing data in healthcare time series from EHRs. It enhances existing neural network-based imputation methods by incorporating attention-based hidden state initialization, domain-informed temporal decay, and a non-uniform masking strategy. Experimental results on four EHR datasets show that CSAI outperforms state-of-the-art architectures in both data restoration and downstream tasks.
CSAI是一种新型递归神经网络,用于处理EHRs中的复杂缺失数据。它引入了基于注意力的隐藏状态初始化、领域导向的时间衰减以及非均匀遮罩策略。在四个EHR基准数据集上的实验表明,CSAI在数据恢复和下游任务中优于现有方法。
Self-Verification is All You Need To Pass The Japanese Bar Examination
Authors: Andrew Shin
First: 2026-01-06T16:13:47+00:00 · Latest: 2026-01-06T16:13:47+00:00
Comments: https://github.com/shinandrew/self_verification
Abstract
Despite rapid advances in large language models (LLMs), achieving reliable performance on highly professional and structured examinations remains a significant challenge. The Japanese bar examination is a particularly demanding benchmark, requiring not only advanced legal reasoning but also strict adherence to complex answer formats that involve joint evaluation of multiple propositions. While recent studies have reported improvements by decomposing such questions into simpler true--false judgments, these approaches have not been systematically evaluated under the original exam format and scoring scheme, leaving open the question of whether they truly capture exam-level competence. In this paper, we present a self-verification model trained on a newly constructed dataset that faithfully replicates the authentic format and evaluation scale of the exam. Our model is able to exceed the official passing score when evaluated on the actual exam scale, marking the first demonstration, to our knowledge, of an LLM passing the Japanese bar examination without altering its original question structure or scoring rules. We further conduct extensive comparisons with alternative strategies, including multi-agent inference and decomposition-based supervision, and find that these methods fail to achieve comparable performance. Our results highlight the importance of format-faithful supervision and consistency verification, and suggest that carefully designed single-model approaches can outperform more complex systems in high-stakes professional reasoning tasks. Our dataset and codes are publicly available.
中文标题/摘要
标题:自我验证即通过日本律师资格考试所需的一切
尽管大型语言模型(LLMs)取得了快速进步,但在高度专业和结构化的考试中实现可靠表现仍然是一个重大挑战。日本律师资格考试尤其具有挑战性,不仅需要高级法律推理,还需要严格遵守复杂的答案格式,涉及对多个命题的联合评估。虽然最近的研究报告称通过将此类问题分解为简单的真伪判断来取得改进,但这些方法尚未在原始考试格式和评分方案下系统评估,留下了一个问题,即它们是否真正捕捉到了考试水平的能力。在本文中,我们介绍了一个在新构建的数据集上训练的自我验证模型,该数据集忠实复制了考试的原始格式和评价标准。我们的模型能够在实际考试规模上超过官方通过分数线,这是我们所知的第一个证明,一个LLM在不改变原始问题结构或评分规则的情况下通过了日本律师资格考试。我们还进行了广泛的比较,包括多智能体推理和基于分解的监督,发现这些方法未能达到可比的性能。我们的结果强调了格式忠实监督和一致性验证的重要性,并表明精心设计的单模型方法在高风险专业推理任务中可能优于更复杂的系统。我们的数据集和代码已公开。
Summary / 总结
This paper addresses the challenge of achieving reliable performance on highly structured professional examinations using large language models (LLMs). It introduces a self-verification model trained on a new dataset that mirrors the authentic format and evaluation criteria of the Japanese bar examination. The model surpasses the official passing score when evaluated on the actual exam scale, demonstrating for the first time that an LLM can pass the Japanese bar examination without altering the original question structure or scoring rules. Comparative analysis with other strategies, such as multi-agent inference and decomposition-based supervision, shows that these methods fail to achieve similar performance. The study emphasizes the importance of format-faithful supervision and suggests that single-model approaches can outperform more complex systems in high-stakes professional reasoning tasks.
本文旨在解决大型语言模型(LLMs)在高度结构化的专业考试中实现可靠表现的挑战。研究引入了一种自验证模型,该模型在忠实复制日本司法考试真实格式和评分标准的新数据集上进行训练。该模型在实际考试评分标准下超过官方及格分数线,首次证明LLM可以在不改变原始问题结构或评分规则的情况下通过日本司法考试。研究还比较了该方法与多智能体推理和分解监督,发现这些方法未能达到可比的性能,强调了格式忠实监督和单一模型方法在高风险专业推理任务中的重要性。
Neuronal Attention Circuit (NAC) for Representation Learning
Authors: Waleed Razzaq, Izis Kanjaraway, Yun-Bo Zhao
First: 2025-12-11T04:49:44+00:00 · Latest: 2026-01-06T16:07:09+00:00
Comments: Ongoing work
Abstract
Attention improves representation learning over RNNs, but its discrete nature limits continuous-time (CT) modeling. We introduce Neuronal Attention Circuit (NAC), a novel, biologically inspired CT-Attention mechanism that reformulates attention logit computation as the solution to a linear first-order ODE with nonlinear interlinked gates derived from repurposing C.elegans Neuronal Circuit Policies (NCPs) wiring. NAC replaces dense projections with sparse sensory gates for key-query projections and a sparse backbone network with two heads for computing content-target and learnable time-constant gates, enabling efficient adaptive dynamics. To improve efficiency and memory consumption, we implemented an adaptable subquadratic sparse Top-K pairwise concatenation mechanism that selectively curates key-query interactions. We provide rigorous theoretical guarantees, including state stability and bounded approximation errors. Empirically, we implemented NAC in diverse domains, including irregular time-series classification, lane-keeping for autonomous vehicles, and industrial prognostics. We observed that NAC matches or outperforms competing baselines in accuracy and occupies an intermediate position in runtime and memory consumption compared with several CT state-of-the-art baselines, while being interpretable at the neuron cell level.
中文标题/摘要
标题:神经注意电路(NAC)用于表示学习
注意力机制在表示学习中优于RNN,但其离散性质限制了连续时间(CT)建模。我们引入了神经注意电路(NAC),这是一种新颖的、受生物启发的CT-注意力机制,将注意力权重计算重新表述为线性一阶微分方程的解,该方程的非线性互连门控由重新利用C. elegans 神经电路策略(NCPs)的连接方式推导而来。NAC用稀疏感官门控取代密集投影,用稀疏骨干网络和两个头来计算内容目标和可学习的时间常数门控,从而实现高效的自适应动力学。为了提高效率和内存消耗,我们实现了一种可调节的亚二次稀疏Top-K成对连接机制,该机制有选择地筛选关键查询交互。我们提供了严格的理论保证,包括状态稳定性及有界逼近误差。实验上,我们在包括不规则时间序列分类、自动驾驶车辆车道保持和工业预测等多个领域实现了NAC。我们观察到,NAC在准确率上与竞争基线相当或优于基线,在运行时间和内存消耗上处于几种CT最新基线之间,同时在神经元细胞层面具有可解释性。
Summary / 总结
The research aims to improve continuous-time representation learning by addressing the limitations of discrete attention mechanisms in recurrent neural networks (RNNs). The Neuronal Attention Circuit (NAC) is introduced as a novel, biologically inspired mechanism that reformulates attention logit computation using a linear first-order ODE with nonlinear interlinked gates derived from C.elegans neuronal circuit policies. Key experimental findings show that NAC matches or outperforms competing baselines in accuracy across various domains such as irregular time-series classification, lane-keeping for autonomous vehicles, and industrial prognostics, while maintaining intermediate runtime and memory consumption compared to state-of-the-art continuous-time models.
研究旨在通过解决离散注意力机制在循环神经网络(RNN)中的局限性,来改进连续时间的表示学习。引入了神经注意电路(NAC),通过使用来自C.elegans神经元电路策略的线性一阶微分方程和非线性互连门来重新定义注意力权重的计算。NAC通过稀疏感官门和稀疏骨干网络中的两个头来计算内容目标和可学习的时间常数门,实现了高效的自适应动态。实验结果显示,NAC在各种领域中的准确度与竞争基线相当或优于基线,同时在运行时间和内存消耗上保持平衡,并且具有神经元细胞级别的可解释性。
Limited Linguistic Diversity in Embodied AI Datasets
Authors: Selma Wanna, Agnes Luhtaru, Jonathan Salfity, Ryan Barron, Juston Moore, Cynthia Matuszek, Mitch Pryor
First: 2026-01-06T16:06:47+00:00 · Latest: 2026-01-06T16:06:47+00:00
Abstract
Language plays a critical role in Vision-Language-Action (VLA) models, yet the linguistic characteristics of the datasets used to train and evaluate these systems remain poorly documented. In this work, we present a systematic dataset audit of several widely used VLA corpora, aiming to characterize what kinds of instructions these datasets actually contain and how much linguistic variety they provide. We quantify instruction language along complementary dimensions-including lexical variety, duplication and overlap, semantic similarity, and syntactic complexity. Our analysis shows that many datasets rely on highly repetitive, template-like commands with limited structural variation, yielding a narrow distribution of instruction forms. We position these findings as descriptive documentation of the language signal available in current VLA training and evaluation data, intended to support more detailed dataset reporting, more principled dataset selection, and targeted curation or augmentation strategies that broaden language coverage.
中文标题/摘要
标题:体态人工智能数据集中的语言多样性有限
语言在视觉-语言-行动(VLA)模型中起着关键作用,然而用于训练和评估这些系统的数据集的语言特征仍然记录不足。在本研究中,我们对几个广泛使用的VLA语料库进行了系统的数据集审计,旨在描述这些数据集实际包含的指令类型以及它们提供的语言多样性程度。我们从词汇多样性、重复和重叠、语义相似性和句法复杂性等互补维度量化指令语言。我们的分析表明,许多数据集依赖于高度重复的、模板化的命令,结构变化有限,导致指令形式分布狭窄。我们将这些发现定位为当前VLA训练和评估数据中可用语言信号的描述性文档,旨在支持更详细的数据集报告、更原则性的数据集选择以及有针对性的扩展语言覆盖范围的策展或增强策略。
Summary / 总结
This study investigates the linguistic diversity in Vision-Language-Action (VLA) datasets, finding that many datasets contain highly repetitive and template-like commands with limited structural variation. The research quantifies instruction language along dimensions such as lexical variety, duplication, semantic similarity, and syntactic complexity, revealing a narrow distribution of instruction forms. This work aims to provide descriptive documentation of the language signal available in current VLA data, supporting more detailed dataset reporting and targeted curation strategies to broaden language coverage.
研究分析了视觉-语言-行动(VLA)数据集中的语言多样性,发现许多数据集包含高度重复和模板化的命令,结构变化有限。研究通过词汇多样性、重复、语义相似性和句法复杂性等维度量化指令语言,揭示了指令形式分布狭窄。这项工作旨在提供当前VLA数据集中语言信号的描述性文档,以支持更详细的报告、更原则的数据集选择以及扩展语言覆盖范围的针对性策展或增强策略。
A Multidimensional AI-powered Framework for Analyzing Tourist Perception in Historic Urban Quarters: A Case Study in Shanghai
Authors: Kaizhen Tan, Yufan Wu, Yuxuan Liu, Haoran Zeng
First: 2025-09-04T02:35:14+00:00 · Latest: 2026-01-06T16:02:10+00:00
Abstract
Historic urban quarters play a vital role in preserving cultural heritage while serving as vibrant spaces for tourism and everyday life. Understanding how tourists perceive these environments is essential for sustainable, human-centered urban planning. This study proposes a multidimensional AI-powered framework for analyzing tourist perception in historic urban quarters using multimodal data from social media. Applied to twelve historic quarters in central Shanghai, the framework integrates focal point extraction, color theme analysis, and sentiment mining. Visual focus areas are identified from tourist-shared photos using a fine-tuned semantic segmentation model. To assess aesthetic preferences, dominant colors are extracted using a clustering method, and their spatial distribution across quarters is analyzed. Color themes are further compared between social media photos and real-world street views, revealing notable shifts. This divergence highlights potential gaps between visual expectations and the built environment, reflecting both stylistic preferences and perceptual bias. Tourist reviews are evaluated through a hybrid sentiment analysis approach combining a rule-based method and a multi-task BERT model. Satisfaction is assessed across four dimensions: tourist activities, built environment, service facilities, and business formats. The results reveal spatial variations in aesthetic appeal and emotional response. Rather than focusing on a single technical innovation, this framework offers an integrated, data-driven approach to decoding tourist perception and contributes to informed decision-making in tourism, heritage conservation, and the design of aesthetically engaging public spaces.
中文标题/摘要
标题:一种基于人工智能的多维度框架:以上海为例分析历史城市街区的游客感知
历史城市街区在保存文化遗产的同时,也是旅游和日常生活活跃的空间。理解游客对这些环境的感知对于可持续的人本城市规划至关重要。本研究提出了一种基于人工智能的多维度框架,利用社交媒体的多模态数据分析历史城市街区的游客感知。该框架应用于上海市中心的十二个历史街区,结合焦点提取、色彩主题分析和情感挖掘。通过微调的语义分割模型,从游客共享的照片中识别视觉焦点区域。为了评估审美偏好,使用聚类方法提取主导色彩,并分析其在街区中的空间分布。进一步将社交媒体照片中的色彩主题与实际街道视图进行比较,揭示出显著的变化。这种差异突显了视觉期望与建成环境之间的潜在差距,反映了风格偏好和感知偏差。通过结合基于规则的方法和多任务BERT模型的混合情感分析方法,评估游客评论。满意度在四个维度上进行评估:游客活动、建筑环境、服务设施和商业形式。结果揭示了审美吸引力和情感反应的空间变化。本框架不仅关注单一的技术创新,还提供了一种综合的数据驱动方法,用于解读游客感知,并为旅游业、文化遗产保护和设计具有审美吸引力的公共空间的决策提供支持。
Summary / 总结
This study aims to understand tourist perception in historic urban quarters by proposing a multidimensional AI-powered framework using multimodal data from social media. The framework integrates focal point extraction, color theme analysis, and sentiment mining. Applied to twelve historic quarters in central Shanghai, it identifies visual focus areas, analyzes dominant colors, and evaluates tourist satisfaction across four dimensions. The results show spatial variations in aesthetic appeal and emotional response, highlighting gaps between visual expectations and the built environment.
本研究提出了一种多维度的AI辅助框架,利用社交媒体多模态数据分析上海市中心十二个历史街区的游客感知。该框架包括焦点提取、色彩主题分析和情感分析。它从游客照片中识别视觉焦点区域,分析色彩偏好,并在多个维度上评估游客满意度。主要发现表明,在审美吸引力和情感反应方面存在空间差异,揭示了视觉期望与建成环境之间的差距。
Unified Thinker: A General Reasoning Modular Core for Image Generation
Authors: Sashuai Zhou, Qiang Zhou, Jijin Hu, Hanqing Yang, Yue Cao, Junpeng Ma, Yinchao Ma, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng, Zhou Zhao
First: 2026-01-06T15:59:33+00:00 · Latest: 2026-01-06T15:59:33+00:00
Abstract
Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
中文标题/摘要
标题:统一思考者:通用推理模块化核心以生成图像
尽管在高保真图像合成方面取得了令人印象深刻的进展,生成模型仍然在逻辑密集型指令遵循方面挣扎,暴露了推理与执行之间的持续差距。同时,封闭源系统(例如Nano Banana)展示了强大的基于推理的图像生成能力,突显了与当前开源模型之间存在的巨大差距。我们认为,缩小这一差距不仅需要更好的视觉生成器,还需要可执行的推理:将高层意图分解为可验证的、与具体图像生成过程直接关联的计划。为此,我们提出了统一思考者,这是一种通用图像生成的无任务推理架构,设计为一个统一的规划核心,可以插接到各种生成器和工作流程中。统一思考者将专门的思考者与图像生成器分离,使得可以在不重新训练整个生成模型的情况下进行推理模块化升级。我们进一步引入了两阶段训练范式:首先为思考者构建结构化的规划接口,然后使用强化学习将其策略与像素级反馈联系起来,鼓励优化视觉正确性而非文本合理性的计划。在文本到图像生成和图像编辑的广泛实验中,统一思考者显著提高了图像推理和生成质量。
Summary / 总结
The research aims to address the reasoning-execution gap in generative models for image synthesis by proposing Unified Thinker, a task-agnostic reasoning architecture. Unified Thinker is designed to decompose high-level intents into grounded plans that steer the generative process, and it is modular, allowing upgrades to the reasoning component without retraining the entire model. The method involves a two-stage training process: first, building a structured planning interface, then grounding the policy through reinforcement learning. Experiments show that Unified Thinker significantly enhances image reasoning and generation quality in text-to-image generation and image editing tasks.
研究旨在通过提出统一思考者(Unified Thinker),一种任务无关的推理架构,解决生成模型在图像合成中的推理执行差距问题。统一思考者将高层意图分解为可执行的计划,并引导生成过程,且该架构是模块化的,允许对推理进行升级而无需重新训练整个模型。研究引入了两阶段训练范式,首先构建结构化的规划接口,然后使用强化学习将策略与像素级反馈对接,鼓励优化视觉正确性而非文本合理性。实验表明,统一思考者在文本到图像生成和图像编辑任务中显著提高了图像推理和生成质量。
Information-Theoretic Generalization Bounds of Replay-based Continual Learning
Authors: Wen Wen, Tieliang Gong, Zeyu Gao, Yunjiao Zhang, Weizhan Zhang, Yong-Jin Liu
First: 2025-07-16T09:00:57+00:00 · Latest: 2026-01-06T15:55:45+00:00
Abstract
Continual learning (CL) has emerged as a dominant paradigm for acquiring knowledge from sequential tasks while avoiding catastrophic forgetting. Although many CL methods have been proposed to show impressive empirical performance, the theoretical understanding of their generalization behavior remains limited, particularly for replay-based approaches. This paper establishes a unified theoretical framework for replay-based CL, deriving a series of information-theoretic generalization bounds that explicitly elucidate the impact of the memory buffer alongside the current task on generalization performance. Specifically, our hypothesis-based bounds capture the trade-off between the number of selected exemplars and the information dependency between the hypothesis and the memory buffer. Our prediction-based bounds yield tighter and computationally tractable upper bounds on the generalization error by leveraging low-dimensional variables. Theoretical analysis is general and broadly applicable to a wide range of learning algorithms, exemplified by stochastic gradient Langevin dynamics (SGLD) as a representative method. Comprehensive experimental evaluations demonstrate the effectiveness of our derived bounds in capturing the generalization dynamics in replay-based CL settings.
中文标题/摘要
标题:基于重放的连续学习的信息论泛化界
连续学习(CL)已成为一种主导范式,用于从顺序任务中获取知识并避免灾难性遗忘。尽管已经提出了许多CL方法以展示出色的实验性能,但对其泛化行为的理论理解仍然有限,特别是对于基于重放的方法。本文为基于重放的CL建立了一个统一的理论框架,推导出一系列信息论泛化界,明确阐明了记忆缓冲区与当前任务对泛化性能的影响。具体而言,基于假设的边界捕捉了所选示例数量与假设和记忆缓冲区之间信息依赖性的权衡。基于预测的边界通过利用低维变量,提供了更紧且计算上可处理的泛化误差上界。理论分析是通用的,广泛适用于各种学习算法,以随机梯度拉格朗日动力学(SGLD)为代表方法。全面的实验评估表明,我们推导出的边界在捕捉基于重放的CL设置中的泛化动态方面是有效的。
Summary / 总结
This paper aims to provide a theoretical understanding of the generalization behavior in replay-based continual learning (CL) methods, which are crucial for avoiding catastrophic forgetting. The authors develop a unified information-theoretic framework, deriving both hypothesis-based and prediction-based generalization bounds. These bounds highlight the trade-off between the number of selected exemplars and the information dependency between the hypothesis and the memory buffer. Experimental evaluations show that these bounds effectively capture the generalization dynamics in replay-based CL settings, providing a valuable tool for analyzing and improving CL methods.
该研究旨在为基于重放的持续学习(CL)方法的一般化行为提供理论理解,这对于避免灾难性遗忘至关重要。作者推导出的信息论一般化界线突显了所选示例数量与假设和记忆缓冲区之间信息依赖性的权衡。理论分析适用于各种学习算法,以随机梯度拉格朗日动力学(SGLD)为例。实验结果表明,推导出的界线能够有效地捕捉基于重放的CL设置中的一般化动态。
Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC
Authors: Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo
First: 2025-06-30T16:50:48+00:00 · Latest: 2026-01-06T15:52:30+00:00
Abstract
Personal LLM agents increasingly combine foreground reactive interactions with background proactive monitoring, forming long-lived, stateful LLM flows that interleave prefill and token-by-token decode. While modern heterogeneous SoCs integrate CPUs, iGPUs, and NPUs to support on-device intelligence, existing LLM engines assume static, single-shot inference and lack mechanisms for flow-level concurrency, prioritization, and efficient accelerator coordination. As a result, commodity SoCs remain poorly matched to the dynamic, mixed-criticality execution patterns of personal agents.
This paper presents Agent$.$xpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator-accelerator affinity, asymmetric DDR contention, and stage-divergent batching behaviors distinct from cloud-serving assumptions. Agent$.$xpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow-aware NPU-iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine-grained preemption with slack-aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal-agent workloads, Agent$.$xpu delivers 1.2-4.9$\times$ proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU-only serving engine and NPU-iGPU static inference with optimal tensor-partitioning schemes. Agent$.$xpu also minimizes energy consumption and graphics interference via controlled iGPU usage.
中文标题/摘要
标题:Agent.xpu: 在异构SoC上高效调度代理型LLM工作负载
个人LLM代理越来越多地结合前景反应性交互与背景主动性监控,形成长期、状态化的LLM流,这些流交错着预填充和逐令牌解码。虽然现代异构SoC集成了CPU、iGPU和NPU以支持设备端智能,但现有的LLM引擎假设静态、单次推理,并缺乏针对流级并发、优先级和高效加速器协调的机制。因此,普通SoC仍然不适应个人代理动态、混合关键性执行模式。
本文介绍了Agent$.$xpu,这是第一个在普通SoC上协调反应性和主动性LLM流的LLM引擎。详尽的剖析揭示了SoC操作符-加速器亲和性、异步DDR争用以及阶段分歧批处理行为等独特特征,这些特征不同于云服务假设。Agent$.$xpu引入了三种关键技术:异构执行图(HEG),捕捉NPU/iGPU亲和性和弹性操作符绑定;流感知的NPU-iGPU协调,具有阶段弹性,解耦预填充和解码以减少带宽争用并强制执行优先级;以及细粒度抢占,通过感知余量的搭车来保证反应性响应性,而不使主动性工作饥饿。在现实的个人代理工作负载中,与工业iGPU仅服务引擎和NPU-iGPU静态推理(采用最优张量分区方案)相比,Agent$.$xpu实现了1.2-4.9倍的主动性吞吐量,并将反应性延迟减少了至少91%。此外,Agent$.$xpu通过受控的iGPU使用来最小化能耗和图形干扰。
Summary / 总结
Agent.xpu is an LLM engine designed to efficiently manage concurrent reactive and proactive LLM flows on commodity SoCs. It addresses the limitations of existing LLM engines by introducing a heterogeneous execution graph, flow-aware coordination, and fine-grained preemption. Experimental results show that Agent.xpu achieves 1.2-4.9 times higher proactive throughput and at least 91% reduction in reactive latency compared to existing iGPU-only serving engines and static NPU-iGPU inference schemes.
Agent.xpu 是一种针对商品 SoC 设计的 LLM 引擎,旨在高效管理并发的反应性和前瞻性 LLM 流。它通过引入异构执行图、流感知 NPU-iGPU 协调和细粒度抢占来解决现有 LLM 引擎的局限性。实验结果表明,Agent.xpu 将前瞻性吞吐量提高了 1.2-4.9 倍,并将反应性延迟降低了至少 91%,与工业 iGPU 仅有的服务引擎和 NPU-iGPU 静态推理方案相比。
ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation
Authors: Peiran Li, Jan Fillies, Adrian Paschke
First: 2026-01-06T15:50:46+00:00 · Latest: 2026-01-06T15:50:46+00:00
Comments: This paper has been accepted to the main conference of EACL 2026
Abstract
Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.
中文标题/摘要
标题:ToxiGAN:通过LLM引导的方向对抗生成进行有毒数据增强
以可控且类别特定的方式增强有毒语言数据对于提高毒性分类的鲁棒性至关重要,但由于监督有限和分布偏斜,这仍然具有挑战性。我们提出ToxiGAN,这是一种基于对抗生成且具有类别意识的文本增强框架,结合了大型语言模型(LLMs)的语义指导。为了解决基于GAN的增强中常见的模式崩溃和语义漂移问题,ToxiGAN引入了一种两步方向性训练策略,并利用LLM生成的中性文本作为语义压舱物。与以往工作将LLMs视为静态生成器不同,我们的方法动态选择中性示例以提供平衡指导。有毒样本被明确优化以偏离这些示例,强化了类别特定的对比信号。在四个仇恨言论基准上的实验表明,ToxiGAN在宏F1和仇恨F1上均实现了最强的平均性能,始终优于传统和基于LLM的增强方法。消融和敏感性分析进一步证实了语义压舱物和方向性训练在增强分类器鲁棒性方面的益处。
Summary / 总结
ToxiGAN is a class-aware text augmentation framework that uses adversarial generation with semantic guidance from large language models to augment toxic language data. It addresses common issues like mode collapse and semantic drift by introducing a two-step directional training strategy and using LLM-generated neutral texts as semantic ballast. Experiments on four hate speech benchmarks show that ToxiGAN outperforms traditional and LLM-based augmentation methods in both macro-F1 and hate-F1 scores, demonstrating its effectiveness in enhancing classifier robustness.
ToxiGAN 是一种基于对抗生成并结合大型语言模型语义指导的文本增强框架,用于增强有毒语言数据。通过引入两步方向性训练策略和使用大型语言模型生成的中性文本作为语义锚定,它解决了常见的模式坍塌和语义漂移问题。实验结果显示,ToxiGAN 在四个仇恨言论基准测试中,无论是宏观 F1 分数还是仇恨 F1 分数,都优于传统和基于大型语言模型的增强方法,证明了其在增强分类器鲁棒性方面的有效性。
Gradient Coupling: The Hidden Barrier to Generalization in Agentic Reinforcement Learning
Authors: Jingyu Liu, Xiaopeng Wu, Jingquan Peng, Kehan Chen, Chuan Yu, Lizhong Ding, Yong Liu
First: 2025-09-28T13:24:38+00:00 · Latest: 2026-01-06T15:48:43+00:00
Abstract
Reinforcement learning (RL) is a dominant paradigm for training autonomous agents, yet these agents often exhibit poor generalization, failing to adapt to scenarios not seen during training. In this work, we identify a fundamental cause of this brittleness, a phenomenon which we term "gradient coupling." We hypothesize that in complex agentic tasks, the high similarity between distinct states leads to destructive interference between gradients. Specifically, a gradient update that reinforces an optimal action in one state can inadvertently increase the likelihood of a suboptimal action in a similar, yet different, state. To solve this, we propose a novel objective where the actor is trained to simultaneously function as a classifier that separates good and bad actions. This auxiliary pressure compels the model to learn disentangled embeddings for positive and negative actions, which mitigates negative gradient interference and improve the generalization performance. Extensive experiments demonstrate the effectiveness of our method.
中文标题/摘要
标题:梯度耦合:代理强化学习中泛化的隐性障碍
强化学习(RL)是训练自主代理的主要范式,但这些代理往往表现出泛化能力差,无法适应训练中未见过的场景。在本研究中,我们识别出这种脆弱性的根本原因,我们将其称为“梯度耦合”。我们假设在复杂的代理任务中,不同状态之间的高相似性会导致梯度之间的破坏性干涉。具体来说,一个增强最优行为的梯度更新可能会无意中增加在相似但不同的状态下采取次优行为的可能性。为了解决这个问题,我们提出了一种新的目标,其中演员不仅作为执行者,还作为分类器,将好行为和坏行为区分开来。这种辅助压力促使模型学习正负行为的分离嵌入,从而减轻负梯度干涉并提高泛化性能。广泛的实验表明了我们方法的有效性。
Summary / 总结
The research aims to address the poor generalization of reinforcement learning agents, which often fail to adapt to unseen scenarios. The study identifies 'gradient coupling' as a key issue, where similar states cause destructive interference between gradients. To tackle this, the authors propose a new objective that trains the actor to act as a classifier, promoting disentangled embeddings for actions. Experiments show that this method improves generalization performance.
本文通过识别一种称为‘梯度耦合’的现象,解决了强化学习代理在泛化方面的不足问题。作者提出了一种新的目标,即训练演员将动作分类为好或坏,这有助于学习分离的嵌入并减轻负梯度干扰。实验表明,这种方法可以提高泛化性能。
One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling
Authors: Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li, Yijia Luo, Wenbo Su, Bo Zheng, Pengfei Liu
First: 2026-01-06T15:41:35+00:00 · Latest: 2026-01-06T15:41:35+00:00
Abstract
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
中文标题/摘要
标题:一例通吃:强化学习在RL扩展中的极端数据效率
大型语言模型(LLMs)的推理能力可以通过强化学习(RL)释放(OpenAI, 2024;DeepSeek-AI等, 2025a;Zeng等, 2025)。现有RL在LLMs中的成功通常依赖于数千甚至更多的高质量样本。本文通过展示单次学习的有效性挑战了RL对LLMs的数据需求假设。具体来说,我们提出了多才多艺学习框架,设计一个训练样本以引发多学科影响。我们提出了三个关键发现:(1)一个精心选择的数学推理样本可以在多个领域,包括物理、化学和生物学中通过RL产生显著的性能提升;(2)推理所需的数学技能表明了最优多才多艺样本的特征;(3)一个整合多学科元素的工程化合成样本优于使用自然出现的单个样本进行训练。我们的方法在各种推理基准测试中实现了优于更大数据集训练的性能,表明样本质量和设计而非数量可能是解锁语言模型增强推理能力的关键。我们的结果表明了一种被称为样本工程的转变,即转向精确设计训练样本而非简单增加数据量。
Summary / 总结
This paper explores the potential of extreme data efficiency in reinforcement learning (RL) for large language models (LLMs) by introducing polymath learning, a framework that uses a single, strategically selected sample to improve performance across multiple domains. Key findings include significant performance improvements in physics, chemistry, and biology with just one math reasoning sample, the importance of math skills in determining the optimal sample, and the superior performance of an engineered synthetic sample over natural samples. The study suggests that sample quality and design are more critical than quantity in enhancing reasoning capabilities in LLMs.
本文通过引入多才多艺学习框架,使用单一、精心选择的样本在多个领域(如物理、化学和生物学)中提高强化学习(RL)在大型语言模型(LLMs)中的性能,探索了极端数据效率的可能性。关键发现包括仅用一个数学推理样本就能在这些领域中取得显著性能提升,数学技能的重要性决定了最佳样本的特征,以及工程合成样本优于自然样本。研究结果表明,样本质量与设计比数量更为关键,以提升LLMs的推理能力。