arXiv 论文速递

Towards Understanding Best Practices for Quantization of Vision-Language Models

Authors: Gautom Das, Vincent La, Ethan Lau, Abhinav Shrivastava, Matthew Gwilliam

First: 2026-01-21T18:59:51+00:00 · Latest: 2026-01-21T18:59:51+00:00

Comments: 15 pages, 12 figures, 1 table

Abstract

Large language models (LLMs) deliver impressive results for a variety of tasks, but state-of-the-art systems require fast GPUs with large amounts of memory. To reduce both the memory and latency of these systems, practitioners quantize their learned parameters, typically at half precision. A growing body of research focuses on preserving the model performance with more aggressive bit widths, and some work has been done to apply these strategies to other models, like vision transformers. In our study we investigate how a variety of quantization methods, including state-of-the-art GPTQ and AWQ, can be applied effectively to multimodal pipelines comprised of vision models, language models, and their connectors. We address how performance on captioning, retrieval, and question answering can be affected by bit width, quantization method, and which portion of the pipeline the quantization is used for. Results reveal that ViT and LLM exhibit comparable importance in model performance, despite significant differences in parameter size, and that lower-bit quantization of the LLM achieves high accuracy at reduced bits per weight (bpw). These findings provide practical insights for efficient deployment of MLLMs and highlight the value of exploration for understanding component sensitivities in multimodal models. Our code is available at https://github.com/gautomdas/mmq.

中文标题/摘要

标题：理解视觉-语言模型量化最佳实践

大型语言模型（LLMs）在各种任务中表现出色，但最先进的系统需要快速的GPU和大量的内存。为了减少这些系统的内存和延迟，实践者通常会将它们的学习参数量化为半精度。越来越多的研究集中在使用更激进的位宽来保持模型性能，并且已经有一些工作将这些策略应用于其他模型，如视觉变换器。在我们的研究中，我们探讨了如何有效地将包括最先进的GPTQ和AWQ在内的各种量化方法应用于由视觉模型、语言模型及其连接器组成的多模态管道。我们研究了位宽、量化方法以及量化在管道中的使用位置如何影响字幕生成、检索和问答的性能。结果表明，尽管参数大小存在显著差异，ViT和LLM在模型性能中具有相当的重要性，并且LLM的低位量化可以在减少每个权重位数（bpw）的情况下实现高精度。这些发现为高效部署多模态大模型提供了实用见解，并突显了探索多模态模型组件敏感性的价值。我们的代码可在https://github.com/gautomdas/mmq/获取。

Summary / 总结

This study investigates the application of various quantization methods, including GPTQ and AWQ, to multimodal pipelines involving vision models, language models, and their connectors. The research aims to understand how different bit widths and quantization techniques impact performance in tasks such as captioning, retrieval, and question answering. Key findings show that both vision transformers (ViT) and large language models (LLMs) are crucial for model performance, and that LLMs can achieve high accuracy with lower-bit quantization, reducing the memory footprint significantly.

研究探讨了GPTQ和AWQ等不同量化方法在包含视觉模型、语言模型及其连接器的多模态管道中的应用。研究旨在了解不同位宽和量化技术对任务如描述、检索和问答性能的影响。关键发现包括ViT和LLM在模型性能中的相对重要性相当，尽管它们的参数量存在显著差异，以及通过较低位宽量化LLM可以实现高精度，从而减少每个权重的位数。这些发现为部署高效的多模态大型语言模型提供了实用指导。

Iterative Refinement Improves Compositional Image Generation

Authors: Shantanu Jaiswal, Mihir Prabhudesai, Nikash Bhardwaj, Zheyang Qin, Amir Zadeh, Chuan Li, Katerina Fragkiadaki, Deepak Pathak

First: 2026-01-21T18:59:40+00:00 · Latest: 2026-01-21T18:59:40+00:00

Comments: Project webpage: https://iterative-img-gen.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Text-to-image (T2I) models have achieved remarkable progress, yet they continue to struggle with complex prompts that require simultaneously handling multiple objects, relations, and attributes. Existing inference-time strategies, such as parallel sampling with verifiers or simply increasing denoising steps, can improve prompt alignment but remain inadequate for richly compositional settings where many constraints must be satisfied. Inspired by the success of chain-of-thought reasoning in large language models, we propose an iterative test-time strategy in which a T2I model progressively refines its generations across multiple steps, guided by feedback from a vision-language model as the critic in the loop. Our approach is simple, requires no external tools or priors, and can be flexibly applied to a wide range of image generators and vision-language models. Empirically, we demonstrate consistent gains on image generation across benchmarks: a 16.9% improvement in all-correct rate on ConceptMix (k=7), a 13.8% improvement on T2I-CompBench (3D-Spatial category) and a 12.5% improvement on Visual Jenga scene decomposition compared to compute-matched parallel sampling. Beyond quantitative gains, iterative refinement produces more faithful generations by decomposing complex prompts into sequential corrections, with human evaluators preferring our method 58.7% of the time over 41.3% for the parallel baseline. Together, these findings highlight iterative self-correction as a broadly applicable principle for compositional image generation. Results and visualizations are available at https://iterative-img-gen.github.io/

中文标题/摘要

标题：迭代优化提升组合图像生成

文本到图像（T2I）模型取得了显著进展，但仍难以处理需要同时处理多个对象、关系和属性的复杂提示。现有的推理时策略，如并行采样带验证器或简单增加去噪步骤，可以改善提示对齐，但在许多约束必须满足的丰富组合场景中仍不够充分。受大型语言模型中链式思考推理成功的启发，我们提出了一种迭代测试时策略，在该策略中，T2I模型在多个步骤中逐步细化其生成，由循环中的视觉语言模型作为批评者提供反馈。我们的方法简单，无需外部工具或先验知识，可以灵活应用于各种图像生成器和视觉语言模型。实验证明，我们的方法在基准测试中的一致改进：在ConceptMix（k=7）上提高了16.9%的全正确率，在T2I-CompBench（3D-空间类别）上提高了13.8%，在视觉积木场景分解上提高了12.5%，与计算匹配的并行采样相比。除了定量改进，迭代优化还通过将复杂提示分解为顺序修正，生成更忠实的图像，人类评估者中有58.7%的人更偏好我们的方法，而并行基线为41.3%。这些发现共同强调了迭代自我修正作为组合图像生成广泛适用原则的重要性。结果和可视化可在https://iterative-img-gen.github.io/获取

Summary / 总结

The paper addresses the challenge of generating complex images from text prompts by proposing an iterative refinement strategy. This method involves a text-to-image model refining its output across multiple steps, guided by feedback from a vision-language model. The approach shows consistent improvements across various benchmarks, with a 16.9% increase in the all-correct rate on ConceptMix, 13.8% on T2I-CompBench, and 12.5% on Visual Jenga scene decomposition. It also produces more faithful images, with human evaluators preferring the iterative method over the parallel baseline 58.7% of the time.

本文提出了一种迭代细化策略，以解决从文本提示生成复杂图像的挑战。该方法涉及文本到图像模型在多次步骤中逐步改进其生成，受到视觉语言模型反馈的指导。实验结果显示，在基准测试中表现出一致的改进，包括在ConceptMix上的16.9%提高准确率和在T2I-CompBench上的13.8%提高。迭代细化还生成了更忠实的图像，人类评估者中有58.7%的时间更偏好这种方法，而不是平行基线。

Walk through Paintings: Egocentric World Models from Internet Priors

Authors: Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, Martial Hebert

First: 2026-01-21T18:59:32+00:00 · Latest: 2026-01-21T18:59:32+00:00

Abs · PDF · Code1 · Code2

Abstract

What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.

中文标题/摘要

标题：穿越绘画：基于互联网先验的主观世界模型

如果一个视频生成模型不仅能想象一个合理的未来，还能准确地反映每次动作后世界的变化，那会怎样？我们通过提出主观世界模型（EgoWM），一种简单且架构无关的方法来回答这个问题，该方法能够将任何预训练的视频扩散模型转换为动作条件下的世界模型，从而实现可控的未来预测。我们不是从头开始训练，而是利用互联网规模视频模型丰富的世界先验，并通过轻量级条件层注入运动命令。这使得模型能够忠实跟随动作，同时保持现实感和强大的泛化能力。我们的方法自然地扩展到不同的实体和动作空间，从3-自由度移动机器人到25-自由度类人机器人，其中预测以主观关节角度驱动的动力学要困难得多。该模型为导航和操作任务生成连贯的滚动，仅需适度微调。为了独立于视觉外观评估物理正确性，我们引入了结构一致性分数（SCS），衡量稳定场景元素是否与提供的动作一致地演变。EgoWM在先前最先进的导航世界模型上将SCS提高高达80%，同时实现高达六倍的更低推理延迟，并在未见过的环境中实现稳健的泛化，包括在绘画中导航。

Summary / 总结

The research aims to develop a method for generating accurate and controllable future predictions by transforming pretrained video diffusion models into action-conditioned world models. The Egocentric World Model (EgoWM) repurposes Internet-scale video priors and injects motor commands through lightweight conditioning layers, enabling faithful action following while maintaining realism and generalization across various embodiments. Key findings include improved Structural Consistency Score by up to 80 percent over previous models, lower inference latency, and robust generalization to unseen environments, such as navigating inside paintings.

研究旨在通过引入Egocentric World Model (EgoWM)，使视频生成模型能够基于动作预测正确的未来场景。EgoWM 通过轻量级的条件层注入运动命令，将预训练的视频扩散模型转换为动作条件化的世界模型。该方法允许模型忠实执行动作，同时保持现实感和强大的泛化能力，适用于不同身体模型和动作空间。模型在导航和操作任务中展示了连贯的滚动效果，仅需适度微调。通过结构一致性评分（SCS）评估物理正确性，EgoWM 的 SCS 比之前的最佳模型提高了高达 80%，同时具有较低的推理延迟和对未见过的环境的鲁棒泛化能力，包括在绘画中进行导航。

LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

Authors: Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, Christian Richardt

First: 2026-01-21T18:59:22+00:00 · Latest: 2026-01-21T18:59:22+00:00

Comments: Project page: https://luxremix.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see https://luxremix.github.io.

中文标题/摘要

标题：LuxRemix：室内场景的照明分解与重组

我们提出了一种基于单个多视角场景捕获的室内场景交互式光编辑的新方法。该方法利用生成的基于图像的照明分解模型，将复杂的室内场景照明分解为其构成的光源。这种分解使得可以独立操作各个光源，具体来说，可以控制它们的状态（开/关）、色温以及强度。我们进一步引入多视角照明协调，以确保照明分解在所有场景视图中的一致传播。这被集成到可重新照明的3D高斯点表示中，提供了对各个光源的实时交互控制。我们的结果展示了在各种室内场景中高度逼真的照明分解和重新照明效果。我们在合成数据集和真实世界数据集上评估了该方法，并与最先进的技术进行了定量和定性的比较。有关视频结果和交互式演示，请参见https://luxremix.github.io。

Summary / 总结

The research aims to enable interactive light editing in indoor scenes using a single multi-view capture. The method uses a generative image-based light decomposition model to separate complex indoor lighting into individual light sources, allowing independent control over their state, chromaticity, and intensity. The approach also includes multi-view lighting harmonization to ensure consistent lighting across all views. The results show highly realistic lighting manipulation in various indoor scenes, outperforming existing techniques both quantitatively and qualitatively. Interactive demos and video results are available at the project page.

研究旨在通过单个多视图捕获实现室内场景的互动灯光编辑。提出了一种基于生成的图像灯光分解模型，将复杂的室内照明分解为独立的光源，允许独立调整它们的状态、色温和强度。该方法还包括多视图灯光和谐化，以确保所有视图中灯光的一致性。实验结果表明，该方法在各种室内场景中实现了高度逼真的灯光效果，优于现有技术的定量和定性比较。

Rethinking Video Generation Model for the Embodied World

Authors: Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou

First: 2026-01-21T18:59:18+00:00 · Latest: 2026-01-21T18:59:18+00:00

Comments: Github: https://github.com/DAGroup-PKU/ReVidgen/ Project website: https://dagroup-pku.github.io/ReVidgen.github.io/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.

中文标题/摘要

标题：重新思考具身世界的视频生成模型

视频生成模型显著推进了具身智能的发展，解锁了生成捕捉感知、推理和行动的多样化机器人数据的新可能性。然而，合成高质量、准确反映真实世界机器人交互的视频仍然具有挑战性，缺乏标准化基准限制了公平比较和进步。为解决这一问题，我们引入了一个全面的机器人基准RBench，旨在评估面向机器人的视频生成能力，涵盖五个任务领域和四种不同的具身形式。它通过可重复的子指标评估任务级正确性和视觉保真度，包括结构一致性、物理合理性以及动作完整性。对25个代表性模型的评估揭示了生成物理现实机器人行为的重大缺陷。此外，基准与人类评估的相关系数达到0.96，验证了其有效性。虽然RBench提供了识别这些缺陷所需的视角，但实现物理现实需要超越评估，解决高质量训练数据的严重短缺。基于这些见解，我们引入了一个改进的四阶段数据管道，产生了RoVid-X，这是最大的开源机器人视频生成数据集，包含400万标注视频片段，覆盖数千个任务，并附有全面的物理属性注释。这一评估和数据协同生态系统共同为视频模型的严格评估和可扩展训练奠定了坚实基础，加速了具身AI向通用智能的演变。

Summary / 总结

The paper addresses the challenge of generating high-quality videos that accurately reflect real-world robotic interactions, introducing RBench, a comprehensive robotics benchmark. RBench evaluates robot-oriented video generation across five task domains and four distinct embodiments, assessing both task-level correctness and visual fidelity. Evaluation of 25 models revealed significant deficiencies in generating physically realistic robot behaviors, and RBench achieved a high Spearman correlation coefficient of 0.96 with human evaluations. The authors then introduced RoVid-X, a large open-source robotic dataset with 4 million annotated video clips, to address the critical shortage of high-quality training data for video generation models.

论文旨在解决生成能够准确反映真实世界机器人交互的高质量视频的挑战，引入了RBench，一个全面的机器人基准。RBench在五个任务领域和四种不同的机器人实体上评估机器人导向的视频生成，评估任务级正确性和视觉保真度。对25个模型的评估显示了生成物理上现实的机器人行为的显著不足，而RBench与人类评估的Spearman相关系数达到了0.96。作者随后引入了RoVid-X，一个包含400万标注视频片段的大型开源机器人数据集，以解决视频生成模型训练数据质量不足的问题。

StableWorld: Towards Stable and Consistent Long Interactive Video Generation

Authors: Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Ziwei Liu, Chenyang Si

First: 2026-01-21T18:59:02+00:00 · Latest: 2026-01-21T18:59:02+00:00

Comments: 17 pages, 21 figures,

Abs · PDF · Code1 · Code2

Abstract

In this paper, we explore the overlooked challenge of stability and temporal consistency in interactive video generation, which synthesizes dynamic and controllable video worlds through interactive behaviors such as camera movements and text prompts. Despite remarkable progress in world modeling, current methods still suffer from severe instability and temporal degradation, often leading to spatial drift and scene collapse during long-horizon interactions. To better understand this issue, we initially investigate the underlying causes of instability and identify that the major source of error accumulation originates from the same scene, where generated frames gradually deviate from the initial clean state and propagate errors to subsequent frames. Building upon this observation, we propose a simple yet effective method, \textbf{StableWorld}, a Dynamic Frame Eviction Mechanism. By continuously filtering out degraded frames while retaining geometrically consistent ones, StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation. Promising results on multiple interactive video models, \eg, Matrix-Game, Open-Oasis, and Hunyuan-GameCraft, demonstrate that StableWorld is model-agnostic and can be applied to different interactive video generation frameworks to substantially improve stability, temporal consistency, and generalization across diverse interactive scenarios.

中文标题/摘要

标题：StableWorld：朝向稳定和一致的长时交互视频生成

在本文中，我们探讨了交互视频生成中被忽视的稳定性和时间一致性挑战，该生成过程通过交互行为（如相机运动和文本提示）合成动态和可控的视频世界。尽管在世界建模方面取得了显著进展，但当前方法仍然遭受严重的不稳定性和时间退化问题，经常导致长时间交互过程中出现空间漂移和场景崩溃。为了更好地理解这一问题，我们最初调查了不稳定性的根本原因，并发现错误累积的主要来源是同一场景，其中生成的帧逐渐偏离初始干净状态，并将错误传播到后续帧。基于这一观察，我们提出了一种简单而有效的方法——StableWorld，一种动态帧移除机制。通过不断过滤掉退化的帧并保留几何上一致的帧，StableWorld 有效地在源头上防止了累积漂移，从而提高了交互生成的稳定性和时间一致性。在多个交互视频模型（例如，Matrix-Game、Open-Oasis 和 Hunyuan-GameCraft）上的实验结果表明，StableWorld 是模型无关的，并且可以应用于不同的交互视频生成框架，以显著提高稳定性和时间一致性，并在各种交互场景中提高泛化能力。

Summary / 总结

The paper addresses the issue of instability and temporal inconsistency in long-horizon interactive video generation. It proposes StableWorld, a Dynamic Frame Eviction Mechanism that filters out degraded frames while retaining geometrically consistent ones, to prevent cumulative drift. Experiments on Matrix-Game, Open-Oasis, and Hunyuan-GameCraft show that StableWorld improves stability, temporal consistency, and generalization across different interactive scenarios.

论文探讨了交互式视频生成中的稳定性和时间一致性问题，当前方法常出现空间漂移和场景崩溃。作者提出了一种动态帧移除机制StableWorld，通过过滤掉退化的帧并保留几何上一致的帧，从而防止累积漂移，提高不同交互视频生成框架中的稳定性和时间一致性，如Matrix-Game、Open-Oasis和Hunyuan-GameCraft等。

MolecularIQ: Characterizing Chemical Reasoning Capabilities Through Symbolic Verification on Molecular Graphs

Authors: Christoph Bartmann, Johannes Schimunek, Mykyta Ielanskyi, Philipp Seidl, Günter Klambauer, Sohvi Luukkonen

First: 2026-01-21T18:58:01+00:00 · Latest: 2026-01-21T18:58:01+00:00

Abs · PDF · Code1 · Code2

Abstract

A molecule's properties are fundamentally determined by its composition and structure encoded in its molecular graph. Thus, reasoning about molecular properties requires the ability to parse and understand the molecular graph. Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. Most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and reveals capability patterns that localize model failures to specific tasks and molecular structures. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure.

中文标题/摘要

标题：MolecularIQ：通过分子图的符号验证表征化学推理能力

分子的性质从根本上由其分子图中的组成和结构决定。因此，关于分子性质的推理需要能够解析和理解分子图的能力。大型语言模型（LLMs）在化学领域中越来越被应用，解决诸如分子名称转换、配图、文本引导生成以及性质或反应预测等任务。现有的大多数基准测试侧重于通用化学知识，依赖于文献或可能泄露或带有偏见的替代标签，或者将评估简化为选择题。我们引入了MolecularIQ，这是一个专注于符号验证任务的分子结构推理基准测试。MolecularIQ 使对分子图的推理进行细粒度评估成为可能，并揭示了模型失败的具体任务和分子结构模式。这为当前化学LLM的优势和局限性提供了可操作的见解，并指导了能够忠实推理分子结构的模型的开发。

Summary / 总结

The research aims to evaluate the chemical reasoning capabilities of Large Language Models (LLMs) by focusing on symbolically verifiable tasks in molecular structure reasoning. The method involves creating MolecularIQ, a benchmark that specifically tests the ability to parse and understand molecular graphs. Key findings show that models exhibit specific strengths and weaknesses in different tasks and molecular structures, highlighting the need for models that can reason accurately over molecular structures.

研究旨在通过引入MolecularIQ基准来评估大型语言模型（LLMs）的化学推理能力，该基准专注于可符号验证的任务。方法是设计可以通过分子图的符号计算进行验证的任务。关键发现表明，LLMs在推理分子结构方面表现出特定的优势和劣势，这为模型开发指出了改进的方向。

RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

Authors: Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, Shubham Tulsiani

First: 2026-01-21T18:55:51+00:00 · Latest: 2026-01-21T18:55:51+00:00

Comments: Project page: https://rayrope.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows SE(3)-invariant attention with multi-frequency similarity, and can be adaptive to the geometry of the underlying scene. We find that prior (absolute or relative) encoding schemes for multi-view attention do not meet the above desiderata, and present RayRoPE to address this gap. RayRoPE represents patch positions based on associated rays but leverages a predicted point along the ray instead of the direction for a geometry-aware encoding. To achieve SE(3) invariance, RayRoPE computes query-frame projective coordinates for computing multi-frequency similarity. Lastly, as the 'predicted' 3D point along a ray may not be precise, RayRoPE presents a mechanism to analytically compute the expected position encoding under uncertainty. We validate RayRoPE on the tasks of novel-view synthesis and stereo depth estimation and show that it consistently improves over alternate position encoding schemes (e.g. 15% relative improvement on LPIPS in CO3D). We also show that RayRoPE can seamlessly incorporate RGB-D input, resulting in even larger gains over alternatives that cannot positionally encode this information.

中文标题/摘要

标题：RayRoPE：多视图注意力的投影光线位置编码

我们研究了处理一组摆好姿势的输入图像中的标记的多视图变压器的位置编码，并寻求一种能够唯一编码块、允许SE(3)不变的注意力机制，并且可以适应底层场景的几何形状。我们发现，现有的（绝对或相对）多视图注意力的位置编码方案无法满足上述要求，因此提出了RayRoPE来弥补这一差距。RayRoPE基于关联的光线表示块的位置，但利用光线上的预测点而不是方向来进行几何感知编码。为了实现SE(3)不变性，RayRoPE计算查询帧的投影坐标以计算多频率相似性。最后，由于光线上的“预测”三维点可能不够精确，RayRoPE提出了一种机制来在不确定性下分析计算期望的位置编码。我们在新颖视图合成和立体深度估计任务上验证了RayRoPE，并展示了它在替代位置编码方案（例如，在CO3D上相对改进15%的LPIPS）中的一致性改进。我们还展示了RayRoPE可以无缝地结合RGB-D输入，从而在无法对这种信息进行位置编码的替代方案中获得更大的收益。

Summary / 总结

The paper addresses the need for effective positional encodings in multi-view transformers to process input images, focusing on unique patch encoding, SE(3)-invariance, and adaptability to scene geometry. It introduces RayRoPE, which represents patch positions using rays and a predicted point along the ray, ensuring geometry-aware encoding. RayRoPE computes query-frame projective coordinates for multi-frequency similarity and analytically computes position encodings under uncertainty. Experiments on novel-view synthesis and stereo depth estimation show RayRoPE outperforms other schemes, with a 15% relative improvement on LPIPS in CO3D. Additionally, RayRoPE can incorporate RGB-D input, leading to further improvements over alternatives.

论文旨在为处理输入图像的多视图变压器提供有效的位置编码，关注于独特的块编码、SE(3)不变的注意力和几何感知。它引入了RayRoPE，通过射线表示块位置，并使用射线上的预测点进行编码，同时通过计算投影坐标来实现多频率相似性，以达到SE(3)不变性。在新颖视角合成和立体深度估计任务上的实验表明，RayRoPE在LPIPS指标上比其他位置编码方案高出15%的相对改进。此外，RayRoPE还可以无缝地结合RGB-D输入，进一步优于其他无法编码这种信息的替代方案。

Evaluation of Large Language Models in Legal Applications: Challenges, Methods, and Future Directions

Authors: Yiran Hu, Huanghai Liu, Chong Wang, Kunran Li, Tien-Hsuan Wu, Haitao Li, Xinran Xu, Siqing Huo, Weihang Su, Ning Zheng, Siyuan Zheng, Qingyao Ai, Yun Liu, Renjun Bian, Yiqun Liu, Charles L. A. Clarke, Weixing Shen, Ben Kao

First: 2026-01-21T18:51:37+00:00 · Latest: 2026-01-21T18:51:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are being increasingly integrated into legal applications, including judicial decision support, legal practice assistance, and public-facing legal services. While LLMs show strong potential in handling legal knowledge and tasks, their deployment in real-world legal settings raises critical concerns beyond surface-level accuracy, involving the soundness of legal reasoning processes and trustworthy issues such as fairness and reliability. Systematic evaluation of LLM performance in legal tasks has therefore become essential for their responsible adoption. This survey identifies key challenges in evaluating LLMs for legal tasks grounded in real-world legal practice. We analyze the major difficulties involved in assessing LLM performance in the legal domain, including outcome correctness, reasoning reliability, and trustworthiness. Building on these challenges, we review and categorize existing evaluation methods and benchmarks according to their task design, datasets, and evaluation metrics. We further discuss the extent to which current approaches address these challenges, highlight their limitations, and outline future research directions toward more realistic, reliable, and legally grounded evaluation frameworks for LLMs in legal domains.

中文标题/摘要

标题：大型语言模型在法律应用中的评估：挑战、方法与未来方向

大型语言模型（LLMs）正越来越多地被整合到法律应用中，包括司法决策支持、法律实践辅助和面向公众的法律服务。尽管LLMs在处理法律知识和任务方面表现出强大的潜力，但在实际法律环境中的部署引发了超出表面准确性的关键问题，涉及法律推理过程的可靠性以及公平性和可靠性等信任问题。因此，系统地评估LLMs在法律任务中的表现对于其负责任的采用变得至关重要。本文综述了基于实际法律实践评估LLMs的关键挑战。我们分析了评估LLMs在法律领域表现的主要困难，包括结果正确性、推理可靠性和可信度。基于这些挑战，我们回顾并根据任务设计、数据集和评估指标对现有评估方法和基准进行了分类。我们进一步讨论了当前方法在多大程度上解决了这些挑战，指出了其局限性，并概述了未来研究方向，以实现更现实、可靠且法律基础的评估框架，用于法律领域的LLMs。

Summary / 总结

This paper evaluates the performance of large language models (LLMs) in legal applications, addressing challenges such as the soundness of legal reasoning and trustworthiness. The study identifies key difficulties in assessing LLMs, including outcome correctness and reasoning reliability, and reviews existing evaluation methods and benchmarks. The research highlights the limitations of current approaches and suggests future directions for more realistic and legally grounded evaluation frameworks.

本文评估了大型语言模型（LLMs）在法律应用中的性能，重点关注法律推理的准确性和可信度等挑战。研究识别了评估LLMs的关键难点，包括结果正确性和推理可靠性，并回顾了现有的评估方法和基准。研究指出现有方法的局限性，并提出了更现实和法律导向的评估框架的研究方向。

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight

Authors: Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

First: 2025-12-22T18:59:34+00:00 · Latest: 2026-01-21T18:48:54+00:00

Comments: Project codebase: https://github.com/junzeye/validate-medcalc-labels

Abs · PDF · Code1 · Code2 · Code3

Abstract

We examine the reliability of a widely used clinical AI benchmark whose reference labels were partially generated by LLMs, and find that a substantial fraction are clinically misaligned. We introduce a phased stewardship procedure to amplify the positive impact of physician experts' feedback and then demonstrate, via a controlled RL experiment, how uncaught label bias can materially affect downstream LLM evaluation and alignment. Our results demonstrate that partially LLM-generated labels can embed systemic errors that distort not only evaluation but also downstream model alignment. By adopting a hybrid oversight system, we can prioritize scarce expert feedback to maintain benchmarks as living, clinically-grounded documents. Ensuring this alignment is a prerequisite for the safe deployment of LLMs in high-stakes medical decision support.

中文标题/摘要

标题：LLM辅助临床基准的可扩展监护与医师监督

我们检查了一个广泛使用的临床AI基准的可靠性，该基准的参考标签部分由LLM生成，发现其中相当一部分存在临床偏差。我们引入了一种分阶段的监护程序，以放大医师专家反馈的积极影响，然后通过受控的RL实验，展示了未发现的标签偏差如何实质性地影响下游LLM的评估和对齐。我们的结果表明，部分由LLM生成的标签可以嵌入系统性错误，不仅扭曲了评估，还影响了下游模型的对齐。通过采用混合监督系统，我们可以优先利用稀缺的专家反馈，使基准保持为活的、临床相关的文件。确保这种对齐是安全部署LLM于高风险医疗决策支持的前提。

Summary / 总结

The study addresses the reliability of a clinical AI benchmark where reference labels were partly generated by LLMs, finding significant clinical misalignment. It introduces a phased stewardship procedure involving physician oversight to correct these errors. Through a controlled RL experiment, the research shows that uncorrected label bias can negatively impact LLM evaluation and alignment. The findings highlight the importance of hybrid oversight to maintain benchmarks as accurate and clinically grounded documents, essential for safe LLM deployment in medical decision support.

研究检查了一个部分由LLM生成标签的临床AI基准的可靠性，发现存在显著的临床不一致。为此，引入了一种分阶段的监督程序，涉及医生的监督。研究显示，未解决的标签偏差会影响LLM的评估和对齐。通过使用混合监督系统，研究优先考虑专家反馈以保持基准的临床准确性和对齐，这对于在医疗决策支持中安全部署LLM至关重要。

DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration

Authors: Dominik Rößle, Xujun Xie, Adithya Mohan, Venkatesh Thirugnana Sambandham, Daniel Cremers, Torsten Schön

First: 2026-01-21T18:41:05+00:00 · Latest: 2026-01-21T18:41:05+00:00

Comments: Accepted to the IEEE Intelligent Vehicles Symposium 2026. For code and dataset, see https://github.com/cvims/DrivIng

Abs · PDF · Code1 · Code2 · Code3

Abstract

Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ~18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ~1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng enables a 1-to-1 transfer of real traffic into simulation, preserving agent interactions while enabling realistic and flexible scenario testing. To support reproducible research and robust validation, we benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase.

中文标题/摘要

标题：DrivIng：一种全面集成全数字孪生的大型多模态驾驶数据集

感知是自动驾驶的核心，使车辆能够理解其周围环境并做出安全可靠的决策。开发稳健的感知算法需要大规模、高质量的数据集，涵盖各种驾驶条件并支持全面评估。现有数据集往往缺乏高保真数字孪生，限制了系统测试、边缘案例模拟、传感器修改和模拟到现实的评估。为了解决这一差距，我们提出了DrivIng，一种包含约18公里路线完整地理参考数字孪生的大型多模态数据集，该路线涵盖城市、郊区和高速公路路段。我们的数据集提供了来自六个RGB摄像头、一个LiDAR和高精度ADMA基定位的连续记录，覆盖白天、黄昏和夜晚。所有序列以10 Hz的频率进行注释，涵盖12个类别的3D边界框和跟踪ID，总计约120万注释实例。除了数字孪生的优势，DrivIng还允许将真实交通1:1地转移到模拟中，同时保留代理交互并实现真实和灵活的场景测试。为了支持可重复研究和稳健验证，我们使用最先进的感知模型对DrivIng进行了基准测试，并公开了数据集、数字孪生、高清地图和代码库。

Summary / 总结

DrivIng is a large-scale multimodal driving dataset that integrates a full digital twin of a 18 km route, covering urban, suburban, and highway environments. It includes continuous recordings from six RGB cameras, one LiDAR, and high-precision localization, with annotations for 3D bounding boxes and track IDs. The dataset supports thorough testing and sim-to-real evaluations, enabling realistic scenario testing and robust validation of perception algorithms. Key findings include the provision of a comprehensive digital twin and the benchmarking of state-of-the-art perception models.

研究旨在通过提供一个包含完整数字孪生的大规模多模态数据集来开发自动驾驶的稳健感知算法。DrivIng数据集包括六台RGB相机、一台LiDAR和高精度定位的连续记录，覆盖各种驾驶条件。关键发现表明，DrivIng能够支持现实和灵活的场景测试，促进与最先进的感知模型的模拟到现实的评估和可重复研究。

FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion

Authors: Zichen Xi, Hao-Xiang Chen, Nan Xue, Hongyu Yan, Qi-Yuan Feng, Levent Burak Kara, Joaquim Jorge, Qun-Ce Xu

First: 2026-01-21T18:32:27+00:00 · Latest: 2026-01-21T18:32:27+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2

Abstract

Semantic Scene Completion (SSC) from monocular RGB images is a fundamental yet challenging task due to the inherent ambiguity of inferring occluded 3D geometry from a single view. While feed-forward methods have made progress, they often struggle to generate plausible details in occluded regions and preserve the fundamental spatial relationships of objects. Such accurate generative reasoning capability for the entire 3D space is critical in real-world applications. In this paper, we present FlowSSC, the first generative framework applied directly to monocular semantic scene completion. FlowSSC treats the SSC task as a conditional generation problem and can seamlessly integrate with existing feed-forward SSC methods to significantly boost their performance. To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching that operates in a compact triplane latent space. Unlike standard diffusion models that require hundreds of steps, our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems. Extensive experiments on SemanticKITTI demonstrate that FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines.

中文标题/摘要

标题：FlowSSC：通过一步潜在扩散实现通用生成单目语义场景完成

从单目RGB图像中完成语义场景（SSC）是一项基本但具有挑战性的任务，因为从单个视角推断被遮挡的3D几何形状存在固有的不确定性。尽管前馈方法已经取得进展，但在生成被遮挡区域的合理细节和保持物体基本空间关系方面仍然存在困难。这种对整个3D空间的准确生成推理能力在实际应用中至关重要。在本文中，我们提出了FlowSSC，这是第一个直接应用于单目语义场景完成的生成框架。FlowSSC将SSC任务视为条件生成问题，并可以无缝集成到现有的前馈SSC方法中，显著提升其性能。为了在不牺牲质量的情况下实现实时推理，我们引入了捷径流匹配机制，该机制在紧凑的三平面潜在空间中操作。与需要数百步的标准扩散模型不同，我们的方法利用捷径机制在一步中实现高保真生成，使其实用部署在自主系统中成为可能。在SemanticKITTI上的大量实验表明，FlowSSC达到了最先进的性能，显著优于现有基线。

Summary / 总结

FlowSSC is a generative framework for monocular semantic scene completion that addresses the challenge of inferring occluded 3D geometry from a single image. It treats the task as a conditional generation problem and integrates with existing feed-forward methods to enhance their performance. By using a shortcut mechanism in a compact triplane latent space, FlowSSC can generate high-fidelity scenes in a single step, achieving state-of-the-art performance on SemanticKITTI and outperforming existing methods.

FlowSSC 是一种用于单目语义场景完成的生成框架，旨在从单张图像中推断被遮挡的 3D 几何结构。它将 SSC 任务视为条件生成问题，并与现有的前馈方法集成以提升其性能。通过在紧凑的三平面潜空间中使用快捷机制，FlowSSC 能够在单步中实现高保真生成，超越现有基线在 SemanticKITTI 上的表现，并支持自主系统的实时推理。

Diffusion In Diffusion: Reclaiming Global Coherence in Semi-Autoregressive Diffusion

Authors: Linrui Ma, Yufei Cui, Kai Han, Yunhe Wang

First: 2026-01-20T05:00:26+00:00 · Latest: 2026-01-21T18:21:39+00:00

Comments: Work In Progress

Abs · PDF · Code1 · Code2

Abstract

One of the most compelling features of global discrete diffusion language models is their global bidirectional contextual capability. However, existing block-based diffusion studies tend to introduce autoregressive priors, which, while offering benefits, can cause models to lose this global coherence at the macro level. To regain global contextual understanding while preserving the advantages of the semi-autoregressive paradigm, we propose Diffusion in Diffusion, a 'draft-then-refine' framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models. Our approach first employs block diffusion to generate rapid drafts using small blocks, then refines these drafts through global bidirectional diffusion with a larger bidirectional receptive field. We utilize snapshot confidence remasking to identify the most critical tokens that require modification, and apply mix-scale training to expand the block diffusion model's global capabilities. Empirical results demonstrate that our approach sets a new benchmark for discrete diffusion models on the OpenWebText dataset. Using only 26% of the fine-tuning budget of baseline models, we reduce generative perplexity from 25.7 to 21.9, significantly narrowing the performance gap with autoregressive models.

中文标题/摘要

标题：扩散中的扩散：在半自回归扩散中重新获得全局一致性

全球离散扩散语言模型最引人注目的特征之一是其全局双向上下文能力。然而，现有的块基扩散研究倾向于引入自回归先验，虽然这提供了某些优势，但也会导致模型在宏观层面上失去全局一致性。为了在保持半自回归范式优势的同时重新获得全局上下文理解，我们提出了扩散中的扩散，这是一种“先草拟后润色”的框架，旨在克服块扩散模型固有的不可逆性和短视问题。我们的方法首先使用块扩散生成快速草稿，然后通过具有更大双向感受野的全局双向扩散对这些草稿进行润色。我们利用快照置信度重新遮盖来识别需要修改的最关键令牌，并采用多尺度训练来扩展块扩散模型的全局能力。实验证明，我们的方法在OpenWebText数据集上为离散扩散模型设定了新的基准。仅使用基线模型微调预算的26%，我们使生成困惑度从25.7降低到21.9，显著缩小了与自回归模型的性能差距。

Summary / 总结

The paper proposes Diffusion in Diffusion, a 'draft-then-refine' framework to address the loss of global coherence in block-based diffusion models. It uses block diffusion for rapid drafts and global bidirectional diffusion for refinement, along with snapshot confidence remasking and mix-scale training. The approach improves generative perplexity on the OpenWebText dataset, reducing it from 25.7 to 21.9 with only 26% of the fine-tuning budget of baseline models, narrowing the performance gap with autoregressive models.

论文提出了一种名为Diffusion in Diffusion的‘草稿-然后润色’框架，以解决块基扩散模型中的全局一致性损失问题。该方法使用块扩散进行快速草稿生成，并通过全局双向扩散进行润色，利用快照置信度重涂和多尺度训练来增强全局上下文理解。该方法在OpenWebText数据集上达到了新的基准，将生成困惑度从25.7降低到21.9，仅使用基线模型26%的微调预算。

Tracing 3D Anatomy in 2D Strokes: A Multi-Stage Projection Driven Approach to Cervical Spine Fracture Identification

Authors: Fabi Nahian Madhurja, Rusab Sarmun, Muhammad E. H. Chowdhury, Adam Mushtak, Israa Al-Hashimi, Sohaib Bassam Zoghoul

First: 2026-01-21T18:15:47+00:00 · Latest: 2026-01-21T18:15:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Cervical spine fractures are critical medical conditions requiring precise and efficient detection for effective clinical management. This study explores the viability of 2D projection-based vertebra segmentation for vertebra-level fracture detection in 3D CT volumes, presenting an end-to-end pipeline for automated analysis of cervical vertebrae (C1-C7). By approximating a 3D volume through optimized 2D axial, sagittal, and coronal projections, regions of interest are identified using the YOLOv8 model from all views and combined to approximate the 3D cervical spine area, achieving a 3D mIoU of 94.45 percent. This projection-based localization strategy reduces computational complexity compared to traditional 3D segmentation methods while maintaining high performance. It is followed by a DenseNet121-Unet-based multi-label segmentation leveraging variance- and energy-based projections, achieving a Dice score of 87.86 percent. Strategic approximation of 3D vertebral masks from these 2D segmentation masks enables the extraction of individual vertebra volumes. The volumes are analyzed for fractures using an ensemble of 2.5D Spatio-Sequential models incorporating both raw slices and projections per vertebra for complementary evaluation. This ensemble achieves vertebra-level and patient-level F1 scores of 68.15 and 82.26, and ROC-AUC scores of 91.62 and 83.04, respectively. We further validate our approach through an explainability study that provides saliency map visualizations highlighting anatomical regions relevant for diagnosis, and an interobserver variability analysis comparing our model's performance with expert radiologists, demonstrating competitive results.

中文标题/摘要

标题：在二维笔划中追踪三维解剖结构：一种多阶段投影驱动的颈椎骨折识别方法

颈椎骨折是需要精确和高效检测的严重医疗状况，对于临床管理至关重要。本研究探讨了基于二维投影的椎骨分割在三维CT体积中椎骨水平骨折检测的可行性，提出了一种端到端的自动化分析颈椎（C1-C7）的管道。通过优化的2D轴向、矢状和冠状投影近似三维体积，使用YOLOv8模型从所有视图中识别感兴趣区域并结合以近似三维颈椎区域，实现了94.45%的3D mIoU。这种基于投影的定位策略与传统的三维分割方法相比，降低了计算复杂性，同时保持了高性能。随后使用DenseNet121-Unet基于多标签分割利用方差和能量投影，实现了87.86%的Dice分数。从这些二维分割掩码近似三维椎骨掩码，使个体椎骨体积的提取成为可能。使用结合原始切片和每个椎骨投影的2.5D时空序列模型的集成进行骨折分析，实现了椎骨水平和患者水平的F1分数分别为68.15和82.26，以及ROC-AUC分数分别为91.62和83.04。我们进一步通过解释性研究验证了该方法，提供了突出诊断相关解剖区域的显著性图可视化，并通过与专家放射科医生的性能比较分析了观察者间变异性，展示了具有竞争力的结果。

Summary / 总结

This study aims to improve the detection of cervical spine fractures by developing an end-to-end pipeline using optimized 2D projections and a multi-stage segmentation approach. The method involves approximating 3D volumes through 2D projections and using YOLOv8 for localization, followed by DenseNet121-Unet for segmentation, achieving a Dice score of 87.86%. The ensemble of 2.5D Spatio-Sequential models for fracture detection yields F1 scores of 68.15 and 82.26 at vertebra and patient levels, respectively, and ROC-AUC scores of 91.62 and 83.04. The approach also includes an explainability study and interobserver variability analysis, showing competitive results compared to expert radiologists.

该研究旨在通过开发基于2D投影的椎体分割端到端管道来提高颈椎骨折的检测。该方法使用YOLOv8进行2D分割和DenseNet121-Unet进行多标签分割，实现了高3D mIoU和Dice分数。2.5D时空序列模型的集成用于骨折检测，其F1分数和ROC-AUC分数与专家放射科医生相当。

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Authors: Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

First: 2026-01-21T17:56:59+00:00 · Latest: 2026-01-21T17:56:59+00:00

Comments: Website: https://progresslm.github.io/ProgressLM/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails.

中文标题/摘要

标题：PROGRESSLM：迈向视觉语言模型中的进度推理

估计任务进度需要推理长时动态，而不仅仅是识别静态视觉内容。尽管现代视觉语言模型（VLMs）在描述可见内容方面表现出色，但尚不清楚它们是否能够从部分观察中推断出任务的进展情况。为此，我们引入了Progress-Bench，用于系统评估VLMs中的进度推理。除了基准测试外，我们还通过无训练提示和基于精心构建的数据集ProgressLM-45K的训练方法，进一步探索了受人类启发的两阶段进度推理范式。在14个VLMs上的实验表明，大多数模型尚未准备好进行任务进度估计，表现出对演示模态和视角变化的敏感性，以及对无法回答的情况处理不佳。虽然无训练提示强制结构化的进度推理仅能获得有限且模型依赖的收益，但基于训练的ProgressLM-3B即使在小型模型规模下也能实现一致的改进，尽管其训练任务集与评估任务集完全不重叠。进一步的分析揭示了特征错误模式，并阐明了进度推理何时以及为何成功或失败。

Summary / 总结

The research aims to evaluate the ability of Vision-Language Models (VLMs) to estimate task progress by introducing Progress-Bench, a benchmark for progress reasoning. The study explores a two-stage human-inspired approach through both training-free prompting and a training-based method using the curated dataset ProgressLM-45K. Experiments on 14 VLMs reveal that most models struggle with task progress estimation, showing sensitivity to changes in demonstration modality and viewpoint, and difficulty with unanswerable cases. The training-based ProgressLM-3B model shows consistent improvements, even at a small scale, despite being trained on a task set disjoint from the evaluation tasks.

研究旨在通过引入Progress-Bench基准来评估Vision-Language模型（VLM）在任务进度估计方面的能力。研究探索了通过训练-free提示和基于精心构建的ProgressLM-45K数据集的训练方法来实现两阶段的人类启发式方法。实验结果显示，大多数模型在任务进度估计方面存在困难，表现出对演示模态和视角变化的敏感性，以及处理不可回答情况的困难。尽管训练集与评估任务集完全不一致，但训练-based的ProgressLM-3B模型仍能表现出一致的改进，即使是在小规模模型上。

ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation

Authors: Hanlei Guo, Jiahao Shao, Xinya Chen, Xiyang Tan, Sheng Miao, Yujun Shen, Yiyi Liao

First: 2026-01-21T17:53:21+00:00 · Latest: 2026-01-21T17:53:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.

中文标题/摘要

标题：ScenDi: 3D到2D场景扩散级联方法用于城市生成

使用扩散模型生成3D物体的最新进展取得了显著成功，但生成逼真的3D城市场景仍然具有挑战性。现有方法仅依赖3D扩散模型往往会损失外观细节，而仅使用2D扩散模型的方法通常会牺牲相机可控性。为克服这一限制，我们提出了一种名为ScenDi的方法，该方法结合了3D和2D扩散模型以生成城市场景。我们首先训练一个3D潜在扩散模型以生成3D高斯分布，从而能够以相对较低的分辨率渲染图像。为了实现可控合成，3DGS生成过程可以有条件地指定输入，如3D边界框、道路图或文本提示。然后，我们训练一个2D视频扩散模型，以增强基于3D高斯分布渲染图像的外观细节。通过利用粗略的3D场景作为2D视频扩散的指导，ScenDi可以根据输入条件生成所需的场景，并成功遵循准确的相机轨迹。在两个具有挑战性的现实世界数据集Waymo和KITTI-360上的实验表明了我们方法的有效性。

Summary / 总结

The research aims to generate realistic 3D urban scenes by integrating 3D and 2D diffusion models. ScenDi first uses a 3D latent diffusion model to generate 3D Gaussians, which are then rendered into low-resolution images. These images are further enhanced by a 2D video diffusion model, conditioned on the rendered images, to improve appearance details. Experiments on Waymo and KITTI-360 datasets show that ScenDi can generate scenes with accurate camera trajectories and detailed appearance, overcoming limitations of previous methods.

研究旨在利用扩散模型生成逼真的3D城市场景。ScenDi结合了3D和2D扩散模型：首先使用3D潜扩散模型生成3D高斯分布，该过程可以由3D边界框、道路图或文本提示进行条件化。然后，基于渲染图像，2D视频扩散模型增强外观细节。在Waymo和KITTI-360数据集上的实验表明，ScenDi能够生成具有准确相机轨迹和详细外观的场景。

QueStER: Query Specification for Generative keyword-based Retrieval

Authors: Arthur Satouf, Yuxuan Zong, Habiboulaye Amadou-Boubacar, Pablo Piantanida, Benjamin Piwowarski

Venue: eACL 2026

First: 2025-11-07T15:01:38+00:00 · Latest: 2026-01-21T17:37:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative retrieval (GR) differs from the traditional index-then-retrieve pipeline by storing relevance in model parameters and generating retrieval cues directly from the query, but it can be brittle out of domain and expensive to scale. We introduce QueStER (QUEry SpecificaTion for gEnerative Keyword-Based Retrieval), which bridges GR and query reformulation by learning to generate explicit keyword-based search specifications. Given a user query, a lightweight LLM produces a keyword query that is executed by a standard retriever (BM25), combining the generalization benefits of generative query rewriting with the efficiency and scalability of lexical indexing. We train the rewriting policy with reinforcement learning techniques. Across in- and out-of-domain evaluations, QueStER consistently improves over BM25 and is competitive with neural IR baselines, while maintaining strong efficiency.

中文标题/摘要

标题：QueStER：生成式关键词检索的查询规范

生成式检索（GR）与传统的索引后检索管道不同，它通过存储相关性在模型参数中并在查询中直接生成检索提示，但可能会在领域外表现脆弱且难以扩展。我们提出了QueStER（查询特定规范生成式关键词检索），它通过学习生成显式的关键词搜索规范来连接GR和查询重写。给定用户查询，一个轻量级的LLM生成一个关键词查询，该查询由标准检索器（BM25）执行，结合生成式查询重写的一般化优势和词法索引的效率和可扩展性。我们使用强化学习技术训练重写策略。在领域内和领域外的评估中，QueStER始终优于BM25，并且与神经信息检索基线相当，同时保持了强大的效率。

Summary / 总结

QueStER is designed to enhance generative retrieval by learning to generate explicit keyword-based search specifications. It leverages a lightweight LLM to produce keywords that are then used by a standard retriever (BM25), combining the benefits of generative query rewriting with the efficiency of lexical indexing. Experimental results show that QueStER outperforms BM25 and is competitive with neural IR baselines, while maintaining strong efficiency.

QueStER 通过学习从用户查询生成明确的关键字搜索规范来增强生成检索。它结合了生成查询重写的优势和词法索引的效率，使用轻量级的 LLM 和 BM25。该系统通过强化学习进行训练，并在跨领域评估中表现出对 BM25 的一致改进，同时与神经 IR 基线相当，保持了强大的效率。

ZENITH: Automated Gradient Norm Informed Stochastic Optimization

Authors: Dhrubo Saha

First: 2026-01-21T17:36:12+00:00 · Latest: 2026-01-21T17:36:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero-overhead Evolution using Norm-Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall-clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R-CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.

中文标题/摘要

标题：ZENITH：自动梯度范数指导的随机优化

训练深度计算机视觉模型需要人工监督或调整学习率（LR）计划。虽然现有的自适应优化器可以自动调度LR，但它们会遭受计算和内存开销、与正则化不兼容以及LR选择不佳的问题。在本工作中，我们引入了ZENITH（零开销进化利用范数训练历史）优化器，该优化器使用梯度范数的时间演变来调整LR。跨6种CNN架构和6个基准的图像分类实验表明，ZENITH在更短的墙钟时间内实现了更高的测试精度。此外，在使用R-CNN家族模型进行对象检测、关键点检测和实例分割时，它还实现了更好的mAP。此外，其与正则化的兼容性使其能够实现更好的泛化。

Summary / 总结

ZENITH is an optimizer that adapts the learning rate based on the temporal evolution of the gradient norm, aiming to reduce manual oversight and hyperparameter tuning in training deep computer vision models. Experiments across various CNN architectures and benchmarks show that ZENITH achieves higher test accuracy and faster wall-clock time compared to existing adaptive optimizers. Additionally, it performs better in object detection, keypoint detection, and instance segmentation on MS COCO using R-CNN models, and its compatibility with regularization improves generalization.

ZENITH 是一种基于梯度范数时间演变自动调整学习率的优化器，旨在减少训练深度计算机视觉模型时的手动监督和超参数调优。实验表明，ZENITH 在各种 CNN 架构和基准上的测试准确率更高，且训练时间更短。此外，它在 MS COCO 上使用 R-CNN 家族模型进行目标检测、关键点检测和实例分割时表现更优，并且与正则化兼容可以进一步提高泛化能力。

Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs

Authors: Mohammad Shahab Sepehri, Berk Tinaz, Zalan Fabian, Mahdi Soltanolkotabi

First: 2025-07-16T05:54:37+00:00 · Latest: 2026-01-21T17:27:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each puzzle is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.

中文标题/摘要

标题：超幻象：评估多模态大语言模型的内心可视化能力基准

内心可视化，即构建和操控内部视觉表征的能力，是人类认知的核心组成部分，在涉及推理、预测和抽象的任务中发挥着重要作用。尽管多模态大语言模型（MLLMs）取得了快速进展，当前的基准主要评估被动的视觉感知，对模型内部构建视觉模式以支持问题解决的能力了解有限。然而，内心可视化是人类的一项关键认知技能，支持诸如空间导航、预测物理轨迹和通过想象模拟解决复杂视觉问题等能力。为了弥合这一差距，我们引入了超幻象，这是一种合成基准，旨在通过四个精心设计的谜题来评估MLLMs的内心可视化能力。每个谜题都是程序生成的，并以三个难度级别呈现，从而可以对模型在不断增加的复杂性下的表现进行受控分析。我们对最先进的模型的全面评估表明，人类和MLLMs在表现上存在显著差距。此外，我们还探讨了强化学习在提高视觉模拟能力方面的潜力。我们的研究结果表明，虽然一些模型在识别视觉模式方面表现出部分能力，但稳健的内心可视化仍然是当前MLLMs面临的开放挑战。

Summary / 总结

The paper introduces Hyperphantasia, a benchmark to evaluate the mental visualization capabilities of Multimodal Large Language Models (MLLMs). It addresses the gap in current benchmarks by focusing on the active construction of visual patterns, crucial for reasoning and problem-solving. The benchmark consists of four procedurally generated puzzles at varying difficulty levels, showing that state-of-the-art MLLMs significantly lag behind human performance in mental visualization. The study also explores reinforcement learning to enhance visual simulation capabilities, highlighting the ongoing challenge of robust mental visualization for MLLMs.

论文提出了Hyperphantasia，这是一个用于评估多模态大型语言模型（MLLMs）的内部视觉化能力的基准。它包含四个按难度分级生成的谜题，以评估MLLMs能否内部构建和操作视觉表示。评估结果显示，当前的MLLMs在视觉化能力上远不如人类，表明需要提高它们的内部视觉化能力。研究还探讨了使用强化学习来增强MLLMs的视觉模拟能力。

A Dynamic Prognostic Prediction Method for Colorectal Cancer Liver Metastasis

Authors: Wei Yang, Yiran Zhu, Yan su, Zesheng Li, Chengchang Pan, Honggang Qi

First: 2025-05-06T02:41:34+00:00 · Latest: 2026-01-21T17:26:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Colorectal cancer liver metastasis (CRLM) exhibits high postoperative recurrence and pronounced prognostic heterogeneity, challenging individualized management. Existing prognostic approaches often rely on static representations from a single postoperative snapshot, and fail to jointly capture tumor spatial distribution, longitudinal disease dynamics, and multimodal clinical information, limiting predictive accuracy. We propose DyPro, a deep learning framework that infers postoperative latent trajectories via residual dynamic evolution. Starting from an initial patient representation, DyPro generates a 12-step sequence of trajectory snapshots through autoregressive residual updates and integrates them to predict recurrence and survival outcomes. On the MSKCC CRLM dataset, DyPro achieves strong discrimination under repeated stratified 5-fold cross-validation, reaching a C-index of 0.755 for OS and 0.714 for DFS, with OS AUC@1y of 0.920 and OS IBS of 0.143. DyPro provides quantitative risk cues to support adjuvant therapy planning and follow-up scheduling.

中文标题/摘要

标题：结直肠癌肝转移的动态预后预测方法

结直肠癌肝转移（CRLM）术后复发率高且预后异质性显著，挑战个体化管理。现有预后方法通常依赖于单次术后静态表示，未能联合捕捉肿瘤空间分布、纵向疾病动态和多模态临床信息，限制了预测准确性。我们提出DyPro，一种深度学习框架，通过残差动态演化推断术后潜在轨迹。从初始患者表示开始，DyPro 通过自回归残差更新生成12步轨迹快照序列，并整合预测复发和生存结果。在MSKCC CRLM数据集上，DyPro 在重复分层5折交叉验证中表现出强大的区分能力，OS C指数为0.755，DFS C指数为0.714，OS AUC@1y为0.920，OS IBS为0.143。DyPro 提供定量风险提示，支持辅助治疗规划和随访安排。

Summary / 总结

The study aims to improve the prediction of recurrence and survival outcomes for colorectal cancer liver metastasis (CRLM) by addressing the limitations of static prognostic approaches. DyPro, a deep learning framework, infers postoperative latent trajectories through residual dynamic evolution, generating a 12-step sequence of trajectory snapshots to predict recurrence and survival. On the MSKCC CRLM dataset, DyPro achieves a C-index of 0.755 for overall survival (OS) and 0.714 for disease-free survival (DFS), with an OS AUC@1y of 0.920 and an OS integrated brier score of 0.143, demonstrating strong predictive accuracy.

研究旨在通过解决静态预后方法的局限性，提高对结直肠癌肝转移（CRLM）复发和生存结果的预测。DyPro是一种深度学习框架，通过残差动态演化推断术后潜在轨迹，生成12步轨迹快照序列以预测复发和生存。在MSKCC CRLM数据集上，DyPro的总体生存（OS）C指数为0.755，无病生存（DFS）C指数为0.714，1年OS AUC为0.920，OS综合布rier分数为0.143，显示出较强的预测准确性。

OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions

Authors: Maxim Popov, Regina Kurkova, Mikhail Iumanov, Jaafar Mahmoud, Sergey Kolyubin

Venue: IROS

First: 2025-03-13T13:07:51+00:00 · Latest: 2026-01-21T17:25:25+00:00

Comments: Project page: https://be2rlab.github.io/OSMa-Bench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ, and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Project page is available at https://be2rlab.github.io/OSMa-Bench/.

中文标题/摘要

标题：OSMa-Bench：在不同照明条件下评估开放语义映射

开放语义映射（OSM）是机器人感知的关键技术，结合了语义分割和SLAM技术。本文介绍了一种由动态可配置和高度自动化的LLM/LVLM驱动的管道，用于评估OSM解决方案，称为OSMa-Bench（开放语义映射基准）。研究重点是在不同室内照明条件下评估最先进的语义映射算法，这是室内环境中的一个关键挑战。我们引入了一个新的数据集，包含模拟的RGB-D序列和地面真实3D重建，便于在不同照明条件下对映射性能进行严格的分析。通过在ConceptGraphs、BBQ和OpenScene等领先模型上进行实验，我们评估了物体识别和分割的语义准确性。此外，我们引入了一种场景图评估方法，以分析模型解释语义结构的能力。结果提供了这些模型鲁棒性的见解，为开发稳健和适应性强的机器人系统指明了未来的研究方向。项目页面可在https://be2rlab.github.io/OSMa-Bench/获取。

Summary / 总结

This paper introduces OSMa-Bench, a pipeline for evaluating Open Semantic Mapping (OSM) algorithms under varying indoor lighting conditions. The study uses a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions to assess the semantic fidelity of object recognition and segmentation in leading models like ConceptGraphs, BBQ, and OpenScene. The results highlight the robustness of these models and provide insights for future research in resilient robotic systems.

本文介绍了OSMa-Bench，一个用于评估Open Semantic Mapping (OSM)在不同室内光照条件下的管道。研究使用了一个包含模拟RGB-D序列和真实3D重建的新数据集，以评估ConceptGraphs、BBQ和OpenScene等领先模型在物体识别和分割中的语义准确性。研究结果展示了这些模型的鲁棒性，并为开发适应性强的机器人系统提供了未来研究方向的见解。

BBoxMaskPose v2: Expanding Mutual Conditioning to 3D

Authors: Miroslav Purkrabek, Constantin Kolomiiets, Jiri Matas

First: 2026-01-21T17:18:04+00:00 · Latest: 2026-01-21T17:18:04+00:00

Comments: GitHub repository: https://github.com/MiraPurkrabek/BBoxMaskPose/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Most 2D human pose estimation benchmarks are nearly saturated, with the exception of crowded scenes. We introduce PMPose, a top-down 2D pose estimator that incorporates the probabilistic formulation and the mask-conditioning. PMPose improves crowded pose estimation without sacrificing performance on standard scenes. Building on this, we present BBoxMaskPose v2 (BMPv2) integrating PMPose and an enhanced SAM-based mask refinement module. BMPv2 surpasses state-of-the-art by 1.5 average precision (AP) points on COCO and 6 AP points on OCHuman, becoming the first method to exceed 50 AP on OCHuman. We demonstrate that BMP's 2D prompting of 3D model improves 3D pose estimation in crowded scenes and that advances in 2D pose quality directly benefit 3D estimation. Results on the new OCHuman-Pose dataset show that multi-person performance is more affected by pose prediction accuracy than by detection. The code, models, and data are available on https://MiraPurkrabek.github.io/BBox-Mask-Pose/.

中文标题/摘要

标题：BBoxMaskPose v2: 扩展互惠条件到3D

大多数2D人体姿态估计基准几乎饱和，除了拥挤的场景。我们引入了PMPose，这是一种顶部向下的人体姿态估计器，结合了概率公式和掩码条件。PMPose在提高拥挤姿态估计的同时，不牺牲标准场景的性能。在此基础上，我们提出了BBoxMaskPose v2 (BMPv2)，它结合了PMPose和改进的基于SAM的掩码细化模块。BMPv2在COCO上的平均精度（AP）高出1.5分，在OCHuman上高出6分，成为第一个在OCHuman上超过50 AP的方法。我们证明了BMP的2D提示3D模型在拥挤场景中提高了3D姿态估计，并且2D姿态质量的进步直接有利于3D估计。新OCHuman-Pose数据集上的结果显示，多人性能更多地受到姿态预测准确性的影响，而不是检测。代码、模型和数据可在https://MiraPurkrabek.github.io/BBox-Mask-Pose/获取。

Summary / 总结

The research aims to improve 2D human pose estimation in crowded scenes by introducing PMPose, which incorporates a probabilistic formulation and mask-conditioning. Building on this, BBoxMaskPose v2 integrates PMPose with an enhanced SAM-based mask refinement module, achieving a 1.5 AP point improvement on COCO and 6 AP points on OCHuman, surpassing state-of-the-art methods. The study shows that 2D pose quality significantly impacts 3D pose estimation in crowded scenes, and results on the OCHuman-Pose dataset indicate that multi-person performance is more influenced by pose prediction accuracy than detection accuracy.

研究引入了PMPose，这是一种顶下式的人体姿态估计器，能够在拥挤场景中提高姿态估计性能，同时在标准场景中保持性能。在此基础上，BBoxMaskPose v2集成了增强的SAM基掩码精修模块，分别在COCO和OCHuman上提高了1.5和6个AP点，超越了最先进的方法。3D模型在拥挤场景中的2D提示增强了3D姿态估计，研究还表明，多人性能更多地依赖于姿态预测的准确性而非检测。

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Authors: Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

First: 2026-01-21T17:15:22+00:00 · Latest: 2026-01-21T17:15:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

Summary / 总结

BayesianVLA addresses the issue of information collapse in Vision-Language-Action models by proposing a Bayesian decomposition framework. It introduces learnable Latent Action Queries to estimate both a vision-only prior and a language-conditioned posterior, optimizing the policy to maximize the conditional PMI between actions and instructions. This approach improves generalization, achieving a 11.3% improvement on the OOD SimplerEnv benchmark.

研究解决了Vision-Language-Action模型中信息坍缩的问题，即语言指令从视觉输入中变得高度可预测，导致泛化能力差。BayesianVLA提出了一种贝叶斯分解框架，使用可学习的潜在动作查询来估计视觉和语言条件下的策略。该方法通过最大化动作和指令之间的条件PMI来优化策略，惩罚视觉捷径并奖励能够明确解释语言命令的动作。实验结果显示，在泛化能力方面取得了显著提升，特别是在SimplerEnv基准测试中获得了11.3%的改进。

Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub

Authors: Ramtin Ehsani, Sakshi Pathak, Shriya Rawal, Abdullah Al Mujahid, Mia Mohammad Imran, Preetha Chatterjee

First: 2026-01-21T17:12:46+00:00 · Latest: 2026-01-21T17:12:46+00:00

Comments: Accepted at International Mining Software Repositories Conference (MSR 2026)

Abs · PDF · Code1 · Code2

Abstract

AI coding agents are now submitting pull requests (PRs) to software projects, acting not just as assistants but as autonomous contributors. As these agentic contributions are rapidly increasing across real repositories, little is known about how they behave in practice and why many of them fail to be merged. In this paper, we conduct a large-scale study of 33k agent-authored PRs made by five coding agents across GitHub. (RQ1) We first quantitatively characterize merged and not-merged PRs along four broad dimensions: 1) merge outcomes across task types, 2) code changes, 3) CI build results, and 4) review dynamics. We observe that tasks related to documentation, CI, and build update achieve the highest merge success, whereas performance and bug-fix tasks perform the worst. Not-merged PRs tend to involve larger code changes, touch more files, and often do not pass the project's CI/CD pipeline validation. (RQ2) To further investigate why some agentic PRs are not merged, we qualitatively analyze 600 PRs to derive a hierarchical taxonomy of rejection patterns. This analysis complements the quantitative findings in RQ1 by uncovering rejection reasons not captured by quantitative metrics, including lack of meaningful reviewer engagement, duplicate PRs, unwanted feature implementations, and agent misalignment. Together, our findings highlight key socio-technical and human-AI collaboration factors that are critical to improving the success of future agentic workflows.

中文标题/摘要

标题：AI编程代理失败在哪里？GitHub上失败的代理拉取请求的实证研究

AI编程代理现在正在向软件项目提交拉取请求（PR），不仅作为助手，还作为自主贡献者。随着这些代理贡献在实际仓库中的迅速增加，人们对它们在实际中的行为以及为何许多PR未被合并知之甚少。在本文中，我们对GitHub上五个编程代理提交的33,000个代理撰写的PR进行了大规模研究。（RQ1）我们首先从四个广泛维度定量地表征了合并和未合并的PR：1）任务类型上的合并结果，2）代码更改，3）CI构建结果，4）审查动态。我们观察到，与文档、CI和构建更新相关的任务的合并成功率最高，而性能和bug修复任务表现最差。未合并的PR通常涉及更大的代码更改，触及更多的文件，并且往往未通过项目的CI/CD管道验证。（RQ2）为了进一步探讨为什么某些代理PR未被合并，我们对600个PR进行了定性分析，以推导出拒绝模式的层次分类。该分析补充了RQ1中的定量发现，揭示了定量指标未捕捉到的拒绝原因，包括缺乏有意义的审阅者参与、重复的PR、不必要的功能实现以及代理的不匹配。综上所述，我们的研究结果突出了关键的社会技术因素和人机协作因素，这些因素对于提高未来代理工作流程的成功至关重要。

Summary / 总结

The study investigates the behavior and failure modes of AI coding agents in GitHub by analyzing 33,000 pull requests from five agents. It quantitatively characterizes merge outcomes and reasons for failure, finding that documentation, CI, and build update tasks are most successful, while performance and bug-fix tasks fail more often. Qualitative analysis of 600 rejected PRs reveals additional reasons for failure, such as lack of reviewer engagement and misalignment with project goals. The research highlights socio-technical factors affecting agent success in software development workflows.

研究分析了GitHub上五个AI编码代理提交的33,000个拉取请求，探讨为何许多请求未被合并。研究通过任务类型、代码更改、CI构建结果和审查动态的定量分析，发现文档、CI和构建更新任务的合并成功率最高，而性能和错误修复任务则表现最差。对600个请求的定性分析揭示了缺乏审查者参与、重复请求和代理错位等拒绝原因，补充了定量分析的结果。

Benchmarking Large Language Models for ABAP Code Generation: An Empirical Study on Iterative Improvement by Compiler Feedback

Authors: Stephan Wallraven, Tim Köhne, Hartmut Westenberger, Andreas Moser

First: 2026-01-21T17:06:41+00:00 · Latest: 2026-01-21T17:06:41+00:00

Comments: 20 pages, 10 figures, Author: Hartmut Westenberger (ORCID: 0009-0009-9063-8318)

Abs · PDF · Code1 · Code2

Abstract

This work investigates the performance of Large Language Models (LLMs) in generating ABAP code. Despite successful applications of generative AI in many programming languages, there are hardly any systematic analyses of ABAP code generation to date. The aim of the study is to empirically analyze to what extent various LLMs can generate syntactically correct and functional ABAP code, how effectively they use compiler feedback for iterative improvement, and which task types pose special challenges. For this purpose, a benchmark with 180 tasks is conducted, consisting of adapted HumanEval tasks and practical SAP scenarios. The results show significant performance differences between the models: more powerful LLMs achieve success rates of around 75% after several iterations and benefit greatly from compiler feedback, while smaller models perform significantly weaker. Overall, the study highlights the high potential of powerful LLMs for ABAP development processes, especially in iterative error correction.

中文标题/摘要

标题：大型语言模型在ABAP代码生成中的基准测试：编译器反馈驱动的迭代改进实证研究

本研究探讨了大型语言模型（LLMs）在生成ABAP代码方面的性能。尽管生成式AI在许多编程语言中的应用已经取得成功，但迄今为止对ABAP代码生成的系统性分析却很少。本研究旨在实证分析各种LLMs生成语法正确且功能完备的ABAP代码的程度，它们如何有效地利用编译器反馈进行迭代改进，以及哪些任务类型会带来特殊挑战。为此，进行了包含180个任务的基准测试，这些任务由改编的HumanEval任务和实际的SAP场景组成。结果显示，不同模型之间存在显著的性能差异：更强大的LLMs在经过多次迭代后成功率达到约75%，并且从编译器反馈中获益良多，而较小的模型则表现明显较差。总体而言，本研究突显了强大LLMs在ABAP开发流程中的高潜力，尤其是在迭代错误修正方面。

Summary / 总结

This study evaluates the performance of Large Language Models (LLMs) in generating ABAP code, focusing on their ability to generate syntactically correct and functional code through iterative improvement with compiler feedback. The research uses a benchmark of 180 tasks, including adapted HumanEval tasks and practical SAP scenarios, and finds that more powerful LLMs achieve success rates of around 75% after several iterations and benefit significantly from compiler feedback, whereas smaller models perform much weaker. The study underscores the potential of powerful LLMs for ABAP development, particularly in iterative error correction.

该研究评估了大型语言模型（LLMs）在生成ABAP代码方面的性能，使用了包含改编自HumanEval任务和实际SAP场景的180个任务基准。研究发现，更强大的LLMs在经过多次迭代后可以达到约75%的成功率，并且显著受益于编译器反馈，而较小的模型则表现较弱。该研究强调了强大LLMs在ABAP开发中的潜力，特别是在迭代错误修正方面。

A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments

Authors: Md Jahangir Alam Khondkar, Ajan Ahmed, Stephanie Schuckers, Masudul Haider Imtiaz

First: 2025-06-17T22:12:40+00:00 · Latest: 2026-01-21T17:05:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Speech enhancement, particularly denoising, is vital in improving the intelligibility and quality of speech signals for real-world applications, especially in noisy environments. While prior research has introduced various deep learning models for this purpose, many struggle to balance noise suppression, perceptual quality, and speaker-specific feature preservation, leaving a critical research gap in their comparative performance evaluation. This study benchmarks three state-of-the-art models Wave-U-Net, CMGAN, and U-Net, on diverse datasets such as SpEAR, VPQAD, and Clarkson datasets. These models were chosen due to their relevance in the literature and code accessibility. The evaluation reveals that U-Net achieves high noise suppression with SNR improvements of +71.96% on SpEAR, +64.83% on VPQAD, and +364.2% on the Clarkson dataset. CMGAN outperforms in perceptual quality, attaining the highest PESQ scores of 4.04 on SpEAR and 1.46 on VPQAD, making it well-suited for applications prioritizing natural and intelligible speech. Wave-U-Net balances these attributes with improvements in speaker-specific feature retention, evidenced by VeriSpeak score gains of +10.84% on SpEAR and +27.38% on VPQAD. This research indicates how advanced methods can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition. The findings may contribute to advancing voice biometrics, forensic audio analysis, telecommunication, and speaker verification in challenging acoustic conditions.

中文标题/摘要

标题：实时噪声环境中文语增强深度学习模型的比较评估

语声增强，尤其是降噪，对于提高实时应用中语声信号的可懂度和质量至关重要，尤其是在噪声环境中。尽管先前的研究引入了各种深度学习模型，但许多模型在噪声抑制、感知质量和说话人特定特征保留之间难以平衡，留下了比较性能评估的关键研究空白。本研究在SpEAR、VPQAD和克拉克森数据集等多样数据集上对三种最先进的模型Wave-U-Net、CMGAN和U-Net进行了基准测试。这些模型因其文献相关性和代码可访问性而被选择。评估结果显示，U-Net在噪声抑制方面表现出色，SpEAR数据集上的信噪比提高71.96%，VPQAD数据集上的信噪比提高64.83%，克拉克森数据集上的信噪比提高364.2%。CMGAN在感知质量方面表现更优，SpEAR数据集上的PESQ得分为4.04，VPQAD数据集上的PESQ得分为1.46，使其适用于重视自然和可懂度的语声应用。Wave-U-Net在这些属性之间取得了平衡，通过VeriSpeak得分提高10.84%（SpEAR）和27.38%（VPQAD）来保留说话人特定特征。这项研究表明，先进的方法如何优化噪声抑制、感知质量和说话人识别之间的权衡。研究结果可能有助于推进语音生物特征识别、法医音频分析、电信和在挑战性声学条件下说话人验证。

Summary / 总结

This study evaluates three deep learning models, Wave-U-Net, CMGAN, and U-Net, for speech enhancement in noisy environments. The research finds that U-Net excels in noise suppression with significant SNR improvements, CMGAN performs best in perceptual quality, and Wave-U-Net balances noise suppression with speaker-specific feature retention. These findings suggest that different models can optimize trade-offs between noise suppression, perceptual quality, and speaker recognition, which is crucial for applications like voice biometrics and forensic audio analysis.

本研究评估了Wave-U-Net、CMGAN和U-Net三种最先进的深度学习模型在嘈杂环境中的语音增强性能。这些模型在SpEAR、VPQAD和Clarkson数据集上进行了基准测试。U-Net在噪声抑制方面表现出色，SNR提升显著，而CMGAN在感知质量方面表现最佳。Wave-U-Net在噪声抑制和说话人特定特征保留之间取得了平衡，如VeriSpeak得分提高。研究强调了噪声抑制、感知质量和说话人识别之间的权衡，对语音生物特征识别和法医音频分析具有重要意义。

Dynamic Management of a Deep Learning-Based Anomaly Detection System for 5G Networks

Authors: Lorenzo Fernández Maimó, Alberto Huertas Celdrán, Manuel Gil Pérez, Félix J. García Clemente, Gregorio Martínez Pérez

First: 2026-01-21T16:54:19+00:00 · Latest: 2026-01-21T16:54:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Fog and mobile edge computing (MEC) will play a key role in the upcoming fifth generation (5G) mobile networks to support decentralized applications, data analytics and management into the network itself by using a highly distributed compute model. Furthermore, increasing attention is paid to providing user-centric cybersecurity solutions, which particularly require collecting, processing and analyzing significantly large amount of data traffic and huge number of network connections in 5G networks. In this regard, this paper proposes a MEC-oriented solution in 5G mobile networks to detect network anomalies in real-time and in autonomic way. Our proposal uses deep learning techniques to analyze network flows and to detect network anomalies. Moreover, it uses policies in order to provide an efficient and dynamic management system of the computing resources used in the anomaly detection process. The paper presents relevant aspects of the deployment of the proposal and experimental results to show its performance.

中文标题/摘要

标题：基于深度学习的5G网络异常检测系统的动态管理

雾计算和移动边缘计算（MEC）将在即将到来的第五代（5G）移动网络中发挥关键作用，以支持去中心化的应用程序，并通过高度分布式的计算模型将数据处理和管理直接引入网络。此外，对提供以用户为中心的网络安全解决方案的关注不断增加，这特别需要在5G网络中收集、处理和分析大量数据流量和大量的网络连接。为此，本文在5G移动网络中提出了一种面向MEC的解决方案，以实现实时和自主的网络异常检测。我们的提案使用深度学习技术来分析网络流并检测网络异常。此外，它还使用策略来提供在异常检测过程中使用的计算资源的高效和动态管理系统。本文介绍了提案的部署相关方面并展示了实验结果以证明其性能。

Summary / 总结

This paper addresses the need for real-time and autonomous anomaly detection in 5G networks using fog and mobile edge computing (MEC). It proposes a solution that employs deep learning techniques to analyze network flows and detect anomalies. The system also dynamically manages computing resources through policies. Experimental results demonstrate its performance in detecting anomalies efficiently.

本文针对5G网络中使用雾计算和移动边缘计算（MEC）实现实时和自主异常检测的需求。该方案利用深度学习技术分析网络流量并检测异常，同时通过策略动态管理计算资源。实验结果表明，该方法能够有效地检测网络异常。

Finding Kissing Numbers with Game-theoretic Reinforcement Learning

Authors: Chengdong Ma, Théo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Yuan Cheng, Yuan Qi, Yaodong Yang

First: 2025-11-17T14:02:00+00:00 · Latest: 2026-01-21T16:46:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem represents the local analogue of Hilbert's 18th problem on sphere packing, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry and exponentially growing combinatorial complexity beyond 8 dimensions, which exceeds the complexity of Go game, limit the scalability of existing methods. Here we model this problem as a two-player matrix completion game that can be fully parallelized at large scale, and train the game-theoretic reinforcement learning system, PackingStar, to efficiently explore high-dimensional spaces. The matrix entries represent pairwise cosines of sphere center vectors; one player fills entries while another corrects suboptimal ones, jointly maximizing the matrix size, corresponding to the kissing number. This cooperative dynamics substantially improves sample quality, making the extremely large spaces tractable. PackingStar reproduces previous configurations and surpasses all human-known records from dimensions 25 to 31, with the configuration in 25 dimensions geometrically corresponding to the Leech lattice and suggesting possible optimality. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions, discovers over 6000 new structures in 14 and other dimensions, and establishes new records for generalized kissing configurations under various angular constraints. These results demonstrate AI's power to explore high-dimensional spaces beyond human intuition and open new pathways for the Kissing Number Problem and broader geometry problems.

中文标题/摘要

标题：使用博弈强化学习寻找接吻数

自艾萨克·牛顿在1694年首次研究接吻数问题以来，确定围绕中心球的最大非重叠球体数量一直是基本挑战。该问题代表了希尔伯特第18个问题中球体堆积的局部类比，连接了几何学、数论和信息论。尽管通过晶格和编码取得了显著进展，但高维几何中的不规则性以及8维以上（超过围棋的复杂性）的组合复杂性指数增长限制了现有方法的可扩展性。在这里，我们将该问题建模为一个可以大规模并行化的两玩家矩阵填充博弈，并训练博弈强化学习系统PackingStar高效探索高维空间。矩阵条目代表球心向量的成对余弦值；一个玩家填充条目，另一个纠正次优条目，共同最大化矩阵大小，对应于接吻数。这种合作动态显著提高了样本质量，使极其庞大的空间变得可处理。PackingStar复现了先前的配置，并从第25维到第31维超越了所有已知的人类记录，第25维的配置几何上对应于Leech晶格，暗示可能的最优性。它在第13维实现了自1971年以来的第一个突破，发现了超过6000个新结构在14维及其他维度，并在各种角度约束下的广义接吻配置中建立了新的记录。这些结果展示了AI在超越人类直觉探索高维空间方面的强大能力，并为接吻数问题和更广泛的几何问题开辟了新途径。

Summary / 总结

The research addresses the Kissing Number Problem, which seeks the maximum number of non-overlapping spheres around a central sphere. It models this problem as a two-player matrix completion game and trains a game-theoretic reinforcement learning system, PackingStar, to explore high-dimensional spaces. PackingStar reproduces known configurations and surpasses human records in dimensions 25 to 31, and discovers over 6000 new structures in 14 and other dimensions, establishing new records for generalized kissing configurations under various angular constraints.

研究旨在解决Kissing Number Problem，即围绕一个中心球体找到最多非重叠球体的数量。为此，研究将该问题建模为一个两玩家矩阵填充游戏，并训练了一个名为PackingStar的游戏理论强化学习系统。该方法能够高效探索高维空间，并在25至31维度上超越了先前的记录，包括25维度的一个配置与Leech晶格相对应。此外，PackingStar还在14维度及其他维度上发现了超过6000个新结构，并建立了各种角度约束下的新的接吻配置记录。

CRAFT: Continuous Reasoning and Agentic Feedback Tuning for Multimodal Text-to-Image Generation

Authors: V. Kovalev, A. Kuvshinov, A. Buzovkin, D. Pokidov, D. Timonin

First: 2025-12-23T13:44:41+00:00 · Latest: 2026-01-21T16:42:28+00:00

Comments: 37 pages, 42 figures

Abs · PDF · Code1 · Code2

Abstract

Recent work has shown that inference-time reasoning and reflection can improve text-to-image generation without retraining. However, existing approaches often rely on implicit, holistic critiques or unconstrained prompt rewrites, making their behavior difficult to interpret, control, or stop reliably. In contrast, large language models have benefited from explicit, structured forms of **thinking** based on verification, targeted correction, and early stopping. We introduce CRAFT (Continuous Reasoning and Agentic Feedback Tuning), a training-free and model-agnostic framework for multimodal image generation. CRAFT transforms a user prompt into a set of explicit, dependency-structured visual constraints, verifies generated images using a vision-language model, and performs targeted prompt updates only when specific constraints are violated. This iterative process includes an explicit stopping criterion, resulting in an interpretable and controllable inference-time refinement loop. Across multiple model families and challenging benchmarks, CRAFT consistently improves compositional accuracy, text rendering, and preference-based evaluations, with particularly strong gains for lightweight generators. Importantly, these improvements incur only a negligible inference-time overhead, allowing smaller or cheaper models to approach the quality of substantially more expensive systems. Our results suggest that explicitly structured, constraint-driven inference-time reasoning is a key ingredient for improving the reliability of multimodal generative models.

中文标题/摘要

标题：CRAFT：连续推理和自主反馈调优的多模态文本到图像生成

近期研究表明，在推理时进行推理和反思可以提高文本到图像生成的效果，而无需重新训练。然而，现有方法往往依赖于隐式的、整体的批评或不受限制的提示重写，这使得它们的行为难以解释、控制或可靠地停止。相比之下，大型语言模型得益于基于验证、目标纠正和早期停止的明确、结构化的**思考**形式。我们提出了CRAFT（连续推理和自主反馈调优），这是一种无需训练且模型无关的多模态图像生成框架。CRAFT 将用户提示转换为一组明确的、依赖结构化的视觉约束，使用视觉语言模型验证生成的图像，并仅在特定约束被违反时进行有针对性的提示更新。这一迭代过程包括一个明确的停止标准，从而形成一个可解释且可控的推理时细化循环。在多个模型家族和具有挑战性的基准测试中，CRAFT 一致地提高了组合准确性、文本呈现和基于偏好的评估，特别是在轻量级生成器方面取得了显著的改进。重要的是，这些改进仅带来了微不足道的推理时开销，使得较小或更便宜的模型能够接近更昂贵系统的质量。我们的结果表明，明确结构化的、基于约束的推理是提高多模态生成模型可靠性的关键成分。

Summary / 总结

CRAFT is a training-free and model-agnostic framework for multimodal text-to-image generation that transforms user prompts into explicit visual constraints, verifies generated images using a vision-language model, and updates prompts only when constraints are violated. This iterative process, with an explicit stopping criterion, leads to improved compositional accuracy, text rendering, and preference-based evaluations, especially for lightweight generators, with minimal inference-time overhead.

CRAFT 是一个无需训练且模型无关的框架，将用户提示转换为显式的视觉约束，使用视觉语言模型验证生成的图像，并仅在约束被违反时更新提示。这一迭代过程包含明确的停止标准，提高了组合准确性、文本渲染和基于偏好的评估，尤其是对于轻量级生成器，同时具有极小的推理时间开销。

The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Authors: Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

First: 2026-01-21T16:41:58+00:00 · Latest: 2026-01-21T16:41:58+00:00

Comments: Code and pre-trained models: https://github.com/LeapLabTHU/JustGRPO

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

中文标题/摘要

标题：灵活性陷阱：任意顺序限制为何在扩散语言模型中限制了推理潜力

扩散大型语言模型（dLLMs）打破了传统LLMs的从左到右的刚性约束，使令牌生成可以以任意顺序进行。直观上，这种灵活性意味着一个严格包含固定自回归轨迹的解空间，理论上为数学和编程等通用任务解锁了更优的推理潜力。因此，许多研究工作利用强化学习（RL）来激发dLLMs的推理能力。在本文中，我们揭示了一个反直觉的现实：当前形式的任意顺序生成实际上缩小了而不是扩大了dLLMs的推理边界。我们发现dLLMs倾向于利用这种顺序灵活性绕过对探索至关重要的高不确定性令牌，导致解空间过早崩溃。这一观察挑战了现有dLLMs的RL方法的前提，其中许多复杂性，如处理组合轨迹和不可计算的似然性，往往致力于保持这种灵活性。我们证明，通过故意放弃任意顺序并应用标准的组相对策略优化（GRPO）可以更有效地激发有效的推理。我们的方法JustGRPO简洁而效果显著（例如，在GSM8K上的准确率为89.1%），同时完全保留了dLLMs的并行解码能力。项目页面：https://nzl-thu.github.io/the-flexibility-trap

Summary / 总结

This paper investigates the reasoning capabilities of diffusion large language models (dLLMs) and finds that their arbitrary order generation actually limits their reasoning potential. The authors reveal that dLLMs tend to avoid high-uncertainty tokens, which narrows the solution space. They propose JustGRPO, a minimalist approach that intentionally forgoes arbitrary order, achieving 89.1% accuracy on GSM8K while maintaining parallel decoding ability. This work challenges the reliance on reinforcement learning for enhancing dLLMs' reasoning capabilities.

该论文研究了扩散大型语言模型（dLLMs）的推理能力，发现它们的任意顺序生成实际上限制了其推理潜力。作者揭示，dLLMs倾向于避开高不确定性标记，这缩小了解空间。他们提出了一种简约的方法JustGRPO，故意放弃任意顺序，实现了在GSM8K上的89.1%准确率，同时保持了dLLMs的并行解码能力。这挑战了现有依赖于保留顺序灵活性的强化学习方法。

V-CAGE: Context-Aware Generation and Verification for Scalable Long-Horizon Embodied Tasks

Authors: Yaru Liu, Ao-bo Wang, Nanyang Ye

First: 2026-01-21T16:41:51+00:00 · Latest: 2026-01-21T16:41:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning long-horizon embodied behaviors from synthetic data remains challenging because generated scenes are often physically implausible, language-driven programs frequently "succeed" without satisfying task semantics, and high-level instructions require grounding into executable action sequences. To address these limitations, we introduce V-CAGE, a closed-loop framework for generating robust, semantically aligned manipulation datasets at scale. First, we propose a context-aware instantiation mechanism that enforces geometric consistency during scene synthesis. By dynamically maintaining a map of prohibited spatial areas as objects are placed, our system prevents interpenetration and ensures reachable, conflict-free configurations in cluttered environments. Second, to bridge the gap between abstract intent and low-level control, we employ a hierarchical instruction decomposition module. This decomposes high-level goals (e.g., "get ready for work") into compositional action primitives, facilitating coherent long-horizon planning. Crucially, we enforce semantic correctness through a VLM-based verification loop. Acting as a visual critic, the VLM performs rigorous rejection sampling after each subtask, filtering out "silent failures" where code executes but fails to achieve the visual goal. Experiments demonstrate that V-CAGE yields datasets with superior physical and semantic fidelity, significantly boosting the success rate and generalization of downstream policies compared to non-verified baselines.

中文标题/摘要

标题：V-CAGE：面向大规模长时域体态任务的上下文感知生成与验证

从合成数据中学习长时域体态行为仍然具有挑战性，因为生成的场景往往在物理上不可行，语言驱动的程序经常“成功”而不满足任务语义，高层指令需要转化为可执行的动作序列。为了解决这些限制，我们提出了V-CAGE，一种闭环框架，用于大规模生成鲁棒且语义对齐的操纵数据集。首先，我们提出了一种上下文感知的实例化机制，在场景合成过程中强制几何一致性。通过动态维护一个禁止空间区域的地图，随着物体的放置，我们的系统防止了穿透并确保了在拥挤环境中可达且无冲突的配置。其次，为了弥合抽象意图与低级控制之间的差距，我们采用了一种分层指令分解模块。该模块将高层目标（例如，“准备上班”）分解为组合动作原语，促进连贯的长时域规划。至关重要的是，我们通过基于VLM的验证循环强制执行语义正确性。作为视觉批评者，VLM 在每个子任务后执行严格的拒绝采样，过滤掉“静默失败”情况，即代码执行但未能实现视觉目标。实验表明，V-CAGE 生成的数据集在物理和语义保真度方面更优，显著提高了下游策略的成功率和泛化能力，相比未验证的基线有显著提升。

Summary / 总结

V-CAGE is a closed-loop framework designed to generate robust and semantically aligned manipulation datasets for long-horizon embodied tasks. It introduces a context-aware instantiation mechanism to enforce geometric consistency and prevent interpenetration in cluttered environments. Additionally, it uses a hierarchical instruction decomposition module to break down high-level goals into actionable primitives, and a VLM-based verification loop to ensure semantic correctness. Experiments show that V-CAGE improves the physical and semantic fidelity of datasets, leading to higher success rates and better generalization of downstream policies compared to non-verified methods.

V-CAGE 是一个用于生成和验证大规模操作数据集的框架，适用于长期的实体任务。它引入了一种上下文感知的实例化机制以确保几何一致性，并使用层次指令分解模块将抽象目标与低级动作联系起来。框架还包括一个使用 VLM 的验证循环来过滤失败情况，确保物理和语义的正确性。实验表明，与未验证的方法相比，V-CAGE 能显著提高下游策略的成功率和泛化能力。

Knowledge Graphs are Implicit Reward Models: Path-Derived Signals Enable Compositional Reasoning

Authors: Yuval Kansal, Niraj K. Jha

First: 2026-01-21T16:38:59+00:00 · Latest: 2026-01-21T16:38:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models have achieved near-expert performance in structured reasoning domains like mathematics and programming, yet their ability to perform compositional multi-hop reasoning in specialized scientific fields remains limited. We propose a bottom-up learning paradigm in which models are grounded in axiomatic domain facts and compose them to solve complex, unseen tasks. To this end, we present a post-training pipeline, based on a combination of supervised fine-tuning and reinforcement learning (RL), in which knowledge graphs act as implicit reward models. By deriving novel reward signals from knowledge graph paths, we provide verifiable, scalable, and grounded supervision that encourages models to compose intermediate axioms rather than optimize only final answers during RL. We validate this approach in the medical domain, training a 14B model on short-hop reasoning paths (1-3 hops) and evaluating its zero-shot generalization to complex multi-hop queries (4-5 hops). Our experiments show that path-derived rewards act as a "compositional bridge", enabling our model to significantly outperform much larger models and frontier systems like GPT-5.2 and Gemini 3 Pro, on the most difficult reasoning tasks. Furthermore, we demonstrate the robustness of our approach to adversarial perturbations against option-shuffling stress tests. This work suggests that grounding the reasoning process in structured knowledge is a scalable and efficient path toward intelligent reasoning.

中文标题/摘要

标题：知识图谱是隐式奖励模型：路径衍生信号促进组合推理

大型语言模型在数学和编程等结构化推理领域已接近专家水平，但在执行特定科学领域的组合多跳推理方面仍受到限制。我们提出了一种自底向上的学习范式，其中模型基于公理领域的事实并组合这些事实以解决复杂的未见任务。为此，我们提出了一种基于监督微调和强化学习（RL）结合的后训练管道，其中知识图谱作为隐式奖励模型。通过从知识图谱路径中推导出新的奖励信号，我们提供了可验证、可扩展且基于事实的监督，鼓励模型组合中间公理而不是仅在RL中优化最终答案。我们在医学领域验证了这种方法，对一个14B模型进行了短跳推理路径（1-3跳）的训练，并评估了其零样本泛化能力以解决复杂的多跳查询（4-5跳）。我们的实验表明，路径衍生的奖励作为“组合桥梁”，使我们的模型在最困难的推理任务上显著优于更大的模型和前沿系统如GPT-5.2和Gemini 3 Pro。此外，我们展示了我们的方法在选项打乱压力测试中的鲁棒性。这项工作表明，将推理过程与结构化知识相结合是一种可扩展且高效的通向智能推理的道路。

Summary / 总结

The research aims to enhance the compositional reasoning capabilities of large language models in specialized scientific fields. It proposes a bottom-up learning paradigm grounded in axiomatic domain facts, using a post-training pipeline combining supervised fine-tuning and reinforcement learning. The key finding is that path-derived rewards from knowledge graphs enable the model to significantly outperform larger models and frontier systems on complex reasoning tasks, demonstrating robustness against adversarial perturbations.

研究旨在增强大型语言模型在专业科学领域的组合推理能力。它提出了一种基于公理领域事实的自底向上的学习范式，使用结合监督微调和强化学习的后训练管道。关键发现是，来自知识图的路径衍生奖励使模型在复杂推理任务上显著优于更大规模的模型和前沿系统，并且在对抗性扰动下表现出鲁棒性。

Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data

Authors: Lingkai Kong, Haichuan Wang, Tonghan Wang, Guojun Xiong, Milind Tambe

Venue: NeurIPS 2025 Spotlight

First: 2025-05-29T04:09:19+00:00 · Latest: 2026-01-21T16:37:21+00:00

Comments: NeurIPS 2025 Spotlight

Abs · PDF · Code1 · Code2

Abstract

Incorporating pre-collected offline data can substantially improve the sample efficiency of reinforcement learning (RL), but its benefits can break down when the transition dynamics in the offline dataset differ from those encountered online. Existing approaches typically mitigate this issue by penalizing or filtering offline transitions in regions with large dynamics gap. However, their dynamics-gap estimators often rely on KL divergence or mutual information, which can be ill-defined when offline and online dynamics have mismatched support. To address this challenge, we propose CompFlow, a principled framework built on the theoretical connection between flow matching and optimal transport. Specifically, we model the online dynamics as a conditional flow built upon the output distribution of a pretrained offline flow, rather than learning it directly from a Gaussian prior. This composite structure provides two advantages: (1) improved generalization when learning online dynamics under limited interaction data, and (2) a well-defined and stable estimate of the dynamics gap via the Wasserstein distance between offline and online transitions. Building on this dynamics-gap estimator, we further develop an optimistic active data collection strategy that prioritizes exploration in high-gap regions, and show theoretically that it reduces the performance gap to the optimal policy. Empirically, CompFlow consistently outperforms strong baselines across a range of RL benchmarks with shifted-dynamics data.

中文标题/摘要

标题：复合流匹配在具有偏移动力学数据的强化学习中的应用

将预先收集的离线数据纳入可以显著提高强化学习（RL）的样本效率，但当离线数据集中的转换动力学与在线遇到的动力学不同时，其优势可能会消失。现有方法通常通过惩罚或过滤具有大动力学差距的离线过渡来缓解这一问题。然而，它们的动力学差距估计器往往依赖于KL散度或互信息，当离线和在线动力学支持不匹配时，这些估计器可能不明确。为了解决这一挑战，我们提出CompFlow，这是一种基于流匹配与最优传输之间理论联系的原理性框架。具体而言，我们将在线动力学建模为基于预训练离线流输出分布的条件流，而不是直接从高斯先验中学习。这种复合结构提供了两个优势：（1）在有限交互数据下学习在线动力学时的改进泛化能力；（2）通过离线和在线过渡之间的 Wasserstein 距离获得明确且稳定的动力学差距估计。基于这一动力学差距估计器，我们进一步开发了一种乐观的主动数据收集策略，优先探索高差距区域，并从理论上证明它减少了与最优策略的性能差距。实验上，CompFlow 在一系列具有偏移动力学数据的RL基准测试中始终优于强大的基线。

Summary / 总结

The paper addresses the challenge of using pre-collected offline data in reinforcement learning when there is a mismatch between offline and online dynamics. It proposes CompFlow, a framework that models online dynamics as a conditional flow based on a pretrained offline flow, which improves generalization and provides a well-defined dynamics gap estimate. Empirically, CompFlow outperforms strong baselines in various RL benchmarks with shifted-dynamics data.

论文解决了在离线数据与在线动力学不匹配时使用离线数据在强化学习中的挑战。它提出了CompFlow框架，该框架使用复合流结构来建模在线动力学，从而提高样本效率并提供动力学差距的稳定估计。实验中，CompFlow在各种具有偏移动力学数据的RL基准测试中优于强基线。

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Authors: Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

First: 2026-01-21T16:36:19+00:00 · Latest: 2026-01-21T16:36:19+00:00

Comments: 80 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive gradient descent to discover such systematic reasoning remains poorly understood. We address this by analyzing the gradient flow dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought (CoT) but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, gradient flow drives the model to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of "simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler instances, the model learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, gradient-based learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

中文标题/摘要

标题：基于结果的RL使Transformer自发具备推理能力，但仅在适当数据下有效

通过强化学习（RL）和基于结果的监督训练的Transformer可以自发发展生成中间推理步骤（链式思考）的能力。然而，稀疏奖励如何驱动梯度下降以发现系统性推理机制仍不甚明了。我们通过分析单层Transformer在合成图遍历任务中的梯度流动动力学来解决这一问题，该任务无法仅通过链式思考解决，但允许简单的迭代解决方案。我们证明，尽管仅基于最终答案的正确性进行训练，梯度流动仍能驱动模型收敛到一个结构化、可解释的算法，该算法逐个顶点迭代遍历图。我们描述了这种涌现所需的分布特性，确定了“简单示例”的关键作用：需要较少推理步骤的实例。当训练分布对这些更简单的实例赋予足够的权重时，模型可以学习一种可泛化的遍历策略，适用于更长的链；当这种权重消失时，基于梯度的学习变得不可行。我们通过合成数据实验和现实世界的语言模型在数学推理任务上的实验，验证了我们的理论结果，证明了我们的理论发现适用于实际应用。

Summary / 总结

The study investigates how Transformers trained with outcome-based reinforcement learning can develop the ability to generate intermediate reasoning steps (Chain-of-Thought). By analyzing gradient flow dynamics on a synthetic graph traversal task, the researchers prove that despite sparse rewards, the model converges to a structured algorithm that iteratively traverses the graph. The key finding is that the distribution of training examples, particularly those requiring fewer reasoning steps, is crucial for the model to learn a generalizable traversal strategy that can handle longer chains of reasoning. When these simpler examples are insufficient, gradient-based learning fails to discover systematic reasoning. Experiments on synthetic data and real-world language models support these theoretical insights in mathematical reasoning tasks.

研究探讨了通过结果导向的强化学习训练的Transformer如何自发发展出生成中间推理步骤的能力。通过分析合成图遍历任务上的梯度流动动态，研究证明，尽管仅基于最终答案的正确性进行训练，模型仍会收敛到一个迭代遍历图的结构化算法。关键发现是，训练样本的分布，特别是那些需要较少推理步骤的样本，对使模型学习到可泛化的遍历策略至关重要。

From Charts to Code: A Hierarchical Benchmark for Multimodal Models

Authors: Jiahao Tang, Henry Hengyuan Zhao, Lijian Wu, Yifei Tao, Dongxing Mao, Yang Wan, Jingru Tan, Min Zeng, Min Li, Alex Jinpeng Wang

First: 2025-10-20T15:11:56+00:00 · Latest: 2026-01-21T16:16:29+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Chart2Code, a new benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models (LMMs). Chart2Code is explicitly designed from a user-driven perspective, capturing diverse real-world scenarios and progressively increasing task difficulty. It consists of three levels: Level 1 (Chart Reproduction) reproduces charts from a reference figure and user query; Level 2 (Chart Editing) involves complex modifications such as changing chart types or adding elements; and Level 3 (Long-Table to Chart Generation) requires models to transform long, information-dense tables into faithful charts following user instructions. To our knowledge, this is the first hierarchical benchmark that reflects practical chart2code usage while systematically scaling task complexity. In total, Chart2Code contains 2,023 tasks across 22 chart types, paired with multi-level evaluation metrics that assess both code correctness and the visual fidelity of rendered charts. We benchmark 25 state-of-the-art (SoTA) LMMs, including both proprietary and the latest open-source models such as GPT-5, Qwen2.5-VL, InternVL3/3.5, MiMo-VL, and Seed-1.6-VL. Experimental results demonstrate that even the SoTA model GPT-5 averages only 0.57 on code-based evaluation and 0.22 on chart-quality assessment across the editing tasks, underscoring the difficulty of Chart2Code. We anticipate this benchmark will drive advances in multimodal reasoning and foster the development of more robust and general-purpose LMMs. Our code and data are available on Chart2Code.

中文标题/摘要

标题：从图表到代码：多模态模型的层次基准

我们引入了Chart2Code，这是一个新的基准，用于评估大型多模态模型（LMMs）的图表理解和代码生成能力。Chart2Code从用户驱动的角度明确设计，捕捉多样化的现实场景，并逐步增加任务难度。它包括三个层次：第一层（图表复现）从参考图表和用户查询复现图表；第二层（图表编辑）涉及复杂的修改，如改变图表类型或添加元素；第三层（长表格到图表生成）要求模型根据用户指令将长、信息密集的表格转换为忠实的图表。据我们所知，这是第一个反映实际图表到代码使用情况并系统地按任务复杂度扩展的层次基准。总计，Chart2Code包含2,023个任务，涉及22种图表类型，并配有多层次评估指标，评估代码正确性和渲染图表的视觉保真度。我们对25个最先进的（SoTA）LMMs进行了基准测试，包括各种专有和最新的开源模型，如GPT-5、Qwen2.5-VL、InternVL3/3.5、MiMo-VL和Seed-1.6-VL。实验结果表明，即使是最先进的模型GPT-5在代码评估中的平均得分为0.57，在编辑任务的图表质量评估中的得分为0.22，突显了Chart2Code的难度。我们预计这个基准将推动多模态推理的发展，并促进更稳健和通用的LMMs的发展。我们的代码和数据可在Chart2Code上获得。

Summary / 总结

The paper introduces Chart2Code, a hierarchical benchmark for evaluating the chart understanding and code generation capabilities of large multimodal models. It consists of three levels: chart reproduction, complex chart editing, and long-table to chart generation. The benchmark includes 2,023 tasks across 22 chart types and evaluates both code correctness and visual fidelity. State-of-the-art models perform poorly on this benchmark, highlighting its difficulty and potential for advancing multimodal reasoning.

Chart2Code 是一个新基准，用于评估大型多模态模型的图表理解和代码生成能力。它包含三个级别：图表复现、复杂修改和长表格到图表生成。基准包括 22 种图表类型的 2,023 个任务，并评估代码正确性和视觉保真度。最先进的模型如 GPT-5 表现不佳，代码评估平均分为 0.57，图表质量评估平均分为 0.22，突显了 Chart2Code 的挑战性。

CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning

Authors: Tianshi Xu, Yuteng Chen, Meng Li

First: 2026-01-21T16:14:30+00:00 · Latest: 2026-01-21T16:14:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Agentic Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to utilize tools like Python interpreters for complex problem-solving. However, for parameter-constrained models (e.g., 4B--7B), the exploration phase is often plagued by frequent execution failures, creating noisy trajectories that hinder policy optimization. Under standard outcome-based reward settings, this noise leads to a critical credit assignment issue, where erroneous actions are inadvertently reinforced alongside successful outcomes. Existing mitigations face a dilemma: dense rewards often trigger reward hacking, while supersampling incurs prohibitive computational costs. To address these challenges, we propose CLEANER. Distinct from external filtering methods, CLEANER exploits the model's intrinsic self-correction capabilities to eliminate error-contaminated context directly during data collection. At its core, the Similarity-Aware Adaptive Rollback (SAAR) mechanism autonomously constructs clean, purified trajectories by retrospectively replacing failures with successful self-corrections. Based on semantic similarity, SAAR adaptively regulates replacement granularity from shallow execution repairs to deep reasoning substitutions. By training on these self-purified paths, the model internalizes correct reasoning patterns rather than error-recovery loops. Empirical results on AIME24/25, GPQA, and LiveCodeBench show average accuracy gains of 6%, 3%, and 5% over baselines. Notably, CLEANER matches state-of-the-art performance using only one-third of the training steps, highlighting trajectory purification as a scalable solution for efficient agentic RL. Our models and code are available at GitHub

中文标题/摘要

标题：CLEANER: 自净化轨迹提升代理强化学习

代理强化学习（RL）使大型语言模型（LLMs）能够利用如Python解释器等工具解决复杂问题。然而，对于参数受限模型（例如4B-7B），探索阶段经常受到频繁执行失败的困扰，产生嘈杂的轨迹，阻碍策略优化。在标准基于结果的奖励设置下，这种噪声导致关键的归因问题，错误行为被无意中强化。现有缓解措施面临困境：密集奖励常引发奖励作弊，而超采样则带来高昂的计算成本。为解决这些挑战，我们提出了CLEANER。不同于外部过滤方法，CLEANER利用模型的内在自我纠正能力，在数据收集过程中直接消除错误污染的上下文。其核心是相似性感知自适应回滚（SAAR）机制，该机制自主构建清洁、净化的轨迹，通过回顾性替换失败为成功的自我纠正。基于语义相似性，SAAR自适应调节替换粒度，从浅层执行修复到深层推理替换。通过训练这些自我净化路径，模型内化正确的推理模式而非错误恢复循环。在AIME24/25、GPQA和LiveCodeBench上的实验证明，CLEANER在基线基础上平均提高了6%、3%和5%的准确率。值得注意的是，CLEANER仅使用三分之一的训练步骤就达到了最先进的性能，突显了轨迹净化作为高效代理RL的可扩展解决方案。我们的模型和代码可在GitHub获取

Summary / 总结

CLEANER addresses the issue of noisy trajectories in agentic reinforcement learning for parameter-constrained models by proposing a Similarity-Aware Adaptive Rollback (SAAR) mechanism that purifies trajectories during data collection. This method autonomously replaces errors with successful self-corrections based on semantic similarity, avoiding the pitfalls of dense rewards and supersampling. Experiments on AIME24/25, GPQA, and LiveCodeBench demonstrate average accuracy gains of 6%, 3%, and 5%, respectively, and CLEANER achieves state-of-the-art performance with only one-third of the training steps, indicating the scalability of trajectory purification for efficient agentic RL.

CLEANER 通过提出一种相似性感知自适应回滚（SAAR）机制，在数据收集过程中净化轨迹，解决了参数受限模型中强化学习中的噪声轨迹问题。该方法基于语义相似性自动用成功自纠正替换错误，避免了密集奖励和超采样的问题。实验结果表明，在AIME24/25、GPQA和LiveCodeBench上的平均准确率分别提高了6%、3%和5%，CLEANER 使用仅三分之一的训练步骤就达到了最先进的性能。

Graph Recognition via Subgraph Prediction

Authors: André Eberhard, Gerhard Neumann, Pascal Friederich

First: 2026-01-21T16:07:17+00:00 · Latest: 2026-01-21T16:07:17+00:00

Comments: This work has been submitted to the IEEE for possible publication

Abs · PDF · Code1 · Code2

Abstract

Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.

中文标题/摘要

标题：基于子图预测的图识别

尽管在图像分类、目标检测和分割等任务上取得了巨大的进步，但视觉关系的识别，通常建模为从图像中提取一个图的过程，仍然是一个具有挑战性的任务。我们认为，这主要是因为目前没有一种标准的方法来处理视觉图识别任务。大多数现有的解决方案都是针对特定问题的，无法在不同上下文之间直接转移，尽管概念问题是一样的。为了广泛适用性和简洁性，我们在本文中开发了一种方法，称为基于子图预测的图识别（GraSP），用于在图像中识别图。我们通过几个合成基准和一个实际应用展示了我们的方法可以处理多种类型的图及其绘制，并且可以在任务之间进行转移而无需针对特定任务进行修改，为视觉图识别提供了一个更统一的框架。

Summary / 总结

This paper addresses the challenge of visual graph recognition, where graphs are extracted from images. The motivation is to develop a method that can be applied broadly and transferred between tasks without modification. The main method is GraSP, which predicts subgraphs to recognize graphs in images. Key experimental findings show that GraSP works across various synthetic benchmarks and real-world applications, demonstrating its versatility and transferability.

论文针对视觉图识别这一关键任务，如图像分类和对象检测，但由于缺乏标准方法而面临挑战。作者提出了GraSP方法，通过预测子图来识别图像中的图。实验结果显示，GraSP能够处理不同类型的图，并且可以在不进行任务特定修改的情况下应用于不同的任务，表明了视觉图识别的更统一框架的可能性。

Vehicle Routing with Finite Time Horizon using Deep Reinforcement Learning with Improved Network Embedding

Authors: Ayan Maity, Sudeshna Sarkar

Venue: AAAI

First: 2026-01-21T16:05:04+00:00 · Latest: 2026-01-21T16:05:04+00:00

Comments: Accepted at AAAI-26 Workshop on AI for Urban Planning

Abs · PDF · Code1 · Code2

Abstract

In this paper, we study the vehicle routing problem with a finite time horizon. In this routing problem, the objective is to maximize the number of customer requests served within a finite time horizon. We present a novel routing network embedding module which creates local node embedding vectors and a context-aware global graph representation. The proposed Markov decision process for the vehicle routing problem incorporates the node features, the network adjacency matrix and the edge features as components of the state space. We incorporate the remaining finite time horizon into the network embedding module to provide a proper routing context to the embedding module. We integrate our embedding module with a policy gradient-based deep Reinforcement Learning framework to solve the vehicle routing problem with finite time horizon. We trained and validated our proposed routing method on real-world routing networks, as well as synthetically generated Euclidean networks. Our experimental results show that our method achieves a higher customer service rate than the existing routing methods. Additionally, the solution time of our method is significantly lower than that of the existing methods.

中文标题/摘要

标题：使用改进网络嵌入的深度强化学习在有限时间窗内的车辆路径规划

在本文中，我们研究了有限时间窗内的车辆路径规划问题。在该路径规划问题中，目标是在有限的时间窗内最大化服务的客户请求数量。我们提出了一种新颖的路径网络嵌入模块，该模块创建局部节点嵌入向量和上下文感知的全局图表示。我们为车辆路径规划问题提出的马尔可夫决策过程将节点特征、网络邻接矩阵和边特征作为状态空间的组成部分。我们将剩余的有限时间窗整合到网络嵌入模块中，为嵌入模块提供适当的路径上下文。我们将嵌入模块与基于策略梯度的深度强化学习框架结合，以解决有限时间窗内的车辆路径规划问题。我们在实际路径网络以及合成的欧几里得网络上训练并验证了我们提出的方法。实验结果表明，我们的方法在客户服务率方面优于现有方法。此外，我们方法的求解时间显著低于现有方法。

Summary / 总结

This paper addresses the vehicle routing problem with a finite time horizon, aiming to maximize the number of customer requests served within a specified time. It introduces a novel routing network embedding module that generates local node embeddings and a context-aware global graph representation. The method uses a Markov decision process that includes node features, network adjacency matrix, and edge features. By integrating the remaining finite time horizon into the network embedding, the method provides a proper routing context. Experimental results on real-world and synthetic networks show that the proposed method outperforms existing methods in terms of customer service rate and solution time.

本文研究了有限时间范围内的车辆路线问题，目标是在指定时间内最大化服务的客户请求数量。提出了一种新型的路由网络嵌入模块，生成局部节点嵌入并向量以及上下文感知的全局图表示。方法使用了一个包含节点特征、网络邻接矩阵和边特征的马尔可夫决策过程。通过将剩余的有限时间范围整合到网络嵌入中，该方法为嵌入模块提供了适当的路由上下文。实验结果表明，该方法在真实世界和合成网络上的客户服务率和解算时间都优于现有方法。

The Plausibility Trap: Using Probabilistic Engines for Deterministic Tasks

Authors: Ivan Carrera, Daniel Maldonado-Ruiz

First: 2026-01-21T16:05:01+00:00 · Latest: 2026-01-21T16:05:01+00:00

Abs · PDF · Code1 · Code2

Abstract

The ubiquity of Large Language Models (LLMs) is driving a paradigm shift where user convenience supersedes computational efficiency. This article defines the "Plausibility Trap": a phenomenon where individuals with access to Artificial Intelligence (AI) models deploy expensive probabilistic engines for simple deterministic tasks-such as Optical Character Recognition (OCR) or basic verification-resulting in significant resource waste. Through micro-benchmarks and case studies on OCR and fact-checking, we quantify the "efficiency tax"-demonstrating a ~6.5x latency penalty-and the risks of algorithmic sycophancy. To counter this, we introduce Tool Selection Engineering and the Deterministic-Probabilistic Decision Matrix, a framework to help developers determine when to use Generative AI and, crucially, when to avoid it. We argue for a curriculum shift, emphasizing that true digital literacy relies not only in knowing how to use Generative AI, but also on knowing when not to use it.

中文标题/摘要

标题：概率陷阱：使用概率引擎完成确定性任务

大型语言模型（LLMs）的普及正在推动一种范式转变，用户便利性超越了计算效率。本文定义了“概率陷阱”现象：拥有AI模型访问权的个人使用昂贵的概率引擎来完成简单的确定性任务，如光学字符识别（OCR）或基本验证，导致了大量资源浪费。通过微基准测试和OCR及事实核查案例研究，我们量化了“效率税”——展示了约6.5倍的延迟惩罚，并指出了算法阿谀奉承的风险。为应对这一问题，我们提出了工具选择工程和确定性-概率决策矩阵，这是一种帮助开发者确定何时使用生成式AI以及何时避免使用它的框架。我们主张课程改革，认为真正的数字素养不仅在于知道如何使用生成式AI，还在于知道何时不应使用它。

Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

Authors: Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid

First: 2025-09-09T09:01:01+00:00 · Latest: 2026-01-21T16:03:15+00:00

Comments: Accepted at ASRU 2025

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored despite audio's centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data, less than 30K hours (5K unique), Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors are not required for strong performance, even compared to models trained on over 500K hours of data.

中文标题/摘要

标题：基于公共数据数据高效单阶段训练的竞争力音频-语言模型

大型语言模型（LLMs）已彻底改变NLP，尽管音频对人类交流至关重要，但其与音频的整合仍被严重忽视。我们介绍了Falcon3-Audio，这是一种基于指令调优LLMs和Whisper编码器的音频-语言模型（ALMs）家族。使用不到30K小时（5K个唯一）的公共音频数据，Falcon3-Audio-7B在MMAU基准测试中达到了最佳报告性能，得分为64.14，与R1-AQA相当，同时通过数据和参数效率、单阶段训练和透明性脱颖而出。值得注意的是，我们最小的1B模型仍能与从2B到13B参数的更大开放模型竞争。通过广泛的消融实验，我们发现常见的复杂性，如阶梯式学习、多个音频编码器和复杂的跨注意力连接器，并非强性能所必需，即使与在超过500K小时数据上训练的模型相比也是如此。

Summary / 总结

The research aims to enhance the integration of audio with large language models (LLMs) by introducing Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using less than 30K hours of public audio data, Falcon3-Audio-7B achieves a competitive score of 64.14 on the MMAU benchmark, matching the performance of larger models while demonstrating superior data and parameter efficiency and single-stage training. Notably, even the smallest 1B model outperforms larger models in terms of efficiency and performance. The study also finds that common complexities in model design do not significantly improve performance compared to models trained on much larger datasets.

论文介绍了Falcon3-Audio，这是一种结合了指令调优的大语言模型和Whisper编码器的音频-语言模型家族。使用不到30K小时的公共音频数据，Falcon3-Audio-7B在MMAU基准测试中达到了与最佳公开权重模型相当的性能。该模型展示了数据和参数效率高、单阶段训练和透明度的优势。值得注意的是，即使是最小的1B模型也比更大规模的模型在数据和参数效率上表现出色，表明复杂的训练技术并非强性能所必需。