arXiv 论文速递

ARC Is a Vision Problem!

Authors: Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang, Yeyin Eva Zhu, Jacob Andreas, Kaiming He

First: 2025-11-18T18:59:49+00:00 · Latest: 2025-11-18T18:59:49+00:00

Comments: Technical Report. Project webpage: https://github.com/lillian039/VARC

Abstract

The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.

中文标题/摘要

标题：ARC 是一个视觉问题!

抽象和推理语料库（ARC）旨在促进抽象推理的研究，这是人类智能的一个基本方面。常见的ARC处理方法将其视为语言导向的问题，通过大型语言模型（LLMs）或递归推理模型来解决。然而，尽管ARC中的谜题任务本质上是视觉性的，现有的研究很少从视觉中心的角度来处理这个问题。在本文中，我们从视觉范式出发，将ARC视为图像到图像的转换问题。为了引入视觉先验，我们将输入表示为一个“画布”，可以像自然图像一样进行处理。因此，我们自然地可以应用标准的视觉架构，如基础的视觉变换器（ViT），来进行图像到图像的映射。我们的模型仅从头开始在ARC数据上进行训练，并通过测试时的训练泛化到未见过的任务。我们的框架称为Vision ARC（VARC），在ARC-1基准测试中达到了60.4%的准确率，显著优于其他从头开始训练的方法。我们的结果与领先的LLMs相当，并且接近平均人类表现。

Summary / 总结

The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning. Previous approaches treated ARC as a language problem, but this work formulates ARC within a vision paradigm, framing it as an image-to-image translation problem. The model, termed Vision ARC (VARC), uses a vanilla Vision Transformer and is trained from scratch on ARC data, achieving 60.4% accuracy on the ARC-1 benchmark, outperforming existing methods and competitive with leading large language models.

抽象推理语料库（ARC）旨在推进抽象推理的研究，这是人类智能的关键方面。以往的方法将ARC视为语言问题，但本研究将其重新定义为视觉问题，采用图像到图像的转换方法。模型Vision ARC (VARC) 使用Vision Transformer，并仅在ARC数据上进行训练，实现了60.4%的ARC-1基准准确率，超过了现有方法，并接近人类表现。

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

Authors: Rui Tian, Mingfei Gao, Haiming Gang, Jiasen Lu, Zhe Gan, Yinfei Yang, Zuxuan Wu, Afshin Dehghan

First: 2025-11-18T18:59:30+00:00 · Latest: 2025-11-18T18:59:30+00:00

Abs · PDF · Code1 · Code2

Abstract

We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. Building upon UniGen, we comprehensively enhance the model architecture and training pipeline to strengthen the image understanding and generation capabilities while unlocking strong image editing ability. Especially, we propose a unified Reinforcement Learning (RL) strategy that improves both image generation and image editing jointly via shared reward models. To further enhance image editing performance, we propose a light Edit Instruction Alignment stage that significantly improves the editing instruction comprehension that is essential for the success of the RL training. Experimental results show that UniGen-1.5 demonstrates competitive understanding and generation performance. Specifically, UniGen-1.5 achieves 0.89 and 4.31 overall scores on GenEval and ImgEdit that surpass the state-of-the-art models such as BAGEL and reaching performance comparable to proprietary models such as GPT-Image-1.

中文标题/摘要

标题：UniGen-1.5：通过强化学习中的奖励统一增强图像生成和编辑

我们提出了UniGen-1.5，这是一种统一的多模态大型语言模型（MLLM），用于高级图像理解、生成和编辑。基于UniGen，我们全面增强了模型架构和训练管道，以加强图像理解和生成能力，同时解锁强大的图像编辑能力。特别是，我们提出了一种统一的强化学习（RL）策略，通过共享奖励模型同时提高图像生成和图像编辑。为了进一步提高图像编辑性能，我们提出了一种轻量级的编辑指令对齐阶段，显著提高了对于RL训练成功至关重要的编辑指令理解能力。实验结果表明，UniGen-1.5展示了竞争力的理解和生成性能。具体而言，UniGen-1.5在GenEval和ImgEdit上的总体得分为0.89和4.31，超过了BAGEL等最先进的模型，并达到了与GPT-Image-1等专有模型相当的性能。

Summary / 总结

UniGen-1.5 is a unified multimodal large language model that enhances image understanding, generation, and editing capabilities. It uses a unified RL strategy with shared reward models to improve both generation and editing. Additionally, a light Edit Instruction Alignment stage is introduced to better understand editing instructions. Experimental results show that UniGen-1.5 outperforms state-of-the-art models like BAGEL and achieves performance comparable to proprietary models such as GPT-Image-1 on GenEval and ImgEdit scores of 0.89 and 4.31 respectively.

研究旨在通过统一的多模态大型语言模型UniGen-1.5来增强图像理解、生成和编辑能力。该模型采用统一的强化学习策略和共享奖励模型来同时提升生成和编辑效果。此外，引入了一个轻量级的编辑指令对齐阶段以更好地理解编辑指令。实验结果显示，UniGen-1.5在GenEval和ImgEdit任务中超过了BAGEL等最先进的模型，并且达到了与GPT-Image-1等专有模型相当的性能。

$π^{*}_{0.6}$: a VLA That Learns From Experience

Authors: Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Yao Lu, Vishnu Mano, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Alex Swerdlow, James Tanner, Marcel Torne, Quan Vuong, Anna Walling, Haohuan Wang, Blake Williams, Sukwon Yoo, Lili Yu, Ury Zhilinsky, Zhiyuan Zhou

First: 2025-11-18T18:58:55+00:00 · Latest: 2025-11-18T18:58:55+00:00

Abs · PDF · Code1 · Code2

Abstract

We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $π^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $π^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.

中文标题/摘要

标题：$π^{*}_{0.6}$：一种通过经验学习的VLA

我们研究了视觉-语言-行动（VLA）模型如何通过强化学习（RL）在实际部署中得到改进。我们提出了一种通用方法，即通过优势条件策略的强化学习与经验及纠正（RECAP），该方法通过优势条件化为VLAs提供RL训练。我们的方法将异构数据纳入自我改进过程，包括演示、在线策略收集的数据以及自主执行期间提供的专家远程操作干预。RECAP首先通过离线RL预训练一个通用VLA，我们称之为$π^{*}_{0.6}$，然后可以通过机器人数据收集将其专门化以在下游任务中达到高性能。我们展示了使用完整RECAP方法训练的$π^{*}_{0.6}$模型可以在真实家庭中折叠衣物、可靠地组装盒子，并使用专业咖啡机制作咖啡饮品。在一些最难的任务上，RECAP将任务吞吐量提高了两倍多，并将任务失败率降低了约一半。

Summary / 总结

The study explores how vision-language-action (VLA) models can improve through real-world reinforcement learning (RL) deployments. The RECAP method, which conditions advantage in RL training, is introduced to incorporate heterogeneous data such as demonstrations and expert interventions. The $π^{*}_{0.6}$ model, pre-trained with offline RL, is then fine-tuned on-robot to achieve high performance on various tasks. The model can fold laundry, assemble boxes, and make espresso drinks, with RECAP doubling task throughput and halving failure rates on some of the hardest tasks.

研究旨在通过实际部署中的强化学习（RL）提升视觉-语言-行动（VLA）模型。RECAP方法整合了演示、在线收集的数据和专家干预等多种数据类型，以提高VLA的表现。通过RECAP预训练的$π^{*}_{0.6}$模型在实际任务中表现出色，包括折叠衣物、组装盒子和制作意式咖啡。该方法在一些最难的任务上显著提高了任务处理速度并降低了失败率。

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

Authors: Youpeng Li, Fuxun Yu, Xinda Wang

First: 2025-11-14T21:57:48+00:00 · Latest: 2025-11-18T18:53:42+00:00

Abs · PDF · Code1 · Code2

Abstract

The widespread reliance on open-source software dramatically increases the risk of vulnerability exploitation, underscoring the need for effective and scalable vulnerability detection (VD). Existing VD techniques, whether traditional machine learning-based or LLM-based approaches like prompt engineering, supervised fine-tuning, or off-policy preference optimization, remain fundamentally limited in their ability to perform context-aware analysis: They depend on fixed inputs or static preference datasets, cannot adaptively explore repository-level dependencies, and are constrained by function-level benchmarks that overlook critical vulnerability context. This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware VD. To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information. We then design multi-dimensional reward structuring that jointly captures prediction correctness, vulnerability localization accuracy, and the semantic relevance of vulnerability analysis, thereby guiding the model toward comprehensive contextual reasoning. To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling, encouraging the model to explore challenging cases while maintaining balanced reward distribution. Extensive experiments demonstrate the superiority of our VULPO framework in context-aware VD: Our VULPO-4B substantially outperforms existing VD baselines based on prompt engineering and off-policy optimization, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model, DeepSeek-R1-0528.

中文标题/摘要

标题：VULPO：基于策略优化的上下文感知漏洞检测

开源软件的广泛应用极大地增加了漏洞被利用的风险，突显了有效且可扩展的漏洞检测（VD）的必要性。现有的VD技术，无论是传统的机器学习方法还是基于LLM的方法，如提示工程、监督微调或离策偏好优化，仍然在进行上下文感知分析方面存在根本局限：它们依赖于固定输入或静态偏好数据集，无法适应性地探索仓库级别的依赖关系，并且受限于仅关注函数级别的基准测试，忽视了关键的漏洞上下文。本文介绍了漏洞自适应策略优化（VULPO），这是一种基于策略优化的LLM强化学习框架，用于上下文感知的VD。为了支持训练和评估，我们首先构建了ContextVul，这是一个新的数据集，通过轻量级方法将高质量的函数级样本与仓库级别的上下文信息结合起来。然后，我们设计了多维度的奖励结构，联合捕捉预测准确性、漏洞定位准确性和漏洞分析的语义相关性，从而引导模型向全面的上下文推理发展。为了应对不同漏洞案例的不对称难度并缓解奖励作弊，VULPO整合了标签级和样本级难度自适应奖励缩放，鼓励模型探索具有挑战性的案例，同时保持奖励分布的平衡。广泛的实验表明，我们的VULPO框架在上下文感知VD方面具有优越性：我们的VULPO-4B显著优于基于提示工程和离策优化的现有VD基线，F1值比Qwen3-4B提高了85%，并实现了与150倍更大规模模型DeepSeek-R1-0528相当的性能。

Summary / 总结

VULPO is an on-policy LLM reinforcement learning framework designed for context-aware vulnerability detection (VD). It addresses the limitations of existing techniques by incorporating a new ContextVul dataset and multi-dimensional reward structuring. VULPO outperforms existing VD baselines, improving F1 by 85% over Qwen3-4B and achieving performance comparable to a 150x larger-scale model.

VULPO 是一种基于强化学习的 LLM 框架，旨在进行开放源代码软件中的上下文感知漏洞检测。它通过引入新的 ContextVul 数据集和多维度奖励结构来解决现有技术的局限性。VULPO 在 F1 分数上比 Qwen3-4B 提高了 85%，性能与 150 倍更大的 DeepSeek-R1-0528 模型相当。

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Authors: Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, Sebastian Scherer

First: 2025-11-18T18:52:22+00:00 · Latest: 2025-11-18T18:52:22+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and MapAnything, Co-Me achieves up to $11.3\times$ and $7.2\times$ speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

中文标题/摘要

标题：Co-Me：基于置信度的标记合并方法用于视觉几何变换器

我们提出了基于置信度的标记合并（Co-Me），这是一种无需重新训练或微调基础模型的视觉几何变换器加速机制。Co-Me 提取出一个轻量级的置信度预测器，按不确定性对标记进行排序，并选择性地合并低置信度的标记，从而有效减少计算量同时保持空间覆盖。与基于相似性的合并或剪枝相比，Co-Me 中的置信度信号可靠地指示了由变换器强调的区域，从而在不降低性能的情况下实现显著加速。Co-Me 可无缝应用于各种多视图和流式视觉几何变换器，加速效果随序列长度增加而增加。当应用于 VGGT 和 MapAnything 时，Co-Me 实现了高达 $11.3\times$ 和 $7.2\times$ 的加速，使视觉几何变换器适用于实时 3D 感知和重建。

Summary / 总结

The paper introduces Confidence-Guided Token Merging (Co-Me), a mechanism to accelerate visual geometric transformers by ranking and merging low-confidence tokens without retraining. This method uses a lightweight confidence predictor to reduce computation while preserving spatial coverage, offering up to 11.3x and 7.2x speedups on VGGT and MapAnything respectively, without degrading performance. Co-Me can be applied to various multi-view and streaming visual geometric transformers, providing scalable speed improvements based on sequence length.

Co-Me 是一种通过合并低置信度的标记来加速视觉几何变换器的方法，无需重新训练。它使用置信度预测器对标记进行排序和合并，减少计算量同时保持空间覆盖。Co-Me 通过可靠地指示变压器强调的区域，优于基于相似性的合并或剪枝，分别对 VGGT 和 MapAnything 实现了高达 11.3 倍和 7.2 倍的加速，使这些模型适用于实时 3D 感知和重建。

Vision Large Language Models Are Good Noise Handlers in Engagement Analysis

Authors: Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppänen, Xiaobai Li

First: 2025-11-18T18:50:26+00:00 · Latest: 2025-11-18T18:50:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.

中文标题/摘要

标题：视觉大型语言模型是参与分析中的良好噪声处理器

在视频数据集中，参与识别不同于传统的图像分类任务，特别受到主观标签和噪声的挑战，限制了模型的性能。为克服主观和噪声参与标签的挑战，我们提出了一种利用视觉大型语言模型（VLMs）来细化注释并指导训练过程的框架。该框架使用问卷提取行为线索，并将数据分为高可靠性和低可靠性子集。我们还引入了一种结合递增学习和软标签细化的训练策略，逐步引入模糊样本并调整监督以反映不确定性。我们证明，经过细化的高可靠性子集训练的古典计算机视觉模型，并结合我们的递增策略进行增强，显示出改进，突显了使用VLMs解决标签主观性的益处。该方法在EngageNet（三个中的六个特征设置，最大改进为+1.21%）和DREAMS / PAFE等参与基准上超越了先前的最先进水平，F1值分别提高了+0.22 / +0.06。

Summary / 总结

The research aims to improve engagement recognition in video datasets by addressing the challenges of subjective and noisy labels. The method involves using Vision Large Language Models (VLMs) to refine annotations and guide training. The framework extracts behavioral cues through a questionnaire and splits data into high- and low-reliability subsets. It also employs a curriculum learning strategy with soft label refinement, gradually incorporating ambiguous samples and adjusting supervision to reflect uncertainty. The results show that classical computer vision models trained on refined high-reliability subsets and enhanced with this curriculum strategy outperform previous state-of-the-art methods, particularly on EngageNet and DREAMS/PAFE benchmarks with F1 gains of +0.22 and +0.06 respectively.

研究旨在通过解决主观和噪声标签的挑战来提高视频数据集中的参与度识别。方法包括使用视觉大型语言模型（VLMs）来精炼注释并指导训练。框架通过问卷提取行为线索，并将数据分为高可靠性和低可靠性子集。还采用了逐步引入模糊样本并调整监督以反映不确定性的课程学习策略。结果显示，经过精炼的高可靠性子集训练并增强此课程策略的经典计算机视觉模型在EngageNet和DREAMS/PAFE基准上的F1分数分别提高了+0.22和+0.06，超过了先前的最先进方法。

OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model

Authors: Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Animesh Garg, Valts Blukis

First: 2025-06-01T22:15:45+00:00 · Latest: 2025-11-18T18:49:00+00:00

Comments: 13 pages

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/

中文标题/摘要

标题：OG-VLA：基于视觉语言动作模型的正交图像生成

我们介绍了OG-VLA，这是一种结合了视觉语言动作模型（VLAs）的泛化优势和3D感知策略的鲁棒性的新型架构和学习框架。我们解决了将自然语言指令和一个或多个RGBD观察映射到准静态机器人动作的挑战。3D感知的机器人策略在精确的机器人操作任务上达到了最先进的性能，但在处理未见过的指令、场景和物体时存在泛化问题。另一方面，VLAs在指令和场景的泛化方面表现出色，但对相机和机器人姿态的变化较为敏感。我们利用嵌入在语言和视觉基础模型中的先验知识来提高3D感知关键帧策略的泛化能力。OG-VLA将输入观察从多个视角反投影到点云中，然后从标准正交视角进行渲染，确保输入视角不变性和输入输出空间的一致性。这些标准视角通过视觉骨干、大型语言模型（LLM）和图像扩散模型处理，生成编码末端执行器在输入场景中下一个位置和方向的图像。在Arnold和Colosseum基准上的评估表明，OG-VLA在未见过的环境中实现了最先进的泛化能力，相对改进超过40%，同时在已见过的环境中保持了稳健的性能。我们还展示了3到5次演示中的实际应用，并且具有强大的泛化能力。有关视频和资源，请访问https://og-vla.github.io/

Summary / 总结

OG-VLA is a novel architecture that integrates the generalization capabilities of Vision Language Action models with the robustness of 3D-aware policies. It addresses the challenge of mapping natural language instructions and RGBD observations to robot actions. The method involves unprojecting input observations into a point cloud and rendering canonical orthographic views, which are then processed by a vision backbone, a Large Language Model, and an image diffusion model to generate images that encode the next position and orientation of the end-effector. Experiments on the Arnold and Colosseum benchmarks show that OG-VLA achieves state-of-the-art generalization to unseen environments with over 40% relative improvements, while maintaining robust performance in seen settings.

OG-VLA 是一种新颖的架构，将 Vision Language Action 模型的泛化能力与 3D 意识策略的鲁棒性相结合。它解决了将自然语言指令和 RGBD 观测映射到机器人动作的挑战。通过利用语言和视觉基础模型，OG-VLA 改进了 3D 意识关键帧策略的泛化能力。该模型生成编码末端执行器下一位置和方向的图像，确保输入视角不变性和输入输出空间的一致性。评估结果显示，OG-VLA 在未见过的环境中实现了最先进的泛化能力，相对改进超过 40%，同时在已见过的环境中保持了鲁棒性能，并展示了通过少量演示实现的现实世界适应性。

Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge

Authors: Antonia Ebner, Christoph Bartmann, Sonja Topf, Sohvi Luukkonen, Johannes Schimunek, Günter Klambauer

First: 2025-11-18T18:43:42+00:00 · Latest: 2025-11-18T18:43:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep learning's rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision's "ImageNet moment" - arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.

中文标题/摘要

标题：药物发现中人工智能进步的衡量：Tox21 挑战赛的可重复排行榜

自2010年代初深度学习兴起以来，它已经改变了诸如计算机视觉和自然语言处理等领域的面貌，并对生物医学研究产生了强烈影响。对于药物发现而言，一个关键转折点——类似于视觉领域的“ImageNet时刻”——出现在2015年，当时深度神经网络在Tox21数据挑战赛中超越了传统方法。这一里程碑加速了深度学习在制药行业的应用，如今大多数大型公司都已将这些方法整合到其研究管道中。在Tox21挑战赛结束后，其数据集被包含在多个现有基准中，如MoleculeNet和开放图基准。然而，在这些整合过程中，数据集被修改，标签被填补或制造，导致研究之间的可比性丧失。因此，过去十年生物活性和毒性预测方法的进步程度仍然不清楚。为了解决这一问题，我们引入了一个可重复的排行榜，该排行榜托管在Hugging Face上，使用原始的Tox21挑战赛数据集，以及一组基线和代表性方法。当前版本的排行榜表明，原始Tox21的获胜者——基于集成的DeepTox方法——和2017年引入的基于描述符的自归一化神经网络，继续表现出色，并在毒性预测方面排名靠前，表明过去十年在毒性预测方面是否取得了实质性的进展尚不清楚。作为这项工作的部分，我们使所有基线和评估模型通过标准化API调用在Hugging Face Spaces上公开，以便进行推理。

Summary / 总结

The paper aims to measure progress in AI for drug discovery by introducing a reproducible leaderboard using the original Tox21 Challenge dataset. The method involves comparing various baseline and representative models on this dataset. Key findings show that the original Tox21 winner, DeepTox, and self-normalizing neural networks from 2017 continue to perform competitively, suggesting limited progress in toxicity prediction over the past decade.

该论文旨在通过使用原始Tox21挑战数据集引入可重复的排行榜来衡量AI在药物发现中的进步。方法是将各种基线和代表性模型在该数据集上进行比较。主要发现表明，原始Tox21获胜者DeepTox和2017年的自规范化神经网络继续表现出色，这表明过去十年在毒性预测方面几乎没有取得显著进展。

Beyond Means: A Dynamic Framework for Predicting Customer Satisfaction

Authors: Christof Naumzik, Abdurahman Maarouf, Stefan Feuerriegel, Markus Weinmann

First: 2025-11-18T18:43:29+00:00 · Latest: 2025-11-18T18:43:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Online ratings influence customer decision-making, yet standard aggregation methods, such as the sample mean, fail to adapt to quality changes over time and ignore review heterogeneity (e.g., review sentiment, a review's helpfulness). To address these challenges, we demonstrate the value of using the Gaussian process (GP) framework for rating aggregation. Specifically, we present a tailored GP model that captures the dynamics of ratings over time while additionally accounting for review heterogeneity. Based on 121,123 ratings from Yelp, we compare the predictive power of different rating aggregation methods in predicting future ratings, thereby finding that the GP model is considerably more accurate and reduces the mean absolute error by 10.2% compared to the sample mean. Our findings have important implications for marketing practitioners and customers. By moving beyond means, designers of online reputation systems can display more informative and adaptive aggregated rating scores that are accurate signals of expected customer satisfaction.

中文标题/摘要

标题：超越均值：一种预测客户满意度的动态框架

在线评分影响客户决策，但标准聚合方法，如样本均值，无法适应质量随时间的变化，并且忽略了评论异质性（例如，评论情感、评论的帮助性）。为了解决这些挑战，我们展示了使用高斯过程（GP）框架进行评分聚合的价值。具体来说，我们提出了一种针对时间上评分动态的定制GP模型，同时考虑了评论异质性。基于来自Yelp的121,123条评分，我们将不同的评分聚合方法在预测未来评分方面的预测能力进行了比较，从而发现GP模型的准确性显著提高，与样本均值相比，均绝对误差降低了10.2%。我们的研究结果对营销从业者和客户具有重要意义。通过超越均值，设计在线声誉系统的设计者可以展示更具信息性和适应性的聚合评分，这些评分是预期客户满意度的准确信号。

Summary / 总结

The study aims to improve the prediction of customer satisfaction by addressing the limitations of standard rating aggregation methods, such as the sample mean, which fail to adapt to quality changes over time and ignore review heterogeneity. The authors propose using a Gaussian process (GP) framework to capture rating dynamics and review heterogeneity. Using 121,123 Yelp ratings, they find that the GP model outperforms the sample mean, reducing the mean absolute error by 10.2% and providing more accurate predictions of future ratings.

研究旨在通过解决标准评分聚合方法的局限性，提高客户满意度预测的准确性。它引入了一个高斯过程（GP）模型，能够捕捉时间动态和评分异质性。使用Yelp数据，研究显示GP模型优于样本均值，将平均绝对误差降低了10.2%。这提高了聚合评分的准确性，提供了更好的预期客户满意度信号。

A Neural Field-Based Approach for View Computation & Data Exploration in 3D Urban Environments

Authors: Stefan Cobeli, Kazi Shahrukh Omar, Rodrigo Valença, Nivan Ferreira, Fabio Miranda

First: 2025-11-18T18:41:28+00:00 · Latest: 2025-11-18T18:41:28+00:00

Comments: Accepted at IEEE Transactions on Visualization and Computer Graphics. Code and data are publicly available at https://urbantk.org/neural-3d

Abs · PDF · Code1 · Code2

Abstract

Despite the growing availability of 3D urban datasets, extracting insights remains challenging due to computational bottlenecks and the complexity of interacting with data. In fact, the intricate geometry of 3D urban environments results in high degrees of occlusion and requires extensive manual viewpoint adjustments that make large-scale exploration inefficient. To address this, we propose a view-based approach for 3D data exploration, where a vector field encodes views from the environment. To support this approach, we introduce a neural field-based method that constructs an efficient implicit representation of 3D environments. This representation enables both faster direct queries, which consist of the computation of view assessment indices, and inverse queries, which help avoid occlusion and facilitate the search for views that match desired data patterns. Our approach supports key urban analysis tasks such as visibility assessments, solar exposure evaluation, and assessing the visual impact of new developments. We validate our method through quantitative experiments, case studies informed by real-world urban challenges, and feedback from domain experts. Results show its effectiveness in finding desirable viewpoints, analyzing building facade visibility, and evaluating views from outdoor spaces. Code and data are publicly available at https://urbantk.org/neural-3d.

中文标题/摘要

标题：基于神经场的方法在3D城市环境中的视图计算与数据探索

尽管3D城市数据集的可用性不断增加，但由于计算瓶颈和与数据交互的复杂性，提取见解仍然具有挑战性。实际上，3D城市环境的复杂几何结构导致了高度的遮挡，需要大量的手动视点调整，这使得大规模探索变得低效。为了解决这个问题，我们提出了一种基于视图的3D数据探索方法，其中向量场编码了环境中的视图。为了支持这种方法，我们引入了一种基于神经场的方法，用于构建3D环境的高效隐式表示。该表示使我们能够进行更快的直接查询，即视图评估指标的计算，以及逆查询，帮助避免遮挡并促进寻找与所需数据模式匹配的视图。我们的方法支持诸如可见性评估、太阳辐射评估和评估新开发项目的视觉影响等关键城市分析任务。我们通过定量实验、基于现实世界城市挑战的案例研究以及领域专家的反馈验证了该方法。结果表明，该方法在找到理想视点、分析建筑立面可见性和评估户外空间视图方面具有有效性。代码和数据可在https://urbantk.org/neural-3d 公开获取。

Summary / 总结

The paper addresses the challenge of efficiently exploring 3D urban datasets by proposing a view-based approach using a neural field-based method. This method constructs an implicit representation of 3D environments to enable faster direct and inverse queries, facilitating tasks like visibility assessments and solar exposure evaluations. Experimental results demonstrate the method's effectiveness in finding desirable viewpoints and analyzing building facade visibility.

本文提出了一种基于视图的方法，通过矢量场编码视图来解决从3D城市数据集中提取见解的挑战。该方法引入了一种基于神经场的方法，以创建3D环境的高效隐式表示，从而实现更快的直接和逆向查询。该方法支持诸如视域评估、太阳能暴露评估以及新开发项目视觉影响评估等任务。实验结果显示其在寻找理想视角和分析建筑立面视域方面的有效性。

Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

Authors: Parya Dolatyabi, Mahdi Khodayar

First: 2025-11-18T18:23:35+00:00 · Latest: 2025-11-18T18:23:35+00:00

Comments: 6 pages, 4 figures, TPEC 2025 Conference

Abs · PDF · Code1 · Code2

Abstract

Restoring power distribution systems (PDS) after large-scale outages requires sequential switching operations that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints such as power balance, voltage limits, and thermal ratings. These challenges make conventional optimization and value-based RL approaches computationally inefficient and difficult to scale. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework, instantiated through Heterogeneous-Agent Proximal Policy Optimization (HAPPO), to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts, introducing practical structural heterogeneity. Decentralized actor policies are trained with a centralized critic to compute advantage values for stable on-policy updates. A physics-informed OpenDSS environment provides full power flow feedback and enforces operational limits via differentiable penalty signals rather than invalid action masking. The total DER generation is capped at 2400 kW, and each microgrid must satisfy local supply-demand feasibility. Experiments on the IEEE 123-bus and IEEE 8500-node systems show that HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX. Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.

中文标题/摘要

标题：用于电力分配系统恢复的异构多智能体近端策略优化

在大规模断电后恢复电力分配系统（PDS）需要进行顺序切换操作，重新配置馈线拓扑并协调分布式能源资源（DERs），同时满足非线性约束条件，如功率平衡、电压限制和热容量限制。这些挑战使得传统的优化方法和基于价值的强化学习（RL）方法在计算效率和可扩展性方面存在局限性。本文应用了异构智能体强化学习（HARL）框架，通过异构智能体近端策略优化（HAPPO）实现互联微电网的协调恢复。每个智能体控制一个具有不同负载、DER容量和开关数量的独特微电网，引入了实际的结构异构性。去中心化的智能体策略通过一个中心化的评论家来训练，以计算优势值，实现稳定的策略更新。物理信息的OpenDSS环境提供了完整的功率流反馈，并通过可微惩罚信号而非无效动作掩蔽来强制执行操作限制。DER总发电量限制在2400 kW，每个微电网必须满足当地供电需求。在IEEE 123节点和IEEE 8500节点系统上的实验表明，HAPPO在收敛速度、恢复功率和多种子训练平滑度方面优于DQN、PPO、MAES、MAGDPG、MADQN、均场RL和QMIX。结果表明，在HARL框架中纳入微电网级别的异构性可以实现一个可扩展、稳定且约束感知的复杂PDS恢复解决方案。

Summary / 总结

This paper addresses the challenge of restoring power distribution systems after large-scale outages using a Heterogeneous-Agent Proximal Policy Optimization (HAPPO) framework. The method leverages a decentralized actor policy trained with a centralized critic to handle the structural heterogeneity of interconnected microgrids. Experiments on the IEEE 123-bus and IEEE 8500-node systems demonstrate that HAPPO outperforms other reinforcement learning methods in terms of faster convergence, higher restored power, and smoother multi-seed training, while maintaining operational constraints.

本文通过应用Heterogeneous-Agent Proximal Policy Optimization (HAPPO)框架来解决大规模断电后恢复电力分配系统的问题。该方法涉及使用集中式评论家训练分散的行动者策略以处理互联微电网的结构异质性。实验结果表明，HAPPO在IEEE 123节点和IEEE 8500节点系统上比其他强化学习方法具有更快的收敛速度、更高的恢复电力和更平滑的多种子训练，同时确保满足操作限制。

Automated proving in planar geometry based on the complex number identity method and elimination

Authors: Zoltán Kovács, Xicheng Peng

First: 2025-11-18T18:20:17+00:00 · Latest: 2025-11-18T18:20:17+00:00

Comments: 15 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

We improve the complex number identity proving method to a fully automated procedure, based on elimination ideals. By using declarative equations or rewriting each real-relational hypothesis $h_i$ to $h_i-r_i$, and the thesis $t$ to $t-r$, clearing the denominators and introducing an extra expression with a slack variable, we eliminate all free and relational point variables. From the obtained ideal $I$ in $\mathbb{Q}[r,r_1,r_2,\ldots]$ we can find a conclusive result. It plays an important role that if $r_1,r_2,\ldots$ are real, $r$ must also be real if there is a linear polynomial $p(r)\in I$, unless division by zero occurs when expressing $r$. Our results are presented in Mathematica, Maple and in a new version of the Giac computer algebra system. Finally, we present a prototype of the automated procedure in an experimental version of the dynamic geometry software GeoGebra.

中文标题/摘要

标题：基于复数恒等式方法与消元的理想自动证明平面几何

我们改进了复数恒等式证明方法，基于消元理想，实现了一个全自动的过程。通过使用声明性方程或将每个实关系假设 $h_i$ 转换为 $h_i-r_i$，将结论 $t$ 转换为 $t-r$，清除分母并引入一个带有松弛变量的额外表达式，我们消除了所有自由和关系点变量。从在 $\mathbb{Q}[r,r_1,r_2,\ldots]$ 中获得的理想 $I$ 中，我们可以找到一个结论。如果 $r_1,r_2,\ldots$ 是实数，除非在表示 $r$ 时出现除零错误，否则只要 $I$ 中存在线性多项式 $p(r)$，$r$ 也必须是实数。我们的结果在 Mathematica、Maple 和 Giac 计算机代数系统的最新版本中进行了展示。最后，我们在动态几何软件 GeoGebra 的实验版本中展示了自动证明过程的原型。

Summary / 总结

The research aims to automate the process of proving theorems in planar geometry using the complex number identity method enhanced with elimination ideals. The method involves transforming hypotheses and the thesis into equations, clearing denominators, and introducing a slack variable to eliminate all point variables. The obtained ideal in the polynomial ring helps determine the truth of the statement. The approach was implemented in Mathematica, Maple, and Giac, and a prototype was developed for GeoGebra.

研究旨在通过增强的复数恒等式方法和消元理想来自动化平面几何定理的证明过程。该方法包括将假设和论题转换为方程，清除分母，并引入一个松弛变量以消除所有点变量。关键发现是，如果理想中的剩余多项式是线性和实数的，则结果也必须是实数，除非在表示时出现除零情况。该自动化过程已在Mathematica、Maple和Giac计算机代数系统的最新版本中实现，并在GeoGebra的实验版本中开发了一个原型。

Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising

Authors: Yifan Wang, Liya Ji, Zhanghan Ke, Harry Yang, Ser-Nam Lim, Qifeng Chen

First: 2025-11-18T18:06:29+00:00 · Latest: 2025-11-18T18:06:29+00:00

Comments: Project Page: https://wyf0824.github.io/Video_Realism_Enhancement/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.

中文标题/摘要

标题：基于结构感知去噪的零样本合成视频现实感增强

我们提出了一种增强合成视频现实感的方法，可以将合成视频从模拟器重新渲染为照片级现实风格。我们的现实感增强方法是一个零样本框架，专注于在空间和时间域中将合成视频的多级结构保留在增强视频中，基于一个无需进一步微调的扩散视频基础模型。具体来说，我们引入了一种有效的修改，使生成/去噪过程基于从合成视频估计的结构感知信息，如深度图、语义图和边缘图，由辅助模型提供，而不是从模拟器中提取信息。这种指导确保了增强视频在结构和语义层面与原始合成视频的一致性。我们的方法是一种简单而通用且强大的合成视频现实感增强方法：我们在实验中展示了我们的方法在结构一致性方面优于现有基线，同时保持了最先进的照片级现实感质量。

\textit{FLARE}: Adaptive Multi-Dimensional Reputation for Robust Client Reliability in Federated Learning

Authors: Abolfazl Younesi, Leon Kiss, Zahra Najafabadi Samani, Juan Aznar Poveda, Thomas Fahringer

First: 2025-11-18T17:57:40+00:00 · Latest: 2025-11-18T17:57:40+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2 · Code3

Abstract

Federated learning (FL) enables collaborative model training while preserving data privacy. However, it remains vulnerable to malicious clients who compromise model integrity through Byzantine attacks, data poisoning, or adaptive adversarial behaviors. Existing defense mechanisms rely on static thresholds and binary classification, failing to adapt to evolving client behaviors in real-world deployments. We propose FLARE, an adaptive reputation-based framework that transforms client reliability assessment from binary decisions to a continuous, multi-dimensional trust evaluation. FLARE integrates: (i) a multi-dimensional reputation score capturing performance consistency, statistical anomaly indicators, and temporal behavior, (ii) a self-calibrating adaptive threshold mechanism that adjusts security strictness based on model convergence and recent attack intensity, (iii) reputation-weighted aggregation with soft exclusion to proportionally limit suspicious contributions rather than eliminating clients outright, and (iv) a Local Differential Privacy (LDP) mechanism enabling reputation scoring on privatized client updates. We further introduce a highly evasive Statistical Mimicry (SM) attack, a benchmark adversary that blends honest gradients with synthetic perturbations and persistent drift to remain undetected by traditional filters. Extensive experiments with 100 clients on MNIST, CIFAR-10, and SVHN demonstrate that FLARE maintains high model accuracy and converges faster than state-of-the-art Byzantine-robust methods under diverse attack types, including label flipping, gradient scaling, adaptive attacks, ALIE, and SM. FLARE improves robustness by up to 16% and preserves model convergence within 30% of the non-attacked baseline, while achieving strong malicious-client detection performance with minimal computational overhead. https://github.com/Anonymous0-0paper/FLARE

中文标题/摘要

标题：\textit{FLARE}: 联邦学习中自适应多维信誉评估以增强客户端可靠性

联邦学习（FL）允许在保护数据隐私的同时进行协作模型训练。然而，它仍然容易受到通过拜占庭攻击、数据污染或适应性对抗行为破坏模型完整性的恶意客户端的攻击。现有的防御机制依赖于静态阈值和二元分类，无法适应实际部署中不断变化的客户端行为。我们提出了一种自适应信誉框架FLARE，将客户端可靠性评估从二元决策转变为连续的多维信任评估。FLARE 结合了：(i) 多维信誉评分，捕捉性能一致性、统计异常指标和时间行为；(ii) 自校准自适应阈值机制，根据模型收敛性和最近的攻击强度调整安全严格性；(iii) 基于信誉加权聚合和软排除的比例限制可疑贡献，而不是完全排除客户端；(iv) 局部差分隐私（LDP）机制，允许在私有化客户端更新上进行信誉评分。我们还引入了一种高度规避的统计模仿（SM）攻击，这是一种基准对手，将诚实梯度与合成扰动和持久漂移结合在一起，以避免传统过滤器的检测。在MNIST、CIFAR-10和SVHN上的100个客户端的广泛实验表明，FLARE 在各种攻击类型（包括标签翻转、梯度缩放、适应性攻击、ALIE 和 SM）下保持了高模型准确性和更快的收敛速度，比最先进的拜占庭鲁棒方法提高了16%的鲁棒性，并在30%的非攻击基线内保持了模型收敛，同时实现了强大的恶意客户端检测性能，且计算开销最小。https://github.com/Anonymous0-0paper/FLARE

Summary / 总结

FLARE is an adaptive reputation-based framework for federated learning that enhances client reliability assessment by moving from binary decisions to a continuous, multi-dimensional trust evaluation. It includes a multi-dimensional reputation score, a self-calibrating adaptive threshold mechanism, reputation-weighted aggregation, and a Local Differential Privacy mechanism. FLARE demonstrates superior performance in maintaining model accuracy and converging faster than existing Byzantine-robust methods under various attacks, improving robustness by up to 16% and preserving model convergence within 30% of the non-attacked baseline with minimal computational overhead.

FLARE 是一种适应性的声誉评估框架，用于联邦学习，通过多维度评估客户端可靠性并动态调整安全阈值，使用声誉加权聚合来限制可疑贡献。FLARE 在各种攻击下保持高模型准确性和更快的收敛速度，相比现有 Byzantine-robust 方法，其鲁棒性提高最多 16%，同时保持模型收敛在非攻击基线的 30% 内。它还引入了统计模仿攻击作为基准对手。

FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation

Authors: Yunfeng Wu, Jiayi Song, Zhenxiong Tan, Zihao He, Songhua Liu

First: 2025-11-18T17:56:04+00:00 · Latest: 2025-11-18T17:56:04+00:00

Comments: 13 pages, 8 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim

中文标题/摘要

标题：FreeSwim：重新审视用于训练免费超高清视频生成的滑动窗口注意力机制

现代基于Transformer的视频生成器中的注意力机制的时间和空间复杂度呈二次增长，使得端到端训练超高清视频变得极其昂贵。鉴于这一限制，我们提出了一种训练免费的方法，利用在原生尺度下预训练的视频扩散Transformer来合成更高分辨率的视频，无需额外的训练或适应。我们方法的核心在于一种向内滑动窗口注意力机制，其源于一个关键观察：保持每个查询令牌的训练尺度感受野对于保持视觉保真度和细节至关重要。然而，朴素的局部窗口注意力往往导致重复的内容，并且在生成结果中缺乏全局一致性。为克服这一挑战，我们设计了一种双路径管道，通过一种新颖的交叉注意力覆盖策略支持窗口注意力，从而使局部注意力生成的语义内容能够受到具有全感受野的另一分支的引导，从而确保整体一致性。此外，为了提高效率，我们为该分支引入了一种交叉注意力缓存策略，以避免频繁计算全3D注意力。大量实验表明，我们的方法在训练免费的框架下提供了具有精细视觉细节和高效率的超高清视频。同时，在VBench上实现了优于基于训练的替代方案的性能，具有竞争力或更高的效率。代码可在：https://github.com/WillWu111/FreeSwim 获取

Summary / 总结

The paper addresses the computational challenges of training Transformer-based video generators for ultra-high resolution videos. It proposes a training-free approach using video Diffusion Transformers and an inward sliding window attention mechanism to preserve visual fidelity. The method incorporates a dual-path pipeline with cross-attention override and caching to enhance global coherence and efficiency, resulting in ultra-high-resolution videos with fine details and high performance on VBench compared to training-based alternatives.

论文解决了基于Transformer的视频生成器在处理超高清视频时的计算难题。提出了一种无需训练的方法，利用视频扩散Transformer和内向滑动窗口注意力机制来保持视觉细节。该方法结合了双重路径管道和交叉注意力覆盖策略以及缓存策略，以增强全局一致性并提高效率，从而生成具有精细细节的超高清视频，并在VBench上优于基于训练的替代方案。

Towards a Unified Analysis of Neural Networks in Nonparametric Instrumental Variable Regression: Optimization and Generalization

Authors: Zonghao Chen, Atsushi Nitanda, Arthur Gretton, Taiji Suzuki

First: 2025-11-18T17:51:17+00:00 · Latest: 2025-11-18T17:51:17+00:00

Abs · PDF · Code1 · Code2

Abstract

We establish the first global convergence result of neural networks for two stage least squares (2SLS) approach in nonparametric instrumental variable regression (NPIV). This is achieved by adopting a lifted perspective through mean-field Langevin dynamics (MFLD), unlike standard MFLD, however, our setting of 2SLS entails a \emph{bilevel} optimization problem in the space of probability measures. To address this challenge, we leverage the penalty gradient approach recently developed for bilevel optimization which formulates bilevel optimization as a Lagrangian problem. This leads to a novel fully first-order algorithm, termed \texttt{F$^2$BMLD}. Apart from the convergence bound, we further provide a generalization bound, revealing an inherent trade-off in the choice of the Lagrange multiplier between optimization and statistical guarantees. Finally, we empirically validate the effectiveness of the proposed method on an offline reinforcement learning benchmark.

中文标题/摘要

标题：向非参数工具变量回归中神经网络统一分析的迈进：优化与泛化

我们首次建立了神经网络在非参数工具变量回归（NPIV）中的两阶段最小二乘法（2SLS）方法的全局收敛结果。这通过采用均场拉格朗日动力学（MFLD）的提升视角实现，然而，我们的2SLS设置涉及概率测度空间中的双层优化问题。为应对这一挑战，我们利用最近为双层优化开发的惩罚梯度方法，将双层优化问题形式化为拉格朗日问题。这导致了一种新颖的全一阶算法，称为F$^2$BMLD。除了收敛界，我们还提供了泛化界，揭示了优化和统计保证之间固有的权衡。最后，我们在一个离线强化学习基准上实证验证了所提出方法的有效性。

Summary / 总结

The research aims to establish a global convergence result for neural networks in nonparametric instrumental variable regression using a two stage least squares (2SLS) approach. To achieve this, the authors adopt a lifted perspective through mean-field Langevin dynamics (MFLD) and develop a novel fully first-order algorithm called F$^2$BMLD, which addresses the bilevel optimization problem. The study provides both a convergence bound and a generalization bound, highlighting the trade-off in the choice of the Lagrange multiplier. Empirical validation on an offline reinforcement learning benchmark demonstrates the method's effectiveness.

该研究旨在通过两阶段最小二乘法（2SLS）方法，利用均场拉angevin动力学（MFLD）和惩罚梯度方法，为非参数工具变量回归中的神经网络提供全局收敛结果。作者引入了一种名为F$^2$BMLD的新型一阶算法。研究还提供了泛化界，揭示了优化和统计保证之间的权衡。通过在离线强化学习基准上的实证验证，证实了该方法的有效性。

Optimizing Federated Learning by Entropy-Based Client Selection

Authors: Andreas Lutz, Gabriele Steidl, Karsten Müller, Wojciech Samek

First: 2024-11-02T13:31:36+00:00 · Latest: 2025-11-18T17:47:33+00:00

Comments: Accepted at the 3rd IEEE International Conference on Federated Learning Technologies and Applications (FLTA 2025), Dubrovnik, Croatia, October 14-17, 2025

Abs · PDF · Code1 · Code2

Abstract

Although deep learning has revolutionized domains such as natural language processing and computer vision, its dependence on centralized datasets raises serious privacy concerns. Federated learning addresses this issue by enabling multiple clients to collaboratively train a global deep learning model without compromising their data privacy. However, the performance of such a model degrades under label skew, where the label distribution differs between clients. To overcome this issue, a novel method called FedEntOpt is proposed. In each round, it selects clients to maximize the entropy of the aggregated label distribution, ensuring that the global model is exposed to data from all available classes. Extensive experiments on multiple benchmark datasets show that the proposed method outperforms several state-of-the-art algorithms by up to 6% in classification accuracy under standard settings regardless of the model size, while achieving gains of over 30% in scenarios with low participation rates and client dropout. In addition, FedEntOpt offers the flexibility to be combined with existing algorithms, enhancing their classification accuracy by more than 40%. Importantly, its performance remains unaffected even when differential privacy is applied.

中文标题/摘要

标题：基于熵的客户端选择优化联邦学习

尽管深度学习在自然语言处理和计算机视觉等领域取得了革命性进展，但其对集中式数据集的依赖引发了严重的隐私问题。联邦学习通过使多个客户端协作训练全局深度学习模型，而不泄露其数据隐私，解决了这一问题。然而，在标签偏差的情况下，即标签分布在不同客户端之间存在差异时，该模型的性能会下降。为了解决这一问题，提出了一种名为FedEntOpt的新方法。在每一轮中，它选择客户端以最大化聚合标签分布的熵，确保全局模型接触到所有可用类别的数据。在多个基准数据集上的广泛实验表明，该方法在标准设置下无论模型大小如何，分类准确率都比几种最先进的算法高出多达6%，而在低参与率和客户端退出的场景中，性能提升超过30%。此外，FedEntOpt 可以与现有算法结合使用，提高其分类准确率超过40%。重要的是，即使应用差分隐私，其性能也不会受到影响。

Summary / 总结

The paper proposes FedEntOpt, a method for optimizing federated learning by selecting clients based on the entropy of the aggregated label distribution. This approach aims to mitigate label skew and improve model performance. Experiments show that FedEntOpt outperforms state-of-the-art algorithms by up to 6% in classification accuracy under standard settings and by over 30% in low participation scenarios. Additionally, it enhances the accuracy of existing algorithms by more than 40% and maintains performance even with differential privacy applied.

该研究提出了一种名为FedEntOpt的方法，通过基于聚合标签分布的熵来选择客户端以解决标签偏差问题。实验表明，FedEntOpt在标准设置下的分类准确率最高可提高6%，在低参与度和客户端退出场景下提高超过30%，还能将现有算法的准确性提升超过40%，并且即使在应用差分隐私时也能保持性能。

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Authors: Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci

Venue: MICCAI

First: 2025-08-02T09:59:39+00:00 · Latest: 2025-11-18T17:43:54+00:00

Comments: Acccepted in MICCAI Workshop 2025

Abs · PDF · Code1 · Code2

Abstract

Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.

中文标题/摘要

标题：GMAT：基于视觉-语言多实例学习的临床描述生成框架

多实例学习（MIL）是全切片图像（WSI）分类的领先方法，能够高效分析吉普赛像素病理切片。近期工作将视觉-语言模型（VLMs）引入MIL管道中，通过基于文本的类描述而非简单的类名来整合医学知识。然而，当这些方法依赖大型语言模型（LLMs）生成临床描述或使用固定长度的提示来表示复杂的病理概念时，VLMs的有限标记容量往往限制了编码的类信息的表达性和丰富性。此外，仅由LLMs生成的描述可能缺乏领域关联和精细的医学特异性，导致与视觉特征的对齐不足。为解决这些挑战，我们提出了一种基于视觉-语言MIL框架，包含两个关键贡献：（1）一个基于病理教科书的多智能体描述生成系统，利用专业化的病理学（如形态学、空间上下文）生成准确且多样的临床描述；（2）一种使用描述列表而非单一提示的文本编码策略，捕捉更精细且互补的临床信号，以更好地与视觉特征对齐。整合到VLM-MIL管道中，我们的方法在单提示类基准上表现出改进的性能，并在肾癌和肺癌数据集上达到了与最先进的模型相当的结果。

Summary / 总结

The research aims to enhance the performance of whole slide image classification in pathology by addressing the limitations of using large language models (LLMs) for generating clinical descriptions. The method introduces a grounded multi-agent clinical description generation system and a text encoding strategy using a list of descriptions. Experimental results show that this approach outperforms single-prompt class baselines and achieves results comparable to state-of-the-art models on renal and lung cancer datasets.

研究旨在通过解决使用大型语言模型生成临床描述的局限性，提升视觉语言模型在全切片图像分类中的性能。方法包括基于病理教科书和专业分工生成多样化且准确的描述的多智能体系统，以及使用描述列表的文本编码策略。关键实验发现表明，该方法优于单一提示基线，并在肾癌和肺癌数据集上达到了与最先进的模型相当的结果。

Seeing Beyond the Image: ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images

Authors: Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Meryem Jabrane, Vicente Grau, Shahnaz Jamil-Copley, Richard H. Clayton, Chen, Chen

First: 2025-11-18T17:42:20+00:00 · Latest: 2025-11-18T17:42:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate segmentation of myocardial scar from late gadolinium enhanced (LGE) cardiac MRI is essential for evaluating tissue viability, yet remains challenging due to variable contrast and imaging artifacts. Electrocardiogram (ECG) signals provide complementary physiological information, as conduction abnormalities can help localize or suggest scarred myocardial regions. In this work, we propose a novel multimodal framework that integrates ECG-derived electrophysiological information with anatomical priors from the AHA-17 atlas for physiologically consistent LGE-based scar segmentation. As ECGs and LGE-MRIs are not acquired simultaneously, we introduce a Temporal Aware Feature Fusion (TAFF) mechanism that dynamically weights and fuses features based on their acquisition time difference. Our method was evaluated on a clinical dataset and achieved substantial gains over the state-of-the-art image-only baseline (nnU-Net), increasing the average Dice score for scars from 0.6149 to 0.8463 and achieving high performance in both precision (0.9115) and sensitivity (0.9043). These results show that integrating physiological and anatomical knowledge allows the model to "see beyond the image", setting a new direction for robust and physiologically grounded cardiac scar segmentation.

中文标题/摘要

标题：超越图像：基于ECG和解剖知识引导的心肌疤痕分割

从延迟钆增强（LGE）心脏MRI中准确分割心肌疤痕对于评估组织活力至关重要，但由于对比度变化和成像伪影，这一过程仍然具有挑战性。心电图（ECG）信号提供了补充的生理信息，因为传导异常可以帮助定位或建议疤痕心肌区域。在本工作中，我们提出了一种新颖的多模态框架，将ECG衍生的电生理信息与来自AHA-17图谱的解剖先验信息相结合，以实现生理上一致的LGE基疤痕分割。由于ECGs和LGE-MRIs不是同时获取的，我们引入了一种基于其获取时间差的动态加权和融合机制（TAFF机制）。我们的方法在临床数据集上进行了评估，并在最先进的基于图像的基线（nnU-Net）上实现了显著的改进，将疤痕的平均Dice分数从0.6149提高到0.8463，并在精确度（0.9115）和灵敏度（0.9043）方面实现了高性能。这些结果表明，整合生理和解剖知识使模型能够“超越图像”，为稳健和基于生理的心肌疤痕分割设定了新方向。

Summary / 总结

This study addresses the challenge of accurately segmenting myocardial scar from LGE cardiac MRI by integrating ECG-derived electrophysiological information and anatomical priors. The proposed multimodal framework, which includes a Temporal Aware Feature Fusion mechanism, significantly improves the Dice score from 0.6149 to 0.8463, with high precision and sensitivity. This demonstrates the potential of combining physiological and anatomical knowledge for more robust cardiac scar segmentation.

该研究旨在准确分割来自延迟钆增强心脏MRI图像的心肌疤痕，这对于评估组织可利用性至关重要。提出的框架结合了ECG衍生的电生理信息和来自AHA-17图谱的解剖先验知识，使用了时间感知特征融合机制。临床数据集上的评估显示，与最先进的图像仅方法相比，平均Dice分数从0.6149提高到0.8463，精确度和灵敏度分别为0.9115和0.9043，显示出显著的改进。

Real-Time Sign Language to text Translation using Deep Learning: A Comparative study of LSTM and 3D CNN

Authors: Madhumati Pol, Anvay Anturkar, Anushka Khot, Ayush Andure, Aniruddha Ghosh, Anvit Magadum, Anvay Bahadur

Venue: International Journal of Computer Applications, Vol. 187, No. 55, pp. 31-35 (2025)

First: 2025-10-15T04:26:33+00:00 · Latest: 2025-11-18T17:23:28+00:00

Abs · PDF · Code1 · Code2

Abstract

This study investigates the performance of 3D Convolutional Neural Networks (3D CNNs) and Long Short-Term Memory (LSTM) networks for real-time American Sign Language (ASL) recognition. Though 3D CNNs are good at spatiotemporal feature extraction from video sequences, LSTMs are optimized for modeling temporal dependencies in sequential data. We evaluate both architectures on a dataset containing 1,200 ASL signs across 50 classes, comparing their accuracy, computational efficiency, and latency under similar training conditions. Experimental results demonstrate that 3D CNNs achieve 92.4% recognition accuracy but require 3.2% more processing time per frame compared to LSTMs, which maintain 86.7% accuracy with significantly lower resource consumption. The hybrid 3D CNNLSTM model shows decent performance, which suggests that context-dependent architecture selection is crucial for practical implementation.This project provides professional benchmarks for developing assistive technologies, highlighting trade-offs between recognition precision and real-time operational requirements in edge computing environments.

中文标题/摘要

标题：基于深度学习的实时手语到文本翻译：LSTM与3D CNN的比较研究

本研究探讨了3D卷积神经网络（3D CNNs）和长短期记忆（LSTM）网络在实时美国手语（ASL）识别中的性能。尽管3D CNNs在从视频序列中提取时空特征方面表现出色，但LSTMs则更擅长建模序列数据中的时间依赖性。我们在包含50个类别共1,200个ASL手势的数据集上评估了这两种架构，比较了它们的准确率、计算效率和延迟。实验结果表明，3D CNNs的识别准确率为92.4%，但每帧处理时间比LSTMs多3.2%；LSTMs则保持86.7%的准确率，资源消耗显著降低。3D CNNLSTM混合模型表现出良好的性能，这表明在实际应用中选择上下文相关的架构至关重要。本项目为开发辅助技术提供了专业基准，突显了在边缘计算环境中识别精度与实时操作需求之间的权衡。

Summary / 总结

This study compares the performance of 3D CNNs and LSTMs for real-time ASL recognition, using a dataset of 1,200 ASL signs. 3D CNNs achieve 92.4% accuracy but require more processing time, while LSTMs maintain 86.7% accuracy with lower resource consumption. The hybrid 3D CNNLSTM model shows promising results, indicating the importance of context-dependent architecture selection for practical applications.

该研究比较了3D CNN和LSTM在实时ASL识别中的性能，评估了准确率、计算效率和延迟。3D CNN的准确率为92.4%，但每帧处理时间比LSTM多3.2%；而LSTM保持86.7%的准确率，资源消耗更低。混合3D CNNLSTM模型显示出良好的性能，强调了在实际应用中根据上下文选择架构的重要性。

MOON: Generative MLLM-based Multimodal Representation Learning for E-commerce Product Understanding

Authors: Daoze Zhang, Zhanheng Nie, Jianyu Liu, Chenghan Fu, Wanxian Guan, Yuan Gao, Jun Song, Pengjie Wang, Jian Xu, Bo Zheng

Venue: WSDM 2026

First: 2025-08-16T09:59:25+00:00 · Latest: 2025-11-18T17:05:32+00:00

Comments: Accepted by WSDM 2026. 11 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.

中文标题/摘要

标题：MOON：基于生成性MLLM的多模态表示学习在电子商务产品理解中的应用

随着电子商务的快速发展，探索通用表示而非任务特定表示的研究引起了越来越多的关注。对于产品理解而言，尽管现有的判别性双流架构在这一领域取得了进展，但它们本质上难以建模多个产品图像和文本之间的多对一对齐。因此，我们认为生成性多模态大型语言模型（MLLMs）在提高产品表示学习方面具有巨大潜力。然而，由于几个关键挑战的存在，实现这一目标仍然具有挑战性：典型LLMs中缺乏多模态和方面感知建模模块；产品图像中普遍存在背景噪声；以及缺乏用于评估的标准基准。为了解决这些问题，我们提出了第一个基于生成性MLLM的产品表示学习模型MOON。我们的方法（1）采用引导混合专家（MoE）模块进行多模态和方面特定的产品内容的针对性建模；（2）有效检测产品图像中的核心语义区域，以减轻背景噪声的干扰和干扰；（3）引入专门的负样本策略，以增加负样本的难度和多样性。此外，我们还发布了大规模多模态基准MBE，用于各种产品理解任务。实验表明，我们的模型在我们的基准和公开数据集上均表现出竞争力的零样本性能，展示了其在各种下游任务中的强大泛化能力，包括跨模态检索、产品分类和属性预测。此外，案例研究和可视化展示了MOON在产品理解中的有效性。

Summary / 总结

MOON is a generative MLLM-based model designed to improve product representation learning in e-commerce. It addresses the limitations of existing discriminative dual-flow architectures by employing a guided Mixture-of-Experts module, detecting core semantic regions in product images, and using a specialized negative sampling strategy. Experimental results show that MOON achieves competitive zero-shot performance on various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction, demonstrating strong generalization capabilities.

研究旨在通过利用生成的多模态大型语言模型（MLLM）来提高电子商务中的产品表示学习。方法引入了指导下的Mixture-of-Experts模块进行多模态和方面特定的内容建模，检测产品图像中的核心语义区域以减少背景噪声，并使用专门的负样本策略增加负样本的难度和多样性。实验结果显示，在各种下游任务上具有竞争力的零样本性能，表明具有强大的泛化能力。

OptScale: Probabilistic Optimality for Inference-time Scaling

Authors: Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei

Venue: AAAI

First: 2025-06-27T16:44:11+00:00 · Latest: 2025-11-18T17:04:04+00:00

Comments: Accepted by AAAI-2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textsc{OptScale}, a practical algorithm that dynamically determines the optimal number of sampled responses. \textsc{OptScale} employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on representative reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that \textsc{OptScale} significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning. The source code is publicly available at https://github.com/Albertwyk/OptScale.

中文标题/摘要

标题：OptScale：推理时概率最优缩放

推理时缩放已成为增强大型语言模型（LLMs）推理性能的强大技术。然而，现有方法通常依赖于启发式策略进行并行采样，缺乏一个原则性的基础。为解决这一问题，我们提出了一种概率框架，该框架在假设并行样本独立且同分布（i.i.d.）且最佳N次选择策略遵循可估计的概率分布的前提下，形式化了推理时缩放的最优性。在此框架内，我们推导出实现目标性能水平所需的样本数量的理论下限，从而提供了第一个计算高效的缩放的原理性指导。利用这一见解，我们开发了OptScale，一种实用算法，能够动态确定最优的采样响应数量。OptScale 使用基于语言模型的预测器估计概率先验参数，使决策满足预定义的性能阈值和置信水平所需的最小样本数量成为可能。在代表性的推理基准测试（包括MATH-500、GSM8K、AIME和AMC）上的广泛实验表明，OptScale 显著减少了采样开销，同时保持或优于最先进的推理性能。我们的工作为推理时缩放提供了理论基础和实用解决方案，解决了LLMs高效部署进行复杂推理的关键问题。源代码可在https://github.com/Albertwyk/OptScale公开获取。

Summary / 总结

OptScale proposes a probabilistic framework to optimize inference-time scaling for LLMs, providing a theoretical lower bound on the number of samples needed to achieve target performance. The method uses a language model-based predictor to dynamically determine the optimal number of samples, reducing sampling overhead while maintaining or improving reasoning performance. Experiments on various benchmarks show that OptScale outperforms or matches state-of-the-art methods in terms of efficiency and performance.

OptScale 提出了一种概率框架来优化大型语言模型（LLMs）的推理时扩展，提供了达到目标性能水平所需的样本数的理论下限。它利用语言模型预测器动态确定最优的采样响应数量，从而减少采样开销并保持或提高推理性能。实验表明，OptScale 在效率和性能方面优于现有的启发式方法。

NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Authors: Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, Soujanya Poria

First: 2025-11-18T16:55:48+00:00 · Latest: 2025-11-18T16:55:48+00:00

Comments: https://declare-lab.github.io/nora-1.5

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.

中文标题/摘要

标题：NORA-1.5：一种基于世界模型和动作偏好奖励训练的视觉-语言-行动模型

视觉-语言-行动（VLA）模型在各种体态任务上已经显示出有希望的表现，但在可靠性和泛化能力方面仍然存在不足，尤其是在部署到不同的体态或真实环境时。在本研究中，我们介绍了NORA-1.5，这是一种基于预训练NORA骨干网络构建的VLA模型，并在此基础上添加了一个基于流匹配的动作专家。这种架构上的改进本身就带来了显著的性能提升，使NORA-1.5在模拟和真实世界基准测试中均优于NORA和多个最先进的VLA模型。为了进一步提高鲁棒性和任务成功率，我们开发了一套用于后训练VLA策略的奖励模型。我们的奖励模型结合了（i）一种动作条件下的世界模型（WM），用于评估生成的动作是否朝向目标，以及（ii）一种偏离真实值的启发式方法，用于区分好的动作和差的动作。利用这些奖励信号，我们构建了偏好数据集，并通过直接偏好优化（DPO）将NORA-1.5适应到目标体态。广泛的评估表明，奖励驱动的后训练在模拟和真实机器人设置中均能一致地提高性能，通过简单的有效奖励模型显著提高了VLA模型的可靠性。我们的研究结果强调了NORA-1.5和奖励引导的后训练作为实现更可靠体态代理的可行路径，这些代理适合真实世界的部署。

Summary / 总结

NORA-1.5 is a vision-language-action model that enhances the pre-trained NORA backbone with an action expert, achieving better performance than NORA and other state-of-the-art models in both simulated and real-world benchmarks. To further improve its robustness, the model uses a set of reward models combining an action-conditioned world model and a deviation-from-ground-truth heuristic, which are used for direct preference optimization to adapt NORA-1.5 to different embodiments. The extensive evaluations show that these reward-driven post-training methods significantly improve the model's reliability in both simulation and real-robot settings.

NORA-1.5 是一个视觉-语言-行动模型，通过增强预训练的 NORA 基干并加入行动专家，使其在模拟和真实机器人环境中均优于 NORA 和其他最先进的模型。为了进一步提高鲁棒性，该模型使用结合了行动条件世界模型和偏离真实情况启发式的奖励模型，通过直接偏好优化来适应不同的实体。实验结果表明，奖励驱动的后训练显著提高了 VLA 模型在模拟和真实机器人环境中的可靠性。

Seeing and Knowing in the Wild: Open-domain Visual Entity Recognition with Large-scale Knowledge Graphs via Contrastive Learning

Authors: Hongkuan Zhou, Lavdim Halilaj, Sebastian Monka, Stefan Schmid, Yuqicheng Zhu, Jingcheng Wu, Nadeem Nazer, Steffen Staab

First: 2025-10-15T15:33:36+00:00 · Latest: 2025-11-18T16:50:10+00:00

Comments: Accepted by AAAI2026

Abs · PDF · Code1 · Code2

Abstract

Open-domain visual entity recognition aims to identify and link entities depicted in images to a vast and evolving set of real-world concepts, such as those found in Wikidata. Unlike conventional classification tasks with fixed label sets, it operates under open-set conditions, where most target entities are unseen during training and exhibit long-tail distributions. This makes the task inherently challenging due to limited supervision, high visual ambiguity, and the need for semantic disambiguation. We propose a Knowledge-guided Contrastive Learning (KnowCoL) framework that combines both images and text descriptions into a shared semantic space grounded by structured information from Wikidata. By abstracting visual and textual inputs to a conceptual level, the model leverages entity descriptions, type hierarchies, and relational context to support zero-shot entity recognition. We evaluate our approach on the OVEN benchmark, a large-scale open-domain visual recognition dataset with Wikidata IDs as the label space. Our experiments show that using visual, textual, and structured knowledge greatly improves accuracy, especially for rare and unseen entities. Our smallest model improves the accuracy on unseen entities by 10.5% compared to the state-of-the-art, despite being 35 times smaller.

中文标题/摘要

标题：在野外观察与认知：通过对比学习大规模知识图谱进行开放领域视觉实体识别

开放领域视觉实体识别旨在将图像中描绘的实体与广泛且不断演化的现实世界概念联系起来，例如维基数据中的概念。与具有固定标签集的传统分类任务不同，它在开放集条件下运行，其中大多数目标实体在训练期间未见过，并且表现出长尾分布。这使得任务由于监督有限、视觉模糊度高以及需要语义消歧而变得固有地具有挑战性。我们提出了一种知识导向的对比学习（KnowCoL）框架，该框架将图像和文本描述结合到由维基数据结构化信息支撑的共享语义空间中。通过将视觉和文本输入抽象到概念级别，该模型利用实体描述、类型层次结构和关系上下文来支持零样本实体识别。我们在OVEN基准上评估了我们的方法，这是一个具有维基数据ID作为标签空间的大规模开放领域视觉识别数据集。我们的实验表明，使用视觉、文本和结构化知识极大地提高了准确性，尤其是对于稀有和未见过的实体。尽管我们的最小模型比最先进的模型小35倍，但其在未见过的实体上的准确率提高了10.5%。

Summary / 总结

The paper addresses the challenge of open-domain visual entity recognition by proposing a Knowledge-guided Contrastive Learning (KnowCoL) framework. This framework integrates images and text descriptions using structured information from Wikidata to support zero-shot entity recognition. The model improves accuracy, particularly for rare and unseen entities, with a 10.5% increase in accuracy on unseen entities compared to the state-of-the-art model, while being significantly smaller in size.

论文提出了一种知识引导的对比学习（KnowCoL）框架，将图像和文本描述结合使用Wikidata中的结构化信息，以支持零样本实体识别。该模型在罕见和未见过的实体上提高了准确性，与最先进的模型相比，未见过实体的准确性提高了10.5%，同时模型规模要小得多。

Improving segmentation of retinal arteries and veins using cardiac signal in doppler holograms

Authors: Marius Dubosc, Yann Fischer, Zacharie Auray, Nicolas Boutry, Edwin Carlinet, Michael Atlan, Thierry Geraud

First: 2025-11-18T16:49:20+00:00 · Latest: 2025-11-18T16:49:20+00:00

Comments: 5 pages, 3 figures, 1 table. Submitted to ISBI2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Doppler holography is an emerging retinal imaging technique that captures the dynamic behavior of blood flow with high temporal resolution, enabling quantitative assessment of retinal hemodynamics. This requires accurate segmentation of retinal arteries and veins, but traditional segmentation methods focus solely on spatial information and overlook the temporal richness of holographic data. In this work, we propose a simple yet effective approach for artery-vein segmentation in temporal Doppler holograms using standard segmentation architectures. By incorporating features derived from a dedicated pulse analysis pipeline, our method allows conventional U-Nets to exploit temporal dynamics and achieve performance comparable to more complex attention- or iteration-based models. These findings demonstrate that time-resolved preprocessing can unlock the full potential of deep learning for Doppler holography, opening new perspectives for quantitative exploration of retinal hemodynamics. The dataset is publicly available at https://huggingface.co/datasets/DigitalHolography/

中文标题/摘要

标题：利用心脏信号改进多普勒全息图中视网膜动脉和静脉的分割

多普勒全息图是一种新兴的视网膜成像技术，能够以高时间分辨率捕获血液流动的动态行为，从而实现对视网膜血流动力学的定量评估。这需要对视网膜动脉和静脉进行准确分割，但传统的分割方法仅关注空间信息，而忽视了全息数据的时间丰富性。在本研究中，我们提出了一种简单而有效的方法，利用标准分割架构在时间多普勒全息图中进行动脉-静脉分割。通过结合来自专用脉冲分析流水线的特征，我们的方法使传统的U-Nets能够利用时间动态性，并实现与更复杂的注意力-或迭代-基于模型相当的性能。这些发现表明，时间分辨预处理可以解锁深度学习在多普勒全息图中的全部潜力，为定量探索视网膜血流动力学开辟了新的视角。数据集可在https://huggingface.co/datasets/DigitalHolography/公开获取

RepAir: A Framework for Airway Segmentation and Discontinuity Correction in CT

Authors: John M. Oyer, Ali Namvar, Benjamin A. Hoff, Wassim W. Labaki, Ella A. Kazerooni, Charles R. Hatt, Fernando J. Martinez, MeiLan K. Han, Craig J. Galbán, Sundaresh Ram

First: 2025-11-18T16:41:44+00:00 · Latest: 2025-11-18T16:41:44+00:00

Comments: 4 pages, 3 figures, 1 table. Preprint submitted to SSIAI 2026 Conference on November 17, 2025

Abs · PDF · Code1 · Code2

Abstract

Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM'22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.

中文标题/摘要

标题：RepAir：一种用于CT扫描气道分割和断续性修正的框架

从胸部计算机断层扫描（CT）图像中准确分割气道对于定量肺分析至关重要，但手动注释不切实际，而许多基于U-Net的自动化方法会产生断开的组件，妨碍可靠的生物标志物提取。我们提出了一种名为RepAir的三阶段框架，用于稳健的3D气道分割，该框架结合了基于nnU-Net的网络和解剖学导向的拓扑修正。分割网络生成初始气道掩码，之后基于骨架的算法识别潜在断续点并提出重新连接建议。然后，1D卷积分类器确定哪些候选连接对应于真实的解剖分支而非假象或阻塞路径。我们在两个不同的数据集上评估了RepAir：ATM'22，包含主要健康受试者的注释CT扫描，以及AeroPath，包含具有严重气道病理的注释扫描。在两个数据集上，RepAir在体素级和拓扑度量方面均优于现有的3D U-Net方法，如Bronchinet和NaviAirway，同时生成更完整且解剖上一致的气道树，保持高分割准确性。

Summary / 总结

RepAir is a three-stage framework for 3D airway segmentation in chest CT scans, combining an nnU-Net-based network with topology correction. It produces an initial airway mask, identifies potential discontinuities, and uses a 1D convolutional classifier to determine true anatomical branches. RepAir outperforms existing 3D U-Net-based methods on both voxel-level and topological metrics across two datasets, providing more complete and anatomically consistent airway trees.

RepAir 是一个三阶段框架，用于从胸部 CT 扫描中进行 robust 3D 气道分割，结合了基于 nnU-Net 的网络和拓扑修正。它首先生成初始气道掩码，然后识别潜在的断点，并使用 1D 卷积分类器来确定真实的解剖分支。RepAir 在两个数据集上的体素级和拓扑度量指标上均优于现有的 3D U-Net 方法，提供了更完整且解剖上更一致的气道树。

Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

Authors: Yanna Ding, Songtao Lu, Yingdong Lu, Tomasz Nowicki, Jianxi Gao

Venue: NeurIPS 2025

First: 2025-10-21T13:42:48+00:00 · Latest: 2025-11-18T16:28:27+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.

中文标题/摘要

标题：Transformer在学习马尔可夫动态函数中的最优性和NP难问题

Transformer架构可以通过输入-输出对在给定提示中进行上下文学习（ICL）来解决未见过的任务。现有的ICL理论研究主要集中在线性回归任务上，通常假设输入是独立同分布的。为了理解Transformer在建模由动力学驱动的函数时如何表达ICL，我们通过一个结构化的ICL设置研究了马尔可夫函数学习，揭示了损失景观以揭示潜在的优化行为。具体来说，我们（1）为单层线性自注意力（LSA）模型提供了全局最小值（在扩展的参数空间中）的闭式表达式；（2）证明了一般情况下恢复实现最优解的Transformer参数是NP难问题，揭示了一层LSA在表示结构化动力学函数方面的基本局限性；（3）提供了一种新颖的多层LSA执行预条件梯度下降以优化多个目标（超越平方损失）的解释。这些理论结果通过简化Transformer进行了数值验证。

Summary / 总结

The research aims to understand how transformers perform in learning Markovian dynamical functions using in-context learning. The study investigates a structured in-context learning setup and characterizes the loss landscape. Key findings include providing a closed-form expression for the global minimizer of a single-layer linear self-attention model, proving that recovering optimal transformer parameters is NP-hard in general, and interpreting multilayer LSA as preconditioned gradient descent for optimizing multiple objectives. Numerical validation using simplified transformers supports these theoretical results.

研究旨在通过探讨马尔可夫函数学习来理解变压器在上下文学习（ICL）中的表现。研究提供了单层线性自注意力模型全局最小值的闭式表达式，并证明了一般情况下恢复最优变压器参数是NP难问题，揭示了一层LSA在表示结构化动态函数方面的局限性。此外，研究还将多层LSA解释为对多个目标（不仅仅是平方损失）进行预条件梯度下降优化。这些发现通过使用简化变压器进行数值验证得到了支持。

SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction

Authors: Meiying Gu, Jiawei Zhang, Jiahe Li, Xiaohan Yu, Haonan Luo, Jin Zheng, Xiao Bai

Venue: AAAI 2026

First: 2025-11-18T16:24:37+00:00 · Latest: 2025-11-18T16:24:37+00:00

Comments: Accepted at AAAI 2026. Project page: https://miya-oi.github.io/SparseSurf-project

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose \net{}, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.

中文标题/摘要

标题：SparseSurf: 稀疏视角3D 高斯点云表面重建

最近在优化高斯点云对于场景几何的建模方面取得的进展，使得从图像中高效重建详细的表面成为可能。然而，当输入视角稀疏时，这种优化容易导致过拟合，从而导致重建质量不佳。现有方法通过使用拉平的高斯原语更好地拟合表面几何，并结合深度正则化来缓解有限视角下的几何歧义，来应对这一挑战。然而，拉平的高斯原语固有的各向异性加剧了稀疏视角场景中的过拟合问题，妨碍了表面拟合的准确性并降低了新颖视角合成性能。在本文中，我们提出了一种方法，能够在保持高质量新颖视角渲染的同时，重建更准确和详细的表面。我们的关键见解是引入立体几何-纹理对齐，这将渲染质量和几何估计联系起来，从而同时增强表面重建和视图合成。此外，我们提出了伪特征增强几何一致性，通过结合训练和未见过的视角来强制执行多视角几何一致性，有效缓解了稀疏监督引起的过拟合。在DTU、BlendedMVS和Mip-NeRF360数据集上的大量实验表明，我们的方法达到了最先进的性能。

Summary / 总结

The research aims to improve surface reconstruction from sparse views by addressing overfitting issues. The method, SparseSurf, introduces Stereo Geometry-Texture Alignment to enhance both surface reconstruction and novel view rendering. It also includes Pseudo-Feature Enhanced Geometry Consistency to enforce multi-view geometric consistency, which mitigates overfitting. Experiments show that SparseSurf outperforms existing methods on DTU, BlendedMVS, and Mip-NeRF360 datasets.

该论文提出SparseSurf方法，利用高斯点绘制从稀疏视图重建详细表面。通过引入立体几何-纹理对齐来增强表面重建和新颖视图渲染，以及伪特征增强几何一致性来确保多视图几何一致性，从而解决过拟合问题。实验表明SparseSurf在DTU、BlendedMVS和Mip-NeRF360数据集上优于现有方法。

Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

Authors: Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia

First: 2025-11-18T16:23:02+00:00 · Latest: 2025-11-18T16:23:02+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent

中文标题/摘要

标题：利用视觉语言模型能力增强代理自主科学研究

我们展示了由视觉语言模型（VLMs）引导的多代理系统可以提高端到端的自主科学研究能力。通过将图表视为可验证的检查点，VLM作为裁判评估图表与动态生成的领域特定评分标准的符合情况，使代理能够纠正自己的错误并在实时中引导探索性数据分析。在宇宙学和天体化学案例研究中，展示了从错误推理路径中恢复并适应新数据集而无需人类干预的能力。在数据驱动发现的10任务基准测试中，增强VLM的系统达到0.7-0.8的通过率，而仅代码和代码与文本基线分别达到0.2-0.3和0.4-0.5，同时提供可审计的推理轨迹以提高可解释性。代码可在以下链接获取：https://github.com/CMBAgents/cmbagent

Summary / 总结

The research aims to enhance autonomous scientific discovery using multi-agent systems guided by vision-language models. These models act as judges, evaluating figures against domain-specific rubrics and enabling agents to correct errors in real-time. The study demonstrates improved performance on a 10-task benchmark, with pass rates of 0.7-0.8 for VLM-augmented systems compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines. Additionally, the systems provide auditable reasoning traces that enhance interpretability. Code for the system is available on GitHub.

研究旨在通过由视觉-语言模型引导的多智能体系统来增强自主科学研究。这些模型作为裁判，根据领域特定的评分标准评估图表，并使智能体能够实时纠正错误。研究在10项任务基准测试中展示了改进的表现，VLM增强系统的通过率为0.7-0.8，而代码仅系统为0.2-0.3，代码和文本基线为0.4-0.5。此外，这些系统还提供了可审计的推理痕迹，提高了可解释性。系统代码可在GitHub上获得。

Failure to Mix: Large language models struggle to answer according to desired probability distributions

Authors: Ivy Yuqian Yang, David Yu Zhang

First: 2025-11-18T16:22:26+00:00 · Latest: 2025-11-18T16:22:26+00:00

Comments: 13 pages, 6 figures. Code and reproducibility package: https://github.com/BiostateAIresearch/failure-to-mix

Abs · PDF · Code1 · Code2 · Code3

Abstract

Scientific idea generation and selection requires exploration following a target probability distribution. In contrast, current AI benchmarks have objectively correct answers, and training large language models (LLMs) via reinforcement learning against these benchmarks discourages probabilistic exploration. Here, we conducted systematic experiments requesting LLMs to produce outputs following simple probabilistic distributions, and found that all modern LLMs tested grossly fail to follow the distributions. For example, requesting a binary output of "1" 49% of the time produces an answer of "0" nearly 100% of the time. This step function-like behavior of near-exclusively generating the output with marginally highest probability even overrules even strong in-built LLM biases.

中文标题/摘要

标题：混合失败：大型语言模型难以按照期望的概率分布作答

科学创意生成和选择需要遵循目标概率分布进行探索。相比之下，当前的AI基准有客观的正确答案，通过强化学习训练大型语言模型（LLMs）会抑制概率探索。在此，我们系统地要求LLMs生成遵循简单概率分布的输出，并发现所有测试的现代LLMs严重未能遵循这些分布。例如，要求二元输出为“1”49%的时间，实际生成的答案几乎100%为“0”。这种近似于阶跃函数的行为，即使在强内置LLM偏差面前，也优先生成概率略高的输出。

Summary / 总结

The study investigates how large language models (LLMs) handle probabilistic distributions in scientific idea generation and selection, which requires exploration following a target distribution. Despite training LLMs via reinforcement learning on benchmarks with objectively correct answers, the models fail to produce outputs according to the desired distributions. For instance, when asked to generate a binary output of '1' 49% of the time, the models nearly always produce '0'. This indicates a lack of probabilistic exploration and a tendency to generate the most probable output exclusively.

研究探讨了大型语言模型（LLMs）在生成输出时如何处理概率分布，这对于科学创意生成和选择至关重要。尽管通过强化学习在包含正确答案的基准上训练LLMs，但模型未能遵循所需的概率分布。例如，当要求以49%的概率输出'1'时，它们几乎总是输出'0'。这种行为在各种LLMs中是一致的，并且甚至会覆盖它们固有的偏差。

A More Realistic Evaluation of Cross-Frequency Transfer Learning and Foundation Forecasting Models

Authors: Kin G. Olivares, Malcolm Wolff, Tatiana Konstantinova, Shankar Ramasubramanian, Boris Oreshkin, Andrew Gordon Wilson, Andres Potapczynski, Willa Potosnak, Michael W. Mahoney, Mengfei Cao, Dmitry Efimov

Venue: NeurIPS 2025

First: 2025-09-23T18:19:50+00:00 · Latest: 2025-11-18T16:20:48+00:00

Comments: NeurIPS 2025 Workshop on Recent Advances in Time Series Foundation Models (BERT2S)

Abs · PDF · Code1 · Code2

Abstract

Cross-frequency transfer learning (CFTL) has emerged as a popular framework for curating large-scale time series datasets to pre-train foundation forecasting models (FFMs). Although CFTL has shown promise, current benchmarking practices fall short of accurately assessing its performance. This shortcoming stems from many factors: an over-reliance on small-scale evaluation datasets; inadequate treatment of sample size when computing summary statistics; reporting of suboptimal statistical models; and failing to account for non-negligible risks of overlap between pre-training and test datasets. To address these limitations, we introduce a unified reimplementation of widely-adopted neural forecasting networks, adapting them for the CFTL setup; we pre-train only on proprietary and synthetic data, being careful to prevent test leakage; and we evaluate on 15 large, diverse public forecast competition datasets. Our empirical analysis reveals that statistical models' accuracy is frequently underreported. Notably, we confirm that statistical models and their ensembles consistently outperform existing FFMs by more than 8.2% in sCRPS, and by more than 20% MASE, across datasets. However, we also find that synthetic dataset pre-training does improve the accuracy of a FFM by 7% percent.

中文标题/摘要

标题：更现实的跨频域迁移学习和基础预测模型评估

跨频域迁移学习（CFTL）已成为一种流行的框架，用于编排大规模时间序列数据集以预训练基础预测模型（FFMs）。尽管CFTL显示出潜力，但当前的基准测试实践未能准确评估其性能。这一不足源于多个因素：过度依赖小型评估数据集；在计算汇总统计时未充分处理样本量；报告次优统计模型；以及未能考虑预训练和测试数据集之间非可忽略的风险重叠。为解决这些限制，我们引入了广泛采用的神经预测网络的统一重实现，将其适应CFTL设置；仅在专有和合成数据上进行预训练，以防止测试泄漏；并在15个大型多样的公共预测竞赛数据集上进行评估。我们的实证分析表明，统计模型的准确性经常被低估。值得注意的是，我们确认统计模型及其集成在所有数据集中始终比现有FFMs高出8.2%以上的sCRPS和20%以上的MASE。然而，我们还发现，合成数据集预训练确实可以提高FFM的准确性7%。

Summary / 总结

The research aims to provide a more realistic evaluation of cross-frequency transfer learning (CFTL) and foundation forecasting models (FFMs) by addressing common benchmarking issues. The authors re-implemented neural forecasting networks for CFTL, ensuring no test data leakage, and evaluated on 15 large, diverse public datasets. Key findings include that statistical models and their ensembles outperform existing FFMs by more than 8.2% in sCRPS and over 20% in MASE, and that pre-training on synthetic data improves FFM accuracy by 7%.

研究旨在通过解决当前基准测试实践中的局限性，提供对交叉频率转移学习（CFTL）和基础预测模型（FFMs）更现实的评估。研究人员引入了一种统一的神经预测网络实现方式，仅在专有和合成数据上进行预训练以避免测试泄露，并在15个大型多样的公共预测竞赛数据集上进行评估。结果显示，统计模型及其集成在sCRPS和MASE指标上分别比现有FFMs高出超过8.2%和20%，并且合成数据集预训练可以提高FFM的准确性约7%。

Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

Authors: Ruoyu Qin, Weiran He, Weixiao Huang, Yangkun Zhang, Yikai Zhao, Bo Pang, Xinran Xu, Yingdi Shan, Yongwei Wu, Mingxing Zhang

First: 2025-11-18T16:12:21+00:00 · Latest: 2025-11-18T16:12:21+00:00

Comments: 16 pages, 12 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.

中文标题/摘要

标题：Seer：在线上下文学习以实现快速同步LLM强化学习

强化学习（RL）已成为推动现代大型语言模型（LLMs）进步的关键，但现有的同步RL系统面临严重的性能瓶颈。回放阶段，它主导了端到端迭代时间，由于固有的工作负载不平衡，导致了显著的长尾延迟和资源利用率低下。我们提出了Seer，这是一种新颖的在线上下文学习系统，通过利用相同提示下请求之间被忽视的输出长度和生成模式的相似性来解决这些挑战。Seer引入了三种关键技术：动态负载均衡的分段回放、上下文感知调度和自适应分组推测解码。这些机制共同显著减少了长尾延迟并提高了回放期间的资源效率。在生产级RL工作负载上的评估表明，与最先进的同步RL系统相比，Seer将端到端回放吞吐量提高了74%到97%，并将长尾延迟降低了75%到93%，显著加速了RL训练迭代。

Summary / 总结

Seer is a system designed to enhance the performance of synchronous reinforcement learning for Large Language Models (LLMs) by addressing the long-tail latency and resource inefficiency issues in the rollout phase. It uses online context learning to identify and exploit similarities in request patterns, implementing techniques such as divided rollout, context-aware scheduling, and adaptive grouped speculative decoding. These methods significantly reduce long-tail latency and improve resource utilization, with evaluations showing a 74% to 97% increase in end-to-end rollout throughput and a 75% to 93% reduction in long-tail latency compared to existing systems.

Seer 是一个系统，旨在通过解决同步强化学习 (RL) 中的大语言模型 (LLMs) 的瓶颈问题来提升其性能。它引入了动态负载均衡的分段回放、上下文感知调度和自适应分组推测解码等技术，以减少长尾延迟并提高资源效率。评估结果显示，Seer 将端到端回放吞吐量提高了 74% 至 97%，并将长尾延迟降低了 75% 至 93%，显著加快了 RL 训练迭代。

3D-Guided Scalable Flow Matching for Generating Volumetric Tissue Spatial Transcriptomics from Serial Histology

Authors: Mohammad Vali Sanian, Arshia Hemmat, Amirhossein Vahidi, Jonas Maaskola, Jimmy Tsz Hang Lee, Stanislaw Makarchuk, Yeliz Demirci, Nana-Jane Chipampe, Omer Bayraktar, Lassi Paavolainen, Mohammad Lotfollahi

First: 2025-11-18T16:08:24+00:00 · Latest: 2025-11-18T16:08:24+00:00

Comments: 11 pages

Abs · PDF · Code1 · Code2

Abstract

A scalable and robust 3D tissue transcriptomics profile can enable a holistic understanding of tissue organization and provide deeper insights into human biology and disease. Most predictive algorithms that infer ST directly from histology treat each section independently and ignore 3D structure, while existing 3D-aware approaches are not generative and do not scale well. We present Holographic Tissue Expression Inpainting and Analysis (HoloTea), a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E while explicitly using information from adjacent sections. Our key idea is to retrieve morphologically corresponding spots on neighboring slides in a shared feature space and fuse this cross section context into a lightweight ControlNet, allowing conditioning to follow anatomical continuity. To better capture the count nature of the data, we introduce a 3D-consistent prior for flow matching that combines a learned zero-inflated negative binomial (ZINB) prior with a spatial-empirical prior constructed from neighboring sections. A global attention block introduces 3D H&E scaling linearly with the number of spots in the slide, enabling training and inference on large 3D ST datasets. Across three spatial transcriptomics datasets spanning different tissue types and resolutions, HoloTea consistently improves 3D expression accuracy and generalization compared to 2D and 3D baselines. We envision HoloTea advancing the creation of accurate 3D virtual tissues, ultimately accelerating biomarker discovery and deepening our understanding of disease.

中文标题/摘要

标题：基于3D引导的可扩展流匹配方法用于从连续组织切片生成体积组织空间转录组学

一个可扩展且稳健的3D组织转录组学谱型可以促进对组织组织的理解，并提供对人类生物学和疾病更深入的见解。大多数从组织切片推断ST的预测算法将每个切片独立处理并忽略3D结构，而现有的3D感知方法不是生成的，也不易于扩展。我们提出了Holographic Tissue Expression Inpainting and Analysis (HoloTea)，这是一种3D感知的流匹配框架，可以从HE染色中推断出斑点级基因表达，并明确利用相邻切片的信息。我们的核心思想是在共享特征空间中检索邻近幻灯片上的形态对应斑点，并将这种跨切片上下文融合到一个轻量级的ControlNet中，使条件遵循解剖连续性。为了更好地捕捉数据的计数性质，我们引入了一种3D一致的先验用于流匹配，结合了学习的零膨胀负二项式（ZINB）先验和从相邻切片构建的空间经验先验。全局注意力块引入了3D HE缩放，其线性依赖于幻灯片中的斑点数量，从而在大型3D ST数据集上实现训练和推理。在三个不同组织类型和分辨率的空间转录组学数据集中，HoloTea在3D表达准确性和泛化能力上始终优于2D和3D基线。我们设想HoloTea将促进准确的3D虚拟组织的创建，最终加速生物标志物的发现并加深我们对疾病的理解。

Summary / 总结

HoloTea is a 3D-aware flow-matching framework that imputes spot-level gene expression from H&E sections by leveraging information from adjacent sections. It uses a shared feature space to retrieve morphologically corresponding spots and a 3D-consistent prior for flow matching. Across three spatial transcriptomics datasets, HoloTea improves 3D expression accuracy and generalization compared to 2D and 3D baselines.

HoloTea 是一个 3D 意识的流匹配框架，通过利用相邻切片的信息来从 HE 切片中推断斑点级基因表达。它使用共享特征空间来检索形态学上对应的斑点，并将此上下文融合到一个轻量级的 ControlNet 中。HoloTea 引入了 3D 一致的先验用于流匹配，并且包含了一个全局注意力模块以处理大规模数据集。在三个不同组织类型和分辨率的空间转录组学数据集中，HoloTea 在 3D 表达准确性和泛化能力上优于 2D 和 3D 基线方法。

Dimension vs. Precision: A Comparative Analysis of Autoencoders and Quantization for Efficient Vector Retrieval on BEIR SciFact

Authors: Satyanarayan Pati

First: 2025-11-17T07:02:11+00:00 · Latest: 2025-11-18T16:07:31+00:00

Comments: 16 pages, 9 figures, 1 table

Abs · PDF · Code1 · Code2

Abstract

Dense retrieval models have become a standard for state-of-the-art information retrieval. However, their high-dimensional, high-precision (float32) vector embeddings create significant storage and memory challenges for real-world deployment. To address this, we conduct a rigorous empirical study on the BEIR SciFact benchmark, evaluating the trade-offs between two primary compression strategies: (1) Dimensionality Reduction via deep Autoencoders (AE), reducing original 384-dim vectors to latent spaces from 384 down to 12, and (2) Precision Reduction via Quantization (float16, int8, and binary). We systematically compare each method by measuring the "performance loss" (or gain) relative to a float32 baseline across a full suite of retrieval metrics (NDCG, MAP, MRR, Recall, Precision) at various k cutoffs. Our results show that int8 scalar quantization provides the most effective "sweet spot," achieving a 4x compression with a negligible [~1-2%] drop in nDCG@10. In contrast, Autoencoders show a graceful degradation but suffer a more significant performance loss at equivalent 4x compression ratios (AE-96). binary quantization was found to be unsuitable for this task due to catastrophic performance drops. This work provides a practical guide for deploying efficient, high-performance retrieval systems.

中文标题/摘要

标题：维度 vs. 精度：BEIR SciFact 上自动编码器和量化高效向量检索的比较分析

密集检索模型已成为最先进的信息检索的标准。然而，它们的高维、高精度（float32）向量嵌入在实际部署中造成了显著的存储和内存挑战。为了解决这一问题，我们在BEIR SciFact基准上进行了严格的实证研究，评估了两种主要压缩策略之间的权衡：（1）通过深度自动编码器（AE）进行维度降低，将原始384维向量减少到12维的潜在空间；（2）通过量化（float16、int8和二进制）进行精度降低。我们通过测量相对于float32基线的“性能损失”（或增益）来系统地比较每种方法，使用一系列检索指标（NDCG、MAP、MRR、召回率、精度）在不同k截断点进行评估。结果显示，int8标量量化提供了最有效的“甜蜜点”，实现了4倍压缩，nDCG@10的下降几乎可以忽略不计（~1-2%）。相比之下，自动编码器在等效4倍压缩比下表现出优雅的退化，但性能损失更大。二进制量化由于性能急剧下降而被认为不适合此任务。本研究为部署高效、高性能的检索系统提供了实用指南。

Summary / 总结

This study investigates the trade-offs between dimensionality reduction via Autoencoders and precision reduction via quantization for efficient vector retrieval on the BEIR SciFact benchmark. The research evaluates float16, int8, and binary quantization, as well as Autoencoders that reduce the original 384-dimensional vectors to 12 dimensions. Key findings show that int8 scalar quantization achieves a 4x compression with minimal performance loss, while Autoencoders suffer more significant degradation at the same compression ratio. Binary quantization was found to be unsuitable due to poor performance.

研究在BEIR SciFact基准上比较了通过自编码器进行维度降低和通过量化进行精度降低之间的权衡，以实现高效的向量检索。评估了float16、int8和二值量化，以及将维度从384降低到12的自编码器，并在各种检索指标（NDCG、MAP、MRR、召回率、精度）中测量性能损失。结果表明，int8量化提供了最佳的平衡，实现了4倍压缩并仅轻微[~1-2%]降低nDCG@10，而自编码器在相同压缩比下显示出更大的性能损失。

Fine-Grained Representation for Lane Topology Reasoning

Authors: Guoqing Xu, Yiheng Li, Yang Yang

Venue: AAAI 2026

First: 2025-11-16T13:24:30+00:00 · Latest: 2025-11-18T16:06:07+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Precise modeling of lane topology is essential for autonomous driving, as it directly impacts navigation and control decisions. Existing methods typically represent each lane with a single query and infer topological connectivity based on the similarity between lane queries. However, this kind of design struggles to accurately model complex lane structures, leading to unreliable topology prediction. In this view, we propose a Fine-Grained lane topology reasoning framework (TopoFG). It divides the procedure from bird's-eye-view (BEV) features to topology prediction via fine-grained queries into three phases, i.e., Hierarchical Prior Extractor (HPE), Region-Focused Decoder (RFD), and Robust Boundary-Point Topology Reasoning (RBTR). Specifically, HPE extracts global spatial priors from the BEV mask and local sequential priors from in-lane keypoint sequences to guide subsequent fine-grained query modeling. RFD constructs fine-grained queries by integrating the spatial and sequential priors. It then samples reference points in RoI regions of the mask and applies cross-attention with BEV features to refine the query representations of each lane. RBTR models lane connectivity based on boundary-point query features and further employs a topological denoising strategy to reduce matching ambiguity. By integrating spatial and sequential priors into fine-grained queries and applying a denoising strategy to boundary-point topology reasoning, our method precisely models complex lane structures and delivers trustworthy topology predictions. Extensive experiments on the OpenLane-V2 benchmark demonstrate that TopoFG achieves new state-of-the-art performance, with an OLS of 48.0 on subsetA and 45.4 on subsetB.

中文标题/摘要

标题：车道拓扑精细表示方法

精确建模车道拓扑对于自动驾驶至关重要，因为它直接影响导航和控制决策。现有方法通常用单个查询表示每条车道，并基于车道查询之间的相似性推断拓扑连接性。然而，这种设计难以准确建模复杂的车道结构，导致拓扑预测不可靠。基于此，我们提出了一种精细粒度的车道拓扑推理框架（TopoFG）。该框架将从鸟瞰图（BEV）特征到拓扑预测的过程细分为三个阶段，即层次先验提取器（HPE）、区域聚焦解码器（RFD）和鲁棒边界点拓扑推理（RBTR）。具体而言，HPE 从BEV掩码中提取全局空间先验，并从车道内的关键点序列中提取局部顺序先验，以指导后续的精细粒度查询建模。RFD 通过整合空间和顺序先验构建精细粒度查询。然后，它在掩码的RoI区域中采样参考点，并应用交叉注意力机制与BEV特征相结合，以细化每条车道的查询表示。RBTR 基于边界点查询特征建模车道连接性，并进一步采用拓扑去噪策略以减少匹配不确定性。通过将空间和顺序先验整合到精细粒度查询中，并在边界点拓扑推理中应用去噪策略，我们的方法能够精确建模复杂的车道结构并提供可靠的拓扑预测。在OpenLane-V2基准上的广泛实验表明，TopoFG 达到了新的最佳性能，subsetA上的OLS为48.0，subsetB上的OLS为45.4。

Summary / 总结

The research aims to improve lane topology modeling for autonomous driving by addressing the limitations of existing methods. It introduces TopoFG, a framework that divides the process into three phases: Hierarchical Prior Extractor, Region-Focused Decoder, and Robust Boundary-Point Topology Reasoning. This approach integrates spatial and sequential priors into fine-grained queries and applies a denoising strategy to enhance the accuracy of lane connectivity prediction. Experiments show that TopoFG outperforms previous methods, achieving new state-of-the-art performance on the OpenLane-V2 benchmark with OLS scores of 48.0 and 45.4 on subsetA and subsetB respectively.

研究旨在通过改进车道拓扑建模来提升自动驾驶能力，解决现有方法的局限性。提出了一种TopoFG框架，分为三级先验提取器、区域聚焦解码器和稳健边界点拓扑推理三个阶段。该方法将空间和序列先验整合到细粒度查询中，并应用去噪策略以提高车道连接性的预测准确性。在OpenLane-V2基准测试上的实验表明，TopoFG在subsetA和subsetB分别取得了48.0和45.4的新最佳OLS成绩。

Logos as a Well-Tempered Pre-train for Sign Language Recognition

Authors: Ilya Ovodov, Petr Surovtsev, Karina Kvanchiani, Alexander Kapitanov, Alexander Nagaev

Venue: In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24351-24364 (2025)

First: 2025-05-15T16:31:49+00:00 · Latest: 2025-11-18T15:59:03+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper examines two aspects of the isolated sign language recognition (ISLR) task. First, although a certain number of datasets is available, the data for individual sign languages is limited. It poses the challenge of cross-language ISLR model training, including transfer learning. Second, similar signs can have different semantic meanings. It leads to ambiguity in dataset labeling and raises the question of the best policy for annotating such signs. To address these issues, this study presents Logos, a novel Russian Sign Language (RSL) dataset, the most extensive available ISLR dataset by the number of signers, one of the most extensive datasets in size and vocabulary, and the largest RSL dataset. It is shown that a model, pre-trained on the Logos dataset can be used as a universal encoder for other language SLR tasks, including few-shot learning. We explore cross-language transfer learning approaches and find that joint training using multiple classification heads benefits accuracy for the target low-resource datasets the most. The key feature of the Logos dataset is explicitly annotated visually similar sign groups. We show that explicitly labeling visually similar signs improves trained model quality as a visual encoder for downstream tasks. Based on the proposed contributions, we outperform current state-of-the-art results for the WLASL dataset and get competitive results for the AUTSL dataset, with a single stream model processing solely RGB video. The source code, dataset, and pre-trained models are publicly available.

中文标题/摘要

标题：Logos作为一种均衡预训练的孤立手语识别前训练

本文探讨了孤立手语识别(ISLR)任务的两个方面。首先，尽管存在一定的数据集，但每种手语的数据量有限，这给跨语言ISLR模型训练，包括迁移学习带来了挑战。其次，相似的手语可能具有不同的语义意义，这导致数据集标注的模糊性，并提出了如何标注此类手语的最佳策略。为了解决这些问题，本文提出了Logos，一个新颖的俄罗斯手语(RSL)数据集，是目前可用的ISLR数据集中手语者数量最多的数据集，也是规模和词汇量最大的数据集之一，是最大的RSL数据集之一。研究表明，预训练于Logos数据集的模型可以作为其他语言SLR任务的通用编码器，包括少样本学习。我们探索了跨语言迁移学习方法，并发现使用多个分类头的联合训练对目标低资源数据集的准确性提升最大。Logos数据集的关键特征是显式标注了视觉上相似的手语组。我们证明，明确标注视觉上相似的手语可以提高训练模型作为视觉编码器的质量，以供下游任务使用。基于提出的贡献，我们在WLASL数据集上超过了当前最先进的结果，并在仅使用RGB视频的单流模型中获得了与AUTSL数据集竞争的结果。源代码、数据集和预训练模型均已公开。

Summary / 总结

This paper addresses the challenges of isolated sign language recognition, particularly the limited data for individual sign languages and the ambiguity in labeling similar signs. It introduces Logos, a large Russian Sign Language dataset, and demonstrates that a model pre-trained on Logos can serve as a universal encoder for other sign languages. The study finds that joint training with multiple classification heads improves accuracy for low-resource datasets. Explicitly annotating visually similar sign groups in the Logos dataset enhances the model's quality as a visual encoder. The proposed approach outperforms current state-of-the-art results for the WLASL dataset and achieves competitive results for the AUTSL dataset with a single-stream model using RGB video.

本文探讨了孤立手语识别中的挑战，特别是个别手语数据有限以及类似手势具有不同含义的标签模糊问题。研究引入了Logos，一个大规模的手语数据集，并展示了预训练在Logos上的模型可以作为其他手语识别任务的通用编码器。研究发现，使用多个分类头进行联合训练可以提高低资源数据集的准确性。明确标注Logos中类似的手势提高了模型作为视觉编码器的性能。研究在WLASL数据集上超越了当前最先进的结果，并在仅使用RGB视频的单流模型处理下，达到了AUTSL数据集的竞争力结果。

XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation

Authors: Yilin Zhang, Leo D. Westbury, Elaine M. Dennison, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar

First: 2025-11-18T15:53:42+00:00 · Latest: 2025-11-18T15:53:42+00:00

Comments: 11 figures, 10 tables, 38 pages. Submitted to Artificial Intelligence in Medicine (currently with editor)

Abs · PDF · Code1 · Code2

Abstract

Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model's potential in real-world scenarios.

中文标题/摘要

标题：XAttn-BMD：跨注意力多模态深度学习用于股骨颈骨矿密度估计

骨健康不良是公共卫生的重要问题，低骨矿密度（BMD）会增加骨折风险，这是骨质疏松症的关键特征。我们提出了XAttn-BMD（跨注意力BMD），这是一种多模态深度学习框架，可以从髋部X光片和结构化临床元数据中预测股骨颈BMD。它利用了一种新颖的双向跨注意力机制，动态地将图像和元数据特征进行跨模态互强化。针对BMD不平衡问题，我们定制了一种加权平滑L1损失函数，以优先处理临床显著病例。在赫特福德郡队列研究的数据上进行的大量实验表明，我们的模型在回归泛化和鲁棒性方面优于基线模型。消融研究证实了跨注意力融合和定制损失函数的有效性。实验结果表明，通过跨注意力融合多模态数据优于简单的特征拼接，未使用跨注意力时，MSE降低了16.7%，MAE降低了6.03%，R2分数提高了16.4%，突显了该方法在股骨颈BMD估计中的有效性。此外，使用临床相关股骨颈BMD阈值进行二元分类评估了筛查性能，展示了该模型在实际场景中的潜力。

Summary / 总结

XAttn-BMD is a multimodal deep learning framework that predicts femoral neck bone mineral density (BMD) using hip X-ray images and clinical metadata. It employs a bidirectional cross-attention mechanism to integrate image and metadata features, and a Weighted Smooth L1 loss to handle BMD imbalance. Experiments on the Hertfordshire Cohort Study data show that XAttn-BMD outperforms baseline models in regression and robustness, with a 16.7% reduction in MSE, 6.03% reduction in MAE, and a 16.4% increase in R2 score. Ablation studies confirm the effectiveness of cross-attention and the customized loss function.

XAttn-BMD 是一个利用髋部 X 光图像和临床元数据预测股骨颈骨矿密度 (BMD) 的多模态深度学习框架。它使用双向交叉注意力机制来整合图像和元数据特征，并使用针对 BMD 不平衡定制的加权平滑 L1 损失。在赫特福德郡队列研究数据上的实验表明，XAttn-BMD 在回归和鲁棒性方面优于基线模型，MSE 减少了 16.7%，MAE 减少了 6.03%，R2 分数提高了 16.4%。消融研究证实了交叉注意力和定制损失函数的有效性。该模型在临床相关 BMD 阈值下的筛查性能表明其在实际应用中的潜力。

A Method for Characterizing Disease Progression from Acute Kidney Injury to Chronic Kidney Disease

Authors: Yilu Fang, Jordan G. Nestor, Casey N. Ta, Jerard Z. Kneifati-Hayek, Chunhua Weng

First: 2025-11-18T15:53:31+00:00 · Latest: 2025-11-18T15:53:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Patients with acute kidney injury (AKI) are at high risk of developing chronic kidney disease (CKD), but identifying those at greatest risk remains challenging. We used electronic health record (EHR) data to dynamically track AKI patients' clinical evolution and characterize AKI-to-CKD progression. Post-AKI clinical states were identified by clustering patient vectors derived from longitudinal medical codes and creatinine measurements. Transition probabilities between states and progression to CKD were estimated using multi-state modeling. After identifying common post-AKI trajectories, CKD risk factors in AKI subpopulations were identified through survival analysis. Of 20,699 patients with AKI at admission, 3,491 (17%) developed CKD. We identified fifteen distinct post-AKI states, each with different probabilities of CKD development. Most patients (75%, n=15,607) remained in a single state or made only one transition during the study period. Both established (e.g., AKI severity, diabetes, hypertension, heart failure, liver disease) and novel CKD risk factors, with their impact varying across these clinical states. This study demonstrates a data-driven approach for identifying high-risk AKI patients, supporting the development of decision-support tools for early CKD detection and intervention.

中文标题/摘要

标题：一种从急性肾损伤到慢性肾病疾病进展的表征方法

急性肾损伤(AKI)患者发展为慢性肾病(CKD)的风险很高，但识别出风险最大的患者仍然具有挑战性。我们利用电子健康记录(EHR)数据动态跟踪AKI患者的临床演变，并表征AKI到CKD的进展。通过聚类患者随访的医疗代码和肌酐测量值生成的患者向量来识别AKI后的临床状态。使用多状态模型估计状态之间的转换概率和进展到CKD的概率。在识别出常见的AKI后轨迹后，通过生存分析确定AKI亚群中的CKD风险因素。在入院时有20,699名AKI患者的患者中，3,491名(17%)发展为CKD。我们确定了十五种不同的AKI后状态，每种状态都有不同的CKD发展概率。大多数患者(75%，n=15,607)在整个研究期间仅停留在一个状态或仅发生一次状态转换。这些临床状态中存在既定的(如AKI严重程度、糖尿病、高血压、心力衰竭、肝病)和新型CKD风险因素，它们对这些临床状态的影响不同。本研究展示了通过数据驱动的方法识别高风险AKI患者，支持早期CKD检测和干预决策支持工具的发展。

Summary / 总结

The study aimed to identify high-risk patients with acute kidney injury (AKI) who are likely to develop chronic kidney disease (CKD). By using electronic health record data, the researchers dynamically tracked the clinical evolution of AKI patients and characterized their progression to CKD using multi-state modeling and survival analysis. They identified 15 distinct post-AKI states and found that most patients remained in a single state or made only one transition during the study period. The study revealed both established and novel CKD risk factors that varied across these clinical states, highlighting the importance of a data-driven approach for early CKD detection and intervention.

该研究旨在通过使用电子健康记录（EHR）数据动态跟踪急性肾损伤（AKI）患者的临床演变，识别出可能发展为慢性肾病（CKD）的高风险患者。研究人员使用多状态模型估计了临床状态之间的转换概率，并识别出了十五种不同的AKI后临床状态，每种状态都有不同的CKD发展概率。研究发现，20,699名AKI患者中有17%发展为CKD，并且识别出了既有的和新的CKD风险因素，这些因素对不同临床状态的影响不同。该研究支持了早期CKD检测和干预决策支持工具的发展。

StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving

Authors: Ruiyang Hao, Bowen Jing, Haibao Yu, Zaiqing Nie

First: 2025-06-30T15:48:38+00:00 · Latest: 2025-11-18T15:45:36+00:00

Comments: 25 pages, 7 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Personalization, while extensively studied in conventional autonomous driving pipelines, has been largely overlooked in the context of end-to-end autonomous driving (E2EAD), despite its critical role in fostering user trust, safety perception, and real-world adoption. A primary bottleneck is the absence of large-scale real-world datasets that systematically capture driving preferences, severely limiting the development and evaluation of personalized E2EAD models. In this work, we introduce the first large-scale real-world dataset explicitly curated for personalized E2EAD, integrating comprehensive scene topology with rich dynamic context derived from agent dynamics and semantics inferred via a fine-tuned vision-language model (VLM). We propose a hybrid annotation pipeline that combines behavioral analysis, rule-and-distribution-based heuristics, and subjective semantic modeling guided by VLM reasoning, with final refinement through human-in-the-loop verification. Building upon this dataset, we introduce the first standardized benchmark for systematically evaluating personalized E2EAD models. Empirical evaluations on state-of-the-art architectures demonstrate that incorporating personalized driving preferences significantly improves behavioral alignment with human demonstrations.

中文标题/摘要

标题：StyleDrive：面向端到端自动驾驶的驾驶风格感知基准测试

个性化，在传统自动驾驶管道中得到了广泛研究，但在端到端自动驾驶（E2EAD）的背景下却很少被关注，尽管它在培养用户信任、安全感知和实际应用中的作用至关重要。主要瓶颈在于缺乏大规模的现实世界数据集，这些数据集能够系统地捕捉驾驶偏好，严重限制了个性化E2EAD模型的开发和评估。在本文中，我们介绍了第一个专门用于个性化E2EAD的大规模现实世界数据集，该数据集结合了全面的场景拓扑和从代理动力学和通过微调的视觉语言模型（VLM）推断的语义中提取的丰富动态上下文。我们提出了一种混合注释流水线，该流水线结合了行为分析、基于规则和分布的启发式方法以及由VLM推理引导的主观语义建模，并通过人工在环验证进行最终细化。基于此数据集，我们介绍了第一个标准化基准，用于系统地评估个性化E2EAD模型。对最先进的架构的实证评估表明，纳入个性化的驾驶偏好可以显著提高行为与人类示范的一致性。

Summary / 总结

This paper addresses the lack of personalization in end-to-end autonomous driving (E2EAD) by introducing a new large-scale real-world dataset that captures driving preferences. The dataset integrates scene topology and dynamic context, and a hybrid annotation pipeline is proposed to annotate it. The authors also introduce a benchmark to evaluate personalized E2EAD models, showing that incorporating personalized driving preferences enhances behavioral alignment with human demonstrations.

本文通过引入一个新的大规模真实世界数据集来解决端到端自动驾驶(E2EAD)中缺乏个性化的问题，该数据集能够捕捉驾驶偏好。该数据集整合了场景拓扑和动态上下文，并提出了一种混合注释流水线来标注数据。作者还引入了一个基准来评估个性化E2EAD模型，实验证明，将个性化驾驶偏好纳入模型可以提高行为与人类演示的对齐程度。

MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimer's Disease Cohorts

Authors: Nathaniel Putera, Daniel Vilet Rodríguez, Noah Videcrantz, Julia Machnio, Mostafa Mehdipour Ghazi

First: 2025-11-18T15:45:01+00:00 · Latest: 2025-11-18T15:45:01+00:00

Comments: Accepted at SPIE - Medical Imaging Conference 2026

Abs · PDF · Code1 · Code2

Abstract

Accurate modeling of cognitive decline in Alzheimer's disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.

中文标题/摘要

标题：MRI嵌入补充临床预测因子以建模阿尔茨海默病队列中的认知衰退

准确建模阿尔茨海默病中的认知衰退对于早期分层和个性化管理至关重要。虽然表格预测因子提供了稳健的全局风险标志，但它们捕捉细微脑部变化的能力仍然有限。在本研究中，我们评估了表格和基于成像的表示形式的预测贡献，重点关注基于变压器的磁共振成像（MRI）嵌入。我们引入了一种基于动态时间规整聚类的轨迹感知标签策略，以捕捉认知变化的异质模式，并通过无监督重建在协调和增强的MRI数据上训练3D视觉变压器（ViT），以获得保留解剖结构的嵌入，而无需进展标签。预训练的编码嵌入随后使用传统机器学习分类器和深度学习头部进行评估，并与表格表示和卷积网络基线进行比较。结果强调了不同模态的互补优势。临床和容积特征在预测轻度和重度进展方面取得了最高的AUC值，约为0.70，突显了它们在捕捉全局衰退轨迹方面的效用。相比之下，ViT模型的MRI嵌入在区分认知稳定个体方面最有效，AUC值为0.71。然而，所有方法在异质中度组中表现不佳。这些发现表明，临床特征在识别高风险极端方面表现出色，而基于变压器的MRI嵌入对细微的稳定性标志更为敏感，这促使采用多模态融合策略进行AD进展建模。

Summary / 总结

This study aims to improve the modeling of cognitive decline in Alzheimer's disease by integrating clinical and imaging data. It uses a transformer-derived MRI embedding method and a trajectory-aware labeling strategy to capture cognitive changes. The results show that clinical and volumetric features are best for predicting severe progression, while MRI embeddings from the Vision Transformer model are better for identifying cognitively stable individuals. However, all methods struggle with the moderate group, suggesting the need for multimodal approaches.

该研究旨在通过整合临床和影像数据来提高阿尔茨海默病认知衰退的建模。研究引入了一种轨迹感知的标签方法，并使用3D视觉变换器生成不依赖于进展标签的MRI嵌入。研究发现，临床和体积特征最适合预测严重认知衰退，而MRI嵌入对于识别认知稳定个体更为有效。这些发现表明，结合这些模态可以提高阿尔茨海默病进展建模的准确性。