arXiv 论文速递

Snapshot: 20260210_0409

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Authors: Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, Imran Razzak

First: 2026-02-06T18:59:59+00:00 · Latest: 2026-02-06T18:59:59+00:00

Comments: 21 pages, 6 figures and 4 tables

Abstract

Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO's broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at https://genmilab.github.io/MedMO-Page

中文标题/摘要

标题：MedMO：基于多模态大型语言模型的医学图像理解

多模态大型语言模型（MLLMs）已迅速发展，但在医学领域的应用受限于领域覆盖范围、模态对齐和基于事实的推理方面的差距。本文介绍了一种名为MedMO的医学基础模型，该模型基于通用MLLM架构，并仅在大规模、领域特定的数据上进行训练。MedMO采用多阶段训练方法：（i）跨模态预训练，将异构视觉编码器与医学语言骨干对齐；（ii）在涉及图像字幕、VQA、报告生成、检索和带有边界框的基于事实的疾病定位的多任务监督下进行指令调优；（iii）结合事实检查和边界框级别的GIoU奖励的强化学习，以增强复杂临床场景中的空间对齐和逐步推理。MedMO在多个模态和任务上均优于强大的开源医学MLLMs。在VQA基准测试中，MedMO的平均准确率提高了13.7%，并在Fleming-VL基准测试中仅落后1.9%。对于基于文本的问答，它比基线提高了6.9%，比Fleming-VL提高了14.5%。在医学报告生成方面，MedMO在语义和临床准确性方面取得了显著进步。此外，它还展示了强大的空间对齐能力，与基线相比提高了40.4%，与Fleming-VL相比提高了37.0%，突显了其稳健的空间推理和定位性能。放射学、眼科和病理学-显微镜学领域的评估证实了MedMO的跨模态泛化能力。我们发布了MedMO的两个版本：4B和8B。项目详情请参见https://genmilab.github.io/MedMO-Page

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Authors: Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen

First: 2026-02-06T18:59:27+00:00 · Latest: 2026-02-06T18:59:27+00:00

Comments: Project Page: https://zju-real.github.io/InftyThink-Plus Code: https://github.com/ZJU-REAL/InftyThink-Plus

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

中文标题/摘要

标题：InftyThink+: 通过强化学习实现有效高效的无限期推理

大型推理模型通过扩展推理时的链式思考来实现强大的性能，但这种范式会遭受二次成本、上下文长度限制以及由于中间迷失而导致的推理退化。迭代推理通过定期总结中间想法来缓解这些问题，但现有方法依赖于监督学习或固定启发式方法，并不能优化何时总结、保留什么以及如何继续推理。我们提出了一种名为InftyThink+的端到端强化学习框架，该框架优化了整个迭代推理轨迹，基于模型控制的迭代边界和显式总结。InftyThink+采用两阶段训练方案，先进行监督冷启动，然后进行轨迹级强化学习，使模型能够学习战略性的总结和继续决策。实验表明，InftyThink+在DeepSeek-R1-Distill-Qwen-1.5B上的AIME24准确率提高了21%，在常规长链式思考强化学习中表现出明显的优势，同时在离分布基准上也表现出更好的泛化能力。此外，InftyThink+显著减少了推理延迟并加速了强化学习训练，展示了更强的推理效率和性能。

Summary / 总结

InftyThink+ is an end-to-end reinforcement learning framework that optimizes iterative reasoning for infinite-horizon problems, addressing the limitations of large reasoning models. It uses a two-stage training scheme to learn strategic summarization and continuation decisions, improving accuracy by 21% on AIME24 and outperforming conventional methods. InftyThink+ also reduces inference latency and accelerates training, enhancing reasoning efficiency.

InftyThink+ 是一个端到端的强化学习框架，优化了迭代推理过程，通过周期性总结中间想法来解决大型推理模型的局限性。它在AIME24上的准确率提高了21%，并优于传统的长链推理学习，同时减少了推理延迟并加速了训练。实验表明InftyThink+在分布外基准上具有更好的泛化能力。

CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Authors: Kaiyi Huang, Yukun Huang, Yu Li, Jianhong Bai, Xintao Wang, Zinan Lin, Xuefei Ning, Jiwen Yu, Pengfei Wan, Yu Wang, Xihui Liu

First: 2026-02-06T18:59:24+00:00 · Latest: 2026-02-06T18:59:24+00:00

Comments: Project website: https://karine-huang.github.io/CineScene/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.

中文标题/摘要

标题：CineScene：隐式3D作为有效的场景表示以生成电影级视频

电影级视频制作需要对场景-主体组成和摄像机运动进行控制，但由于需要构建物理场景，现场拍摄仍然成本高昂。为了解决这一问题，我们提出了分解场景上下文的电影级视频生成任务：给定静态环境的多张图像，目标是合成高质量的视频，其中包含动态主体，同时保持场景一致性并遵循用户指定的摄像机轨迹。我们提出了CineScene框架，该框架利用隐式3D感知场景表示进行电影级视频生成。我们的主要创新是一种新颖的上下文条件机制，以隐式方式注入3D感知特征：通过VGGT将场景图像编码为视觉表示，CineScene通过上下文连接将空间先验注入预训练的文本到视频生成模型中，从而实现受摄像机控制的视频合成，具有一致的场景和动态主体。为了进一步增强模型的鲁棒性，我们在训练过程中引入了一种简单而有效的输入场景图像随机打乱策略。为了解决训练数据不足的问题，我们使用Unreal Engine 5构建了一个场景分解数据集，包含场景及其动态主体的配对视频，全景图像代表底层静态场景，以及它们的摄像机轨迹。实验表明，CineScene在场景一致的电影级视频生成方面达到了最先进的性能，能够处理大范围的摄像机运动，并在多种环境中表现出良好的泛化能力。

Summary / 总结

CineScene addresses the challenge of generating high-quality cinematic videos with dynamic subjects while preserving scene consistency and following user-specified camera movements. It uses an implicit 3D-aware scene representation and a novel context conditioning mechanism that injects spatial priors into a pretrained text-to-video model. The framework also includes a random-shuffling strategy for training robustness. Experiments demonstrate that CineScene outperforms existing methods in scene-consistent video generation, handling large camera movements and generalizing across various environments.

CineScene通过引入一种场景解耦的电影视频生成框架来解决现场拍摄的高成本问题。该框架利用隐式的3D感知场景表示和一种新颖的上下文条件机制，实现具有一致场景和动态主体的摄像机控制视频合成。通过在训练过程中引入简单的随机洗牌策略来增强模型的鲁棒性。实验表明，CineScene在生成具有场景一致性的电影视频方面优于现有方法，即使面对大范围的摄像机运动和多样化的环境也能表现出色。

Improving Credit Card Fraud Detection with an Optimized Explainable Boosting Machine

Authors: Reza E. Fazel, Arash Bakhtiary, Siavash A. Bigdeli

First: 2026-02-06T18:56:17+00:00 · Latest: 2026-02-06T18:56:17+00:00

Comments: 22 pages, 5 figures, 5 tables