arXiv 论文速递

2025-12-01 03:24
Snapshot: 20251201_0324
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Authors: Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang
First: 2025-11-26T18:59:56+00:00 · Latest: 2025-11-26T18:59:56+00:00
Comments: 24 pages; webpage: https://snap-research.github.io/canvas-to-image/
Abstract
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.
中文标题/摘要
标题:画布到图像:多模态控制下的组合图像生成
虽然现代扩散模型在生成高质量和多样化图像方面表现出色,但在高保真组合和多模态控制方面仍然存在困难,尤其是在用户同时指定文本提示、主题参考、空间布局、姿态约束和布局注释时。我们引入了画布到图像,这是一种统一框架,将这些异构控制合并到一个画布界面中,使用户能够生成忠实反映其意图的图像。我们的主要想法是将各种控制信号编码到单个复合画布图像中,该图像可以直接被模型解释以进行综合的空间视觉推理。我们进一步整理了一组多任务数据集,并提出了一种多任务画布训练策略,该策略优化了扩散模型,使其能够在统一的学习框架内同时理解和整合异构控制到文本到图像生成中。这种联合训练使画布到图像能够在多个控制模态之间进行推理,而不是依赖于特定任务的启发式方法,并且在推理时能够很好地泛化到多控制场景。广泛的实验表明,画布到图像在多个具有挑战性的基准测试中,在身份保留和控制一致性方面显著优于最先进的方法,包括多人组合、姿态控制组合、布局约束生成和多控制生成。
Summary / 总结
Canvas-to-Image is a unified framework that integrates various multimodal controls into a single canvas interface for generating high-fidelity images. It encodes diverse control signals into a composite canvas image that the model can interpret for integrated visual-spatial reasoning. Experiments show that Canvas-to-Image outperforms existing methods in identity preservation and control adherence across multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation scenarios.
Canvas-to-Image 解决了现代扩散模型在处理高保真度的组合和多模态控制方面的局限性。它引入了一个统一框架,将各种控制信号整合到单个画布界面中,使用户能够生成反映其意图的图像。该模型通过多任务画布训练策略进行训练,优化其理解和整合多种控制模态的能力。实验表明,Canvas-to-Image 在多个人物组合、姿态控制组合以及布局约束生成等各种基准测试中,在身份保留和控制一致性方面显著优于最先进的方法。
TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Authors: Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang
First: 2025-11-26T18:59:55+00:00 · Latest: 2025-11-26T18:59:55+00:00
Abstract
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
中文标题/摘要
标题:TraceGen: 3D 轨迹空间中的世界建模使跨体态视频学习成为可能
仅从少量演示中在新平台上学习新机器人任务仍然具有挑战性。虽然存在其他体态(人类和不同机器人)的视频,但由于体态、摄像机和环境的差异,它们难以直接使用。我们通过引入一种统一的符号表示——场景级轨迹的紧凑3D“轨迹空间”来解决小数据问题,从而能够从跨体态、跨环境和跨任务视频中学习。我们提出了TraceGen,这是一种世界模型,它在轨迹空间而非像素空间中预测未来的运动,抽象掉了外观,但保留了用于操作所需的几何结构。为了大规模训练TraceGen,我们开发了TraceForge,一种数据管道,将异构的人类和机器人视频转换为一致的3D轨迹,生成了包含123,000个视频和1,800,000个观察-轨迹-语言三元组的语料库。在该语料库上进行预训练产生了一种可转移的3D运动先验,能够高效适应:仅用五个目标机器人视频,TraceGen在四个任务上达到了80%的成功率,同时比最先进的基于视频的世界模型快50-600倍的推理速度。在更具有挑战性的场景中,仅用五个手持手机拍摄的未校准的人类演示视频,它在真实机器人上仍能达到67.5%的成功率,突显了TraceGen在不依赖物体检测器或大量像素空间生成的情况下跨体态适应的能力。
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Authors: Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov
First: 2025-11-26T18:59:46+00:00 · Latest: 2025-11-26T18:59:46+00:00
Comments: 21 pages, 6 figures
Abstract
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
中文标题/摘要
标题:ToolOrchestra:通过高效模型和工具编排提升智能
大型语言模型是强大的通用型工具,但解决人类最后考试(HLE)等深层次和复杂问题仍然既概念上具有挑战性,又在计算上昂贵。我们展示了通过小型编排器管理其他模型和各种工具,既能推动智能的上限,又能提高解决复杂代理任务的效率。我们介绍了ToolOrchestra,一种用于训练小型编排器的方法,这些编排器协调智能工具。ToolOrchestra 明确使用了带有结果意识、效率意识和用户偏好意识的强化学习奖励。使用 ToolOrchestra,我们生成了 Orchestrator,这是一种8B模型,其在较低成本下实现了比之前工具使用代理更高的准确性,同时在给定查询时使用哪种工具方面与用户偏好保持一致。在HLE上,Orchestrator 达到了37.1%的得分,优于GPT-5(35.1%),且效率提高了2.5倍。在tau2-Bench和FRAMES上,Orchestrator 仅使用约30%的成本就大幅超过了GPT-5。广泛的分析表明,Orchestrator 在多个指标下实现了性能和成本的最佳权衡,并且能够稳健地泛化到未见过的工具。这些结果表明,使用轻量级编排模型组合多种工具比现有方法更高效且更有效,为实用和可扩展的工具增强推理系统铺平了道路。
Summary / 总结
ToolOrchestra is a method for training small orchestrators to manage other models and tools, enhancing efficiency and performance in solving complex tasks. Using reinforcement learning with specific rewards, ToolOrchestra produces Orchestrator, an 8B model that outperforms GPT-5 on the Humanity's Last Exam with higher accuracy and lower cost. On other benchmarks, Orchestrator also shows significant improvements while using less resources, indicating a better trade-off between performance and cost and robust generalization to new tools.
ToolOrchestra是一种训练小型协调器的方法,用于管理其他模型和工具,以提高解决复杂任务的效率和性能。通过特定奖励的强化学习,ToolOrchestra生成了Orchestrator,一个8B模型,它在人类最后考试中比GPT-5表现更好,具有更高的准确性和更低的成本。在其他基准测试中,Orchestrator也显示出显著的改进,同时使用更少的资源,表明在性能和成本之间有更好的权衡,并且能够稳健地适应新工具。
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Authors: Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang
First: 2025-11-26T18:59:39+00:00 · Latest: 2025-11-26T18:59:39+00:00
Comments: code are released at https://github.com/InternRobotics/G2VLM
Abstract
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
中文标题/摘要
标题:G$^2$VLM: 以几何为基础的视觉语言模型,结合统一的3D重建和空间推理
视觉-语言模型(VLMs)在空间智能方面仍然缺乏稳健性,表现出在空间理解和推理任务上的较差性能。我们将其差距归因于缺乏一种能够从2D图像重建3D空间的视觉几何学习过程。我们提出了G$^2$VLM,这是一种以几何为基础的视觉语言模型,连接了空间智能的两个基本方面:空间3D重建和空间理解。G$^2$VLM 本源地利用学习到的3D视觉几何特征,直接预测3D属性,并通过上下文学习和交错推理增强空间推理任务。我们统一的设计对空间理解具有高度可扩展性:它在丰富的多视角图像和视频数据上进行训练,同时利用通常仅从难以收集的注释中提取的3D视觉先验的好处。实验结果表明,G$^2$VLM 在两个任务上都表现出色,其3D重建结果与最先进的前馈3D重建模型相当,并在空间理解和推理任务上取得了更好的或具有竞争力的结果。通过将语义强大的VLM与低级3D视觉任务统一起来,我们希望G$^2$VLM 能够为社区提供一个强大的基线,并解锁更多未来的应用,如3D场景编辑。
Summary / 总结
The research aims to improve the spatial intelligence of Vision-Language Models (VLMs) by addressing their poor performance in spatial understanding and reasoning. G$^2$VLM is introduced as a geometry-grounded vision-language model that integrates 3D reconstruction and spatial reasoning. It leverages learned 3D visual geometry features to predict 3D attributes and enhance spatial reasoning through in-context learning and interleaved reasoning. Experimental results show that G$^2$VLM performs comparably to state-of-the-art 3D reconstruction models and outperforms or matches other models in spatial understanding and reasoning tasks.
研究旨在通过解决视觉语言模型(VLMs)在空间理解与推理方面的不足,提升其空间智能。G$^2$VLM 是一种结合 3D 重建和空间推理的几何导向模型,通过上下文学习和交错推理利用 3D 视觉几何特征预测 3D 属性。实验结果显示,G$^2$VLM 在 3D 重建和空间理解任务上表现良好,优于或与最先进的模型在各种空间推理任务上相当。
Seeing without Pixels: Perception from Camera Trajectories
Authors: Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
First: 2025-11-26T18:57:01+00:00 · Latest: 2025-11-26T18:57:01+00:00
Comments: Project website: https://sites.google.com/view/seeing-without-pixels
Abstract
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
中文标题/摘要
标题:无需像素:从相机轨迹感知视频内容
是否可以在不看到像素的情况下感知视频的内容,仅通过相机轨迹——它在空间中划过的路径?本文首次系统地探讨了这一看似不可能的问题。为此,我们提出了一种对比学习框架,用于训练CamFormer,这是一种专门的编码器,将相机姿态轨迹投影到联合嵌入空间中,并与自然语言对齐。我们发现,与它的显而易见的简单性相反,相机轨迹是一个非常有信息量的信号,可以揭示视频内容。换句话说,“你如何移动”确实可以揭示“你在做什么”(第一人称视角)或“你在观察什么”(第三人称视角)。我们展示了我们学习到的CamFormer嵌入在一系列下游任务上的通用性,从跨模态对齐到分类和时间分析。重要的是,我们的表示在各种相机姿态估计方法中具有鲁棒性,包括高保真的多传感器和标准的RGB-only估计器。我们的研究结果确立了相机轨迹作为一种轻量级、鲁棒且通用的模态,用于感知视频内容。
Summary / 总结
This paper explores the feasibility of inferring video content from camera trajectories alone, without relying on pixel data. It introduces a contrastive learning framework to train CamFormer, which projects camera pose trajectories into a joint embedding space aligned with natural language. The study reveals that camera trajectories carry substantial information about video content, enabling tasks such as cross-modal alignment, classification, and temporal analysis. The learned representations are robust across various camera pose estimation methods, demonstrating the modality's versatility and reliability.
该研究探讨了仅通过相机轨迹而非像素数据来推断视频内容的可能性。它提出了一种对比学习框架来训练CamFormer,将相机姿态轨迹投影到与自然语言对齐的联合嵌入空间中。研究发现,相机轨迹携带着关于视频内容的重要信息,能够完成跨模态对齐、分类和时间分析等任务。所学的表示在各种相机姿态估计方法下表现出鲁棒性,展示了该模态的多样性和可靠性。
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
Authors: Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li
First: 2025-11-26T18:55:08+00:00 · Latest: 2025-11-26T18:55:08+00:00
Abstract
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.
中文标题/摘要
标题:具有生长与精炼多模态语义记忆的能动学习者
MLLMs 在独立查询上表现出强大的推理能力,但它们独立工作——每次解决问题时都独立进行,经常重复同样的错误。现有的记忆增强代理主要存储过去的轨迹以便重用。然而,基于轨迹的记忆会受到简短偏差的影响,逐渐失去重要的领域知识。更严重的是,即使在真正的多模态问题解决环境中,它也只能记录过去行为的单一模态痕迹,未能保存视觉注意力和逻辑推理如何共同贡献于解决方案。这与人类认知从根本上不一致:语义记忆既是多模态的又是整合的,通过协调但独立的表示流保存视觉和抽象知识。因此,我们引入了ViLoMem,这是一种双流记忆框架,构建紧凑的基于模式的记忆。它分别编码视觉干扰模式和逻辑推理错误,使MLLMs能够从成功和失败的经验中学习。遵循生长与精炼的原则,系统逐步积累和更新多模态语义知识——保持稳定、可泛化的策略,同时避免灾难性遗忘。在六个多模态基准测试中,ViLoMem 一致地提高了 pass@1 准确率,并显著减少了重复的视觉和逻辑错误。消融实验确认了双流记忆的必要性,特别是明确的干扰-幻觉分离,展示了错误感知的多模态记忆对于终身和跨域能动学习的价值。我们的项目页面将在 https://weihao-bo.github.io/ViLoMeo-page/ 上提供。
Summary / 总结
This paper addresses the limitations of existing memory-augmented agents in multimodal problem-solving by introducing ViLoMem, a dual-stream memory framework. ViLoMem separately encodes visual distraction patterns and logical reasoning errors, allowing MLLMs to learn from both successful and failed experiences. The system incrementally accumulates and updates multimodal semantic knowledge, improving pass@1 accuracy and reducing repeated errors across six benchmarks. Ablation studies confirm the necessity of dual-stream memory for error-aware multimodal learning.
论文针对现有记忆增强代理独立解决问题且易受简短偏差影响,逐渐失去关键领域知识的问题。提出了ViLoMem,这是一种双流记忆框架,分别编码视觉干扰模式和逻辑推理错误,使MLLMs能够从成功和失败的经验中学习。在六个跨模态基准测试中,实验显示一致提高了pass@1准确率,并显著减少了重复的视觉和逻辑错误。
Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models
Authors: Pandiyaraju V, Sreya Mynampati, Abishek Karthik, Poovarasan L, D. Saraswathi
First: 2025-11-26T18:51:46+00:00 · Latest: 2025-11-26T18:51:46+00:00
Abstract
Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.
中文标题/摘要
标题:利用3D MRI引导的混合深度学习模型革新胶质瘤分割与分级
胶质瘤是具有高死亡率的脑肿瘤类型,早期和准确的诊断对于肿瘤的治疗干预至关重要。为解决这一难题,本研究将开发一种混合深度学习模型,该模型结合了基于U-Net的分割和具有多头注意力和空间-通道注意力能力的混合DenseNet-VGG分类网络。分割模型将在3D MRI数据的体积中,根据空间和上下文信息精确勾勒出肿瘤。结合DenseNet和VGG的分类网络将关注已勾勒出的肿瘤上的特征,并通过注意力机制聚焦于临床相关特征。通过预处理步骤,如归一化、重采样和数据增强,高维3D MRI数据可以成功地用于模型中。通过多种措施评估框架:分割性能的度量包括Dice系数和平均交并比(IoU),分类性能的度量包括准确率、精确率、召回率和F1分数。所提出的混合框架通过物理测试表明,其在肿瘤分割中的Dice系数可达98%,分类准确率可达99%,优于传统CNN模型和无注意力机制的方法。多头注意力机制增强了在肿瘤临床显著方面的重要性的概念,提高了可解释性和准确性。结果表明,该框架在协助临床医生进行胶质瘤的及时和可靠诊断与分级方面具有巨大潜力,有助于更好地规划患者的治疗。
Summary / 总结
This research aims to improve the early and accurate diagnosis of gliomas, which have a high mortality rate. It proposes a hybrid deep learning model combining U-Net for segmentation and a DenseNet-VGG network with attention mechanisms for classification. The model processes 3D MRI data through normalization, resampling, and augmentation. The framework achieves a Dice coefficient of 98% in tumor segmentation and 99% in classification accuracy, outperforming traditional CNN models and attention-free methods. Multi-head attention mechanisms enhance the model's interpretability and accuracy, making it a promising tool for clinicians to diagnose and grade gliomas more reliably.
研究旨在通过提高胶质瘤的早期和准确诊断来改善高死亡率的胶质瘤治疗。提出了一种结合U-Net进行分割和DenseNet-VGG网络进行分类的混合深度学习模型,该模型包含注意力机制。模型在肿瘤分割中的Dice系数达到98%,分类准确率达到99%,优于传统CNN模型和无注意力机制的方法。使用多头注意力机制增强了模型的可解释性和准确性,使其在胶质瘤的及时和可靠诊断和分级方面具有很大的潜力。
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Authors: Fengze Yu, Leshu Li, Brad McDanel, Saiqian Zhang
First: 2025-11-26T18:47:25+00:00 · Latest: 2025-11-26T18:47:25+00:00
Abstract
Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.
中文标题/摘要
标题:DSD:边缘-云敏捷大模型服务的分布式推测解码解决方案
大型语言模型(LLM)推理经常遭受高解码延迟和跨异构边缘-云环境的有限扩展性。现有推测解码(SD)技术加速了标记生成,但仍然局限于单节点执行。我们提出DSD,这是一种分布式推测解码框架,通过协调草案目标执行将SD扩展到多设备部署。鉴于缺乏对这种范式的模拟研究,我们首先引入了DSD-Sim,这是一种离散事件模拟器,能够捕捉网络、批处理和调度动态。基于DSD-Sim的见解,我们进一步设计了自适应窗口控制(AWC)策略,动态调整推测窗口大小以优化吞吐量。在不同工作负载下的实验表明,DSD在现有SD基线上的速度提高了1.1倍,吞吐量提高了9.7%,从而实现了边缘和云上的敏捷和可扩展的LLM服务。
Summary / 总结
The research aims to address the high decoding latency and scalability issues of large language model (LLM) inference in edge-cloud environments. DSD, a distributed speculative decoding framework, is proposed to extend speculative decoding techniques to multi-device deployments. By using a discrete-event simulator, DSD-Sim, and an Adaptive Window Control (AWC) policy, DSD achieves up to 1.1x speedup and 9.7% higher throughput compared to existing speculative decoding baselines, enabling agile and scalable LLM serving across edge and cloud environments.
研究旨在解决大型语言模型(LLM)推理在边缘-云环境中的高解码延迟和可扩展性问题。提出了分布式推测性解码框架DSD,将推测性解码技术扩展到多设备部署。通过使用离散事件模拟器DSD-Sim和自适应窗口控制(AWC)策略,DSD实现了比现有推测性解码基线高达1.1倍的加速和9.7%更高的吞吐量,从而实现边缘和云环境中的敏捷和可扩展的LLM服务。
Escaping the Verifier: Learning to Reason via Demonstrations
Authors: Locke Cai, Ivan Provilkov
First: 2025-11-26T18:42:52+00:00 · Latest: 2025-11-26T18:42:52+00:00
Abstract
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
中文标题/摘要
标题:超越验证者:通过示范学习推理
训练大型语言模型(LLMs)进行推理通常依赖于特定任务的强化学习(RL)和验证器。然而,许多实际的推理密集型任务缺乏验证器,尽管这些任务提供了大量的专家示范,这些示范在推理导向的训练中尚未充分利用。我们引入了RARO(相对对抗推理优化),该方法通过逆向强化学习仅从专家示范中学习强大的推理能力。我们的方法设置了一个对抗交互,其中策略(生成器)和相对批评者(判别器)之间进行对抗:策略学习模仿专家答案,而批评者学习比较和区分策略和专家答案。我们的方法通过RL联合和连续训练策略和批评者,并确定了实现稳健学习所需的关键稳定化技术。实验结果表明,RARO在我们的所有评估任务——Countdown、DeepMath和诗歌创作——中显著优于强大的无验证器基线,并且在验证任务上具有相同的稳健扩展趋势。这些结果表明,我们的方法能够仅从专家示范中有效激发强大的推理性能,即使在特定任务验证器不可用时也能实现稳健的推理学习。
Summary / 总结
The research aims to address the challenge of training large language models to reason without task-specific verifiers, which are often unavailable for real-world tasks. The method, RARO, uses Inverse Reinforcement Learning to learn from expert demonstrations. It sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator) to mimic expert answers and distinguish between them. Experiments show that RARO outperforms verifier-free baselines on tasks like Countdown, DeepMath, and Poetry Writing, and scales similarly to RL on verifiable tasks, indicating that expert demonstrations can effectively train reasoning capabilities.
研究旨在通过逆强化学习增强大型语言模型(LLMs)的推理能力,而不依赖于特定任务的验证器,这些验证器在许多实际任务中往往不可用。研究引入了RARO方法,通过模仿专家演示来训练模型。该方法涉及政策(生成器)和相对主义批评者(鉴别器)之间的对抗交互,两者通过强化学习联合训练。关键发现表明,RARO在所有评估任务中显著优于无验证器基线,并且表现出与验证任务中RL相同的稳健扩展趋势,表明仅从专家演示中可以有效获得强大的推理性能。
Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models
Authors: Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng, Wentao Mo, Peiqiang Wang, Nan Zhang
First: 2025-11-26T18:37:54+00:00 · Latest: 2025-11-26T18:37:54+00:00
Abstract
In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
中文标题/摘要
标题:注意力引导的视觉-语言-动作模型分块稀疏对抗攻击
近年来,嵌入式智能中的视觉-语言-动作(VLA)模型发展迅速。然而,现有的对抗攻击方法需要昂贵的端到端训练,并且通常会产生明显的扰动块。为了解决这些限制,我们提出了ADVLA框架,该框架直接在视觉编码器投影到文本特征空间的特征上应用对抗扰动。ADVLA在低振幅约束下有效地破坏了下游动作预测,并且注意力引导使得扰动既集中又稀疏。我们引入了三种策略以增强敏感性、强制稀疏性和集中扰动。实验表明,在$L_{\infty}=4/255$约束下,ADVLA结合Top-K掩码修改的块少于10%,而攻击成功率接近100%。扰动集中在关键区域,几乎不会在整体图像中被察觉,单步迭代仅需约0.06秒,显著优于传统的块基攻击。总之,ADVLA在低振幅和局部稀疏条件下有效地削弱了VLA模型的下游动作预测,避免了传统块攻击的高训练成本和明显扰动,并且在攻击VLA特征空间方面展示了独特的有效性和实用性。
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Authors: Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang
First: 2025-11-26T18:35:17+00:00 · Latest: 2025-11-26T18:35:17+00:00
Abstract
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
中文标题/摘要
标题:多标准:在多元标准遵循能力上的多模态评判员基准测试
大型多模态模型(LMMs)因其强大的指令遵循能力和与人类偏好的一致性,越来越多地被用作多模态评估系统中的评判员。然而,它们在遵循多样且细致的评估标准方面的能力仍然未被充分探索。我们开发了多标准(Multi-Crit),这是一个用于评估多模态评判员在遵循多元标准和产生可靠标准级别判断方面能力的基准测试。该基准测试涵盖了开放生成和可验证推理任务,通过严格的数据整理管道构建,收集了具有多标准人类注释的具有挑战性的响应对。此外,它还引入了三个新的度量标准,以系统地评估多元标准的遵守情况、标准切换的灵活性以及识别标准级别偏好冲突的能力。对25个LMMs的全面分析揭示了以下几点:1)专有模型在开放生成评估中仍然难以保持一致的多元标准遵守;2)开源模型在灵活遵循多种标准方面落后更多;3)使用整体判断信号进行批评微调可以增强视觉定位,但无法推广到多元标准级别判断。对推理微调、测试时缩放以及开源和专有模型之间边界一致性进行的额外分析进一步探索了当前多模态评判员的局限性。作为一项开创性研究,多标准为构建可靠和可控的多模态AI评估奠定了基础。
Summary / 总结
The study aims to evaluate the ability of large multimodal models (LMMs) to follow diverse evaluation criteria, developing Multi-Crit as a benchmark. It includes both open-ended generation and verifiable reasoning tasks, with human annotations for multiple criteria. The research finds that proprietary models struggle with consistent adherence to pluralistic criteria, especially in open-ended tasks, while open-source models perform even worse in flexibly following diverse criteria. Critic fine-tuning with holistic judgment signals improves visual grounding but does not generalize to pluralistic criterion-level judgments. The study also explores reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models, highlighting the current limitations of multimodal judges.
研究开发了Multi-Crit基准,以评估多模态评判者遵循多种评价标准的能力。使用严格的数据收集流程评估了25个大型多模态模型,并引入了新的评估指标。主要发现包括:专有模型在开放性任务中难以保持对多元标准的一致性遵从,而开源模型在遵循多种标准方面表现更差。批评微调可以改善视觉定位,但无法推广到多元标准级别的判断。研究强调了构建更可靠和可控的多模态AI评估系统的需求。
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Authors: Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin
First: 2025-11-20T17:48:21+00:00 · Latest: 2025-11-26T18:30:04+00:00
Comments: Project page: https://xuboshen.github.io/TimeViper; Code: https://github.com/xiaomi-research/timeviper
Abstract
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
中文标题/摘要
标题:TimeViper:一种混合Mamba-Transformer视觉-语言模型,用于高效理解长视频
我们介绍了TimeViper,一种混合视觉-语言模型,旨在解决长视频理解的挑战。处理长视频需要高效的模型架构和有效的机制来处理扩展的时间上下文。为此,TimeViper采用了一种混合Mamba-Transformer骨干,结合了状态空间模型的高效性和注意力机制的表达能力。通过这种混合设计,我们揭示了视觉到文本信息聚合的现象,其中信息随着LLM深度增加,逐渐从视觉标记流向文本标记,导致视觉标记冗余严重。受此观察的启发,我们提出了TransV,一种标记信息传输模块,将视觉标记转换并压缩为指令标记,同时保持多模态理解能力。这种设计使TimeViper能够处理超过10,000帧的长达一小时的视频。在多个基准上的广泛实验表明,TimeViper在与最先进的模型竞争的同时,扩展了帧数。我们进一步分析了Mamba和Transformer层的注意力行为,提供了关于混合模型可解释性的新见解。这项工作代表了开发、解释和压缩混合Mamba-Transformer架构的初步步骤。
Summary / 总结
TimeViper is a hybrid Mamba-Transformer model designed for efficient long video understanding. It combines the efficiency of state-space models with the expressivity of attention mechanisms. The model reveals a vision-to-text information aggregation phenomenon, leading to redundant vision tokens. To address this, TimeViper introduces TransV, a token information transfer module that compresses vision tokens into instruction tokens while maintaining multimodal understanding. Experiments show that TimeViper can process hour-long videos with over 10,000 frames and outperforms state-of-the-art models on multiple benchmarks while extending frame numbers.
TimeViper 是一种结合了状态空间模型效率和注意力机制表达性的混合 Mamba-Transformer 模型,旨在高效处理长视频理解任务。该模型揭示了视觉到文本信息聚合的现象,并提出了一种 Token 信息转移模块 TransV,该模块将视觉 Token 压缩为指令 Token 同时保持多模态理解能力。实验表明,TimeViper 可以处理小时级的长视频,并在多个基准测试中优于现有最佳模型。
EvilGenie: A Reward Hacking Benchmark
Authors: Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld
First: 2025-11-26T18:27:17+00:00 · Latest: 2025-11-26T18:27:17+00:00
Abstract
We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.
中文标题/摘要
标题:EvilGenie:奖励劫持基准
我们介绍了EvilGenie,一个用于编程环境中的奖励劫持基准。我们从LiveCodeBench获取问题,并创建了一个环境,使代理可以轻松地进行奖励劫持,例如通过硬编码测试案例或编辑测试文件。我们通过三种方式衡量奖励劫持:保留的单元测试、LLM裁判和测试文件编辑检测。我们通过人工审查和相互验证验证了这些方法的有效性。我们发现LLM裁判在明确的案例中非常有效于检测奖励劫持,并观察到保留的测试案例的使用仅带来微小的改进。除了使用Inspect的基本代理框架测试许多模型外,我们还测量了三种流行的专有编码代理的奖励劫持率:OpenAI的Codex、Anthropic的Claude Code和Google的Gemini CLI,分别使用GPT-5、Claude Sonnet 4和Gemini 2.5 Pro。我们观察到Codex和Claude Code存在明确的奖励劫持行为,而所有三个代理都表现出不一致的行为。我们的代码库可以在https://github.com/JonathanGabor/EvilGenie找到。
CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow
Authors: Ruisheng Han, Kanglei Zhou, Shuang Chen, Amir Atapour-Abarghouei, Hubert P. H. Shum
First: 2025-11-26T18:25:41+00:00 · Latest: 2025-11-26T18:25:41+00:00
Abstract
Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow
中文标题/摘要
标题:CaFlow:通过因果反事实流增强长期动作质量评估
动作质量评估(AQA)从动作视频中预测细粒度执行评分,并广泛应用于体育、康复和技能评估。长期AQA,如花样滑冰或体操,尤其具有挑战性,因为它需要建模长期时间动态,同时保持对上下文混杂因素的鲁棒性。现有方法要么依赖昂贵的注释,要么依赖单向时间建模,使其容易受到虚假相关性和长期表示不稳定的威胁。为此,我们提出CaFlow,这是一种统一框架,将反事实去混杂与双向时间条件流相结合。因果反事实正则化(CCR)模块以自监督方式分离因果和混杂特征,并通过反事实干预增强因果稳健性,而BiT-Flow模块在循环一致性约束下建模前向和后向动力学,以生成更平滑和更连贯的表示。在多个长期AQA基准上的广泛实验表明,CaFlow达到了最先进的性能。代码可在https://github.com/Harrison21/CaFlow 获取
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Authors: Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rémi Munos, Tyler Lu
First: 2025-11-26T18:12:16+00:00 · Latest: 2025-11-26T18:12:16+00:00
Comments: 12 pages, 2 figures
Abstract
Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.
中文标题/摘要
标题:使用迭代PPO使大语言模型朝多轮对话目标对齐
优化大型语言模型(LLMs)以实现多轮对话结果仍然是一个重大挑战,尤其是在像AI营销或通过消息平台促进交易的销售代理这样的目标导向环境中。这一挑战源于稀疏的、长期的奖励以及响应级规划与标记级生成之间的不一致。在本技术说明中,我们提出了一种将多轮强化学习问题形式化地简化为一系列单轮RLHF风格问题的方法。这通过设置一个学习到的多轮Q函数作为单轮问题的奖励模型来实现。我们展示了并证明了一个关键见解:使用标准标记级PPO解决这个单轮RL问题等同于多轮问题中的策略改进步骤。这一见解自然地引出了迭代PPO,这是一种批在线策略迭代算法,交替进行从记录的对话轨迹中拟合Q函数和改进策略。一个主要的实际优势是,迭代PPO直接利用了稳定的单轮RLHF工具,使其易于实现。我们的方法介于完全在线和完全离线方法之间,保留了在线更新的适应性,同时获得了离线训练的稳定性优势。
Summary / 总结
This paper addresses the challenge of optimizing large language models for multi-turn conversational outcomes, particularly in goal-oriented settings. It proposes Iterative PPO, a method that reduces the multi-turn RL problem into a series of single-turn RLHF-style problems. By using a learned multi-turn Q-function as the reward model, Iterative PPO alternates between fitting Q-functions from logged conversation trajectories and improving the policy. Key findings include the equivalence of solving the single-turn RL problem with standard token-level PPO to a policy improvement step in the multi-turn problem, and the practical advantage of directly leveraging stable, off-the-shelf single-turn RLHF tools, making implementation straightforward.
本文解决了在目标导向设置中优化大型语言模型进行多轮对话结果的挑战。提出了一种Iterative PPO方法,将多轮RL问题简化为一系列单轮RLHF问题。通过使用学习到的多轮Q函数作为奖励模型,Iterative PPO在拟合Q函数和改进策略之间交替进行,实现了稳定且适应性强的更新。关键发现是,这种方法等同于多轮问题中的策略改进步骤,使得使用现有单轮RLHF工具实现起来更为简便。
Bridging the Unavoidable A Priori: A Framework for Comparative Causal Modeling
Authors: Peter S. Hovmand, Kari O'Donnell, Callie Ogland-Hand, Brian Biroscak, Douglas D. Gunzler
First: 2025-11-26T18:08:20+00:00 · Latest: 2025-11-26T18:08:20+00:00
Comments: Presented at 43rd Conference of the International System Dynamics Society in Boston, United States
Abstract
AI/ML models have rapidly gained prominence as innovations for solving previously unsolved problems and their unintended consequences from amplifying human biases. Advocates for responsible AI/ML have sought ways to draw on the richer causal models of system dynamics to better inform the development of responsible AI/ML. However, a major barrier to advancing this work is the difficulty of bringing together methods rooted in different underlying assumptions (i.e., Dana Meadow's "the unavoidable a priori"). This paper brings system dynamics and structural equation modeling together into a common mathematical framework that can be used to generate systems from distributions, develop methods, and compare results to inform the underlying epistemology of system dynamics for data science and AI/ML applications.
中文标题/摘要
标题:跨越不可避免的先验:比较因果建模的框架
AI/ML模型迅速成为解决先前未解决的问题及其意外后果的创新方法,这些意外后果包括放大人类偏见。负责任的AI/ML倡导者寻求利用系统动力学中更丰富的因果模型来更好地指导负责任的AI/ML开发。然而,推进这项工作的主要障碍是难以将基于不同基本假设的方法(即Dana Meadow的“不可避免的先验”)结合起来。本文将系统动力学和结构方程建模结合到一个共同的数学框架中,可用于从分布生成系统、开发方法并比较结果,以指导系统动力学的数据科学和AI/ML应用的基本认识论。
The Impossibility of Inverse Permutation Learning in Transformer Models
Authors: Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah
First: 2025-09-28T23:48:11+00:00 · Latest: 2025-11-26T18:02:39+00:00
Abstract
In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (``canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with ``scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).
中文标题/摘要
标题:变换器模型中逆排列学习的不可能性
在本技术注记中,我们研究了仅解码器变换器中的逆排列学习问题。给定一个排列和该排列已应用于的字符串,模型的任务是生成原始(“标准”)字符串。我们认为,此任务模拟了各种推理任务中的一种自然鲁棒性属性,包括长上下文检索、多项选择问答和上下文学习。我们的主要贡献是一个不可能性结果:我们证明了任意深度的仅解码器变换器无法学习此任务。此结果关注仅解码器变换器模型的表达能力,与训练动力学或样本复杂性无关。我们给出了两种替代构造,在这两种构造下逆排列学习是可行的。第一个构造突显了因果注意力掩码的基本作用,并揭示了编码器-解码器变换器与更流行的仅解码器架构在表达能力上的差距。后者结果更为令人惊讶:我们证明,简单地用“草稿标记”填充输入可以构造出逆排列学习是可能的。我们推测,这可能表明链式思考提示或更广泛地说,中间“思考”标记如何在大型语言模型中启用推理的一种替代机制,即使这些标记编码了无意义的语义信息(例如,中间计算的结果)。
Summary / 总结
The paper investigates the inverse permutation learning problem in decoder-only transformers, where the model aims to recover the original string from a given permutation and its transformed version. The authors demonstrate that arbitrary depth decoder-only transformers cannot learn this task, highlighting the limitations of their expressive capacity. They also provide two alternative constructions: one emphasizing the importance of the causal attention mask and another showing that padding with 'scratch tokens' can enable inverse permutation learning, suggesting a potential mechanism for reasoning in large language models even without meaningful semantic information.
本文研究了解码器仅有的变压器在给定置换和其变换后的字符串时,恢复原始字符串的问题。作者证明了任意深度的解码器仅有的变压器无法学习这一任务,突显了其表达能力的局限性。他们还提出了两种使逆置换学习成为可能的替代构造,一种强调因果注意力的作用,另一种涉及使用填充标记,这可能表明通过中间标记(即使这些标记没有有意义的语义信息)可以为大型语言模型提供推理能力的一种新机制。
Qwen3-VL Technical Report
Authors: Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu
First: 2025-11-26T17:59:08+00:00 · Latest: 2025-11-26T17:59:08+00:00
Comments: 42 pages
Abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
中文标题/摘要
标题:Qwen3-VL 技术报告
我们介绍了Qwen系列中最强大的视觉-语言模型Qwen3-VL,该模型在多种跨模态基准测试中表现出色。它原生支持多达256K个令牌的交错上下文,无缝集成文本、图像和视频。该模型系列包括密集型(2B/4B/8B/32B)和专家混合型(30B-A3B/235B-A22B)变体,以适应不同的延迟-质量权衡。Qwen3-VL 提供三个核心支柱:(i) 显著增强的纯文本理解,在某些情况下超越了可比的纯文本骨干模型;(ii) 强大的长上下文理解,具有原生的256K令牌窗口,适用于文本和交错的多模态输入,能够忠实保留、检索和跨长文档和视频进行交叉引用;(iii) 跨单图像、多图像和视频任务的高级多模态推理,展示了在MMMU和视觉数学基准测试(如MathVista和MathVision)中的领先性能。从架构上讲,我们引入了三个关键升级:(i) 增强的交错-MRoPE,以增强图像和视频中的空间-时间建模;(ii) DeepStack集成,有效地利用多级ViT特征来加强视觉-语言对齐;(iii) 视频中的文本基础时间对齐,从T-RoPE发展为显式的文本时间戳对齐,以实现更精确的时间定位。在可比的令牌预算和延迟约束下,Qwen3-VL 在密集型和专家混合型(MoE)架构中均表现出色。我们设想Qwen3-VL 将成为图像驱动推理、自主决策和多模态代码智能在实际工作流程中的基础引擎。
Summary / 总结
Qwen3-VL is the most advanced vision-language model in the Qwen series, enhancing pure-text understanding, long-context comprehension, and multimodal reasoning. It supports up to 256K tokens for interleaved contexts and includes dense and mixture-of-experts variants. Key upgrades include enhanced interleaved-MRoPE, DeepStack integration, and text-based time alignment for video. Qwen3-VL outperforms comparable models in various benchmarks and architectures under similar resource constraints, making it suitable for real-world applications like image-grounded reasoning and multimodal code intelligence.
Qwen3-VL 是 Qwen 系列中最先进的视觉语言模型,增强了纯文本理解、长上下文理解和多模态推理。它支持多达 256K 个令牌,并包括密集型和混合专家型变体。关键架构升级包括增强的 interleaved-MRoPE、DeepStack 集成以及视频中的基于文本的时间对齐。Qwen3-VL 在各种基准测试中表现出色,并旨在用于图像接地推理、自主决策和多模态代码智能等实际应用场景。
Scale-Agnostic Kolmogorov-Arnold Geometry in Neural Networks
Authors: Mathew Vanherreweghe, Michael H. Freedman, Keith M. Adams
First: 2025-11-26T17:52:05+00:00 · Latest: 2025-11-26T17:52:05+00:00
Abstract
Recent work by Freedman and Mulligan demonstrated that shallow multilayer perceptrons spontaneously develop Kolmogorov-Arnold geometric (KAG) structure during training on synthetic three-dimensional tasks. However, it remained unclear whether this phenomenon persists in realistic high-dimensional settings and what spatial properties this geometry exhibits. We extend KAG analysis to MNIST digit classification (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales. We find that KAG emerges during training and appears consistently across spatial scales, from local 7-pixel neighborhoods to the full 28x28 image. This scale-agnostic property holds across different training procedures: both standard training and training with spatial augmentation produce the same qualitative pattern. These findings reveal that neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.
中文标题/摘要
标题:无尺度依赖的柯尔莫哥洛夫-阿诺尔德几何学在神经网络中的应用
弗里德曼和穆利根近期的工作表明,浅层多层感知机在处理合成三维任务时,训练过程中会自发形成柯尔莫哥洛夫-阿诺尔德几何(KAG)结构。然而,尚不清楚这种现象是否在现实的高维环境中持续存在,以及这种几何结构具有哪些空间特性。 我们使用2层MLP对MNIST手写数字分类任务(784维度)进行了KAG分析,并进行了系统的多尺度空间分析。我们发现,KAG在训练过程中出现,并且在从局部7像素邻域到整个28x28图像的多个尺度上保持一致。这种无尺度依赖的特性在不同的训练方法下都成立:标准训练和带有空间增强的训练都产生了相同的定性模式。这些发现揭示了神经网络在学习现实高维数据时自发形成了组织化且尺度不变的几何结构。
Summary / 总结
The study extends the analysis of Kolmogorov-Arnold geometric (KAG) structure from synthetic three-dimensional tasks to the MNIST digit classification task, which involves 784 dimensions. Using two-layer multilayer perceptrons, the researchers found that KAG structure emerges during training and persists across different spatial scales, from small neighborhoods to the full image. This scale-agnostic property was consistent across various training methods, indicating that neural networks develop organized, scale-invariant geometric structures during learning on high-dimensional data.
研究探讨了浅层多层感知机在合成任务中观察到的Kolmogorov-Arnold几何(KAG)结构是否能在高维设置如MNIST手写数字分类中保持。使用2层MLP,研究发现KAG在训练过程中出现,并且在不同空间尺度和训练方法下保持一致。这一无尺度特性表明,神经网络在学习真实高维数据时会自发地发展出有组织的、尺度不变的几何结构。
Active Learning for GCN-based Action Recognition
Authors: Hichem Sahbi
First: 2025-11-26T17:51:59+00:00 · Latest: 2025-11-26T17:51:59+00:00
Abstract
Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.
中文标题/摘要
标题:基于GCN的动作识别主动学习
尽管图卷积网络(GCNs)在基于骨架的动作识别方面取得了显著的成功,但其性能往往依赖于大量标记数据,而在实际应用中这类数据往往稀缺。为解决这一限制,我们提出了一种新型的标签高效GCN模型。我们的工作主要做出了两项贡献。首先,我们开发了一种新的获取函数,利用对抗策略来识别一组具有信息性的示例用于标记。这一选择过程平衡了代表性、多样性和不确定性。其次,我们引入了双向和稳定的GCN架构。这些增强的网络有助于更有效地在环境和潜在数据空间之间进行映射,从而更好地理解学习到的示例分布。在两个具有挑战性的基于骨架的动作识别基准上的广泛评估表明,我们的标签高效GCNs相较于先前的工作取得了显著的改进。
Summary / 总结
This paper addresses the challenge of limited labeled data in skeleton-based action recognition using graph convolutional networks (GCNs). It proposes a novel label-efficient GCN model that includes a new acquisition function based on an adversarial strategy to select informative exemplars for labeling. The model also introduces bidirectional and stable GCN architectures to better map between ambient and latent data spaces. Evaluations on two benchmarks show significant performance improvements over previous methods.
研究旨在提高图卷积网络(GCN)在基于骨架的动作识别中的标签数据使用效率。作者提出了一种新的标签高效GCN模型,其中包括一种使用对抗策略的新获取函数,用于选择具有信息性的示例进行标注。该函数平衡了代表性、多样性和不确定性。此外,他们引入了双向和稳定的GCN架构,以更好地在环境和潜在数据空间之间进行映射。在两个基准上的评估显示,与先前的方法相比,性能有了显著提高。
TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding
Authors: Chin-Chia Michael Yeh, Uday Singh Saini, Xin Dai, Xiran Fan, Shubham Jain, Yujie Fan, Jiarui Sun, Junpeng Wang, Menghai Pan, Yingtong Dou, Yuzhong Chen, Vineeth Rakesh, Liang Wang, Yan Zheng, Mahashweta Das
First: 2025-11-24T20:57:31+00:00 · Latest: 2025-11-26T17:43:31+00:00
Abstract
Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people's lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.
中文标题/摘要
标题:TREASURE:一种基于Transformer的基础模型,用于高volume交易理解
支付网络构成了现代商业的骨干,每天产生大量的交易记录。正确地建模这些数据可以实现异常行为检测和消费者级别的洞察,从而改善人们的生活。在本文中,我们介绍了TREASURE,一种用于交易数据的多功能基于Transformer的基础模型,TRansformer Engine As Scalable Universal transaction Representation Encoder。该模型同时捕捉消费者行为和支付网络信号(如响应代码和系统标志),为准确推荐系统和异常行为检测等应用提供全面信息。通过行业标准数据集验证,TREASURE具有三个关键能力:1)具有针对静态和动态属性的专用子模块的输入模块,实现更高效的训练和推理;2)高效的训练范式,用于预测高基数分类属性;3)作为独立模型,其异常行为检测性能比生产系统提高了111%,作为嵌入提供者,其推荐模型性能提高了104%。我们通过广泛的消融研究、与生产模型的基准测试和案例研究,展示了开发TREASURE获得的重要见解。
Summary / 总结
TREASURE is a transformer-based foundation model designed for high-volume transaction understanding, aiming to improve applications like abnormal behavior detection and personalized experiences. It captures consumer behavior and payment network signals, and includes an input module with dedicated sub-modules for static and dynamic attributes, an efficient training paradigm for high-cardinality categorical attributes, and demonstrates a 111% increase in abnormal behavior detection performance and a 104% enhancement in recommendation models.
TREASURE 是一个针对高volume交易数据的变压器基础模型,能够支持异常行为检测和个人化推荐等应用。它同时捕捉消费者行为和支付网络信号,包含专门用于静态和动态属性的输入模块,并采用高效的训练范式处理高基数分类属性。TREASURE 将异常行为检测性能提高了111%,并增强了推荐模型104%。广泛的消融研究和基准测试展示了其有效性。
Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
Authors: Jungyeon Koh, Hyun Jong Yang
First: 2025-11-03T16:04:44+00:00 · Latest: 2025-11-26T17:29:51+00:00
Abstract
The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
中文标题/摘要
标题:资源感知并行推测解码的协作大型语言模型推理
设备上大型语言模型(LLM)推理需求的增长突显了在资源受限环境中高效移动边缘计算(MEC)解决方案的必要性。推测解码通过在移动设备上的轻量级草稿模型和边缘服务器上的强大目标模型之间分割标记生成,提供了一种有前景的解决方案,但存在通信开销和异步延迟的问题。本文首次提出了一种统一框架,联合优化用户关联和资源分配(UARA),以支持高效的并行推测解码。我们使用多智能体深度强化学习算法解决UARA问题。为了在现实条件下评估我们的方法,我们使用Sionna仿真器进行了实验。结果表明,我们的方法在不牺牲推理准确性的前提下,端到端延迟最多可减少28.0%,平均减少23.7%,使MEC系统中的大型语言模型服务具有可扩展性和低延迟。
Summary / 总结
This paper addresses the need for efficient large language model inference in resource-constrained environments by proposing a unified framework that optimizes user association and resource allocation for parallel speculative decoding. The method uses a multi-agent deep reinforcement learning algorithm to partition token generation between mobile devices and edge servers, reducing end-to-end latency by up to 28.0% and on average by 23.7% without affecting inference accuracy.
本文提出了一种统一框架,通过优化用户关联和资源分配来实现并行投机性解码,以解决资源受限环境下高效的大语言模型推理问题。该方法使用多智能体深度强化学习算法在移动设备和边缘服务器之间分割token生成,将端到端延迟最多减少28.0%,平均减少23.7%,同时不牺牲推理准确性。
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
Authors: M. Naseer Subhani
First: 2025-11-26T17:26:00+00:00 · Latest: 2025-11-26T17:26:00+00:00
Abstract
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.
中文标题/摘要
标题:ReSAM:细化、重查询和重强化:用于遥感图像的自我提示点监督分割
交互式分割模型如Segment Anything Model (SAM) 在自然图像上表现出色,但在遥感图像(RSI)上表现不佳,这主要是由于严重的领域偏移和密集注释的稀缺性。为了解决这个问题,我们提出了一种自我提示、点监督框架,仅使用稀疏点注释将SAM适应RSI。该方法采用一个细化-重查询-重强化循环,从初始点生成粗略的伪掩码(细化),通过自我构建的框提示改进(重查询),并在迭代中对嵌入进行对齐以减少确认偏差(重强化)。不依赖于全掩码监督,我们的方法通过自我引导的提示适应逐步提升SAM的分割质量和领域鲁棒性。我们在WHU、HRSID和NWPU VHR-10三个RSI基准数据集上评估了我们提出的方法,结果显示我们的方法在预训练SAM和最近的点监督分割方法上表现更优。我们的结果表明,自我提示和语义对齐为基础分割模型在遥感应用中的点级适应提供了一条高效途径。
Summary / 总结
The research aims to improve the performance of interactive segmentation models like SAM on remote sensing images (RSIs) by addressing domain shift and sparse annotations. The proposed method, ReSAM, uses a Refine-Requery-Reinforce loop to adapt SAM using only sparse point annotations. It generates coarse pseudo-masks, improves them with self-constructed box prompts, and aligns embeddings to reduce confirmation bias. Experiments on WHU, HRSID, and NWPU VHR-10 datasets show that ReSAM consistently outperforms pretrained SAM and other point-supervised segmentation methods, demonstrating the effectiveness of self-prompting and semantic alignment for scalable adaptation of foundation models in remote sensing.
研究针对交互式分割模型如SAM在遥感图像上的表现不佳问题,由于领域偏移和稀疏标注。提出了一种Refine-Requery-Reinforce框架,利用稀疏点标注来适应SAM用于遥感图像。该方法通过迭代生成和细化伪掩码并对齐嵌入来逐步提升分割质量和领域鲁棒性。实验结果表明,该方法在WHU、HRSID和NWPU VHR-10数据集上优于预训练的SAM和其他点监督分割方法。
TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data
Authors: Yizhou Zhao, Xiang Li, Peter Song, Qi Long, Weijie Su
First: 2025-11-26T17:16:14+00:00 · Latest: 2025-11-26T17:16:14+00:00
Abstract
The rise of generative AI has enabled the production of high-fidelity synthetic tabular data across fields such as healthcare, finance, and public policy, raising growing concerns about data provenance and misuse. Watermarking offers a promising solution to address these concerns by ensuring the traceability of synthetic data, but existing methods face many limitations: they are computationally expensive due to reliance on large diffusion models, struggle with mixed discrete-continuous data, or lack robustness to post-modifications. To address them, we propose TAB-DRW, an efficient and robust post-editing watermarking scheme for generative tabular data. TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. To further enhance robustness and efficiency, we introduce a novel rank-based pseudorandom bit generation method that enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against common post-processing attacks, while preserving high data fidelity and fully supporting mixed-type features.
中文标题/摘要
标题:TAB-DRW:基于DFT的稳健生成表格数据水印
生成式AI的兴起使得在医疗保健、金融和公共政策等领域生成高保真合成表格数据成为可能,这引发了关于数据来源和滥用的日益增长的担忧。水印提供了一种有希望的解决方案,通过确保合成数据的可追溯性来解决这些问题,但现有方法存在许多局限性:它们由于依赖于大型扩散模型而计算成本高昂,难以处理混合离散-连续数据,或者缺乏对后修改的鲁棒性。为了解决这些问题,我们提出了TAB-DRW,这是一种高效的针对生成表格数据的稳健后编辑水印方案。TAB-DRW将水印信号嵌入到频域中:它通过Yeo-Johnson变换和标准化对异质特征进行归一化,应用离散傅里叶变换(DFT),并根据预先计算的伪随机位调整适应性选择条目的虚部。为了进一步增强鲁棒性和效率,我们引入了一种新颖的基于排名的伪随机位生成方法,该方法可以在不增加存储开销的情况下按行检索。在五个基准表格数据集上的实验表明,TAB-DRW在检测能力和抵抗常见后处理攻击的鲁棒性方面表现出色,同时保持了高数据保真度并完全支持混合类型特征。
Summary / 总结
TAB-DRW is a DFT-based watermarking method designed to address the limitations of existing techniques in ensuring the traceability of synthetic tabular data. It normalizes features, applies DFT, and adjusts the imaginary parts of selected entries with pseudorandom bits. Experiments demonstrate that TAB-DRW is highly detectable and robust against post-processing attacks, while maintaining data fidelity and supporting mixed-type features.
TAB-DRW 是一种基于 DFT 的水印方法,旨在解决现有技术在确保合成表格数据可追溯性方面的局限性。它通过 Yeo-Johnson 变换和离散傅里叶变换在频域中嵌入水印,并通过基于排名的伪随机位生成方法增强鲁棒性。在五个基准数据集上的实验表明,TAB-DRW 保持了高数据保真度和对后处理攻击的鲁棒性,同时支持混合类型特征。
MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
Authors: Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman, Jinrong Xie, Yongxin Chen
First: 2025-11-26T17:09:03+00:00 · Latest: 2025-11-26T17:09:03+00:00
Abstract
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.
中文标题/摘要
标题:MoGAN:通过少量步骤运动对抗后训练提高视频扩散中的运动质量
视频扩散模型在帧级保真度方面表现出色,但在运动连贯性、动态性和现实感方面仍然存在困难,经常产生抖动、鬼影或不合理的动态。一个关键限制是,标准的去噪均方误差目标无法直接监督时间一致性,允许模型在生成不良运动的同时仍能实现低损失。我们提出MoGAN,这是一种以运动为中心的后训练框架,可以在无需奖励模型或人类偏好数据的情况下提高运动现实感。该框架基于一个三步提取的视频扩散模型,我们训练一个基于DiT的光学流鉴别器来区分真实运动和生成运动,并结合分布匹配正则化器以保持视觉保真度。在Wan2.1-T2V-1.3B上的实验表明,MoGAN在基准测试中显著提高了运动质量。在VBench上,MoGAN将运动得分分别提高了7.3%(相对于50步教师模型)和13.3%(相对于3步DMD模型)。在VideoJAM-Bench上,MoGAN将运动得分分别提高了7.4%(相对于教师模型)和8.8%(相对于DMD),同时保持了可比或更好的美学和图像质量得分。进一步的人类研究表明,MoGAN在运动质量方面更受欢迎(教师模型为52%,DMD为38%;DMD为56%,教师模型为29%)。总体而言,MoGAN在不牺牲视觉保真度或效率的情况下提供了更现实的运动,为快速、高质量视频生成提供了一条实用的道路。项目网页为:https://xavihart.github.io/mogan/
Summary / 总结
MoGAN is a motion-centric post-training framework that enhances motion quality in video diffusion models by training a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. Experiments on Wan2.1-T2V-1.3B show that MoGAN significantly improves motion quality, with a +7.3% boost in motion score on VBench and +7.4% on VideoJAM-Bench compared to the teacher model, while maintaining comparable aesthetic and image-quality scores. A human study also confirms that MoGAN is preferred for motion quality.
MoGAN 是一种以运动为中心的后训练框架,通过训练基于 DiT 的光流鉴别器来区分真实和生成的运动,并结合分布匹配正则化器来保持视觉保真度。实验结果显示,MoGAN 在 Wan2.1-T2V-1.3B 上显著提高了运动质量,VBench 上的运动得分提高了 7.3%,VideoJAM-Bench 上提高了 7.4%,同时保持了可比或更好的美学和图像质量得分。人类研究还证实,MoGAN 在运动质量方面更受欢迎。
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali
Authors: Abdullah Al Sefat
First: 2025-11-25T15:26:47+00:00 · Latest: 2025-11-26T17:08:26+00:00
Abstract
Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
中文标题/摘要
标题:BengaliFig:孟加拉语低资源环境下的比喻与文化 grounding 推理挑战
大型语言模型在多语言基准测试中表现出色,但在比喻和文化 grounding 推理方面仍需广泛评估,尤其是在低资源环境中。我们提出了 BengaliFig,这是一个紧凑但注释丰富的挑战集,旨在填补孟加拉语中的这一空白,孟加拉语是一种广泛使用的低资源语言。数据集包含 435 个独特的谜语,源自孟加拉语口头和文学传统。每个项目都从推理类型、陷阱类型、文化深度、答案类别和难度五个正交维度进行注释,并通过一种基于约束的、AI 辅助的管道自动转换为多项选择格式。我们在零样本和少量样本链式思考提示下评估了来自主要提供商的八种前沿 LLM,揭示了在比喻和文化特定推理方面的持续弱点。因此,BengaliFig 既是一个诊断探针,用于评估 LLM 在低资源文化环境中的鲁棒性,也是一个朝着包容性和遗产意识自然语言处理评估迈进的步骤。
Summary / 总结
The research aims to evaluate large language models' performance in figurative and culturally grounded reasoning, particularly in low-resource languages like Bengali. The study introduces BengaliFig, a dataset of 435 riddles annotated with five dimensions, and evaluates eight leading LLMs under zero-shot and few-shot prompting, highlighting their consistent difficulties in metaphorical and culturally specific reasoning. This dataset serves as a diagnostic tool for assessing LLM robustness in low-resource cultural contexts and promotes inclusive NLP evaluation.
研究旨在评估大型语言模型在隐喻和文化背景推理方面的能力,特别是在低资源语言如孟加拉语中。研究引入了包含435个谜语的BengaliFig数据集,并评估了八种领先的LLM。结果显示,在隐喻和文化特定推理方面存在一致的困难,突显了在低资源背景下进行更好评估的需求。
On the Limits of Innate Planning in Large Language Models
Authors: Charles Schepanowski, Charles Ling
First: 2025-11-26T17:08:13+00:00 · Latest: 2025-11-26T17:08:13+00:00
Comments: 33 pages, 7 figures
Abstract
Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.
中文标题/摘要
标题:关于大型语言模型内置规划能力的局限性
大型语言模型(LLMs)在许多基准测试中取得了令人印象深刻的成果,但它们的规划能力和状态依赖性推理能力仍然不清楚。我们直接研究了这些能力,不使用代码执行或其他工具,使用8-拼图:一个经典的任务,需要状态跟踪和目标导向的规划,同时允许精确的、逐步的评估。四个模型在常见的提示条件下(零样本、思维链、思维算法)和分层纠正反馈下进行测试。反馈在某些模型提示组合中提高了成功率,但许多成功的运行是长的、计算密集型的并且间接的。然后我们使用外部移动验证器来检查这些模型,该验证器仅提供有效移动。即使在这种程度的帮助下,这些模型在该设置中也没有解决任何拼图。定性分析揭示了所有模型中的两个主要缺陷:(1)脆弱的内部状态表示,导致频繁出现无效移动,(2)弱的启发式规划,模型进入循环或选择不减少与目标状态距离的动作。这些发现表明,在没有外部工具如代码解释器的情况下,当前的LLMs在规划方面存在重大局限性,进一步的进步可能需要维护显式状态和执行结构化搜索的机制。
Summary / 总结
This study investigates the planning and stateful reasoning capabilities of large language models (LLMs) using the 8-puzzle task. Four models were tested under different prompting conditions and with varying levels of corrective feedback. While feedback improved success rates, many successful runs were computationally expensive and indirect. With an external move validator providing only valid moves, none of the models could solve any puzzles. The analysis highlighted two main issues: brittle internal state representations leading to invalid moves and weak heuristic planning causing loops or actions that do not reduce the distance to the goal state. This suggests that current LLMs have significant limitations in planning without external tools, and further advancements may require mechanisms for explicit state maintenance and structured search.
研究使用8-拼图任务来考察大型语言模型(LLMs)的规划和状态推理能力。四种模型在不同的提示条件下进行了测试,并且提供了不同程度的纠正反馈。虽然反馈提高了成功率,但许多成功的尝试仍然非常耗时且间接。即使外部移动验证器仅提供有效移动,这些模型也无法解决任何拼图。分析发现两个主要问题:脆弱的内部状态表示导致无效移动,以及弱的启发式规划导致循环或不减少到目标状态距离的动作。这表明,当前的LLMs在没有外部工具的情况下,在规划方面存在显著局限性,进一步的进展可能需要维护显式状态和执行结构化搜索的机制。
Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving
Authors: Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao
Venue: NeurIPS 2025
First: 2025-11-26T17:01:41+00:00 · Latest: 2025-11-26T17:01:41+00:00
Comments: Published at NeurIPS 2025: https://openreview.net/forum?id=4OLbpaTKJe
Abstract
End-to-end (E2E) autonomous driving models have demonstrated strong performance in open-loop evaluations but often suffer from cascading errors and poor generalization in closed-loop settings. To address this gap, we propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment. MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine, exposing the agent to scenarios beyond the original dataset. Based on this generated data, MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the one with the highest expected utility. Experiments on the nuScenes benchmark using a photorealistic closed-loop simulator demonstrate that MPA significantly improves performance across in-domain, out-of-domain, and safety-critical scenarios. We further investigate how the scale of counterfactual data and inference-time guidance strategies affect overall effectiveness.
中文标题/摘要
标题:基于模型的策略适应以实现闭环端到端自动驾驶
端到端(E2E)的自动驾驶模型在开环评估中表现出强大的性能,但在闭环环境中往往遭受级联错误和较差的泛化能力。为解决这一差距,我们提出了基于模型的策略适应(MPA),这是一种通用框架,能够在部署过程中增强预训练E2E驾驶代理的鲁棒性和安全性。MPA 首先使用几何一致的模拟引擎生成多样化的反事实轨迹,使代理暴露于超出原始数据集的场景。基于生成的数据,MPA 训练一个基于扩散的策略适配器以细化基础策略的预测,并训练一个多步Q值模型以评估长期结果。在推理时,适配器提出多个轨迹候选,Q值模型选择具有最高预期效用的轨迹。在nuScenes基准上使用逼真的闭环模拟器进行的实验表明,MPA 显著提高了在领域内、领域外和安全关键场景中的性能。我们进一步研究了反事实数据的规模和推理时的指导策略如何影响整体效果。
Summary / 总结
The research aims to enhance the robustness and safety of end-to-end autonomous driving models in closed-loop settings by proposing Model-based Policy Adaptation (MPA). MPA uses a simulation engine to generate diverse counterfactual trajectories, which are then used to train a policy adapter and a multi-step Q value model. At inference time, the adapter suggests multiple trajectory options, and the Q value model selects the most beneficial one. Experiments on the nuScenes benchmark show that MPA improves performance in various scenarios, including in-domain, out-of-domain, and safety-critical situations.
研究针对端到端自动驾驶模型在闭环环境中的局限性,提出了基于模型的策略适应(MPA)框架。MPA 使用仿真引擎生成多样化的反事实轨迹,并训练策略适配器和多步 Q 值模型来改进和评估基础策略。实验表明,MPA 在各种场景中,包括领域内、领域外和安全关键场景中,均提高了性能。
LCB-CV-UNet: Enhanced Detector for High Dynamic Range Radar Signals
Authors: Yanbin Wang, Xingyu Chen, Yumiao Wang, Xiang Wang, Chuanfei Zang, Guolong Cui, Jiahuan Liu
Venue: Proc. IEEE Int. Geosci. Remote Sens. Symp. (2025) 6050-6054
First: 2025-05-29T14:00:59+00:00 · Latest: 2025-11-26T16:58:54+00:00
Comments: 5 pages, 4 figures. Accepted to IEEE IGARSS 2025
Abstract
We propose the LCB-CV-UNet to tackle performance degradation caused by High Dynamic Range (HDR) radar signals. Initially, a hardware-efficient, plug-and-play module named Logarithmic Connect Block (LCB) is proposed as a phase coherence preserving solution to address the inherent challenges in handling HDR features. Then, we propose the Dual Hybrid Dataset Construction method to generate a semi-synthetic dataset, approximating typical HDR signal scenarios with adjustable target distributions. Simulation results show about 1% total detection probability improvement with under 0.9% computational complexity added compared with the baseline. Furthermore, it excels 5% over the baseline at the range in 11-13 dB signal-to-noise ratio typical for urban targets. Finally, the real experiment validates the practicality of our model.
中文标题/摘要
标题:LCB-CV-UNet:高动态范围雷达信号增强检测器
我们提出LCB-CV-UNet以应对高动态范围(HDR)雷达信号引起的性能退化问题。首先,提出了一种硬件高效、即插即用的模块——对数连接块(LCB),作为相位相干性保持的解决方案,以应对处理HDR特征的固有挑战。然后,我们提出了双混合数据集构建方法,生成一个半合成数据集,通过可调节的目标分布近似典型HDR信号场景。仿真结果表明,与基线相比,总检测概率提高了约1%,计算复杂度增加了不到0.9%。此外,在11-13 dB信噪比的典型城市目标场景下,其性能优于基线5%。最后,实际实验验证了我们模型的实用性。
Summary / 总结
The LCB-CV-UNet is proposed to improve the performance of detecting High Dynamic Range radar signals. It includes a Logarithmic Connect Block (LCB) module to preserve phase coherence and a Dual Hybrid Dataset Construction method for generating a semi-synthetic dataset. The model shows a 1% improvement in total detection probability with minimal added computational complexity and outperforms the baseline by 5% in typical urban signal-to-noise ratio conditions. Real experiments confirm its practical effectiveness.
提出了LCB-CV-UNet来提高高动态范围雷达信号检测性能。该模型包含一个Logarithmic Connect Block (LCB)模块以保持相位相干性,并采用双混合数据集构建方法生成半合成数据集。实验结果显示,该模型在总检测概率上提高了1%,且增加了不到1%的计算复杂度,在特定信噪比条件下比基线模型高出5%。
Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning
Authors: Alex Ning, Yen-Ling Kuo, Gabe Gomes
First: 2025-11-26T16:54:06+00:00 · Latest: 2025-11-26T16:54:06+00:00
Comments: 13 pages, 6 figures
Abstract
Latent reasoning represents a new development in Transformer language models that has shown potential in compressing reasoning lengths compared to chain-of-thought reasoning. By directly passing the information-rich previous final latent state into the next sequence, latent reasoning removes the restriction to human language tokens as the medium for reasoning. We develop adaptive-length latent reasoning models and introduce a post-SFT reinforcement-learning methodology to optimize latent reasoning length by minimizing reasoning length while maintaining accuracy. This, in turn, further reduces compute usage and raises the bar on the compressive capabilities of latent reasoning models. Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset show a $52\%$ drop in total reasoning length with no penalty to accuracy. In future work, we plan to extend to additional models and datasets, analyze relationships between training coefficients, experiment with architecture variations, and continue our knowledge distillation for latent reasoning SFT efforts. We make our code and pretrained weights available at https://github.com/apning/adaptive-latent-reasoning.
中文标题/摘要
标题:学习何时停止:通过强化学习实现自适应潜在推理
潜在推理是Transformer语言模型中的一种新发展,与链式思考推理相比,它在压缩推理长度方面显示出潜力。通过直接将信息丰富的先前最终潜在状态传递到下一个序列,潜在推理消除了将推理限制在人类语言标记中的必要性。我们开发了自适应长度的潜在推理模型,并引入了一种后SFT强化学习方法,通过最小化推理长度同时保持准确性来优化潜在推理长度。这反过来进一步减少了计算使用量,并提高了潜在推理模型压缩能力的标准。在Llama 3.2 1B模型和GSM8K-Aug数据集上的实验表明,在不牺牲准确性的前提下,总推理长度减少了52%。在未来的工作中,我们计划扩展到其他模型和数据集,分析训练系数之间的关系,尝试架构变化,并继续我们的潜在推理SFT知识蒸馏工作。我们将在https://github.com/apning/adaptive-latent-reasoning/上提供我们的代码和预训练权重。
Summary / 总结
This paper introduces adaptive-length latent reasoning models that optimize the length of reasoning processes by minimizing reasoning steps while maintaining accuracy, using a post-SFT reinforcement-learning methodology. Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset demonstrate a 52% reduction in reasoning length without affecting accuracy.
该论文提出了自适应长度的潜在推理模型,通过最小化推理步骤同时保持准确性来优化推理过程的长度,使用了后SFT强化学习方法。实验在Llama 3.2 1B模型和GSM8K-Aug数据集上显示,推理长度减少了52%,且未影响准确性。
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Authors: Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
First: 2025-11-26T16:53:05+00:00 · Latest: 2025-11-26T16:53:05+00:00
Abstract
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
中文标题/摘要
标题:和谐:通过跨任务协同实现音频和视频生成的同步
同步音频-视觉内容的合成是生成式AI中的一个关键挑战,开源模型在稳健的音频-视频对齐方面面临挑战。我们的分析表明,这一问题源于联合扩散过程中的三个基本挑战:(1)对应关系漂移,同时进化的噪声潜在变量阻碍了对齐的稳定学习;(2)低效的全局注意力机制,无法捕捉细微的时间线索;(3)传统无条件分类引导(CFG)的模内偏差,增强了条件性但未提高跨模态同步。为克服这些挑战,我们引入了和谐(Harmony),一种新的框架,机械地确保音频-视觉同步。我们首先提出了一种跨任务协同训练范式,通过利用音频驱动视频和视频驱动音频生成任务中的强监督信号来减轻漂移。然后,我们设计了一种全局-局部解耦交互模块,以实现高效和精确的时间风格对齐。最后,我们提出了一种新的同步增强无条件分类引导(SyncCFG),在推理过程中明确隔离并放大对齐信号。广泛的实验表明,和谐建立了新的最先进的水平,在生成保真度方面显著优于现有方法,并且在实现细微的音频-视觉同步方面更为关键。
Summary / 总结
The paper addresses the challenge of robust audio-video alignment in generative AI by introducing Harmony, a framework that tackles three key issues: correspondence drift, inefficient global attention, and intra-modal bias. Harmony uses a Cross-Task Synergy training paradigm, a Global-Local Decoupled Interaction Module, and a Synchronization-Enhanced Classifier-Free Guidance (SyncCFG) to enforce audio-visual synchronization. Experiments show that Harmony outperforms existing methods in both generation fidelity and fine-grained audio-visual synchronization.
Harmony通过引入一个新颖的框架来解决生成AI中稳健的音频-视频对齐问题,该框架通过克服对应关系漂移、增强全局-局部交互以及改善跨模态同步来实现这一目标。它通过跨任务协同训练范式、全局-局部解耦交互模块和同步增强的分类器自由引导来实现这一目标。实验表明,Harmony在生成保真度和细粒度的音频-视频同步方面显著优于现有方法。
Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss
Authors: Chou Mo, Yehyun Suh, J. Ryan Martin, Daniel Moyer
First: 2025-11-26T16:50:06+00:00 · Latest: 2025-11-26T16:50:06+00:00
Comments: 9 pages, 3 figures, 1 table
Abstract
Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.
中文标题/摘要
标题:使用2D/3D配准损失增强骨盆透视中的关键点检测模型
自动关键点检测为医疗专业人员提供了一种有效的方法,通过术中成像理解患者解剖结构和定位。虽然当前骨盆透视的关键点检测方法显示出有希望的准确性,但大多数方法假设骨盆的固定前-后视图。然而,方向往往偏离这种标准视图,可能是由于成像单元或目标结构本身的重新定位。为了解决这一限制,我们提出了一种新的框架,将2D/3D关键点配准纳入U-Net关键点预测模型的训练中。我们通过在患者姿态可变的现实术中条件下比较基于基线U-Net、使用姿态估计损失训练的U-Net以及使用姿态估计损失微调的U-Net的关键点检测准确性来分析性能差异。
Summary / 总结
The research aims to improve landmark detection in pelvic fluoroscopy by addressing the limitations of fixed view assumptions. The method involves incorporating 2D/3D registration loss into a U-Net model for landmark prediction. The key experimental findings show that the U-Net fine-tuned with Pose Estimation Loss outperforms the baseline U-Net and U-Net trained with Pose Estimation Loss, demonstrating better accuracy under variable patient poses.
该研究旨在通过解决患者姿态变化的问题来改进骨盆透视中的地标检测。作者提出了一种新的框架,将2D/3D地标注册集成到U-Net模型中。他们比较了三种模型的表现:基线U-Net、使用姿态估计损失训练的U-Net以及使用姿态估计损失微调的U-Net。结果表明,在动态的术中条件下,微调后的模型在检测地标方面表现更优,证明了该方法在提高地标检测准确性方面的有效性。
Multimodal Robust Prompt Distillation for 3D Point Cloud Models
Authors: Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou, Shuchao Pang
First: 2025-11-26T16:49:38+00:00 · Latest: 2025-11-26T16:49:38+00:00
Abstract
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.
中文标题/摘要
标题:多模态鲁棒提示蒸馏用于3D点云模型
对抗攻击对基于学习的3D点云模型构成了重大威胁,严重削弱了它们在安全敏感应用中的可靠性。现有防御方法通常存在(1)高计算开销和(2)对不同攻击类型的泛化能力差的问题。为了解决这些问题,我们提出了一种新颖且高效的教师-学生框架,即多模态鲁棒提示蒸馏(MRPD),用于提炼鲁棒的3D点云模型。该方法通过将学生点云模型的特征与来自三个不同教师的鲁棒嵌入对齐来学习轻量级提示:一个处理深度投影的视觉模型、一个高性能的3D模型和一个文本编码器。为了确保可靠的知识转移,蒸馏过程由一个置信度门控机制引导,该机制动态平衡所有输入模态的贡献。值得注意的是,由于蒸馏过程仅在训练阶段进行,因此在推理阶段没有额外的计算开销。广泛的实验表明,MRPD在多种白盒和黑盒攻击下显著优于最先进的防御方法,甚至在干净数据上也表现出更好的性能。我们的工作提出了一种新的、实用的范式,通过高效利用多模态知识来构建鲁棒的3D视觉系统。
Summary / 总结
The paper addresses the vulnerability of 3D point cloud models to adversarial attacks by proposing a novel teacher-student framework called Multimodal Robust Prompt Distillation (MRPD). MRPD learns lightweight prompts by aligning student model features with robust embeddings from three distinct teachers: a vision model, a high-performance 3D model, and a text encoder. The distillation is guided by a confidence-gated mechanism to ensure reliable knowledge transfer. Experiments show that MRPD outperforms existing methods against various attacks and even improves performance on clean data, with no additional computational cost at inference.
研究旨在通过提出一种名为Multimodal Robust Prompt Distillation (MRPD)的新颖教师-学生框架来解决3D点云模型对对抗攻击的脆弱性。MRPD通过将学生模型特征与来自三种不同教师(视觉模型、高性能3D模型和文本编码器)的鲁棒嵌入对齐来学习轻量级提示。知识转移过程由置信度门控机制引导以确保可靠的知识传递。实验表明,MRPD在各种攻击下显著优于现有方法,并且在干净数据上甚至表现出更好的性能,而无需额外的推理成本。
BAMAS: Structuring Budget-Aware Multi-Agent Systems
Authors: Liming Yang, Junyu Luo, Xuanzhe Liu, Yiling Lou, Zhenpeng Chen
Venue: AAAI 2026 oral
First: 2025-11-26T16:48:18+00:00 · Latest: 2025-11-26T16:48:18+00:00
Comments: Accepted by AAAI 2026 (oral paper)
Abstract
Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.
中文标题/摘要
标题:BAMAS:构建预算意识多智能体系统
基于大型语言模型(LLM)的多智能体系统已成为使自主智能体解决复杂任务的强大范式。随着这些系统的复杂性增加,成本成为实际部署中的一个重要考虑因素。然而,现有工作很少解决如何在明确的预算约束下构建多智能体系统的问题。在本文中,我们提出了一种名为BAMAS的新方法,用于构建具有预算意识的多智能体系统。BAMAS首先通过求解一个整数线性规划问题来选择最优的LLM集合,该问题平衡了性能和成本。然后,通过利用基于强化学习的方法来确定这些LLM之间的协作拓扑。最后,系统基于所选的智能体及其协作拓扑进行实例化和执行。我们在三个代表性任务上评估了BAMAS,并将其与最先进的智能体构建方法进行了比较。结果表明,BAMAS在保持性能的同时,成本最多可降低86%。
Summary / 总结
The research motivation is to address the cost considerations in deploying large language model-based multi-agent systems. BAMAS proposes a method that first selects an optimal set of LLMs using an Integer Linear Programming problem to balance performance and cost, then determines the interaction topology through reinforcement learning. The system is then instantiated and executed based on these selections. Experimental results demonstrate that BAMAS can achieve comparable performance while significantly reducing costs, up to 86% in some cases.
论文提出了BAMAS方法,用于使用LLM构建具有预算意识的多智能体系统。首先通过求解整数线性规划问题选择最优的LLM集合,以平衡性能和成本。然后使用强化学习确定这些LLM之间的交互拓扑。实验结果显示,BAMAS可以实现与现有最佳方法相当的性能,同时将成本降低高达86%。
Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners
Authors: Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca
Venue: AAAI 2026
First: 2025-11-13T12:06:12+00:00 · Latest: 2025-11-26T16:39:11+00:00
Comments: AAAI 2026 Workshop on Graphs and more Complex Structures For Learning and Reasoning (GCLR). Version 2 fixes typos in author name and Figure 1
Abstract
While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.
中文标题/摘要
标题:迷失在序列化中:LLM图推理器的不变性和泛化能力
虽然前景广阔,但基于大型语言模型(LLMs)的图推理器缺乏对图表示中对称性的内置不变性。在基于序列图序列化操作时,LLMs 在节点重新索引、边重新排序或格式更改下会产生不同的输出,这引发了鲁棒性方面的担忧。我们系统地分析了这些影响,研究了微调如何影响编码敏感性以及在未见任务上的泛化能力。我们提出了一种原理性的图序列化分解,将图序列化分解为节点标签、边编码和语法,并在全面的基准测试套件上评估LLMs对每个因素变化的鲁棒性。我们还贡献了一组新的谱任务,以进一步评估微调推理器的泛化能力。结果显示,较大的(未微调)模型更具鲁棒性。微调可以减少对节点重新标记的敏感性,但可能会增加对结构和格式变化的敏感性,而微调并不一致地提高在未见任务上的性能。
Summary / 总结
This study addresses the robustness issues of graph reasoners based on Large Language Models (LLMs) by analyzing their sensitivity to symmetries in graph representations. The research decomposes graph serializations into node labeling, edge encoding, and syntax, and evaluates LLMs' robustness to variations in these factors. The findings indicate that larger models are more robust, while fine-tuning can reduce sensitivity to node relabeling but may increase sensitivity to structural and formatting changes, without consistently improving performance on unseen tasks.
研究探讨了基于大型语言模型(LLMs)的图推理器在图序列化变化下的鲁棒性。研究指出,LLMs 对图表示中的对称性缺乏不变性,在节点重新索引、边重新排序或格式变化下会产生不同的输出。作者提出了一种图序列化的分解方法,并在全面的基准测试套件上评估了LLMs的鲁棒性。研究发现,较大的模型更具鲁棒性,而微调可以减少对节点重新标记的敏感性,但可能会增加对结构和格式变化的敏感性,且在未见过的任务上并不总是能提高性能。
UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
Authors: Kang Du, Xue Liao, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, ShengHuang, Zeyu Wang
First: 2025-11-26T16:38:29+00:00 · Latest: 2025-11-26T16:38:29+00:00
Comments: 10 pages, 6 figures
Abstract
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.
中文标题/摘要
标题:UAVLight:无人机场景中光照鲁棒的3D重建基准
光照不一致是多视图3D重建中的基本挑战。太阳光方向、云层覆盖和阴影的变化打破了经典多视图立体视觉(MVS)和结构从运动(SfM)管道以及最近的神经渲染方法所依赖的恒定光照假设,导致几何漂移、颜色不一致和阴影印记。这一问题在基于无人机的重建中尤为关键,因为长时间飞行和户外环境使得光照变化不可避免。然而,现有的数据集要么限制捕获时间窗口较短,缺乏有意义的光照多样性,要么跨越数月和季节,几何和语义变化使得光照鲁棒性研究变得复杂。我们引入了UAVLight,一个控制下的真实基准,用于光照鲁棒的3D重建。每个场景沿可重复的、地理参考的飞行路径在一天中的多个固定时间点捕获,产生在一致几何、校准和视角下的自然光照变化。在不同光照条件下标准化的评估协议下,UAVLight为开发和基准测试在真实户外环境中一致、忠实且可重新照明的重建方法提供了可靠的基础。
Summary / 总结
The paper addresses the challenge of illumination inconsistency in 3D reconstruction, particularly in UAV scenes, where lighting changes are common. It introduces UAVLight, a benchmark that captures scenes along fixed flight paths at different times of day to maintain consistent geometry while introducing natural lighting variations. The key findings show that existing datasets either lack meaningful illumination diversity or include confounding geometric and semantic changes, whereas UAVLight provides a reliable framework for evaluating reconstruction methods under varying lighting conditions.
论文介绍了UAVLight,这是一个用于UAV场景3D重建的基准,旨在解决光照不一致的问题。该方法沿固定的飞行路径在不同的时间段捕捉场景,以创建自然的光照变化,同时保持一致的几何结构和视角。主要发现表明,现有数据集要么缺乏有意义的光照多样性,要么受到几何和语义变化的干扰,而UAVLight为在真实户外环境中开发和基准测试一致、忠实且可重新照明的3D重建方法提供了可靠的框架。
ENMA: Tokenwise Autoregression for Generative Neural PDE Operators
Authors: Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari
First: 2025-06-06T15:25:14+00:00 · Latest: 2025-11-26T16:36:22+00:00
Abstract
Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.
中文标题/摘要
标题:ENMA:时空自回归生成神经PDE算子
求解时间依赖的参数偏微分方程(PDEs)仍然是神经求解器的基本挑战,尤其是在广泛物理参数和动力学范围内进行泛化时。当数据不确定或不完整时,自然的方法是转向生成模型。我们提出了ENMA,一种用于建模来自物理现象的时空动力学的生成神经算子。ENMA使用通过流匹配损失训练的生成遮蔽自回归变压器,在压缩的潜在空间中预测未来动力学,实现按令牌生成。不规则采样的空间观测通过注意力机制编码为均匀的潜在表示,并通过时空卷积编码器进一步压缩。这使得ENMA在推理时能够通过条件依赖于目标轨迹的过去状态或具有相似动力学的辅助上下文轨迹进行上下文学习。结果是一个稳健且适应性强的框架,能够泛化到新的PDE领域,并支持时间依赖参数PDE的一次性代理建模。
Summary / 总结
ENMA is designed to solve time-dependent parametric PDEs by using a generative masked autoregressive transformer to predict future dynamics in a compressed latent space. It encodes irregularly sampled spatial observations into uniform latent representations and further compresses them through a spatio-temporal convolutional encoder. ENMA can condition on past states or auxiliary context trajectories to perform in-context learning at inference time, making it robust and adaptable for new PDE regimes and one-shot surrogate modeling of time-dependent parametric PDEs.
ENMA 通过生成的掩码自回归变压器在压缩的潜在空间中预测未来动态,编码不规则采样的空间观察为均匀的潜在表示,并通过时空卷积编码器进一步压缩。关键发现表明,ENMA 可以在推理时进行上下文学习,并且能够很好地泛化到新的 PDE 环境中,支持时间依赖参数 PDE 的一次拟合代理建模。
Machine Learning Approaches to Clinical Risk Prediction: Multi-Scale Temporal Alignment in Electronic Health Records
Authors: Wei-Chen Chang, Lu Dai, Ting Xu
First: 2025-11-26T16:33:59+00:00 · Latest: 2025-11-26T16:33:59+00:00
Comments: 5 pages, 3 figures
Abstract
This study proposes a risk prediction method based on a Multi-Scale Temporal Alignment Network (MSTAN) to address the challenges of temporal irregularity, sampling interval differences, and multi-scale dynamic dependencies in Electronic Health Records (EHR). The method focuses on temporal feature modeling by introducing a learnable temporal alignment mechanism and a multi-scale convolutional feature extraction structure to jointly model long-term trends and short-term fluctuations in EHR sequences. At the input level, the model maps multi-source clinical features into a unified high-dimensional semantic space and employs temporal embedding and alignment modules to dynamically weight irregularly sampled data, reducing the impact of temporal distribution differences on model performance. The multi-scale feature extraction module then captures key patterns across different temporal granularities through multi-layer convolution and hierarchical fusion, achieving a fine-grained representation of patient states. Finally, an attention-based aggregation mechanism integrates global temporal dependencies to generate individual-level risk representations for disease risk prediction and health status assessment. Experiments conducted on publicly available EHR datasets show that the proposed model outperforms mainstream baselines in accuracy, recall, precision, and F1-Score, demonstrating the effectiveness and robustness of multi-scale temporal alignment in complex medical time-series analysis. This study provides a new solution for intelligent representation of high-dimensional asynchronous medical sequences and offers important technical support for EHR-driven clinical risk prediction.
中文标题/摘要
标题:临床风险预测的机器学习方法:电子健康记录中的多尺度时间对齐
本研究提出了一种基于多尺度时间对齐网络(MSTAN)的风险预测方法,以解决电子健康记录(EHR)中时间不规则性、采样间隔差异和多尺度动态依赖性的挑战。该方法通过引入可学习的时间对齐机制和多尺度卷积特征提取结构,专注于时间特征建模,以共同建模EHR序列中的长期趋势和短期波动。在输入层面,模型将多源临床特征映射到统一的高维语义空间,并使用时间嵌入和对齐模块动态加权不规则采样数据,减少时间分布差异对模型性能的影响。多尺度特征提取模块通过多层卷积和分层融合捕获不同时间粒度下的关键模式,实现患者状态的精细表示。最后,基于注意力的聚合机制整合全局时间依赖性,生成个体级别的风险表示,用于疾病风险预测和健康状况评估。在公开的EHR数据集上进行的实验表明,所提出的模型在准确率、召回率、精确率和F1分数方面优于主流基线,证明了在复杂医学时间序列分析中多尺度时间对齐的有效性和鲁棒性。本研究为智能表示高维异步医疗序列提供了新的解决方案,并为基于EHR的临床风险预测提供了重要的技术支持。
Summary / 总结
This study introduces a Multi-Scale Temporal Alignment Network (MSTAN) to address temporal irregularities in Electronic Health Records (EHR). MSTAN uses a learnable temporal alignment mechanism and multi-scale convolution to model long-term trends and short-term fluctuations, and an attention-based aggregation mechanism to integrate global temporal dependencies. Experiments show that MSTAN outperforms existing methods in accuracy, recall, precision, and F1-Score, highlighting the effectiveness of multi-scale temporal alignment in EHR analysis.
该研究提出了一种多尺度时间对齐网络(MSTAN),以解决电子健康记录(EHR)中的时间不规律性和多尺度依赖性问题。该方法通过可学习的时间对齐机制和多尺度卷积来建模长期趋势和短期波动。实验结果显示,MSTAN在准确率、召回率、精确率和F1分数上均优于现有方法,突出了多尺度时间对齐在EHR分析中的有效性。
VacuumVLA: Boosting VLA Capabilities via a Unified Suction and Gripping Tool for Complex Robotic Manipulation
Authors: Hui Zhou, Siyuan Huang, Minxing Li, Hao Zhang, Lue Fan, Shaoshuai Shi
First: 2025-11-26T16:29:24+00:00 · Latest: 2025-11-26T16:29:24+00:00
Comments: 8 pages
Abstract
Vision Language Action models have significantly advanced general purpose robotic manipulation by harnessing large scale pretrained vision and language representations. Among existing approaches, a majority of current VLA systems employ parallel two finger grippers as their default end effectors. However, such grippers face inherent limitations in handling certain real world tasks such as wiping glass surfaces or opening drawers without handles due to insufficient contact area or lack of adhesion. To overcome these challenges, we present a low cost, integrated hardware design that combines a mechanical two finger gripper with a vacuum suction unit, enabling dual mode manipulation within a single end effector. Our system supports flexible switching or synergistic use of both modalities, expanding the range of feasible tasks. We validate the efficiency and practicality of our design within two state of the art VLA frameworks: DexVLA and Pi0. Experimental results demonstrate that with the proposed hybrid end effector, robots can successfully perform multiple complex tasks that are infeasible for conventional two finger grippers alone. All hardware designs and controlling systems will be released.
中文标题/摘要
标题:VacuumVLA:通过统一吸盘和夹持工具提升VLA能力以实现复杂机器人操作
视觉语言行动模型通过利用大规模预训练的视觉和语言表示,显著提升了通用机器人操作的能力。现有方法中,大多数当前的VLA系统默认使用并行双指夹具作为末端执行器。然而,这种夹具在处理某些实际任务时存在局限性,例如擦拭玻璃表面或打开无把手的抽屉,因为接触面积不足或缺乏附着力。为克服这些挑战,我们提出了一种低成本的集成硬件设计,结合了机械双指夹具和真空吸盘单元,使单一末端执行器能够在两种模式之间进行切换或协同使用。我们的系统支持灵活切换或协同使用这两种模式,从而扩大了可行任务的范围。我们通过两个最先进的VLA框架:DexVLA和Pi0验证了该设计的效率和实用性。实验结果表明,使用提出的混合末端执行器,机器人可以成功完成单个双指夹具无法完成的多个复杂任务。所有硬件设计和控制系统的代码将被发布。
Summary / 总结
The research aims to enhance robotic manipulation capabilities by integrating a vacuum suction unit with a mechanical two-finger gripper, addressing limitations of traditional grippers in handling tasks like wiping glass surfaces or opening drawers. The study validates this hybrid end effector within DexVLA and Pi0 frameworks, showing improved performance on complex tasks that conventional grippers cannot handle. The design is low-cost and allows flexible switching between modes, expanding the range of feasible tasks.
研究旨在通过解决两指夹爪在擦拭玻璃表面或打开无把手抽屉等任务中的局限性,来提升机器人操作能力。研究引入了一种低成本的集成硬件设计,结合了机械两指夹爪和真空吸盘单元,使得单一末端执行器内可以实现两种模式的操作。在DexVLA和Pi0框架下的实验表明,这种混合末端执行器能够使机器人成功完成多个复杂任务,而这些任务仅靠传统夹爪是无法实现的。
Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization
Authors: Sina Mokhtarzadeh Azar, Emad Bahrami, Enrico Pallotta, Gianpiero Francesca, Radu Timofte, Juergen Gall
First: 2025-11-23T02:58:10+00:00 · Latest: 2025-11-26T16:28:59+00:00
Abstract
In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.
中文标题/摘要
标题:连续视频流中基于扩散噪声优化的序列自适应视频预测
在本研究中,我们探讨了基于扩散的视频预测模型,这些模型可以预测未来的视频帧,适用于连续视频流。在这种情况下,模型会不断观察新的训练样本,我们旨在利用这一点来改进其预测。因此,我们提出了一种方法,使预训练的扩散模型能够连续适应视频流。由于对大型扩散模型进行微调成本过高,我们在推理过程中优化扩散噪声,同时保持模型参数不变,使模型能够自适应地确定合适的采样噪声。我们称这种方法为序列自适应视频预测与扩散噪声优化(SAVi-DNO)。为了验证我们的方法,我们在Ego4D数据集上引入了一个新的评估设置,重点关注长连续视频的同时适应和评估。实验证明,SAVi-DNO在Ego4D、OpenDV-YouTube以及UCF-101和SkyTimelapse的长视频上,基于FVD、SSIM和PSNR指标的性能有所提升,展示了其有效性。
Summary / 总结
This work explores diffusion-based video prediction models for continuous video streams, proposing an approach called SAVi-DNO that continuously adapts a pre-trained diffusion model to new video samples. By optimizing the diffusion noise during inference, the model can adaptively determine suitable sampling noise without fine-tuning the parameters, leading to improved performance on long videos as measured by FVD, SSIM, and PSNR metrics compared to static models.
该研究探讨了基于扩散的视频预测模型在连续视频流中的应用,提出了一种称为SAVi-DNO的方法,该方法能够持续将预训练的扩散模型适应新的视频样本。通过在推理过程中优化扩散噪声,模型可以自适应地确定合适的采样噪声,而无需微调参数,从而在长视频上通过FVD、SSIM和PSNR指标展示了比静态模型更好的性能。
History
20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553