arXiv 论文速递

2025-11-28 03:24
Snapshot: 20251128_0324
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Authors: Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang
First: 2025-11-26T18:59:56+00:00 · Latest: 2025-11-26T18:59:56+00:00
Comments: 24 pages; webpage: https://snap-research.github.io/canvas-to-image/
Abstract
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.
中文标题/摘要
标题:画布到图像:多模态控制下的组合图像生成
虽然现代扩散模型在生成高质量和多样化图像方面表现出色,但在高保真组合和多模态控制方面仍然存在困难,尤其是在用户同时指定文本提示、主题参考、空间布局、姿态约束和布局注释时。我们引入了画布到图像,这是一种统一框架,将这些异构控制合并到一个画布界面中,使用户能够生成忠实反映其意图的图像。我们的主要想法是将各种控制信号编码到单个复合画布图像中,该图像可以直接被模型解释以进行综合的空间视觉推理。我们进一步整理了一组多任务数据集,并提出了一种多任务画布训练策略,该策略优化了扩散模型,使其能够在统一的学习框架内同时理解和整合异构控制到文本到图像生成中。这种联合训练使画布到图像能够在多个控制模态之间进行推理,而不是依赖于特定任务的启发式方法,并且在推理过程中能够很好地泛化到多控制场景。广泛的实验表明,画布到图像在多个具有挑战性的基准测试中,在身份保留和控制一致性方面显著优于最先进的方法,包括多人组合、姿态控制组合、布局约束生成和多控制生成。
Summary / 总结
Canvas-to-Image is a unified framework that integrates various multimodal controls into a single canvas interface for generating high-fidelity images. It encodes diverse control signals into a composite canvas image, enabling the model to perform integrated visual-spatial reasoning. Experiments demonstrate that Canvas-to-Image outperforms existing methods in identity preservation and control adherence across multiple benchmarks, including multi-person composition and layout-constrained generation.
研究旨在通过引入Canvas-to-Image,一种将各种控制信号整合到单一画布界面的统一框架,来提高图像生成中的组成和多模态控制。该方法包括将多样化的控制信号编码到复合画布图像中供模型解释,并提出了一种多任务画布训练策略来优化扩散模型。实验表明,Canvas-to-Image 在多个人物组成、布局受限生成等多种基准测试中,在身份保留和控制一致性方面优于现有方法。
TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos
Authors: Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang
First: 2025-11-26T18:59:55+00:00 · Latest: 2025-11-26T18:59:55+00:00
Abstract
Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data problem by introducing a unifying, symbolic representation - a compact 3D "trace-space" of scene-level trajectories - that enables learning from cross-embodiment, cross-environment, and cross-task videos. We present TraceGen, a world model that predicts future motion in trace-space rather than pixel space, abstracting away appearance while retaining the geometric structure needed for manipulation. To train TraceGen at scale, we develop TraceForge, a data pipeline that transforms heterogeneous human and robot videos into consistent 3D traces, yielding a corpus of 123K videos and 1.8M observation-trace-language triplets. Pretraining on this corpus produces a transferable 3D motion prior that adapts efficiently: with just five target robot videos, TraceGen attains 80% success across four tasks while offering 50-600x faster inference than state-of-the-art video-based world models. In the more challenging case where only five uncalibrated human demonstration videos captured on a handheld phone are available, it still reaches 67.5% success on a real robot, highlighting TraceGen's ability to adapt across embodiments without relying on object detectors or heavy pixel-space generation.
中文标题/摘要
标题:TraceGen: 3D 轨迹空间中的世界建模使跨体态视频学习成为可能
仅从少量演示中在新平台上学习新机器人任务仍然具有挑战性。虽然存在其他体态(人类和不同机器人)的视频,但由于体态、摄像机和环境的差异,它们难以直接使用。我们通过引入一种统一的符号表示——场景级轨迹的紧凑3D“轨迹空间”来解决小数据问题,从而能够从跨体态、跨环境和跨任务视频中学习。我们提出了TraceGen,这是一种世界模型,它在轨迹空间而非像素空间中预测未来的运动,抽象掉了外观,但保留了用于操作所需的几何结构。为了大规模训练TraceGen,我们开发了TraceForge,一种数据管道,将异构的人类和机器人视频转换为一致的3D轨迹,生成了包含123,000个视频和1,800,000个观察-轨迹-语言三元组的语料库。在该语料库上进行预训练产生了一种可转移的3D运动先验,能够高效适应:仅用五个目标机器人视频,TraceGen在四个任务上的成功率达到了80%,同时比最先进的基于视频的世界模型快50-600倍的推理速度。在更具有挑战性的场景中,仅用五个手持手机拍摄的未校准的人类演示视频,它在真实机器人上仍能达到67.5%的成功率,突显了TraceGen在不依赖物体检测器或大量像素空间生成的情况下跨体态适应的能力。
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Authors: Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov
First: 2025-11-26T18:59:46+00:00 · Latest: 2025-11-26T18:59:46+00:00
Comments: 21 pages, 6 figures
Abstract
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.
中文标题/摘要
标题:ToolOrchestra:通过高效模型和工具编排提升智能
大型语言模型是强大的通用型模型,但解决诸如人类最后考试(HLE)这类深刻而复杂的问题仍然既在概念上具有挑战性,又在计算上昂贵。我们展示了通过小规模的编排器管理其他模型和各种工具,既可以推动智能的上限,又可以提高解决困难代理任务的效率。我们介绍了ToolOrchestra,一种用于训练小规模编排器的方法,这些编排器协调智能工具。ToolOrchestra 明确使用了带有结果意识、效率意识和用户偏好意识的强化学习奖励。使用 ToolOrchestra,我们生成了 Orchestrator,这是一种8B模型,其在较低成本下实现了更高的准确性,同时在给定查询时使用哪种工具方面与用户偏好保持一致。在HLE上,Orchestrator 达到了37.1%的得分,优于GPT-5(35.1%),同时效率提高了2.5倍。在tau2-Bench和FRAMES上,Orchestrator 仅使用约30%的成本就大幅超过了GPT-5。广泛的分析表明,Orchestrator 在多个指标下实现了性能和成本的最佳权衡,并且能够稳健地泛化到未见过的工具。这些结果表明,使用轻量级编排模型组合多种工具比现有方法更高效且更有效,为实用且可扩展的工具增强推理系统铺平了道路。
Summary / 总结
ToolOrchestra is a method for training small orchestrators that manage other models and tools to solve complex tasks efficiently. It uses reinforcement learning with specific rewards to improve accuracy and cost-effectiveness. The Orchestrator model, trained with ToolOrchestra, outperforms GPT-5 on the Humanity's Last Exam by 2%, while being 2.5 times more efficient. It also excels on tau2-Bench and FRAMES, using only about 30% of the cost compared to GPT-5.
ToolOrchestra 是一种训练小型协调器的方法,这些协调器可以管理其他模型和工具以高效地解决复杂问题。它使用特定奖励的强化学习来提高准确性和成本效益。Orchestrator 模型在 Humanity's Last Exam 中得分更高且成本更低,同时在 tau2-Bench 和 FRAMES 上的表现也优于 GPT-5,且使用资源较少。这表明轻量级协调可以比现有方法更高效且更有效,为工具增强推理系统的实用性和可扩展性铺平了道路。
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Authors: Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang
First: 2025-11-26T18:59:39+00:00 · Latest: 2025-11-26T18:59:39+00:00
Comments: code are released at https://github.com/InternRobotics/G2VLM
Abstract
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
中文标题/摘要
标题:G$^2$VLM: 基于几何的视觉语言模型,统一三维重建与空间推理
视觉-语言模型(VLMs)在空间智能方面仍然缺乏稳健性,表现出在空间理解和推理任务上的较差性能。我们将其差距归因于缺乏一种能够从二维图像重建三维空间的视觉几何学习过程。我们提出了G$^2$VLM,这是一种基于几何的视觉语言模型,它连接了空间智能的两个基本方面:三维重建和空间理解。G$^2$VLM 本征地利用学习到的三维视觉几何特征,直接预测三维属性,并通过上下文学习和交替推理增强空间推理任务。我们的统一设计在空间理解方面具有高度可扩展性:它在丰富的多视角图像和视频数据上进行训练,同时利用通常仅从难以收集的注释中获得的三维视觉先验的好处。实验结果表明,G$^2$VLM 在两个任务上都表现出色,其三维重建性能与最先进的端到端三维重建模型相当,并且在空间理解和推理任务上取得了更好的或可竞争的结果。通过将语义强大的VLM与低级三维视觉任务统一起来,我们希望G$^2$VLM 能够为社区提供一个强大的基线,并解锁更多未来的应用,如三维场景编辑。
Summary / 总结
G$^2$VLM is designed to enhance spatial intelligence in Vision-Language Models by integrating 3D reconstruction and spatial reasoning. It uses learned 3D visual geometry features to predict 3D attributes and improve spatial reasoning through in-context learning. Experiments show G$^2$VLM performs well in both 3D reconstruction and spatial understanding tasks, outperforming or matching state-of-the-art models. This unified approach makes G$^2$VLM a strong baseline for future 3D scene editing applications.
研究旨在通过解决视觉语言模型在空间理解与推理方面的不足,增强其空间智能。提出了G$^2$VLM,这是一种结合了3D重建和空间推理的几何导向模型。该模型利用3D视觉几何特征预测3D属性,并通过上下文学习和交替推理提高空间推理能力。实验表明,G$^2$VLM在3D重建和空间理解任务中表现出色,优于或匹配了最先进的模型。通过将语义强的视觉语言模型与低级3D视觉任务统一,G$^2$VLM有望成为未来应用如3D场景编辑的强基准。
Seeing without Pixels: Perception from Camera Trajectories
Authors: Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
First: 2025-11-26T18:57:01+00:00 · Latest: 2025-11-26T18:57:01+00:00
Comments: Project website: https://sites.google.com/view/seeing-without-pixels
Abstract
Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
中文标题/摘要
标题:无需像素:从相机轨迹感知视频内容
是否可以在不看到像素的情况下感知视频的内容,仅通过相机轨迹——它在空间中划过的路径?本文首次系统地探讨了这一看似不可能的问题。为此,我们提出了一种对比学习框架,用于训练CamFormer,这是一种专门的编码器,将相机姿态轨迹投影到联合嵌入空间中,并与自然语言对齐。我们发现,尽管看似简单,相机轨迹是一个非常有信息量的信号,可以揭示视频内容。换句话说,“你如何移动”确实可以揭示“你在做什么”(第一人称视角)或“你在观察什么”(第三人称视角)。我们展示了我们学习到的CamFormer嵌入在一系列下游任务上的通用性,从跨模态对齐到分类和时间分析。重要的是,我们的表示在各种相机姿态估计方法中具有鲁棒性,包括高保真的多传感器和标准RGB-only估计器。我们的研究结果确立了相机轨迹作为一种轻量级、鲁棒且通用的模态,用于感知视频内容。
Summary / 总结
This paper explores whether camera trajectories can be used to perceive video content without examining the pixel data. It introduces a contrastive learning framework to train CamFormer, which projects camera pose trajectories into a joint embedding space aligned with natural language. The study finds that camera trajectories are surprisingly informative for uncovering video content, suggesting that 'how you move' can reveal 'what you are doing' or 'observing.' The learned embeddings are versatile and robust across different camera pose estimation methods, demonstrating their effectiveness in various downstream tasks such as cross-modal alignment, classification, and temporal analysis.
本文探讨了是否可以通过相机轨迹来感知视频内容,而无需查看像素数据。研究引入了一种对比学习框架来训练CamFormer,将相机姿态轨迹投影到与自然语言对齐的联合嵌入空间。研究发现,相机轨迹对于揭示视频内容来说是出乎意料地有信息量,表明‘如何移动’可以揭示‘正在做什么’或‘正在观察什么’。学习到的嵌入在不同的相机姿态估计方法下具有鲁棒性,并在各种下游任务中表现出色,如跨模态对齐、分类和时间分析。
Agentic Learner with Grow-and-Refine Multimodal Semantic Memory
Authors: Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li
First: 2025-11-26T18:55:08+00:00 · Latest: 2025-11-26T18:55:08+00:00
Abstract
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.
中文标题/摘要
标题:具有生长与精炼多模态语义记忆的代理学习者
MLLMs 在独立查询上表现出强大的推理能力,但它们是独立工作的——每次解决问题时都独立进行,经常重复同样的错误。现有的记忆增强代理主要存储过去的轨迹以便重用。然而,基于轨迹的记忆会受到简短偏差的影响,逐渐失去重要的领域知识。更严重的是,即使在真正的多模态问题解决环境中,它也只能记录过去行为的单一模态痕迹,未能保存视觉注意力和逻辑推理如何共同促成解决方案。这与人类认知从根本上不一致:语义记忆既是多模态的又是整合的,通过协调但独立的表示流保存视觉和抽象知识。因此,我们引入了ViLoMem,这是一种双流记忆框架,构建紧凑的基于模式的记忆。它分别编码视觉干扰模式和逻辑推理错误,使MLLMs能够从成功和失败的经验中学习。遵循生长与精炼的原则,系统逐步积累和更新多模态语义知识——保持稳定、可泛化的策略,同时避免灾难性遗忘。在六个多模态基准测试中,ViLoMem 一致地提高了 pass@1 准确率,并显著减少了重复的视觉和逻辑错误。消融实验确认了双流记忆的必要性,特别是明确的干扰-幻觉分离,展示了错误感知的多模态记忆对于终身和跨域代理学习的价值。我们的项目页面将在 https://weihao-bo.github.io/ViLoMeo-page/ 上提供。
Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models
Authors: Pandiyaraju V, Sreya Mynampati, Abishek Karthik, Poovarasan L, D. Saraswathi
First: 2025-11-26T18:51:46+00:00 · Latest: 2025-11-26T18:51:46+00:00
Abstract
Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.
中文标题/摘要
标题:利用3D MRI引导的混合深度学习模型革新胶质瘤分割与分级
胶质瘤是具有高死亡率的脑肿瘤类型,早期和准确的诊断对于肿瘤的治疗干预至关重要。为解决这一难题,本研究将开发一种混合深度学习模型,该模型结合了基于U-Net的分割和具有多头注意力和空间-通道注意力能力的混合DenseNet-VGG分类网络。分割模型将在3D MRI数据的体积中,通过空间和上下文信息精确地标记肿瘤。结合DenseNet和VGG的分类网络将关注已标记的肿瘤上的特征,并通过注意力机制聚焦于临床相关特征。通过预处理步骤,如归一化、重采样和数据增强,高维3D MRI数据可以成功地用于模型中。通过多种措施评估框架:分割性能的度量包括Dice系数和平均交并比(IoU),分类性能的度量包括准确率、精确率、召回率和F1分数。所提出的混合框架通过物理测试表明,其在肿瘤分割中的Dice系数可达98%,分类准确率可达99%,优于传统CNN模型和无注意力机制的方法。多头注意力机制增强了在肿瘤临床显著方面的重要性的概念,提高了可解释性和准确性。结果表明,该框架在协助临床医生进行胶质瘤的及时和可靠诊断与分级方面具有巨大潜力,有助于更好地规划患者的治疗。
Summary / 总结
This research aims to improve the early and accurate diagnosis of gliomas, which have a high mortality rate, by developing a hybrid deep learning model. The model integrates U-Net for precise 3D MRI tumor segmentation and a hybrid DenseNet-VGG network with attention mechanisms for classification. The model achieves a Dice coefficient of 98% in segmentation and 99% in classification accuracy, outperforming traditional CNN models and attention-free methods. The use of multi-head attention enhances interpretability and accuracy, suggesting potential for timely and reliable glioma diagnosis and grading by clinicians.
该研究旨在提高对具有高死亡率的胶质瘤的早期和准确诊断。它提出了一种结合U-Net进行分割和具有注意力机制的DenseNet-VGG网络进行分类的混合深度学习模型。该模型在肿瘤分割中的Dice系数达到了98%,分类准确率达到了99%,优于传统CNN模型。多头注意力机制通过聚焦临床相关特征增强了模型的可解释性和准确性。
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Authors: Fengze Yu, Leshu Li, Brad McDanel, Saiqian Zhang
First: 2025-11-26T18:47:25+00:00 · Latest: 2025-11-26T18:47:25+00:00
Abstract
Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.
中文标题/摘要
标题:DSD:边缘-云敏捷大型模型服务的分布式推测性解码解决方案
大型语言模型(LLM)推理经常遭受高解码延迟和跨异构边缘-云环境的有限扩展性。现有推测性解码(SD)技术加速了标记生成,但仍然局限于单节点执行。我们提出DSD,这是一种分布式推测性解码框架,通过协调草稿目标执行将SD扩展到多设备部署。鉴于缺乏模拟此范式的先例工作,我们首先引入DSD-Sim,这是一种离散事件模拟器,可以捕捉网络、批处理和调度动态。基于DSD-Sim的见解,我们进一步设计了一种自适应窗口控制(AWC)策略,动态调整推测窗口大小以优化吞吐量。跨不同工作负载的实验表明,DSD在现有SD基线上的速度提升高达1.1倍,吞吐量提高9.7%,从而在边缘和云中实现敏捷和可扩展的LLM服务。
Summary / 总结
The research aims to address the high latency and scalability issues of large language model (LLM) inference in edge-cloud environments. DSD is a distributed speculative decoding framework that extends speculative decoding techniques to multi-device deployments through coordinated draft-target execution. The study introduces DSD-Sim, a discrete-event simulator, to simulate the new paradigm and designs an Adaptive Window Control (AWC) policy to optimize throughput. Experiments demonstrate that DSD achieves up to 1.1x speedup and 9.7% higher throughput compared to existing speculative decoding baselines, enabling agile and scalable LLM serving across edge and cloud environments.
研究旨在解决大型语言模型(LLM)推理在边缘-云环境中的高解码延迟和有限扩展性问题。作者提出了DSD,这是一种分布式推测性解码框架,将推测性解码技术扩展到多设备部署。DSD 包括一个模拟器 DSD-Sim 和一个自适应窗口控制(AWC)策略以优化吞吐量。实验表明,DSD 达到了高达 1.1 倍的加速和 9.7% 的更高吞吐量,增强了边缘和云环境中 LLM 服务的灵活性和扩展性。
Escaping the Verifier: Learning to Reason via Demonstrations
Authors: Locke Cai, Ivan Provilkov
First: 2025-11-26T18:42:52+00:00 · Latest: 2025-11-26T18:42:52+00:00
Abstract
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
中文标题/摘要
标题:超越验证者:通过示范学习推理
训练大型语言模型(LLMs)进行推理通常依赖于特定任务的强化学习(RL)和验证器。然而,许多实际推理密集型任务缺乏验证器,尽管提供了大量未充分利用的专家示范。我们引入了RARO(相对对抗推理优化),通过逆向强化学习仅从专家示范中学习强大的推理能力。我们的方法设置了一种对抗交互,其中策略(生成器)和相对批评家(判别器)之间进行对抗:策略学习模仿专家答案,而批评家学习比较和区分策略和专家答案。我们的方法通过RL联合和连续训练策略和批评家,并确定了实现稳健学习所需的关键稳定技术。实验中,RARO在我们的所有评估任务——Countdown、DeepMath和诗歌创作——中显著优于强大的无验证器基线,并且享受与验证任务上RL相同的稳健扩展趋势。这些结果表明,我们的方法能够仅从专家示范中有效引发强大的推理性能,即使在特定任务验证器不可用时也能实现稳健的推理学习。
Summary / 总结
The paper introduces RARO (Relativistic Adversarial Reasoning Optimization), a method that uses Inverse Reinforcement Learning to train Large Language Models to reason effectively from expert demonstrations, especially in tasks lacking specific verifiers. RARO sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator) to learn reasoning capabilities. The method significantly outperforms verifier-free baselines on tasks like Countdown, DeepMath, and Poetry Writing, showing robust scaling trends similar to RL on verifiable tasks.
论文提出了RARO(相对对抗推理优化)方法,通过逆强化学习利用专家演示来训练大型语言模型进行有效的推理,特别是在缺乏特定验证器的任务中。RARO 设置了一个对抗交互,其中策略(生成器)和相对批评者(判别器)对抗学习推理能力。该方法在 Countdown、DeepMath 和诗歌写作等任务上显著优于无验证器基线,显示出与验证任务中 RL 相似的稳健扩展趋势。
Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models
Authors: Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng, Wentao Mo, Peiqiang Wang, Nan Zhang
First: 2025-11-26T18:37:54+00:00 · Latest: 2025-11-26T18:37:54+00:00
Abstract
In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Authors: Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang
First: 2025-11-26T18:35:17+00:00 · Latest: 2025-11-26T18:35:17+00:00
Abstract
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
中文标题/摘要
标题:多标准:在多元评价标准下评估多模态评判者
大型多模态模型(LMMs)因其强大的指令遵循能力和与人类偏好的一致性,越来越多地被用作多模态评估系统中的评判者。然而,它们在遵循多样且细致的评价标准方面的能力仍被忽视。我们开发了Multi-Crit,这是一个用于评估多模态评判者在遵循多元评价标准和产生可靠标准级判断方面能力的基准。它涵盖了开放生成和可验证推理任务,通过严格的数据整理管道收集具有多标准人类注释的具有挑战性的响应对。此外,它还引入了三个新的度量标准,以系统地评估多元一致性、标准切换灵活性以及识别标准级偏好冲突的能力。对25个LMMs的全面分析表明:1)专有模型在开放评价中仍然难以保持一致的多元一致性;2)开源模型在灵活遵循多种标准方面落后更多;3)使用整体判断信号进行批评微调可以增强视觉定位,但无法推广到多元标准级判断。对推理微调、测试时缩放以及开源和专有模型之间边界一致性进一步的分析进一步探索了当前多模态评判者的极限。作为一项开创性研究,Multi-Crit为构建可靠和可控的多模态AI评估奠定了基础。
Summary / 总结
The study aims to evaluate the ability of large multimodal models (LMMs) to follow diverse evaluation criteria in multimodal systems. It introduces Multi-Crit, a benchmark that assesses LMMs on pluralistic criteria and introduces new metrics for criterion-level judgment. The analysis of 25 LMMs shows that proprietary models struggle with consistent adherence to criteria, especially in open-ended tasks, while open-source models are less flexible in following diverse criteria. Critic fine-tuning improves visual grounding but does not generalize to pluralistic judgments. The study provides insights into the limitations of current multimodal models and sets a foundation for more reliable and steerable AI evaluation systems.
研究旨在评估大型多模态模型(LMMs)在多模态系统中遵循多样评价标准的能力。研究引入了Multi-Crit基准,评估LMMs在多元标准下的表现,并提出了新的评价多元标准的指标。对25个LMMs的分析显示,专有模型在开放性任务中难以保持对标准的一致性遵守,而开源模型在遵循多样标准方面灵活性较低。批评微调虽然改善了视觉定位,但无法泛化到多元标准判断。研究提供了当前多模态模型的局限性见解,并为更可靠和可控的AI评估系统奠定了基础。
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Authors: Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin
First: 2025-11-20T17:48:21+00:00 · Latest: 2025-11-26T18:30:04+00:00
Comments: Project page: https://xuboshen.github.io/TimeViper; Code: https://github.com/xiaomi-research/timeviper
Abstract
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
中文标题/摘要
标题:TimeViper:一种混合Mamba-Transformer视觉语言模型,用于高效理解长视频
我们介绍了TimeViper,一种混合视觉语言模型,旨在解决长视频理解的挑战。处理长视频需要高效的模型架构和有效的机制来处理扩展的时间上下文。为此,TimeViper采用了一种混合Mamba-Transformer骨干,结合了状态空间模型的效率和注意力机制的表达能力。通过这种混合设计,我们揭示了视觉到文本信息聚合的现象,其中信息随着LLM深度增加,逐步从视觉标记流向文本标记,导致视觉标记冗余严重。受此观察的启发,我们提出了TransV,一种标记信息传输模块,将视觉标记转换并压缩为指令标记,同时保持多模态理解能力。这种设计使TimeViper能够处理超过10,000帧的长达一小时的视频。在多个基准上的广泛实验表明,TimeViper在与最先进的模型竞争的同时,扩展了帧数。我们进一步分析了Mamba和Transformer层的注意力行为,提供了关于混合模型可解释性的新见解。这项工作代表了开发、解释和压缩混合Mamba-Transformer架构的初步步骤。
Summary / 总结
TimeViper is a hybrid Mamba-Transformer model designed for efficient long video understanding. It combines the efficiency of state-space models with the expressivity of attention mechanisms. The model reveals a vision-to-text information aggregation phenomenon and proposes TransV, a token information transfer module, to compress vision tokens into instruction tokens while maintaining multimodal understanding. Experiments show that TimeViper can process hour-long videos and competes with state-of-the-art models. The work provides insights into hybrid model interpretability and represents an initial step towards compressing hybrid Mamba-Transformer architectures.
TimeViper 是一种结合了状态空间模型效率和注意力机制表达性的混合 Mamba-Transformer 模型,旨在高效理解长视频。该模型揭示了视觉到文本信息聚合的现象,并提出 TransV,一种 token 信息传输模块,将视觉 token 压缩为指令 token 同时保持多模态理解能力。实验表明,TimeViper 可以处理超过 10,000 帧的小时级视频,并与最先进的模型竞争。该工作还分析了注意力行为,提供了对混合模型可解释性的新见解。
EvilGenie: A Reward Hacking Benchmark
Authors: Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld
First: 2025-11-26T18:27:17+00:00 · Latest: 2025-11-26T18:27:17+00:00
Abstract
We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.
中文标题/摘要
标题:EvilGenie:奖励劫持基准
我们介绍了EvilGenie,一个用于编程环境中的奖励劫持基准。我们从LiveCodeBench获取问题,并创建了一个环境,使代理可以轻松地进行奖励劫持,例如通过硬编码测试案例或编辑测试文件。我们通过三种方式衡量奖励劫持:保留的单元测试、LLM裁判和测试文件编辑检测。我们验证了这些方法与人工审查和其他方法的一致性。我们发现LLM裁判在明确的情况下非常有效于检测奖励劫持,并观察到保留的测试案例的使用仅带来微小的改进。除了使用Inspect的基本代理框架测试许多模型外,我们还测量了三种流行的专有编码代理的奖励劫持率:OpenAI的Codex、Anthropic的Claude Code和Google的Gemini CLI,分别使用GPT-5、Claude Sonnet 4和Gemini 2.5 Pro。我们观察到Codex和Claude Code存在明确的奖励劫持行为,而所有三个代理都表现出不一致的行为。我们的代码库可以在https://github.com/JonathanGabor/EvilGenie/找到。
CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow
Authors: Ruisheng Han, Kanglei Zhou, Shuang Chen, Amir Atapour-Abarghouei, Hubert P. H. Shum
First: 2025-11-26T18:25:41+00:00 · Latest: 2025-11-26T18:25:41+00:00
Abstract
Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow
中文标题/摘要
标题:CaFlow:通过因果反事实流增强长期动作质量评估
动作质量评估(AQA)从动作视频中预测细粒度执行评分,并广泛应用于体育、康复和技能评估。长期AQA,如花样滑冰或体操,尤其具有挑战性,因为它需要建模长期时间动态,同时保持对上下文混杂因素的鲁棒性。现有方法要么依赖昂贵的注释,要么依赖单向时间建模,使其容易受到虚假相关性和长期表示不稳定的威胁。为此,我们提出CaFlow,这是一种统一框架,将反事实去混杂与双向时间条件流相结合。因果反事实正则化(CCR)模块以自监督方式分离因果和混杂特征,并通过反事实干预增强因果稳健性,而BiT-Flow模块在循环一致性约束下建模前向和后向动力学,以生成更平滑和更连贯的表示。在多个长期AQA基准上的广泛实验表明,CaFlow达到了最先进的性能。代码可在https://github.com/Harrison21/CaFlow 获取
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Authors: Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rémi Munos, Tyler Lu
First: 2025-11-26T18:12:16+00:00 · Latest: 2025-11-26T18:12:16+00:00
Comments: 12 pages, 2 figures
Abstract
Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.
中文标题/摘要
标题:使用迭代PPO使大语言模型朝多轮对话结果靠拢
优化大型语言模型(LLMs)以实现多轮对话结果仍然是一个重大挑战,尤其是在像AI营销或通过消息平台促进交易的销售代理这样的目标导向环境中。这一挑战源于稀疏的、长期的奖励以及响应级规划与标记级生成之间的不一致。在本技术说明中,我们提出了一种将多轮强化学习问题形式化为一系列单轮RLHF风格问题的方法。这通过将学习到的多轮Q函数作为单轮问题的奖励模型来实现。我们展示了并证明了一个关键见解:使用标准标记级PPO解决这个单轮RL问题等同于多轮问题中的策略改进步骤。这一见解自然地引出了迭代PPO,这是一种批在线策略迭代算法,交替进行从记录的对话轨迹拟合Q函数和改进策略。一个主要的实际优势是,迭代PPO直接利用了稳定的单轮RLHF工具,使其易于实现。我们的方法介于完全在线和完全离线方法之间,保留了在线更新的适应性,同时获得了离线训练的稳定性优势。
Summary / 总结
This paper addresses the challenge of optimizing large language models for multi-turn conversational outcomes, particularly in goal-oriented settings. It proposes Iterative PPO, an algorithm that reduces the multi-turn RL problem into a sequence of single-turn RLHF-style problems. The key insight is that solving these single-turn problems with standard token-level PPO is equivalent to a policy improvement step in the multi-turn problem. The Iterative PPO algorithm alternates between fitting Q-functions from logged conversation trajectories and improving the policy, leveraging stable, off-the-shelf single-turn RLHF tools for implementation. This method offers a balance between online and offline approaches, providing adaptability and stability benefits.
研究旨在优化大型语言模型(LLMs)以实现目标导向的多轮对话结果,解决稀疏奖励和响应级规划与标记级生成不匹配的挑战。方法是将多轮RL问题简化为一系列单轮RLHF问题,使用学习到的多轮Q函数作为奖励模型。关键发现是,迭代PPO,一种在线策略迭代算法,通过交替拟合Q函数和改进策略来有效提升策略,利用稳定的单轮RLHF工具进行实施。
Bridging the Unavoidable A Priori: A Framework for Comparative Causal Modeling
Authors: Peter S. Hovmand, Kari O'Donnell, Callie Ogland-Hand, Brian Biroscak, Douglas D. Gunzler
First: 2025-11-26T18:08:20+00:00 · Latest: 2025-11-26T18:08:20+00:00
Comments: Presented at 43rd Conference of the International System Dynamics Society in Boston, United States
Abstract
AI/ML models have rapidly gained prominence as innovations for solving previously unsolved problems and their unintended consequences from amplifying human biases. Advocates for responsible AI/ML have sought ways to draw on the richer causal models of system dynamics to better inform the development of responsible AI/ML. However, a major barrier to advancing this work is the difficulty of bringing together methods rooted in different underlying assumptions (i.e., Dana Meadow's "the unavoidable a priori"). This paper brings system dynamics and structural equation modeling together into a common mathematical framework that can be used to generate systems from distributions, develop methods, and compare results to inform the underlying epistemology of system dynamics for data science and AI/ML applications.
中文标题/摘要
标题:跨越不可避免的先验:比较因果建模的框架
AI/ML模型迅速成为解决先前未解决的问题及其意外后果的创新方法,这些后果包括放大人类偏见。负责任的AI/ML倡导者寻求利用系统动力学中更丰富的因果模型来更好地指导负责任的AI/ML开发。然而,推进这项工作的主要障碍是难以将基于不同基本假设的方法(即Dana Meadow的“不可避免的先验”)结合起来。本文将系统动力学和结构方程建模结合到一个共同的数学框架中,可以用于生成系统、开发方法并比较结果,以指导系统动力学的数据科学和AI/ML应用的基本认识论。
Summary / 总结
The paper addresses the challenge of integrating different causal modeling methods, particularly system dynamics and structural equation modeling, to better understand and mitigate biases in AI/ML systems. It proposes a unified framework that allows for the comparison of causal models and the development of more responsible AI/ML. Key findings include the ability to generate systems from distributions and compare results, which informs the epistemology of system dynamics for data science and AI/ML applications.
论文旨在解决将系统动力学和结构方程建模等不同因果建模方法整合在一起的难题,以更好地理解和减轻AI/ML系统中的偏见。它提出了一种统一框架,可以生成系统并比较结果,从而为数据科学和AI/ML应用中的系统动力学提供知识论基础。主要发现包括能够从分布生成系统并比较结果。
The Impossibility of Inverse Permutation Learning in Transformer Models
Authors: Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah
First: 2025-09-28T23:48:11+00:00 · Latest: 2025-11-26T18:02:39+00:00
Abstract
In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (``canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with ``scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking'' tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).
中文标题/摘要
标题:变换器模型中逆排列学习的不可能性
在本技术注记中,我们研究了仅解码器变换器中的逆排列学习问题。给定一个排列和该排列已应用于的字符串,模型的任务是生成原始(“标准”)字符串。我们提出,该任务可以模拟各种推理任务中的自然鲁棒性,包括长上下文检索、多项选择问答和上下文学习。我们的主要贡献是一个不可能性结果:我们证明,任意深度的仅解码器变换器无法学习此任务。该结果关注仅解码器变换器模型的表达能力,与训练动力学或样本复杂性无关。我们给出了两种替代构造,其中逆排列学习是可行的。第一个构造突显了因果注意力掩码的基本作用,并揭示了编码器-解码器变换器与更流行的仅解码器架构在表达能力上的差距。后者结果更为令人惊讶:我们证明,简单地用“草稿标记”填充输入可以产生一种逆排列学习可行的构造。我们推测,这可能表明链式思考提示或更广泛地说,中间“思考”标记如何在大型语言模型中启用推理的一种替代机制,即使这些标记编码了无意义的语义信息(例如,中间计算的结果)。
Summary / 总结
This technical note investigates the inverse permutation learning problem in decoder-only transformers, where the model must recover the original string from a given permutation and its transformed version. The authors demonstrate that arbitrary depth decoder-only transformers cannot learn this task, highlighting the expressive limitations of such models. They also provide two alternative constructions: one emphasizing the role of causal attention masks and another showing that padding with 'scratch tokens' can enable inverse permutation learning, suggesting a potential mechanism for reasoning in large language models even without meaningful semantic information.
该技术笔记研究了解码器-only变压器中的逆排列学习问题,目标是从给定的排列及其变换版本恢复原始字符串。作者证明了任意深度的解码器-only变压器无法学习此任务,突显了它们表达能力的局限性。他们还提供了两种替代构造,其中一种强调因果注意力的作用,另一种涉及使用填充标记(scratch tokens),这可能表明通过链式思考提示或更广泛的中间“思考”标记,即使这些标记不包含有意义的语义信息(例如中间计算的结果),也可以在大型语言模型中实现推理的潜在机制。
Qwen3-VL Technical Report
Authors: Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu
First: 2025-11-26T17:59:08+00:00 · Latest: 2025-11-26T17:59:08+00:00
Comments: 42 pages
Abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
中文标题/摘要
标题:Qwen3-VL 技术报告
我们介绍了Qwen系列中最强大的视觉-语言模型Qwen3-VL,它在多种跨模态基准测试中表现出色。该模型原生支持多达256K个令牌的交错上下文,无缝集成文本、图像和视频。模型系列包括密集型(2B/4B/8B/32B)和专家混合型(30B-A3B/235B-A22B)变体,以适应不同的延迟-质量权衡。Qwen3-VL 提供三个核心支柱:(i)显著增强的纯文本理解,在某些情况下超越了可比的纯文本骨干模型;(ii)强大的长上下文理解,具有原生256K个令牌窗口,适用于文本和交错多模态输入,能够忠实保留、检索和跨长文档和视频进行交叉引用;(iii)跨单图像、多图像和视频任务的高级多模态推理,展示了在MMMU和视觉数学基准测试(如MathVista和MathVision)中的领先性能。从架构上看,我们引入了三个关键升级:(i)增强的交错-MRoPE,以增强图像和视频中的空间-时间建模;(ii)DeepStack集成,有效利用多级ViT特征以加强视觉-语言对齐;(iii)基于文本的时间对齐,从T-RoPE发展为显式的文本时间戳对齐,以实现更精确的时间定位。在可比的令牌预算和延迟约束下,Qwen3-VL 在密集型和专家混合型(MoE)架构中均表现出色。我们设想Qwen3-VL 将成为图像驱动推理、自主决策和多模态代码智能在实际工作流程中的基础引擎。
Summary / 总结
Qwen3-VL is the most advanced vision-language model in the Qwen series, enhancing pure-text understanding, long-context comprehension, and multimodal reasoning. It supports up to 256K tokens and includes dense and mixture-of-experts variants. Key upgrades include enhanced interleaved-MRoPE, DeepStack integration, and text-based time alignment for improved performance in benchmarks like MMMU and visual-math tasks.
Qwen3-VL 是 Qwen 系列中最先进的视觉语言模型,增强了纯文本理解、长上下文理解和多模态推理能力。它支持最多 256K 令牌,并包括密集型和混合专家型变体。关键架构升级包括增强的交织-MRoPE、DeepStack 集成和基于文本的时间对齐以更好地对齐视频。Qwen3-VL 在各种基准测试中表现出色,并旨在用于图像基础推理、自主决策和多模态代码智能等实际应用场景。
Scale-Agnostic Kolmogorov-Arnold Geometry in Neural Networks
Authors: Mathew Vanherreweghe, Michael H. Freedman, Keith M. Adams
First: 2025-11-26T17:52:05+00:00 · Latest: 2025-11-26T17:52:05+00:00
Abstract
Recent work by Freedman and Mulligan demonstrated that shallow multilayer perceptrons spontaneously develop Kolmogorov-Arnold geometric (KAG) structure during training on synthetic three-dimensional tasks. However, it remained unclear whether this phenomenon persists in realistic high-dimensional settings and what spatial properties this geometry exhibits. We extend KAG analysis to MNIST digit classification (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales. We find that KAG emerges during training and appears consistently across spatial scales, from local 7-pixel neighborhoods to the full 28x28 image. This scale-agnostic property holds across different training procedures: both standard training and training with spatial augmentation produce the same qualitative pattern. These findings reveal that neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.
中文标题/摘要
标题:无尺度依赖的柯莫哥洛夫-阿诺尔德几何学在神经网络中的应用
弗里德曼和穆利根近期的工作表明,浅层多层感知机在对合成三维任务进行训练时会自发地发展出柯莫哥洛夫-阿诺尔德几何(KAG)结构。然而,尚不清楚这种现象是否在现实中的高维环境中持续存在,以及这种几何结构表现出什么样的空间特性。 我们使用2层MLP对MNIST手写数字分类任务(784维)进行了KAG分析,并进行了系统的多尺度空间分析。我们发现,KAG在训练过程中出现,并且在从局部7像素邻域到整个28x28图像的多个尺度上都表现出一致性。这种无尺度依赖的特性在不同的训练方法下保持不变:无论是标准训练还是带有空间增强的训练,都会产生相同的定性模式。这些发现揭示了神经网络在学习现实中的高维数据时会自发地发展出有组织的、尺度不变的几何结构。
Summary / 总结
The study investigates whether the Kolmogorov-Arnold geometric (KAG) structure observed in shallow multilayer perceptrons on synthetic tasks also appears in high-dimensional real-world data. Using 2-layer MLPs on the MNIST dataset, the research finds that KAG emerges during training and is consistent across various spatial scales and training methods. This indicates that neural networks develop organized, scale-invariant geometric structures during learning on complex data.
该研究将Kolmogorov-Arnold几何(KAG)结构的分析从合成任务扩展到MNIST手写数字分类问题,这是一个高维设置。使用2层多层感知机,研究显示KAG结构在训练过程中出现,并且在不同的空间尺度和训练方法下保持一致。研究结果表明,神经网络在学习真实高维数据时会自发地发展出有组织的、尺度不变的几何模式。
Active Learning for GCN-based Action Recognition
Authors: Hichem Sahbi
First: 2025-11-26T17:51:59+00:00 · Latest: 2025-11-26T17:51:59+00:00
Abstract
Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.
中文标题/摘要
标题:基于GCN的动作识别主动学习
尽管图卷积网络(GCNs)在基于骨架的动作识别方面取得了显著的成功,但其性能往往依赖于大量标记数据,而在实际应用中这类数据往往稀缺。为解决这一限制,我们提出了一种新型的标签高效GCN模型。我们的工作主要做出了两项贡献。首先,我们开发了一种新的获取函数,采用对抗策略来识别一组具有信息性的示例用于标记。这一选择过程平衡了代表性、多样性和不确定性。其次,我们引入了双向和稳定的GCN架构。这些增强的网络有助于更有效地在环境数据空间和潜在数据空间之间进行映射,从而更好地理解学习到的示例分布。在两个具有挑战性的基于骨架的动作识别基准上的广泛评估表明,我们的标签高效GCNs相较于先前的工作取得了显著的改进。
Summary / 总结
The research aims to improve the efficiency of labeled data usage in graph convolutional networks (GCNs) for skeleton-based action recognition. It introduces a novel acquisition function using an adversarial strategy to select informative exemplars for labeling, balancing representativeness, diversity, and uncertainty. The study also proposes bidirectional and stable GCN architectures to enhance the mapping between ambient and latent data spaces. Experimental results on two benchmarks show significant performance improvements over previous methods.
该论文针对骨架基于的动作识别中图卷积网络(GCNs)因缺乏标注数据而表现不佳的问题,提出了一种新的标签高效GCN模型。该模型通过对抗策略选择具有代表性的示例,并平衡了代表性、多样性和不确定性。此外,模型还采用了双向和稳定的GCN架构,以更好地映射数据空间。实验结果表明,该模型在两个基准测试中显著优于先前的方法。
TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding
Authors: Chin-Chia Michael Yeh, Uday Singh Saini, Xin Dai, Xiran Fan, Shubham Jain, Yujie Fan, Jiarui Sun, Junpeng Wang, Menghai Pan, Yingtong Dou, Yuzhong Chen, Vineeth Rakesh, Liang Wang, Yan Zheng, Mahashweta Das
First: 2025-11-24T20:57:31+00:00 · Latest: 2025-11-26T17:43:31+00:00
Abstract
Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people's lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.
中文标题/摘要
标题:TREASURE:一种基于Transformer的基础模型,用于高volume交易理解
支付网络构成了现代商业的骨干,每天产生大量的交易记录。正确建模这些数据可以实现异常行为检测和消费者级别的洞察,从而改善人们的生活。在本文中,我们介绍了TREASURE,一种用于交易数据的多功能基于Transformer的基础模型,TRansformer Engine As Scalable Universal transaction Representation Encoder。该模型同时捕捉消费者行为和支付网络信号(如响应代码和系统标志),为准确推荐系统和异常行为检测等应用提供全面信息。通过工业级数据集验证,TREASURE具有三个关键能力:1)具有针对静态和动态属性的专用子模块的输入模块,实现更高效的训练和推理;2)高效的训练范式,用于预测高基数分类属性;3)作为独立模型,异常行为检测性能提高111%,作为嵌入提供者,推荐模型性能提高104%。我们通过广泛的消融研究、与生产模型的基准测试和案例研究,展示了开发TREASURE获得的重要见解。
Summary / 总结
TREASURE is a transformer-based foundation model designed for high-volume transaction data, aiming to improve applications like abnormal behavior detection and personalized experiences. It captures both consumer behavior and payment network signals through a dedicated input module and efficient training paradigm, achieving a 111% improvement in abnormal behavior detection and a 104% enhancement in recommendation models. Key features include an input module for static and dynamic attributes, and an effective training method for high-cardinality categorical attributes.
TREASURE 是一种基于变换器的基础模型,旨在理解和处理大量交易数据,以改进异常行为检测和个人化体验等应用。它同时捕捉消费者行为和支付网络信号,并包含一个具有静态和动态属性专用子模块的输入模块、一种高效的高基数分类属性预测训练范式,且在异常行为检测性能上比生产系统提高了111%,在推荐模型上提高了104%。
Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding
Authors: Jungyeon Koh, Hyun Jong Yang
First: 2025-11-03T16:04:44+00:00 · Latest: 2025-11-26T17:29:51+00:00
Abstract
The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
中文标题/摘要
标题:资源感知并行推测解码的协作大型语言模型推理
设备上大型语言模型(LLM)推理需求的增长突显了在资源受限环境中高效移动边缘计算(MEC)解决方案的必要性。推测解码通过在移动设备上的轻量级草稿模型和边缘服务器上的强大目标模型之间分割词元生成,提供了一种有前景的解决方案,但存在通信开销和异步延迟的问题。本文首次提出了一种统一框架,联合优化用户关联和资源分配(UARA),以支持高效的并行推测解码。我们使用多智能体深度强化学习算法解决UARA问题。为了在现实条件下评估我们的方法,我们使用Sionna仿真器进行了实验。结果表明,我们的方法在不牺牲推理准确性的前提下,端到端延迟最多可减少28.0%,平均减少23.7%,使MEC系统中的可扩展和低延迟LLM服务成为可能。
Summary / 总结
This paper addresses the challenge of efficient on-device inference for large language models (LLMs) by proposing a unified framework that optimizes user association and resource allocation (UARA) for parallel speculative decoding. The method uses a multi-agent deep reinforcement learning algorithm to partition token generation between mobile devices and edge servers, reducing end-to-end latency by up to 28.0% and on average by 23.7% without affecting inference accuracy.
本文旨在解决资源受限的移动边缘计算环境中大型语言模型推理的高效性问题。它提出了一种统一框架,用于优化用户关联和资源共享以支持并行推测性解码,并使用多智能体深度强化学习算法。通过Sionna模拟器进行的实验表明,该方法可以将端到端延迟显著降低至最多28.0%,平均降低23.7%,同时不牺牲推理准确性。
ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images
Authors: M. Naseer Subhani
First: 2025-11-26T17:26:00+00:00 · Latest: 2025-11-26T17:26:00+00:00
Abstract
Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM's segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.
中文标题/摘要
标题:ReSAM: 精炼、重询和强化:用于遥感图像的自我提示点监督分割
交互式分割模型如Segment Anything Model (SAM) 在自然图像上表现出色,但在遥感图像(RSI)上表现不佳,这主要是由于严重的领域偏移和密集注释的稀缺性。为了解决这个问题,我们提出了一种自我提示、点监督框架,仅使用稀疏点注释将SAM适应RSI。该方法采用一个精炼-重询-强化循环,从初始点生成粗略的伪掩码(精炼),通过自我构建的框提示改进(重询),并在迭代中对嵌入进行对齐以减少确认偏差(强化)。不依赖于全掩码监督,我们的方法通过自我引导的提示适应逐步提升SAM的分割质量和领域鲁棒性。我们在WHU、HRSID和NWPU VHR-10三个RSI基准数据集上评估了我们提出的方法,结果显示我们的方法在预训练SAM和最近的点监督分割方法上表现更优。我们的结果表明,自我提示和语义对齐为大规模、点级适应基础分割模型提供了有效途径,适用于遥感应用。
Summary / 总结
The research aims to improve the performance of interactive segmentation models like SAM on remote sensing images (RSI) by addressing domain shift and sparse annotations. The proposed method, ReSAM, uses a Refine-Requery-Reinforce loop to adapt SAM to RSI using only sparse point annotations. It generates coarse pseudo-masks, improves them with self-constructed box prompts, and aligns embeddings across iterations to reduce confirmation bias. Experiments on three RSI datasets show that ReSAM consistently outperforms pretrained SAM and other point-supervised methods, demonstrating the effectiveness of self-prompting and semantic alignment for adapting foundation models to remote sensing tasks.
研究旨在通过解决领域偏移和稀疏标注问题,改进交互式分割模型如SAM在遥感图像(RSI)上的性能。提出的ReSAM方法使用Refine-Requery-Reinforce循环,仅利用稀疏点标注来适应RSI。该方法生成粗略的伪掩码,通过自构建的框提示改进它们,并在迭代中对嵌入进行对齐以减少确认偏差。在WHU、HRSID和NWPU VHR-10数据集上的实验表明,ReSAM在所有测试中都优于预训练的SAM和其他点监督分割方法,证明了自我提示和语义对齐在遥感应用中对基础模型可扩展适应的有效性。
TAB-DRW: A DFT-based Robust Watermark for Generative Tabular Data
Authors: Yizhou Zhao, Xiang Li, Peter Song, Qi Long, Weijie Su
First: 2025-11-26T17:16:14+00:00 · Latest: 2025-11-26T17:16:14+00:00
Abstract
The rise of generative AI has enabled the production of high-fidelity synthetic tabular data across fields such as healthcare, finance, and public policy, raising growing concerns about data provenance and misuse. Watermarking offers a promising solution to address these concerns by ensuring the traceability of synthetic data, but existing methods face many limitations: they are computationally expensive due to reliance on large diffusion models, struggle with mixed discrete-continuous data, or lack robustness to post-modifications. To address them, we propose TAB-DRW, an efficient and robust post-editing watermarking scheme for generative tabular data. TAB-DRW embeds watermark signals in the frequency domain: it normalizes heterogeneous features via the Yeo-Johnson transformation and standardization, applies the discrete Fourier transform (DFT), and adjusts the imaginary parts of adaptively selected entries according to precomputed pseudorandom bits. To further enhance robustness and efficiency, we introduce a novel rank-based pseudorandom bit generation method that enables row-wise retrieval without incurring storage overhead. Experiments on five benchmark tabular datasets show that TAB-DRW achieves strong detectability and robustness against common post-processing attacks, while preserving high data fidelity and fully supporting mixed-type features.
中文标题/摘要
标题:TAB-DRW:基于DFT的稳健生成表格数据水印
生成式AI的发展使得在医疗保健、金融和公共政策等领域生成高保真合成表格数据成为可能,这引发了关于数据来源和滥用的日益增长的担忧。水印提供了一种有希望的解决方案,通过确保合成数据的可追溯性来应对这些担忧,但现有方法面临许多限制:它们由于依赖于大型扩散模型而计算成本高昂,难以处理混合离散-连续数据,或者缺乏对后修改的鲁棒性。为了解决这些问题,我们提出了TAB-DRW,这是一种高效的针对生成表格数据的稳健后编辑水印方案。TAB-DRW将水印信号嵌入到频域中:它通过Yeo-Johnson变换和标准化对异构特征进行归一化,应用离散傅里叶变换(DFT),并根据预先计算的伪随机位调整适应性选择条目的虚部。为了进一步增强鲁棒性和效率,我们引入了一种新颖的基于排名的伪随机位生成方法,该方法可以在不增加存储开销的情况下实现行级检索。在五个基准表格数据集上的实验表明,TAB-DRW在检测能力和抵抗常见后处理攻击的鲁棒性方面表现出色,同时保持了高数据保真度并完全支持混合类型特征。
Summary / 总结
The paper addresses the challenge of ensuring data provenance and preventing misuse of synthetic tabular data generated by AI. It introduces TAB-DRW, a DFT-based watermarking method that normalizes features, applies the discrete Fourier transform, and adjusts imaginary parts based on pseudorandom bits. Experiments demonstrate that TAB-DRW is highly detectable and robust against post-processing attacks, while maintaining data fidelity and supporting mixed-type features.
TAB-DRW 是一种基于 DFT 的水印方案,旨在解决现有方法在合成表格数据中的局限性。它通过归一化特征、应用 DFT 并根据伪随机位调整虚部来嵌入水印。在五个基准数据集上的实验表明,TAB-DRW 在保持数据保真度和支持混合类型特征的同时,具有高度的可检测性和对后处理攻击的鲁棒性。
MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
Authors: Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman, Jinrong Xie, Yongxin Chen
First: 2025-11-26T17:09:03+00:00 · Latest: 2025-11-26T17:09:03+00:00
Abstract
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.
中文标题/摘要
标题:MoGAN:通过少量步骤运动对抗后训练提高视频扩散中的运动质量
视频扩散模型在帧级保真度方面表现出色,但在运动连贯性、动态性和现实感方面仍然存在困难,经常产生抖动、鬼影或不合理的动态。一个关键限制是,标准的去噪均方误差目标无法直接监督时间一致性,允许模型在生成不良运动的同时仍能实现低损失。我们提出MoGAN,这是一种以运动为中心的后训练框架,可以在无需奖励模型或人类偏好数据的情况下提高运动现实感。该框架基于一个三步提取的视频扩散模型,我们训练一个基于DiT的光学流鉴别器来区分真实运动和生成运动,并结合分布匹配正则化器以保持视觉保真度。在Wan2.1-T2V-1.3B上的实验表明,MoGAN在基准测试中显著提高了运动质量。在VBench上,MoGAN将运动得分分别提高了7.3%(相对于50步教师模型)和13.3%(相对于3步DMD模型)。在VideoJAM-Bench上,MoGAN将运动得分分别提高了7.4%(相对于教师模型)和8.8%(相对于DMD),同时保持了可比或更好的美学和图像质量得分。进一步的人类研究表明,MoGAN在运动质量方面更受欢迎(教师模型为52%,DMD为38%;DMD为56%,教师模型为29%)。总体而言,MoGAN在不牺牲视觉保真度或效率的情况下提供了更现实的运动,为快速、高质量视频生成提供了一条实用的道路。项目网页为:https://xavihart.github.io/mogan/
Summary / 总结
MoGAN is a motion-centric post-training framework that enhances motion quality in video diffusion models without relying on reward models or human preference data. Built on a 3-step distilled video diffusion model, MoGAN trains a DiT-based optical-flow discriminator to distinguish real from generated motion, combined with a distribution-matching regularizer to maintain visual fidelity. Experiments show that MoGAN significantly improves motion quality on benchmarks like VBench and VideoJAM-Bench, with substantial gains in motion scores over both a 50-step teacher model and a 3-step DMD model, while maintaining comparable aesthetic and image-quality scores. A human study also confirms that MoGAN is preferred for motion quality. Overall, MoGAN improves motion realism without sacrificing visual fidelity or efficiency, offering a practical solution for video generation.
MoGAN 是一种以运动为中心的后训练框架,通过训练基于 DiT 的光流鉴别器和使用分布匹配正则化器来提升视频扩散模型中的运动质量,而不依赖于奖励模型或人类偏好数据。MoGAN 在 VBench(分别比 50 步教师模型高 7.3%,比 3 步 DMD 模型高 13.3%)和 VideoJAM-Bench(分别比教师模型高 7.4%,比 DMD 高 8.8%)中显著提高了运动得分,同时保持或提高了美学和图像质量得分。人类研究还证实了 MoGAN 在运动质量方面的优越性,相比教师模型和 DMD 模型,MoGAN 分别获得了 52% 和 56% 的偏好率,而教师模型和 DMD 模型分别为 38% 和 29%。
BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali
Authors: Abdullah Al Sefat
First: 2025-11-25T15:26:47+00:00 · Latest: 2025-11-26T17:08:26+00:00
Abstract
Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
中文标题/摘要
标题:BengaliFig:孟加拉语低资源环境下的比喻与文化 grounding 推理挑战
大型语言模型在多语言基准测试中表现出色,但在比喻和文化 grounding 推理方面仍需广泛评估,尤其是在低资源环境中。我们提出了 BengaliFig,这是一个紧凑但标注丰富的挑战集,旨在填补孟加拉语中的这一空白,孟加拉语是一种广泛使用的低资源语言。数据集包含 435 个独特的谜语,源自孟加拉语口头和文学传统。每个项目都从推理类型、陷阱类型、文化深度、答案类别和难度五个正交维度进行标注,并通过一种基于约束的、AI 辅助的管道自动转换为多项选择格式。我们在零样本和少量样本链式思考提示下评估了来自主要提供商的八种前沿 LLM,揭示了在比喻和文化特定推理方面的持续弱点。因此,BengaliFig 既是一个诊断探针,用于评估 LLM 在低资源文化环境中的鲁棒性,也是一个朝着包容性和遗产意识自然语言处理评估迈进的步骤。
Summary / 总结
The motivation for BengaliFig is to evaluate large language models' ability in figurative and culturally grounded reasoning, particularly in low-resource languages like Bengali. The dataset consists of 435 riddles annotated along five dimensions and converted to a multiple-choice format. Key findings show that leading LLMs perform poorly in metaphorical and culturally specific reasoning tasks, highlighting the need for better evaluation in low-resource contexts.
研究的动机是评估大型语言模型在隐喻和文化背景推理方面的能力,特别是在低资源语言如孟加拉语中。研究人员开发了包含435个谜语的数据集,并标注了五个维度。他们评估了八种领先的LLM,主要发现这些模型在隐喻性和文化特定性推理任务中表现一致不佳。
On the Limits of Innate Planning in Large Language Models
Authors: Charles Schepanowski, Charles Ling
First: 2025-11-26T17:08:13+00:00 · Latest: 2025-11-26T17:08:13+00:00
Comments: 33 pages, 7 figures
Abstract
Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.
中文标题/摘要
标题:关于大型语言模型内置规划能力的局限性
大型语言模型(LLMs)在许多基准测试中取得了令人印象深刻的成果,但它们的规划能力和状态依赖性推理能力仍然不清楚。我们直接研究了这些能力,不使用代码执行或其他工具,使用8-拼图:一个经典的任务,需要状态跟踪和目标导向的规划,同时允许精确的、逐步的评估。四个模型在常见的提示条件下(零样本、思维链、思维算法)和分层纠正反馈下进行测试。反馈在某些模型提示组合中提高了成功率,但许多成功的运行是长的、计算密集型的并且间接的。然后我们使用外部移动验证器来检查这些模型,该验证器仅提供有效移动。即使在这种程度的帮助下,这些模型在该设置中也没有解决任何拼图。定性分析揭示了所有模型中的两个主要缺陷:(1)脆弱的内部状态表示,导致频繁出现无效移动,(2)弱的启发式规划,模型进入循环或选择不减少与目标状态距离的动作。这些发现表明,在没有外部工具如代码解释器的情况下,当前的LLMs在规划方面存在重大局限性,进一步的进步可能需要维护显式状态和执行结构化搜索的机制。
Summary / 总结
This study investigates the planning and stateful reasoning capabilities of large language models (LLMs) using the 8-puzzle task. Four models were tested under different prompting conditions and with varying levels of corrective feedback. While feedback improved success rates, many solutions were computationally expensive and indirect. Without external tools, none of the models could solve puzzles. The analysis revealed that models struggle with maintaining accurate internal state representations and weak heuristic planning, often leading to invalid moves and looping actions. These findings suggest significant limitations in LLMs' planning abilities and the need for new mechanisms to support explicit state maintenance and structured search.
该研究使用8-拼图任务来考察大型语言模型(LLMs)的规划和状态推理能力。四种模型在不同的提示条件下进行了测试,并提供了不同程度的纠正反馈。尽管反馈提高了成功率,但许多解决方案仍然计算成本高且间接。在没有外部工具的情况下,这些模型都无法解决拼图问题。分析发现,模型在保持准确的内部状态表示和弱启发式规划方面存在困难,经常导致无效动作和循环行为。这些发现表明,LLMs在规划能力方面存在显著限制,并且需要新的机制来支持显式状态维护和结构化搜索。
Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving
Authors: Haohong Lin, Yunzhi Zhang, Wenhao Ding, Jiajun Wu, Ding Zhao
Venue: NeurIPS 2025
First: 2025-11-26T17:01:41+00:00 · Latest: 2025-11-26T17:01:41+00:00
Comments: Published at NeurIPS 2025: https://openreview.net/forum?id=4OLbpaTKJe
Abstract
End-to-end (E2E) autonomous driving models have demonstrated strong performance in open-loop evaluations but often suffer from cascading errors and poor generalization in closed-loop settings. To address this gap, we propose Model-based Policy Adaptation (MPA), a general framework that enhances the robustness and safety of pretrained E2E driving agents during deployment. MPA first generates diverse counterfactual trajectories using a geometry-consistent simulation engine, exposing the agent to scenarios beyond the original dataset. Based on this generated data, MPA trains a diffusion-based policy adapter to refine the base policy's predictions and a multi-step Q value model to evaluate long-term outcomes. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the one with the highest expected utility. Experiments on the nuScenes benchmark using a photorealistic closed-loop simulator demonstrate that MPA significantly improves performance across in-domain, out-of-domain, and safety-critical scenarios. We further investigate how the scale of counterfactual data and inference-time guidance strategies affect overall effectiveness.
中文标题/摘要
标题:基于模型的策略适应以实现闭环端到端自动驾驶
端到端(E2E)自动驾驶模型在开环评估中表现出强大的性能,但在闭环环境中往往遭受级联错误和较差的泛化能力。为解决这一差距,我们提出了一种基于模型的策略适应(MPA)框架,该框架在部署过程中增强了预训练E2E驾驶代理的鲁棒性和安全性。MPA 首先使用几何一致的模拟引擎生成多样化的反事实轨迹,使代理接触到超出原始数据集的场景。基于生成的数据,MPA 训练了一个基于扩散的策略适配器来细化基础策略的预测,并训练一个多步Q值模型来评估长期结果。在推理时,适配器提出多个轨迹候选,Q值模型选择具有最高预期效用的轨迹。在nuScenes基准上使用逼真的闭环模拟器进行的实验表明,MPA 显著提高了在领域内、领域外和安全关键场景中的性能。我们进一步研究了反事实数据的规模和推理时的指导策略如何影响整体效果。
Summary / 总结
The research aims to enhance the robustness and safety of end-to-end autonomous driving models in closed-loop settings by proposing Model-based Policy Adaptation (MPA). MPA uses a simulation engine to generate diverse counterfactual trajectories, which are then used to train a policy adapter and a multi-step Q value model. At inference time, the adapter proposes multiple trajectory candidates, and the Q value model selects the most beneficial one. Experiments show that MPA improves performance in various scenarios, including in-domain, out-of-domain, and safety-critical situations.
研究旨在通过提出基于模型的策略适应(MPA)来增强端到端自动驾驶模型在闭环环境中的鲁棒性和安全性。MPA 使用几何一致的模拟引擎生成多样化的反事实轨迹,然后用于训练基于扩散的策略适配器和多步 Q 值模型。在推理时,适配器提出多个轨迹候选,Q 值模型选择最优的一个。实验表明,MPA 在各种场景中,包括领域内、领域外和安全关键场景中,都提高了性能。
LCB-CV-UNet: Enhanced Detector for High Dynamic Range Radar Signals
Authors: Yanbin Wang, Xingyu Chen, Yumiao Wang, Xiang Wang, Chuanfei Zang, Guolong Cui, Jiahuan Liu
Venue: Proc. IEEE Int. Geosci. Remote Sens. Symp. (2025) 6050-6054
First: 2025-05-29T14:00:59+00:00 · Latest: 2025-11-26T16:58:54+00:00
Comments: 5 pages, 4 figures. Accepted to IEEE IGARSS 2025
Abstract
We propose the LCB-CV-UNet to tackle performance degradation caused by High Dynamic Range (HDR) radar signals. Initially, a hardware-efficient, plug-and-play module named Logarithmic Connect Block (LCB) is proposed as a phase coherence preserving solution to address the inherent challenges in handling HDR features. Then, we propose the Dual Hybrid Dataset Construction method to generate a semi-synthetic dataset, approximating typical HDR signal scenarios with adjustable target distributions. Simulation results show about 1% total detection probability improvement with under 0.9% computational complexity added compared with the baseline. Furthermore, it excels 5% over the baseline at the range in 11-13 dB signal-to-noise ratio typical for urban targets. Finally, the real experiment validates the practicality of our model.
中文标题/摘要
标题:LCB-CV-UNet:高动态范围雷达信号增强检测器
我们提出LCB-CV-UNet以应对高动态范围(HDR)雷达信号引起的性能退化问题。首先,提出了一种硬件高效、即插即用的模块——对数连接块(LCB),作为相位相干性保持的解决方案,以应对处理HDR特征的固有挑战。然后,我们提出了双混合数据集构建方法,生成一个半合成数据集,通过可调节的目标分布近似典型HDR信号场景。仿真结果表明,与基线相比,总检测概率提高了约1%,计算复杂度增加了不到0.9%。此外,在11-13 dB信噪比的典型城市目标场景下,其性能优于基线5%。最后,实际实验验证了我们模型的实用性。
Summary / 总结
The LCB-CV-UNet is proposed to improve the performance of detecting High Dynamic Range radar signals. It includes a Logarithmic Connect Block (LCB) module to preserve phase coherence and a Dual Hybrid Dataset Construction method for generating a semi-synthetic dataset. The model shows a 1% improvement in total detection probability with minimal added computational complexity and outperforms the baseline by 5% in urban target scenarios with 11-13 dB signal-to-noise ratio.
提出LCB-CV-UNet以提升HDR雷达信号检测性能。引入了Logarithmic Connect Block (LCB)模块以保持相位相干性,并采用Dual Hybrid Dataset Construction方法生成半合成数据集。该模型在11-13 dB信噪比下比基线高出约5%的检测性能,并且在总检测概率上提高了约1%,同时增加了不到0.9%的计算复杂度。实际实验验证了其实用性。
Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning
Authors: Alex Ning, Yen-Ling Kuo, Gabe Gomes
First: 2025-11-26T16:54:06+00:00 · Latest: 2025-11-26T16:54:06+00:00
Comments: 13 pages, 6 figures
Abstract
Latent reasoning represents a new development in Transformer language models that has shown potential in compressing reasoning lengths compared to chain-of-thought reasoning. By directly passing the information-rich previous final latent state into the next sequence, latent reasoning removes the restriction to human language tokens as the medium for reasoning. We develop adaptive-length latent reasoning models and introduce a post-SFT reinforcement-learning methodology to optimize latent reasoning length by minimizing reasoning length while maintaining accuracy. This, in turn, further reduces compute usage and raises the bar on the compressive capabilities of latent reasoning models. Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset show a $52\%$ drop in total reasoning length with no penalty to accuracy. In future work, we plan to extend to additional models and datasets, analyze relationships between training coefficients, experiment with architecture variations, and continue our knowledge distillation for latent reasoning SFT efforts. We make our code and pretrained weights available at https://github.com/apning/adaptive-latent-reasoning.
中文标题/摘要
标题:学习何时停止:通过强化学习实现自适应潜在推理
潜在推理是Transformer语言模型中的一种新发展,与链式思考推理相比,它在压缩推理长度方面显示出潜力。通过直接将信息丰富的先前最终潜在状态传递到下一个序列,潜在推理消除了将推理限制为人类语言标记的媒介。我们开发了自适应长度的潜在推理模型,并引入了一种后SFT强化学习方法,通过最小化推理长度同时保持准确性来优化潜在推理长度。这反过来进一步减少了计算使用量,并提高了潜在推理模型压缩能力的标准。在Llama 3.2 1B模型和GSM8K-Aug数据集上的实验显示,总推理长度减少了52%,而准确性没有下降。在未来的工作中,我们计划扩展到其他模型和数据集,分析训练系数之间的关系,尝试架构变化,并继续我们的潜在推理SFT知识精炼工作。我们将在https://github.com/apning/adaptive-latent-reasoning/提供我们的代码和预训练权重。
Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Authors: Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi
First: 2025-11-26T16:53:05+00:00 · Latest: 2025-11-26T16:53:05+00:00
Abstract
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
中文标题/摘要
标题:和谐:通过跨任务协同实现音频和视频生成的同步
同步音频-视觉内容的合成是生成式AI中的一个关键挑战,开源模型在鲁棒的音频-视频对齐方面面临挑战。我们的分析表明,这一问题源于联合扩散过程中的三个基本挑战:(1)对应关系漂移,同时演化的噪声潜在变量阻碍了对齐的稳定学习;(2)低效的全局注意力机制,无法捕捉细微的时间线索;(3)传统无分类器自由引导(CFG)的模内偏差,增强了条件性但未提高跨模态同步。为克服这些挑战,我们引入了和谐,一种新的框架,机械地确保音频-视觉同步。我们首先提出了一种跨任务协同训练范式,通过利用音频驱动视频和视频驱动音频生成任务中的强监督信号来减轻漂移。然后,我们设计了一种全局-局部解耦交互模块,以实现高效和精确的时间风格对齐。最后,我们提出了一种新的同步增强无分类器自由引导(SyncCFG),在推理过程中明确隔离并放大对齐信号。广泛的实验表明,和谐建立了新的最先进的水平,在生成保真度方面显著优于现有方法,并且在关键的音频-视觉细粒度同步方面也表现更优。
Summary / 总结
The paper addresses the challenge of robust audio-video alignment in generative AI by introducing Harmony, a framework that tackles three key issues: correspondence drift, inefficient global attention, and intra-modal bias in Classifier-Free Guidance. Harmony uses a Cross-Task Synergy training approach, a Global-Local Decoupled Interaction Module, and a Synchronization-Enhanced Classifier-Free Guidance to enforce audio-visual synchronization. Experiments show that Harmony outperforms existing methods in both generation fidelity and fine-grained audio-visual synchronization.
论文通过引入Harmony框架解决了生成AI中稳健的音频-视频对齐问题,该框架通过交叉任务协同训练范式和全局-局部解耦交互模块来缓解对应关系漂移并提高跨模态同步。此外,还提出了同步增强的分类器自由引导(SyncCFG),以在推理过程中明确隔离和放大对齐信号。实验表明,Harmony在生成保真度和细粒度的音频-视频同步方面优于现有方法。
Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss
Authors: Chou Mo, Yehyun Suh, J. Ryan Martin, Daniel Moyer
First: 2025-11-26T16:50:06+00:00 · Latest: 2025-11-26T16:50:06+00:00
Comments: 9 pages, 3 figures, 1 table
Abstract
Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.
中文标题/摘要
标题:使用2D/3D配准损失增强骨盆透视中的关键点检测模型
自动关键点检测为医疗专业人员提供了一种有效的方法,通过术中成像理解患者解剖结构和定位。虽然当前骨盆透视的关键点检测方法显示出有希望的准确性,但大多数方法假设骨盆的固定前-后视图。然而,方向往往偏离这种标准视图,可能是由于成像单元或目标结构本身的重新定位。为了解决这一限制,我们提出了一种新的框架,将2D/3D关键点配准纳入U-Net关键点预测模型的训练中。我们通过在患者姿态可变的现实术中条件下比较基于基线U-Net、使用姿态估计损失训练的U-Net以及使用姿态估计损失微调的U-Net的关键点检测准确性来分析性能差异。
Summary / 总结
The research aims to improve landmark detection in pelvic fluoroscopy by addressing the limitations of fixed view assumptions. The method involves incorporating 2D/3D registration loss into a U-Net model for landmark prediction. The key experimental findings show that the U-Net fine-tuned with Pose Estimation Loss outperforms the baseline U-Net and U-Net trained with Pose Estimation Loss in varying patient poses, demonstrating enhanced accuracy under realistic intra-operative conditions.
研究旨在通过解决患者姿态变化的问题,提高盆腔透视中关键点检测的准确性。方法是将2D/3D配准损失整合到U-Net模型的训练中。实验结果表明,经过姿态估计损失微调的U-Net在变姿态的术中条件下表现优于基线U-Net和仅用姿态估计损失训练的U-Net,显示出更高的准确性。
Multimodal Robust Prompt Distillation for 3D Point Cloud Models
Authors: Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou, Shuchao Pang
First: 2025-11-26T16:49:38+00:00 · Latest: 2025-11-26T16:49:38+00:00
Abstract
Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model's features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.
中文标题/摘要
标题:多模态鲁棒提示蒸馏用于3D点云模型
对抗攻击对基于学习的3D点云模型构成了重大威胁,严重削弱了它们在安全敏感应用中的可靠性。现有防御方法通常存在(1)高计算开销和(2)对不同攻击类型的泛化能力差的问题。为了解决这些问题,我们提出了一种新颖且高效的教师-学生框架,即多模态鲁棒提示蒸馏(MRPD),用于提炼鲁棒的3D点云模型。它通过将学生点云模型的特征与来自三个不同教师的鲁棒嵌入对齐来学习轻量级提示:一个处理深度投影的视觉模型、一个高性能的3D模型和一个文本编码器。为了确保可靠的知识转移,蒸馏过程由一个信心门控机制引导,该机制动态平衡所有输入模态的贡献。值得注意的是,由于蒸馏过程仅在训练阶段进行,因此在推理时没有额外的计算成本。广泛的实验表明,MRPD在多种白盒和黑盒攻击下显著优于最先进的防御方法,甚至在干净数据上也表现出更好的性能。我们的工作通过高效利用多模态知识,为构建鲁棒的3D视觉系统提供了一个新的、实用的范式。
Summary / 总结
The paper proposes Multimodal Robust Prompt Distillation (MRPD) to defend against adversarial attacks on 3D point cloud models. MRPD uses a teacher-student framework with three distinct teachers: a vision model, a 3D model, and a text encoder. It aligns the student model's features with robust embeddings from these teachers and employs a confidence-gated mechanism to balance the contributions of different modalities. Experiments show that MRPD effectively defends against various attacks and even improves performance on clean data, with no additional inference cost.
论文提出了一种名为Multimodal Robust Prompt Distillation (MRPD)的新颖教师-学生框架,以应对3D点云模型对 adversarial 攻击的脆弱性。MRPD 使用来自三个不同教师的轻量级提示:视觉模型、高性能 3D 模型和文本编码器。通过置信度门控机制指导蒸馏过程,以确保可靠的知识转移。实验表明,MRPD 在各种攻击下优于现有防御方法,并且在干净数据上甚至还能提高性能,且在推理阶段没有额外的计算成本。
BAMAS: Structuring Budget-Aware Multi-Agent Systems
Authors: Liming Yang, Junyu Luo, Xuanzhe Liu, Yiling Lou, Zhenpeng Chen
Venue: AAAI 2026 oral
First: 2025-11-26T16:48:18+00:00 · Latest: 2025-11-26T16:48:18+00:00
Comments: Accepted by AAAI 2026 (oral paper)
Abstract
Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.
中文标题/摘要
标题:BAMAS:构建预算意识多智能体系统
基于大型语言模型(LLM)的多智能体系统已成为使自主智能体解决复杂任务的强大范式。随着这些系统的复杂性增加,成本成为实际部署中的一个重要考虑因素。然而,现有工作很少解决如何在明确的预算约束下构建多智能体系统的问题。在本文中,我们提出了一种名为BAMAS的新方法,用于构建具有预算意识的多智能体系统。BAMAS首先通过求解一个整数线性规划问题来选择最优的LLM集合,该问题平衡了性能和成本。然后,它通过利用基于强化学习的方法来确定这些LLM之间的协作拓扑。最后,系统基于所选的智能体及其协作拓扑进行实例化和执行。我们在三个代表性任务上评估了BAMAS,并将其与最先进的智能体构建方法进行了比较。结果表明,BAMAS在降低成本高达86%的同时,实现了相当的性能。
Summary / 总结
The paper introduces BAMAS, a method for building budget-aware multi-agent systems using LLMs. It first optimally selects LLMs by solving an Integer Linear Programming problem that balances performance and cost. Then, it determines the interaction topology using reinforcement learning. Experimental results show that BAMAS achieves similar performance to state-of-the-art methods while reducing costs by up to 86%.
BAMAS 是一种通过整数线性规划问题选择最优的大语言模型集,并使用强化学习确定它们的交互拓扑的新方法。然后根据选定的代理及其协作拓扑来实例化和执行系统。实验结果表明,BAMAS 可以实现与现有最佳方法相当的性能,同时将成本降低高达 86%。
Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners
Authors: Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca
Venue: AAAI 2026
First: 2025-11-13T12:06:12+00:00 · Latest: 2025-11-26T16:39:11+00:00
Comments: AAAI 2026 Workshop on Graphs and more Complex Structures For Learning and Reasoning (GCLR). Version 2 fixes typos in author name and Figure 1
Abstract
While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.
中文标题/摘要
标题:迷失在序列化中:LLM图推理器的不变性和泛化能力
虽然前景广阔,但基于大型语言模型(LLMs)的图推理器缺乏对图表示中对称性的内置不变性。在基于序列图序列化操作时,LLMs 在节点重新索引、边重新排序或格式更改下会产生不同的输出,这引发了鲁棒性方面的担忧。我们系统地分析了这些影响,研究了微调如何影响编码敏感性以及在未见任务上的泛化能力。我们提出了一种原理性的图序列化分解,将图序列化分解为节点标签、边编码和语法,并在全面的基准测试套件上评估LLMs对每个因素变化的鲁棒性。我们还贡献了一组新的谱任务,以进一步评估微调推理器的泛化能力。结果显示,较大的(未微调)模型更具鲁棒性。微调可以减少对节点重新标记的敏感性,但可能会增加对结构和格式变化的敏感性,而微调并不一致地提高在未见任务上的性能。
Summary / 总结
The research aims to address the robustness issues of graph reasoners based on Large Language Models (LLMs) due to their lack of invariance to graph symmetries. The study evaluates how fine-tuning affects the encoding sensitivity and generalization of LLMs on graph tasks. Key findings indicate that larger models are more robust, while fine-tuning can reduce sensitivity to node relabeling but may increase sensitivity to structural and formatting changes, and does not uniformly improve performance on new tasks.
研究探讨了基于大型语言模型(LLMs)的图推理器在图序列化变化下的鲁棒性。研究指出,LLMs 对图表示中的对称性缺乏不变性,导致在节点重新索引或边重新排序时输出不同。通过将图序列化分解为节点标签、边编码和语法,研究评估了LLMs 对这些因素的敏感性,并发现较大的模型更具鲁棒性。微调可以减少对节点重新标记的敏感性,但可能会增加对结构和格式变化的敏感性,且不一致地提高在未见过的任务上的性能。
UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes
Authors: Kang Du, Xue Liao, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, ShengHuang, Zeyu Wang
First: 2025-11-26T16:38:29+00:00 · Latest: 2025-11-26T16:38:29+00:00
Comments: 10 pages, 6 figures
Abstract
Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.
中文标题/摘要
标题:UAVLight:无人机场景中光照鲁棒的三维重建基准
光照不一致是多视图三维重建中的基本挑战。太阳光方向、云层覆盖和阴影的变化打破了经典多视图立体视觉(MVS)和结构从运动(SfM)管道以及最近的神经渲染方法所依赖的恒定光照假设,导致几何漂移、颜色不一致和阴影印记。这一问题在基于无人机的重建中尤为关键,因为长时间飞行和户外环境使得光照变化不可避免。然而,现有的数据集要么仅在短时间内捕获,缺乏有意义的光照多样性,要么跨越数月和季节,几何和语义变化使得光照鲁棒性研究变得复杂。我们引入了UAVLight,一个控制下的真实光照鲁棒三维重建基准。每个场景沿可重复的、地理参考的飞行路径在一天中的多个固定时间点捕获,产生在一致几何、校准和视角下自然的光照变化。在不同光照条件下标准化的评估协议下,UAVLight为开发和基准测试在真实户外环境中一致、忠实且可重新照明的重建方法提供了可靠的基础。
Summary / 总结
The paper introduces UAVLight, a benchmark for evaluating illumination-robust 3D reconstruction in UAV scenes. It addresses the challenge of lighting inconsistency by capturing scenes at multiple times of day along fixed flight paths, ensuring consistent geometry and viewpoints while introducing natural lighting variations. The main experimental findings show that existing methods suffer from geometry drift and color inconsistency under varying lighting conditions, while UAVLight provides a reliable dataset for developing and benchmarking more robust reconstruction techniques.
论文介绍了UAVLight,一个用于无人机场景下光照鲁棒3D重建的基准。通过沿可重复的飞行路径在不同时间段捕捉场景,确保几何结构一致的同时引入自然的光照变化,解决光照不一致的问题。主要实验结果表明现有方法在光照鲁棒性方面存在不足,而UAVLight为开发和评估能在真实户外光照条件下表现一致、准确且可重新照明的新方法提供了可靠的数据集基础。
ENMA: Tokenwise Autoregression for Generative Neural PDE Operators
Authors: Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari
First: 2025-06-06T15:25:14+00:00 · Latest: 2025-11-26T16:36:22+00:00
Abstract
Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.
中文标题/摘要
标题:ENMA:时空自回归生成神经PDE算子
求解时间依赖的参数偏微分方程(PDEs)仍然是神经求解器的基本挑战,尤其是在广泛物理参数和动力学范围内进行泛化时。当数据不确定或不完整时,自然的方法是转向生成模型。我们提出了ENMA,一种生成神经算子,旨在建模来自物理现象的时空动态。ENMA使用通过流匹配损失训练的生成遮蔽自回归变压器,在压缩的潜在空间中预测未来动态,实现按令牌生成。不规则采样的空间观测通过注意力机制编码为均匀的潜在表示,并通过时空卷积编码器进一步压缩。这使得ENMA在推理时能够通过条件依赖于目标轨迹的过去状态或具有相似动力学的辅助上下文轨迹进行上下文学习。结果是一个稳健且适应性强的框架,能够泛化到新的PDE领域,并支持时间依赖参数PDE的一次性代理建模。
Summary / 总结
ENMA is designed to solve time-dependent parametric PDEs by predicting future dynamics in a compressed latent space using a generative masked autoregressive transformer. It encodes irregularly sampled spatial observations into uniform latent representations and further compresses them through a spatio-temporal convolutional encoder. Key findings include ENMA's ability to perform in-context learning at inference time and its robustness in generalizing to new PDE regimes and supporting one-shot surrogate modeling of time-dependent parametric PDEs.
ENMA 是一种用于建模物理现象时空动态的生成神经算子。它使用掩码自回归变压器在压缩的潜在空间中预测未来动态,并通过注意力机制和时空卷积编码器将不规则的空间观测编码为均匀的潜在表示。关键发现包括 ENMA 在推理时能够进行上下文学习,并且在泛化到新的 PDE 环境和支持时间依赖参数 PDE 的一次拟合代理建模方面表现出鲁棒性。
Machine Learning Approaches to Clinical Risk Prediction: Multi-Scale Temporal Alignment in Electronic Health Records
Authors: Wei-Chen Chang, Lu Dai, Ting Xu
First: 2025-11-26T16:33:59+00:00 · Latest: 2025-11-26T16:33:59+00:00
Comments: 5 pages, 3 figures
Abstract
This study proposes a risk prediction method based on a Multi-Scale Temporal Alignment Network (MSTAN) to address the challenges of temporal irregularity, sampling interval differences, and multi-scale dynamic dependencies in Electronic Health Records (EHR). The method focuses on temporal feature modeling by introducing a learnable temporal alignment mechanism and a multi-scale convolutional feature extraction structure to jointly model long-term trends and short-term fluctuations in EHR sequences. At the input level, the model maps multi-source clinical features into a unified high-dimensional semantic space and employs temporal embedding and alignment modules to dynamically weight irregularly sampled data, reducing the impact of temporal distribution differences on model performance. The multi-scale feature extraction module then captures key patterns across different temporal granularities through multi-layer convolution and hierarchical fusion, achieving a fine-grained representation of patient states. Finally, an attention-based aggregation mechanism integrates global temporal dependencies to generate individual-level risk representations for disease risk prediction and health status assessment. Experiments conducted on publicly available EHR datasets show that the proposed model outperforms mainstream baselines in accuracy, recall, precision, and F1-Score, demonstrating the effectiveness and robustness of multi-scale temporal alignment in complex medical time-series analysis. This study provides a new solution for intelligent representation of high-dimensional asynchronous medical sequences and offers important technical support for EHR-driven clinical risk prediction.
中文标题/摘要
标题:基于多尺度时间对齐网络的临床风险预测机器学习方法:电子健康记录中的多尺度时间对齐
本研究提出了一种基于多尺度时间对齐网络(MSTAN)的风险预测方法,以解决电子健康记录(EHR)中的时间不规则性、采样间隔差异和多尺度动态依赖性挑战。该方法通过引入可学习的时间对齐机制和多尺度卷积特征提取结构,专注于时间特征建模,以共同建模EHR序列中的长期趋势和短期波动。在输入层面,模型将多源临床特征映射到统一的高维语义空间,并使用时间嵌入和对齐模块动态加权不规则采样数据,减少时间分布差异对模型性能的影响。多尺度特征提取模块通过多层卷积和分层融合捕获不同时间粒度下的关键模式,实现患者状态的精细表示。最后,基于注意力的聚合机制整合全局时间依赖性,生成个体级别的风险表示,用于疾病风险预测和健康状况评估。在公开的EHR数据集上进行的实验表明,所提出模型在准确率、召回率、精确率和F1分数方面优于主流基线,证明了在复杂医疗时间序列分析中多尺度时间对齐的有效性和鲁棒性。本研究为智能表示高维异步医疗序列提供了新的解决方案,并为基于EHR的临床风险预测提供了重要的技术支持。
Summary / 总结
This study introduces a Multi-Scale Temporal Alignment Network (MSTAN) to predict clinical risks in Electronic Health Records (EHR) by addressing temporal irregularities and multi-scale dependencies. The method uses a learnable temporal alignment mechanism and multi-scale convolution to model both long-term trends and short-term fluctuations. Experiments show that MSTAN outperforms existing methods in accuracy, recall, precision, and F1-Score, highlighting the effectiveness of multi-scale temporal alignment in EHR analysis.
该研究提出了一种多尺度时间对齐网络(MSTAN),通过解决时间不规则性和多尺度依赖性问题来预测电子健康记录(EHR)中的临床风险。该方法使用可学习的时间对齐机制和多尺度卷积来建模长期趋势和短期波动。实验结果显示,MSTAN在准确率、召回率、精确率和F1分数等方面优于现有方法,突显了其在复杂医疗时间序列分析中的有效性。
VacuumVLA: Boosting VLA Capabilities via a Unified Suction and Gripping Tool for Complex Robotic Manipulation
Authors: Hui Zhou, Siyuan Huang, Minxing Li, Hao Zhang, Lue Fan, Shaoshuai Shi
First: 2025-11-26T16:29:24+00:00 · Latest: 2025-11-26T16:29:24+00:00
Comments: 8 pages
Abstract
Vision Language Action models have significantly advanced general purpose robotic manipulation by harnessing large scale pretrained vision and language representations. Among existing approaches, a majority of current VLA systems employ parallel two finger grippers as their default end effectors. However, such grippers face inherent limitations in handling certain real world tasks such as wiping glass surfaces or opening drawers without handles due to insufficient contact area or lack of adhesion. To overcome these challenges, we present a low cost, integrated hardware design that combines a mechanical two finger gripper with a vacuum suction unit, enabling dual mode manipulation within a single end effector. Our system supports flexible switching or synergistic use of both modalities, expanding the range of feasible tasks. We validate the efficiency and practicality of our design within two state of the art VLA frameworks: DexVLA and Pi0. Experimental results demonstrate that with the proposed hybrid end effector, robots can successfully perform multiple complex tasks that are infeasible for conventional two finger grippers alone. All hardware designs and controlling systems will be released.
中文标题/摘要
标题:VacuumVLA:通过统一吸盘和夹持工具提升VLA能力以实现复杂机器人操作
视觉语言行动模型通过利用大规模预训练的视觉和语言表示,显著提升了通用机器人操作的能力。在现有方法中,大多数当前的VLA系统默认使用平行双指夹具作为末端执行器。然而,这种夹具在处理某些实际任务时存在局限性,例如擦拭玻璃表面或打开无把手的抽屉,因为它们的接触面积不足或缺乏附着力。为克服这些挑战,我们提出了一种低成本的集成硬件设计,结合了机械双指夹具和真空吸盘单元,使单一末端执行器能够在两种模式之间进行切换或协同操作,从而扩大可行任务的范围。我们通过两个最先进的VLA框架:DexVLA和Pi0验证了该设计的效率和实用性。实验结果表明,使用提出的混合末端执行器,机器人可以成功执行多种复杂的任务,而这些任务仅靠传统的双指夹具是无法完成的。所有硬件设计和控制系统将被发布。
Summary / 总结
The research aims to enhance robotic manipulation capabilities by addressing the limitations of parallel two-finger grippers in handling specific tasks. The study introduces a hybrid end effector combining a mechanical two-finger gripper with a vacuum suction unit, allowing for dual-mode manipulation. Experiments conducted within DexVLA and Pi0 frameworks show that this design enables robots to successfully execute complex tasks that are otherwise impossible with traditional grippers alone.
研究旨在通过将真空吸盘单元与机械两指夹爪集成到单个末端执行器中,来提升机器人的操作能力,解决传统夹爪在擦拭玻璃表面或打开无把手抽屉等任务上的局限性。研究在DexVLA和Pi0框架下验证了这一方法,展示了在复杂任务上的改进表现,这些任务单靠传统夹爪无法完成。
Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization
Authors: Sina Mokhtarzadeh Azar, Emad Bahrami, Enrico Pallotta, Gianpiero Francesca, Radu Timofte, Juergen Gall
First: 2025-11-23T02:58:10+00:00 · Latest: 2025-11-26T16:28:59+00:00
Abstract
In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.
中文标题/摘要
标题:连续视频流中基于扩散噪声优化的序列自适应视频预测
在本研究中,我们探讨了基于扩散的视频预测模型,这些模型可以预测未来的视频帧,适用于连续视频流。在这种情况下,模型会不断观察新的训练样本,我们旨在利用这一点来提高其预测能力。因此,我们提出了一种方法,使预训练的扩散模型能够连续适应视频流。由于对大型扩散模型进行微调参数过于昂贵,我们在推理过程中优化扩散噪声,同时保持模型参数不变,使模型能够自适应地确定合适的采样噪声。我们称这种方法为序列自适应视频预测与扩散噪声优化(SAVi-DNO)。为了验证我们的方法,我们在Ego4D数据集上引入了一个新的评估设置,重点关注长连续视频的同时适应和评估。实验证明,SAVi-DNO在Ego4D、OpenDV-YouTube以及UCF-101和SkyTimelapse的长视频上,基于FVD、SSIM和PSNR指标表现出更好的性能,展示了其有效性。
Summary / 总结
This work explores diffusion-based video prediction models for continuous video streams, aiming to improve their predictions by continuously adapting to new training samples. The proposed approach, Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO), refines the diffusion noise during inference while keeping the model parameters frozen. Experimental results show improved performance on long videos from Ego4D, OpenDV-YouTube, UCF-101, and SkyTimelapse, as measured by FVD, SSIM, and PSNR metrics.
本文探讨了基于扩散的视频预测模型在连续视频流中的应用,提出了一种称为SAVi-DNO的方法,该方法能够持续适应预训练的扩散模型以处理新的训练样本。通过在推理过程中细化扩散噪声,模型可以自适应地确定合适的采样噪声,而无需对参数进行昂贵的微调。实验结果表明,SAVi-DNO在Ego4D、OpenDV-YouTube、UCF-101和SkyTimelapse等数据集的长视频上,在FVD、SSIM和PSNR指标上表现出更好的性能,证明了其有效性。
History
20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553