Scaling Spatial Intelligence with Multimodal Foundation Models
Authors: Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang
First: 2025-11-17T18:59:33+00:00 · Latest: 2025-11-17T18:59:33+00:00
Comments: Model: https://huggingface.co/collections/sensenova/sensenova-si; Code: https://github.com/OpenSenseNova/SenseNova-SI
Abstract
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube, 54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously. All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
中文标题/摘要
标题:利用多模态基础模型扩展空间智能
尽管取得了显著进展,但多模态基础模型在空间智能方面仍然表现出令人惊讶的不足。在本研究中,我们探索了将多模态基础模型扩展到SenseNova-SI家族,该家族基于视觉理解模型(如Qwen3-VL和InternVL3)和统一理解和生成模型(如Bagel)构建而成。我们采取了一种原则性的方法,通过系统地构建SenseNova-SI-8M:八百万种多样化的数据样本,涵盖严格的空间能力分类体系,以培养空间智能。SenseNova-SI在广泛的空间智能基准测试中表现出前所未有的性能:在VSI-Bench上得分为68.7%,在MMSI上得分为43.3%,在MindCube上得分为85.6%,在ViewSpatial上得分为54.6%,在SITE上得分为50.1%,同时保持了强大的通用多模态理解能力(例如,在MM Bench-En上得分为84.9%)。更重要的是,我们分析了数据扩展的影响,讨论了通过多样化数据训练引发的早期泛化能力迹象,分析了过拟合和语言捷径的风险,进行了初步的空间链式推理研究,并验证了潜在的下游应用。SenseNova-SI是一个持续项目,本报告将不断更新。所有新训练的多模态基础模型将公开发布,以促进该领域的进一步研究。
Summary / 总结
This work aims to enhance the spatial intelligence of multimodal foundation models by scaling up the SenseNova-SI family. The authors systematically curate a dataset of eight million diverse samples and train models on a rigorous taxonomy of spatial capabilities. The SenseNova-SI models achieve outstanding performance on various spatial intelligence benchmarks, including 68.7% on VSI-Bench and 85.6% on MindCube, while maintaining strong general multimodal understanding. The study also explores the impact of data scaling and the potential for emergent generalization capabilities and downstream applications.
该研究旨在通过扩大SenseNova-SI家族的规模来提升多模态基础模型的空间智能。作者系统地收集了八百万个多样化的样本,并在严格的空间能力分类体系上进行训练。SenseNova-SI模型在各种空间智能基准测试中表现出色,包括VSI-Bench的68.7%和MindCube的85.6%,同时保持了强大的多模态一般理解能力。研究还探讨了数据规模的影响以及潜在的泛化能力和下游应用的可能性。
Segment Anything Across Shots: A Method and Benchmark
Authors: Hengrui Hu, Kaining Ying, Henghui Ding
Venue: AAAI 2026
First: 2025-11-17T18:58:40+00:00 · Latest: 2025-11-17T18:58:40+00:00
Comments: AAAI 2026, Project Page: https://henghuiding.com/SAAS/
Abstract
This work focuses on multi-shot semi-supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single-shot videos and struggle with shot discontinuities, thereby limiting their real-world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross-shot generalization with single-shot data to alleviate the severe annotated multi-shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut-VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on YouMVOS and Cut-VOS demonstrate that the proposed SAAS achieves state-of-the-art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.
中文标题/摘要
标题:跨越镜头的任何分割:一种方法和基准
本文关注多镜头半监督视频对象分割(MVOS),旨在通过多镜头视频中的多个镜头,对由初始掩码指示的目标对象进行分割。现有的VOS方法主要集中在单镜头视频上,并且难以处理镜头间的不连续性,从而限制了其在实际中的应用。我们提出了一种过渡模仿数据增强策略(TMA),该策略能够使用单镜头数据实现跨镜头泛化,以缓解严重标注的多镜头数据稀疏性问题,并提出了跨越镜头的任何分割(SAAS)模型,该模型能够有效检测和理解镜头过渡。为了支持MVOS的评估和未来研究,我们引入了Cut-VOS,这是一个新的MVOS基准,具有密集掩码标注、多种对象类别和高频过渡。在YouMVOS和Cut-VOS上的广泛实验表明,所提出的SAAS通过有效模仿、理解和分割复杂过渡,实现了最先进的性能。代码和数据集可在https://henghuiding.com/SAAS/发布。
Summary / 总结
This work addresses the challenge of multi-shot semi-supervised video object segmentation (MVOS) by proposing a transition mimicking data augmentation strategy (TMA) and the Segment Anything Across Shots (SAAS) model. The SAAS model can handle shot transitions effectively, improving performance on complex MVOS tasks. Experiments on YouMVOS and Cut-VOS show that SAAS outperforms existing methods. The Cut-VOS benchmark, with dense mask annotations and diverse object categories, supports evaluation and future research in MVOS.
本文提出了一种过渡模仿数据增强策略(TMA)和跨镜头分割(SAAS)模型,以解决多镜头半监督视频对象分割(MVOS)的问题。SAAS模型能够有效处理镜头转换,并在YouMVOS和Cut-VOS基准测试中达到最先进的性能。Cut-VOS基准测试提供了密集的掩码注释和多样的对象类别,以支持未来在MVOS领域的研究。代码和数据集已公开发布。
UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
Authors: Junwei Yu, Trevor Darrell, XuDong Wang
First: 2025-11-17T18:58:34+00:00 · Latest: 2025-11-17T18:58:34+00:00
Abstract
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
中文标题/摘要
标题:UnSAMv2:自我监督学习使在任何粒度下分割一切成为可能
分割一切模型(SAM)家族已成为广泛采用的视觉基础模型,但其控制分割粒度的能力仍然有限。用户通常需要通过添加更多提示或从预生成的掩码中选择来手动细化结果,以达到所需的细节水平。这一过程可能具有模糊性,因为相同的提示可能对应多个合理的掩码,而在所有粒度上收集密集注释的成本极高,使得监督解决方案不可行。为解决这一限制,我们引入了UnSAMv2,它能够在无需人工注释的情况下,在任何粒度下分割一切。UnSAMv2 通过发现丰富的掩码-粒度对并引入一种新颖的粒度控制嵌入,扩展了UnSAM的分而治之策略,从而实现了对分割规模的精确、连续控制。令人惊讶的是,仅使用6K未标注图像和0.02%的额外参数,UnSAMv2 显著增强了SAM-2,实现了在交互式、全图像和视频分割任务中在任何粒度下分割一切。在超过11个基准上评估,UnSAMv2 提高了NoC_90(5.69 → 4.75)、1-IoU(58.0 → 73.1)和AR_1000(49.6 → 68.3),表明少量未标注数据与粒度感知的自我监督学习方法可以解锁视觉基础模型的潜力。
Summary / 总结
UnSAMv2 addresses the limitation of the Segment Anything Model (SAM) in controlling segmentation granularity by introducing a self-supervised learning approach. It uses only 6K unlabeled images and 0.02% additional parameters to enhance SAM-2, enabling precise control over segmentation scale across various tasks. UnSAMv2 significantly improves metrics such as NoC_90, 1-IoU, and AR_1000, demonstrating its effectiveness in achieving segment anything at any granularity without manual annotations or dense labeling.
UnSAMv2通过引入自我监督学习方法解决了Segment Anything Model (SAM)在控制分割粒度方面的局限性。仅使用6K未标注图像和0.02%的额外参数来增强SAM-2,使其能够在各种任务中实现对分割尺度的精确控制。UnSAMv2在NoC_90、1-IoU和AR_1000等指标上取得了显著改进,展示了其在无需人工标注或密集标注的情况下实现任意粒度分割的能力。
Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine
Authors: Xincheng Shuai, Zhenyuan Qin, Henghui Ding, Dacheng Tao
Venue: AAAI 2026
First: 2025-11-17T18:57:39+00:00 · Latest: 2025-11-17T18:57:39+00:00
Comments: AAAI 2026, Project Page: https://henghuiding.com/FFSE/
Abstract
Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.
中文标题/摘要
标题:自由形式场景编辑器:实现类似3D引擎的多轮对象操作
近期在文本到图像(T2I)扩散模型方面的进展显著提高了语义图像编辑的效果,但大多数方法在进行3D感知对象操作时仍存在不足。在本文中,我们提出了FFSE,这是一种3D感知的自回归框架,旨在直接在真实图像上实现直观且物理一致的对象编辑。与之前仅在图像空间操作或需要缓慢且易出错的3D重建的方法不同,FFSE将编辑建模为一系列学习到的3D变换序列,允许用户进行任意操作(如平移、缩放和旋转),同时保留现实背景效果(如阴影、反射)并保持多轮编辑中的全局场景一致性。为了支持多轮3D感知对象操作的学习,我们引入了3DObjectEditor,这是一种由多种对象和场景的模拟编辑序列组成的混合数据集,能够在多轮和动态条件下有效训练。大量实验表明,提出的FFSE在单轮和多轮3D感知编辑场景中均显著优于现有方法。
Summary / 总结
FFSE is a 3D-aware autoregressive framework that enables users to perform intuitive and physically-consistent object manipulations on real-world images. Unlike previous methods, FFSE models editing as a sequence of learned 3D transformations, allowing for arbitrary manipulations while preserving realistic background effects and maintaining global scene consistency. The framework significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.
FFSE 是一个 3D 意识自回归框架,旨在对真实世界图像进行直观且物理上一致的对象编辑。与之前的方法不同,FFSE 将编辑建模为一系列学习的 3D 变换,允许用户执行任意操作同时保留现实的背景效果并保持多轮编辑后的全局场景一致性。实验表明,FFSE 在单轮和多轮 3D 意识编辑场景中均优于现有方法。
From Power to Precision: Learning Fine-grained Dexterity for Multi-fingered Robotic Hands
Authors: Jianglong Ye, Lai Wei, Guangqi Jiang, Changwei Jing, Xueyan Zou, Xiaolong Wang
First: 2025-11-17T18:56:50+00:00 · Latest: 2025-11-17T18:56:50+00:00
Comments: Project page: https://jianglongye.com/power-to-precision
Abstract
Human grasps can be roughly categorized into two types: power grasps and precision grasps. Precision grasping enables tool use and is believed to have influenced human evolution. Today's multi-fingered robotic hands are effective in power grasps, but for tasks requiring precision, parallel grippers are still more widely adopted. This contrast highlights a key limitation in current robotic hand design: the difficulty of achieving both stable power grasps and precise, fine-grained manipulation within a single, versatile system. In this work, we bridge this gap by jointly optimizing the control and hardware design of a multi-fingered dexterous hand, enabling both power and precision manipulation. Rather than redesigning the entire hand, we introduce a lightweight fingertip geometry modification, represent it as a contact plane, and jointly optimize its parameters along with the corresponding control. Our control strategy dynamically switches between power and precision manipulation and simplifies precision control into parallel thumb-index motions, which proves robust for sim-to-real transfer. On the design side, we leverage large-scale simulation to optimize the fingertip geometry using a differentiable neural-physics surrogate model. We validate our approach through extensive experiments in both sim-to-real and real-to-real settings. Our method achieves an 82.5% zero-shot success rate on unseen objects in sim-to-real precision grasping, and a 93.3% success rate in challenging real-world tasks involving bread pinching. These results demonstrate that our co-design framework can significantly enhance the fine-grained manipulation ability of multi-fingered hands without reducing their ability for power grasps. Our project page is at https://jianglongye.com/power-to-precision
中文标题/摘要
标题:从力量到精确:多指灵巧手的细粒度灵巧学习
人类抓握大致可分为两类:力量抓握和精确抓握。精确抓握使工具使用成为可能,并被认为影响了人类的进化。当今的多指灵巧手在力量抓握方面效果显著,但在需要精确操作的任务中,平行夹钳仍更为常用。这种对比突显了当前灵巧手设计的一个关键局限性:难以在单一、多功能系统中同时实现稳定的力量抓握和精确、细粒度的操作。在本研究中,我们通过联合优化多指灵巧手的控制和硬件设计,弥合了这一差距,使其能够同时进行力量和精确操作。我们并未重新设计整个手,而是引入了轻量级指尖几何修改,将其表示为接触平面,并联合优化其参数及其相应的控制。我们的控制策略在力量和精确操作之间动态切换,并将精确控制简化为拇指和食指的并行运动,这证明了其在从仿真到现实的转移中的鲁棒性。在设计方面,我们利用大规模仿真,使用可微神经物理代理模型优化指尖几何。我们通过在仿真到现实和现实到现实设置中的广泛实验验证了我们的方法。我们的方法在仿真到现实的精确抓握中对未见过的对象实现了82.5%的零样本成功率,并在涉及面包捏取的具有挑战性的现实世界任务中实现了93.3%的成功率。这些结果表明,我们的协同设计框架可以显著增强多指灵巧手的细粒度操作能力,而不降低其力量抓握的能力。我们的项目页面为:https://jianglongye.com/power-to-precision
Summary / 总结
This research aims to improve the fine-grained manipulation capabilities of multi-fingered robotic hands by jointly optimizing control and hardware design. The method involves a lightweight fingertip geometry modification represented as a contact plane, which is optimized alongside control parameters. The approach dynamically switches between power and precision manipulation, simplifying precision control to parallel thumb-index motions. Experiments show an 82.5% zero-shot success rate in sim-to-real precision grasping and a 93.3% success rate in real-world tasks, indicating enhanced fine-grained manipulation without compromising power grasping abilities.
该研究旨在通过联合优化控制和硬件设计来提升多指灵巧手的精细操作能力。方法包括一种轻量级的指尖几何修改,表示为接触平面,并与控制参数一起进行优化。该方法在动态切换精细和力量操作之间,实现了在仿真实际转移中82.5%的零样本成功率和在实际任务中93.3%的成功率。这表明在不牺牲力量抓取能力的情况下,显著增强了精细操作能力。
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation
Authors: Zeyu Wang, Zilong Chen, Chenhui Gou, Feng Li, Chaorui Deng, Deyao Zhu, Kunchang Li, Weihao Yu, Haoqin Tu, Haoqi Fan, Cihang Xie
First: 2025-10-27T02:59:57+00:00 · Latest: 2025-11-17T18:55:30+00:00
Comments: Preprint. Work in progress
Abstract
Unified multimodal models have recently shown remarkable gains in both capability and versatility, yet most leading systems are still trained from scratch and require substantial computational resources. In this paper, we show that competitive performance can be obtained far more efficiently by strategically fusing publicly available models specialized for either generation or understanding. Our key design is to retain the original blocks while additionally interleaving multimodal self-attention blocks throughout the networks. This double fusion mechanism (1) effectively enables rich multi-modal fusion while largely preserving the original strengths of the base models, and (2) catalyzes synergistic fusion of high-level semantic representations from the understanding encoder with low-level spatial signals from the generation encoder. By training with only ~ 35B tokens, this approach achieves strong results across multiple benchmarks: 0.91 on GenEval for compositional text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing. By fully releasing the entire suite of code, model weights, and datasets, we hope to support future research on unified multimodal modeling.
中文标题/摘要
标题:LightFusion:一种轻量级双融合框架,用于统一多模态理解和生成
统一多模态模型最近在能力和灵活性方面取得了显著进展,但大多数领先系统仍然从头开始训练,需要大量的计算资源。在本文中,我们通过战略性地融合专门用于生成或理解的公共模型,展示了可以更高效地获得竞争性性能。我们的关键设计是在保留原始模块的同时,在网络中交错插入多模态自注意力模块。这种双融合机制(1)有效地实现了丰富的多模态融合,同时保留了基础模型的原始优势;(2)促进了理解编码器的高层语义表示与生成编码器的低级空间信号的协同融合。通过仅使用约350亿个令牌进行训练,这种方法在多个基准测试中取得了良好的结果:GenEval上的得分为0.91,DPG-Bench上的得分为82.16,GEditBench上的得分为6.06,ImgEdit-Bench上的得分为3.77,用于图像编辑。通过完全释放整个代码套件、模型权重和数据集,我们希望支持未来统一多模态建模的研究。
Summary / 总结
The research aims to develop a lightweight framework for unified multimodal understanding and generation by fusing existing specialized models. The method involves interleaving multimodal self-attention blocks with the original blocks of the base models, enabling rich multimodal fusion while preserving their strengths. Key experimental results include strong performance across various benchmarks: 0.91 on GenEval for text-to-image generation, 82.16 on DPG-Bench for complex text-to-image generation, 6.06 on GEditBench, and 3.77 on ImgEdit-Bench for image editing, all achieved with only ~35B tokens of training data.
研究旨在通过融合现有专门化的模型来开发一种轻量级的统一多模态理解和生成框架。方法是将多模态自注意力模块与基础模型的原始模块交错,以实现高效的融合。关键实验结果表明,该方法在多个基准测试中表现良好,分别在GenEval、DPG-Bench、GEditBench和ImgEdit-Bench上取得了0.91、82.16、6.06和3.77的分数,仅使用约350亿个训练令牌。
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
Authors: Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen
First: 2025-11-17T18:52:44+00:00 · Latest: 2025-11-17T18:52:44+00:00
Comments: Project: https://haroldchen19.github.io/TiViBench-Page/
Abstract
The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
中文标题/摘要
标题:TiViBench:视频生成模型中思考-视频推理的基准测试
视频生成模型的迅速发展使其从生成视觉上可信的输出转向解决需要物理可信性和逻辑一致性的任务。然而,尽管最近取得了突破(如Veo 3的帧链推理),尚不清楚这些模型是否能表现出类似于大型语言模型(LLMs)的推理能力。现有的基准测试主要评估视觉保真度和时间连贯性,未能捕捉到高级推理能力。为弥补这一差距,我们提出了TiViBench,这是一种分层基准测试,专门用于评估图像到视频(I2V)生成模型的推理能力。TiViBench系统地从四个维度评估推理能力:i) 结构推理与搜索,ii) 空间与视觉模式推理,iii) 符号与逻辑推理,以及iv) 动作规划与任务执行,涵盖了3个难度级别下的24种不同任务场景。通过广泛的评估,我们表明商用模型(如Sora 2,Veo 3.1)展示了更强的推理潜力,而开源模型则揭示了受限于有限的训练规模和数据多样性而未被充分利用的潜力。为了进一步释放这种潜力,我们引入了VideoTPO,这是一种简单而有效的测试时策略,灵感来源于偏好优化。通过在生成候选者上进行LLM自我分析以识别其优势和劣势,VideoTPO显著提高了推理性能,而无需额外的训练、数据或奖励模型。TiViBench和VideoTPO共同为评估和推进视频生成模型中的推理奠定了基础,为这一新兴领域的未来研究奠定了基础。
Summary / 总结
TiViBench is a benchmark designed to evaluate the reasoning capabilities of image-to-video generation models, addressing the limitations of existing benchmarks that focus on visual fidelity and temporal coherence. It assesses reasoning across four dimensions and 24 task scenarios, revealing that commercial models have stronger reasoning potential compared to open-source models. The study introduces VideoTPO, a test-time strategy that enhances reasoning performance by leveraging LLM self-analysis without additional training or data. This work sets a foundation for future research in reasoning in video generation models.
TiViBench 是一个用于评估图像到视频生成模型推理能力的基准,重点关注结构、空间、符号和动作推理。它包含24个不同难度级别的多样化任务。商业模型如Sora 2和Veo 3.1展示了更强的推理潜力,而开源模型由于训练规模和数据多样性有限,仍有待开发的潜力。VideoTPO 是一种基于偏好优化的测试时策略,能够在不增加训练、数据或奖励模型的情况下显著提升推理性能。基准和策略共同推动了视频生成模型推理能力的评估和改进。
Generalist Foundation Models Are Not Clinical Enough for Hospital Operations
Authors: Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann
First: 2025-11-17T18:52:22+00:00 · Latest: 2025-11-17T18:52:22+00:00
Abstract
Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
中文标题/摘要
标题:通用基础模型在医院运营中不够临床化
医院和医疗系统依赖于决定患者流动、成本和护理质量的操作决策。尽管在医学知识和对话基准测试中表现出色,但基于通用文本训练的基础模型可能缺乏这些操作决策所需的专门知识。我们介绍了Lang1,这是一个模型家族(1亿-70亿参数),在包含纽约大学朗格one健康EHR中的800亿临床令牌和互联网中的6270亿令牌的专门语料库上进行预训练。为了在真实环境中严格评估Lang1,我们开发了ReMedE基准测试,该基准测试源自668,331份EHR笔记,评估五个关键任务:30天再入院预测、30天死亡率预测、住院时间、合并症编码和预测保险索赔拒绝。在零样本设置中,通用模型和专门模型在四个任务中的表现均不佳(AUCROC 36.6%-71.7%),仅死亡率预测是个例外。经过微调后,Lang1-1B在微调通用模型上表现优于70倍,在零样本模型上表现优于671倍,分别提高了AUCROC 3.64%-6.75%和1.66%-23.66%。我们还观察到跨任务扩展,联合多个任务的微调在其他任务上也有所改进。Lang1-1B有效地转移到了分布外环境中,包括其他临床任务和外部医疗系统。我们的研究结果表明,医院运营的预测能力需要显式的监督微调,而这一过程通过EHR领域的预训练变得更加高效。我们的研究结果支持了专门LLM在专门任务中可以与通用模型竞争的观点,并表明有效的医疗保健系统AI需要领域内预训练、监督微调和超越代理基准的真实世界评估。
Summary / 总结
The study addresses the need for specialized knowledge in operational decisions within hospitals, where generalist foundation models may fall short. It introduces Lang1, a family of models pretrained on a specialized corpus combining clinical and general text. Lang1 is evaluated using the ReMedE benchmark, which assesses five critical tasks. In zero-shot settings, both generalist and specialized models underperform, but Lang1-1B outperforms larger generalist models and zero-shot models after finetuning, demonstrating the importance of explicit supervised finetuning and in-domain pretraining for operational tasks in healthcare systems.
研究关注通用基础模型在医院运营中的局限性,因为这些模型缺乏必要的专业领域知识。研究引入了Lang1模型,该模型基于临床和通用文本进行预训练,并使用ReMedE基准进行评估。经过微调后,Lang1在30天再入院和死亡率预测等任务上表现优于通用和专业模型,强调了明确的监督微调和领域内预训练对于医疗保健运营任务的重要性。
Learning stochasticity: a nonparametric framework for intrinsic noise estimation
Authors: Gianluigi Pillonetto, Alberto Giaretta, Mauro Bisiacco
First: 2025-11-17T18:52:05+00:00 · Latest: 2025-11-17T18:52:05+00:00
Abstract
Understanding the principles that govern dynamical systems is a central challenge across many scientific domains, including biology and ecology. Incomplete knowledge of nonlinear interactions and stochastic effects often renders bottom-up modeling approaches ineffective, motivating the development of methods that can discover governing equations directly from data. In such contexts, parametric models often struggle without strong prior knowledge, especially when estimating intrinsic noise. Nonetheless, incorporating stochastic effects is often essential for understanding the dynamic behavior of complex systems such as gene regulatory networks and signaling pathways. To address these challenges, we introduce Trine (Three-phase Regression for INtrinsic noisE), a nonparametric, kernel-based framework that infers state-dependent intrinsic noise from time-series data. Trine features a three-stage algorithm that com- bines analytically solvable subproblems with a structured kernel architecture that captures both abrupt noise-driven fluctuations and smooth, state-dependent changes in variance. We validate Trine on biological and ecological systems, demonstrating its ability to uncover hidden dynamics without relying on predefined parametric assumptions. Across several benchmark problems, Trine achieves performance comparable to that of an oracle. Biologically, this oracle can be viewed as an idealized observer capable of directly tracking the random fluctuations in molecular concentrations or reaction events within a cell. The Trine framework thus opens new avenues for understanding how intrinsic noise affects the behavior of complex systems.
中文标题/摘要
标题:学习随机性:内在噪声估计的非参数框架
理解支配动力系统的原理是许多科学领域,包括生物学和生态学中的一个核心挑战。由于对非线性相互作用和随机效应的不完全了解,自底向上的建模方法往往无效,这促使人们开发可以直接从数据中发现支配方程的方法。在这种情况下,参数模型往往缺乏有效的估计内在噪声的能力,尤其是在缺乏强大先验知识的情况下。然而,对于理解复杂系统,如基因调控网络和信号通路的动态行为来说,纳入随机效应往往是必不可少的。为了解决这些挑战,我们引入了Trine(三阶段回归用于内在噪声估计),这是一种非参数的核基框架,可以从时间序列数据中推断状态依赖的内在噪声。Trine具有一个三阶段算法,该算法结合了可解析求解的子问题和一个结构化的核架构,能够捕捉到噪声驱动的突变波动和状态依赖的平滑变化。我们在生物学和生态学系统中验证了Trine,展示了其在无需依赖预定义参数假设的情况下揭示隐藏动态的能力。在多个基准问题上,Trine的表现与先验知识完美的“先知”相当。从生物学角度来看,这种先知可以被视为一个理想化的观察者,能够直接追踪分子浓度或细胞内反应事件中的随机波动。Trine框架因此为理解内在噪声如何影响复杂系统的动态行为开辟了新的途径。
Summary / 总结
The paper introduces Trine, a nonparametric framework for estimating intrinsic noise in dynamical systems from time-series data. It addresses the limitations of parametric models in capturing stochastic effects without strong prior knowledge. Trine uses a three-stage algorithm with a structured kernel architecture to infer state-dependent intrinsic noise, effectively uncovering hidden dynamics in biological and ecological systems without predefined parametric assumptions. The method performs comparably to an idealized oracle that can track random fluctuations in molecular concentrations or reaction events within a cell, demonstrating its effectiveness across various benchmark problems.
论文介绍了Trine,一种用于估计动态系统内在噪声的非参数框架。它解决了在缺乏强先验知识的情况下,参数模型难以捕捉随机效应的问题。Trine 使用三阶段算法和结构化核架构,从时间序列数据中推断状态相关的内在噪声,有效地揭示了生物和生态系统的隐藏动态,无需预先定义的参数假设。该方法的性能与能够直接跟踪分子浓度或细胞内反应事件随机波动的理想观察者相当。
Protein Secondary Structure Prediction Using 3D Graphs and Relation-Aware Message Passing Transformers
Authors: Disha Varshney, Samarth Garg, Sarthak Tyagi, Deeksha Varshney, Nayan Deep, Asif Ekbal
First: 2025-11-17T18:39:13+00:00 · Latest: 2025-11-17T18:39:13+00:00
Comments: 40 pages
Abstract
In this study, we tackle the challenging task of predicting secondary structures from protein primary sequences, a pivotal initial stride towards predicting tertiary structures, while yielding crucial insights into protein activity, relationships, and functions. Existing methods often utilize extensive sets of unlabeled amino acid sequences. However, these approaches neither explicitly capture nor harness the accessible protein 3D structural data, which is recognized as a decisive factor in dictating protein functions. To address this, we utilize protein residue graphs and introduce various forms of sequential or structural connections to capture enhanced spatial information. We adeptly combine Graph Neural Networks (GNNs) and Language Models (LMs), specifically utilizing a pre-trained transformer-based protein language model to encode amino acid sequences and employing message-passing mechanisms like GCN and R-GCN to capture geometric characteristics of protein structures. Employing convolution within a specific node's nearby region, including relations, we stack multiple convolutional layers to efficiently learn combined insights from the protein's spatial graph, revealing intricate interconnections and dependencies in its structural arrangement. To assess our model's performance, we employed the training dataset provided by NetSurfP-2.0, which outlines secondary structure in 3-and 8-states. Extensive experiments show that our proposed model, SSRGNet surpasses the baseline on f1-scores.
中文标题/摘要
标题:使用3D图和关系感知消息传递变换器预测蛋白质二级结构
在本研究中,我们解决从蛋白质一级序列预测二级结构这一具有挑战性的任务,这是预测三级结构的重要第一步,同时为了解蛋白质的活性、关系和功能提供了关键见解。现有方法通常使用大量的未标记氨基酸序列。然而,这些方法既没有明确捕捉也没有利用可获取的蛋白质3D结构数据,而这些数据被认为是决定蛋白质功能的关键因素。为了解决这个问题,我们利用蛋白质残基图,并引入各种形式的序列或结构连接以捕捉增强的空间信息。我们巧妙地结合了图神经网络(GNN)和语言模型(LM),具体使用预训练的基于变换器的蛋白质语言模型来编码氨基酸序列,并利用GCN和R-GCN等消息传递机制来捕捉蛋白质结构的几何特征。通过在特定节点附近区域内进行卷积,包括关系,我们堆叠多层卷积层以高效地学习蛋白质空间图的综合见解,揭示其结构排列中的复杂相互连接和依赖关系。为了评估我们模型的性能,我们使用NetSurfP-2.0提供的训练数据集,该数据集以3-和8-状态描述二级结构。大量实验表明,我们提出的模型SSRGNet在f1分数上超过了基线。
Summary / 总结
This study aims to predict protein secondary structures from primary sequences, which is essential for understanding protein functions. The authors introduce SSRGNet, which combines Graph Neural Networks and Language Models, using protein residue graphs and relation-aware message-passing mechanisms. Experiments show that SSRGNet outperforms baseline models on f1-scores using the NetSurfP-2.0 dataset for 3-and 8-state secondary structure prediction.
本研究旨在从蛋白质的一级序列预测二级结构,这对于理解蛋白质功能至关重要。作者引入了SSRGNet,结合了图神经网络和语言模型,使用蛋白质残基图和关系感知的消息传递机制。实验表明,SSRGNet在使用NetSurfP-2.0数据集进行3-和8状态二级结构预测时,优于基线模型的f1分数。
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
Authors: Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu, Xingze Zou, Jing Wang, Haoji Hu
First: 2025-11-17T18:37:41+00:00 · Latest: 2025-11-17T18:37:41+00:00
Comments: Submitting for Neurocomputing
Abstract
We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.
中文标题/摘要
标题:Training-Free多视图扩展的IC-Light文本引导场景重新光照
我们引入了GS-Light,一种基于高斯点表示(3DGS)的高效、文本引导的重新光照流水线。GS-Light实现了一种单输入扩散模型的无训练扩展,以处理多视图输入。给定用户提示,可能包含照明方向、颜色、强度或参考对象等信息,我们使用大型视觉语言模型(LVLM)解析提示以提取照明先验。利用现成的几何和语义估计器(深度、表面法线和语义分割),我们将这些照明先验与视图几何约束融合,计算照明图并为每个视图生成初始潜在代码。这些精心推导的初始潜在代码引导扩散模型生成更符合用户期望的重新光照输出,特别是在照明方向方面。通过将多视图渲染图像与初始潜在代码输入到我们的多视图重新光照模型中,我们生成了高保真度、艺术性重新光照的图像。最后,我们使用重新光照外观微调3DGS场景,以获得完全重新光照的3D场景。我们在室内和室外场景上评估了GS-Light,将其与包括单视图重新光照、视频重新光照和场景编辑方法在内的最新基线进行比较。使用定量指标(多视图一致性、成像质量、美学评分、语义相似度等)和定性评估(用户研究),GS-Light在基线之上展示了持续的改进。代码和资产将在发表后提供。
Summary / 总结
GS-Light is a training-free pipeline for text-guided relighting of 3D scenes using Gaussian Splatting. It parses user prompts with a large vision-language model to derive lighting priors, which are then fused with geometric and semantic estimations to generate initial latent codes. These codes guide a diffusion model to produce relit images that better match user expectations. GS-Light shows consistent improvements over baselines in terms of multi-view consistency, imaging quality, and aesthetic score. Evaluations were conducted on both indoor and outdoor scenes, and user studies confirmed its effectiveness.
GS-Light 是一个无需训练的管道,用于使用高斯点云表示的 3D 场景的文本引导光照调整。它使用大型视觉语言模型解析用户提示以提取光照先验,然后将这些先验与几何和语义信息融合生成每个视图的初始潜在代码。这些代码引导扩散模型生成更符合用户期望的光照调整图像。多视图图像和初始潜在代码输入到多视图光照调整模型中以生成高质量的光照调整图像。GS-Light 在多视图一致性、成像质量和美观度等方面的一系列定量指标和用户研究中均优于最先进的基线方法。
QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention
Authors: Hyunwoo Oh, Hanning Chen, Sanggeon Yun, Yang Ni, Wenjun Huang, Tamoghno Das, Suyeon Jang, Mohsen Imani
First: 2025-11-17T18:34:04+00:00 · Latest: 2025-11-17T18:34:04+00:00
Comments: Accepted to DATE 2026
Abstract
Deformable transformers deliver state-of-the-art detection but map poorly to hardware due to irregular memory access and low arithmetic intensity. We introduce QUILL, a schedule-aware accelerator that turns deformable attention into cache-friendly, single-pass work. At its core, Distance-based Out-of-Order Querying (DOOQ) orders queries by spatial proximity; the look-ahead drives a region prefetch into an alternate buffer--forming a schedule-aware prefetch loop that overlaps memory and compute. A fused MSDeformAttn engine executes interpolation, Softmax, aggregation, and the final projection (W''m) in one pass without spilling intermediates, while small tensors are kept on-chip and surrounding dense layers run on integrated GEMMs. Implemented as RTL and evaluated end-to-end, QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency than an RTX 4090, and exceeds prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. With mixed-precision quantization, accuracy tracks FP32 within <=0.9 AP across Deformable and Sparse DETR variants. By converting sparsity into locality--and locality into utilization--QUILL delivers consistent, end-to-end speedups.
中文标题/摘要
标题:QUILL:一种针对缓存局部变形注意力的算法-架构协同设计
变形变压器在检测方面表现出色,但由于不规则的内存访问和低算术强度,难以映射到硬件。我们提出了QUILL,一种调度感知加速器,将变形注意力转化为缓存友好的单遍工作。其核心是基于距离的顺序查询(DOOQ)按空间接近性对查询进行排序;前瞻机制驱动区域预取到替代缓冲区——形成一个调度感知的预取循环,使内存和计算重叠。融合的MSDeformAttn引擎在一个遍历中执行插值、Softmax、聚合和最终投影(W''m),而不溢出中间结果,同时小张量保留在芯片上,周围密集层在集成GEMM上运行。QUILL作为RTL实现并端到端评估,吞吐量最高提高7.29倍,能效提高47.3倍,与RTX 4090相比,吞吐量超过先前加速器3.26-9.82倍,能效超过2.01-6.07倍。通过混合精度量化,精度在Deformable和Sparse DETR变体中与FP32相差不超过0.9 AP。通过将稀疏性转化为局部性,再将局部性转化为利用率,QUILL实现了端到端的一致加速。
Summary / 总结
QUILL is designed to improve the efficiency of deformable transformers by addressing their hardware mapping issues through a schedule-aware accelerator. The core of QUILL is the Distance-based Out-of-Order Querying (DOOQ) mechanism, which orders queries by spatial proximity and prefetches data, overlapping memory and compute operations. QUILL achieves up to 7.29x higher throughput and 47.3x better energy efficiency compared to an RTX 4090, and surpasses prior accelerators by 3.26-9.82x in throughput and 2.01-6.07x in energy efficiency. Mixed-precision quantization ensures that accuracy remains within <=0.9 AP across different variants of Deformable and Sparse DETR. By converting sparsity into locality and locality into utilization, QUILL consistently improves performance end-to-end.
QUILL 是一种通过解决变形变压器硬件映射问题来提高其效率的调度感知加速器。QUILL 的核心是基于距离的顺序查询 (DOOQ) 机制,它按空间接近度对查询进行排序并预取数据,从而重叠内存和计算操作。QUILL 达到了比 RTX 4090 高 7.29 倍的吞吐量和 47.3 倍的能效比,并且在吞吐量和能效比方面分别超越了之前的加速器 3.26-9.82 倍和 2.01-6.07 倍。通过混合精度量化,准确度在不同的变形和稀疏 DETR 变体中保持在 <=0.9 AP 以内。通过将稀疏性转化为局部性,再将局部性转化为利用率,QUILL 一致地提高了端到端的性能。
T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization
Authors: Hyunwoo Oh, KyungIn Nam, Rajat Bhattacharjya, Hanning Chen, Tamoghno Das, Sanggeon Yun, Suyeon Jang, Andrew Ding, Nikil Dutt, Mohsen Imani
First: 2025-11-17T18:32:03+00:00 · Latest: 2025-11-17T18:32:03+00:00
Comments: Accepted to DATE 2026
Abstract
Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.
中文标题/摘要
标题:T-SAR:通过SIMD ALU重新组织实现的CPU专用三值LLM推理全栈协同设计
近年来,LLM的发展已经超越了主要依赖CPU的边缘平台的计算和内存能力,从而挑战了高效和可扩展部署。虽然三值量化能够实现显著的资源节省,但现有的CPU解决方案主要依赖基于内存的查找表(LUTs),这限制了可扩展性,而FPGA或GPU加速器在边缘使用上仍然不切实际。本文提出了T-SAR,这是第一个通过重新利用SIMD寄存器文件实现动态、寄存器内LUT生成,从而在CPU上实现可扩展的三值LLM推理的框架,仅需少量的硬件修改。T-SAR消除了内存瓶颈并最大化了数据级并行性,分别在GEMM延迟和GEMV吞吐量上实现了5.6-24.5倍和1.1-86.2倍的改进,同时SIMD单元的功耗和面积开销仅为3.2%和1.4%。T-SAR在能效上达到了NVIDIA Jetson AGX Orin的2.5-4.9倍,确立了一种在边缘平台上实现高效LLM推理的实用方法。
Summary / 总结
T-SAR is a framework that enables scalable ternary LLM inference on CPUs by reusing the SIMD register file for dynamic, in-register LUT generation. It eliminates memory bottlenecks and enhances data-level parallelism, achieving up to 24.5x improvement in GEMM latency and 86.2x in GEMV throughput with minimal hardware modifications. T-SAR demonstrates up to 2.5-4.9x energy efficiency compared to NVIDIA Jetson AGX Orin, making it a practical solution for efficient LLM inference on edge platforms.
T-SAR 是一种框架,通过重用 SIMD 寄存器文件进行动态、寄存器内的 LUT 生成来实现 CPU 上可扩展的三值 LLM 推断。它消除了内存瓶颈并增强了数据级并行性,通过最小的硬件修改实现了 GEMM 时延最多 24.5 倍的改进和 GEMV 吞吐量最多 86.2 倍的提升。T-SAR 的能效比 NVIDIA Jetson AGX Orin 高出 2.5-4.9 倍,使其成为边缘平台上高效 LLM 推断的实用解决方案。
iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos
Authors: Weikun Peng, Jun Lv, Cewu Lu, Manolis Savva
First: 2025-06-10T01:41:46+00:00 · Latest: 2025-11-17T18:31:53+00:00
Comments: 3DV 2026 camera-ready version. Project website can be found at https://3dlg-hcvc.github.io/video2articulation/
Abstract
Articulated objects are prevalent in daily life. Interactable digital twins of such objects have numerous applications in embodied AI and robotics. Unfortunately, current methods to digitize articulated real-world objects require carefully captured data, preventing practical, scalable, and generalizable acquisition. We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones. However, this setting is challenging due to simultaneous object and camera motion and significant occlusions as the person interacts with the object. To tackle these challenges, we introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a dataset of 784 videos containing 284 objects across 11 categories that is 20$\times$ larger than available in prior work. We then compare our approach with existing methods that also take video as input. Our experiments show that iTACO outperforms existing articulated object digital twin methods on both synthetic and real casually captured RGBD videos.
中文标题/摘要
标题:iTACO:从随意拍摄的RGBD视频中获取可交互的 articulated 对象数字孪生
articulated 对象在日常生活中很常见。这样的对象的可交互数字孪生在具身人工智能和机器人技术中有广泛的应用。遗憾的是,目前用于数字化真实世界articulated 对象的方法需要精心拍摄的数据,这阻碍了其实用、可扩展和普适的获取。我们专注于从手持相机拍摄的随意拍摄的RGBD视频中分析articulated 对象的运动和部分级分割。使用智能手机可以轻松地大规模获取与articulated 对象互动的随意拍摄视频。然而,由于对象和相机同时运动以及人在与对象互动时的显著遮挡,这种设置具有挑战性。为了解决这些挑战,我们引入了iTACO:一种从动态RGBD视频中推断关节参数并分割可移动部分的粗到细框架。为了在这一新设置下评估我们的方法,我们构建了一个包含784个视频、284个对象和11个类别的数据集,该数据集比之前的工作大20倍。然后,我们将我们的方法与也以视频为输入的现有方法进行比较。我们的实验表明,iTACO在合成和实际随意拍摄的RGBD视频上都优于现有的articulated 对象数字孪生方法。
Summary / 总结
The research aims to create interactable digital twins of articulated objects from casually captured RGBD videos to enhance applications in embodied AI and robotics. The method, iTACO, uses a coarse-to-fine framework to infer joint parameters and segment movable parts from dynamic RGBD videos, addressing challenges like object and camera motion and occlusions. Experiments show that iTACO outperforms existing methods on both synthetic and real casually captured RGBD videos, with a dataset of 784 videos containing 284 objects across 11 categories being used for evaluation.
研究旨在通过随意拍摄的RGBD视频创建可交互的关节对象数字孪生,以增强嵌入式AI和机器人技术的应用。方法iTACO使用粗到细框架从动态RGBD视频中推断关节参数并分割可移动部分,解决了物体和摄像机运动以及交互过程中遮挡等挑战。实验表明,iTACO在合成和实际随意拍摄的RGBD视频上均优于现有方法,使用了包含784个视频和284个物体(涵盖11个类别)的数据集进行评估。
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
Authors: Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, Joshua Hansen, Andrew Howe, Patrick Alan Johnson, Mark Otterlee, Ted Schmitt, Hunter Pitelka, Stephen Daspit, Rachel Ratner, Christopher Wilhelm, Sebastian Wood, Mike Jacobi, Hannah Kerner, Evan Shelhamer, Ali Farhadi, Ranjay Krishna, Patrick Beukema
First: 2025-11-17T18:06:26+00:00 · Latest: 2025-11-17T18:06:26+00:00
Abstract
Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.
中文标题/摘要
标题:OlmoEarth:多模态地球观测稳定潜在图像建模
地球观测数据提出了一个独特的挑战:它既像图像一样具有空间性,又像视频或文本一样具有序列性,且高度多模态。我们提出了OlmoEarth:一种多模态、时空基础模型,采用了一种新颖的自监督学习形式化、掩码策略和损失函数,所有这些都针对地球观测领域进行了设计。与12个其他基础模型相比,OlmoEarth在多种研究基准和外部合作伙伴的实际任务中均实现了最先进的性能。在评估嵌入时,OlmoEarth在24个任务中的15个上表现最佳,经过完全微调后,在29个任务中的19个上表现最佳。我们部署OlmoEarth作为地球观测模型从数据收集、标注、训练到推理的端到端平台的骨干。OlmoEarth平台将前沿基础模型和强大的数据管理工具交给了致力于解决世界最大问题的非营利组织和NGO。OlmoEarth源代码、训练数据和预训练权重可在https://github.com/allenai/olmoearth_pretrain获取。
Summary / 总结
OlmoEarth is a multimodal spatio-temporal foundation model designed for Earth observation data, addressing its unique challenges. It uses a self-supervised learning approach with a novel masking strategy and loss function tailored for this domain. OlmoEarth outperforms 12 other foundation models across various benchmarks and real-world tasks, achieving the best performance on 15 out of 24 tasks and 19 out of 29 tasks with full fine-tuning. It serves as the backbone for an end-to-end platform for Earth observation model development and deployment, benefiting non-profits and NGOs. The model, along with its source code and pre-trained weights, is publicly available.
OlmoEarth 是一种针对地球观测数据设计的多模态时空基础模型,旨在解决其独特挑战。它采用了一种自监督学习方法,并结合了针对该领域的新型掩码策略和损失函数。OlmoEarth 在各种基准测试和实际任务中优于 12 种其他基础模型,分别在 24 项任务中的 15 项和 29 项任务中的 19 项上表现最佳。它作为地球观测模型开发和部署的端到端平台的骨干,为非营利组织和 NGO 提供支持。该模型及其源代码和预训练权重已公开发布。
Physically Interpretable World Models via Weakly Supervised Representation Learning
Authors: Zhenjiang Mao, Mrinall Eashaan Umasudhan, Ivan Ruchkin
First: 2024-12-17T12:51:24+00:00 · Latest: 2025-11-17T18:05:18+00:00
Abstract
Learning predictive models from high-dimensional sensory observations is fundamental for cyber-physical systems, yet the latent representations learned by standard world models lack physical interpretability. This limits their reliability, generalizability, and applicability to safety-critical tasks. We introduce Physically Interpretable World Models (PIWM), a framework that aligns latent representations with real-world physical quantities and constrains their evolution through partially known physical dynamics. Physical interpretability in PIWM is defined by two complementary properties: (i) the learned latent state corresponds to meaningful physical variables, and (ii) its temporal evolution follows physically consistent dynamics. To achieve this without requiring ground-truth physical annotations, PIWM employs weak distribution-based supervision that captures state uncertainty naturally arising from real-world sensing pipelines. The architecture integrates a VQ-based visual encoder, a transformer-based physical encoder, and a learnable dynamics model grounded in known physical equations. Across three case studies (Cart Pole, Lunar Lander, and Donkey Car), PIWM achieves accurate long-horizon prediction, recovers true system parameters, and significantly improves physical grounding over purely data-driven models. These results demonstrate the feasibility and advantages of learning physically interpretable world models directly from images under weak supervision.
中文标题/摘要
标题:通过弱监督表示学习获得物理可解释的世界模型
从高维感官观察中学习预测模型是网络物理系统的基本需求,但标准世界模型学习到的潜在表示缺乏物理可解释性,这限制了它们的可靠性和泛化能力,以及在关键安全任务中的应用。我们提出了物理可解释的世界模型(PIWM),这是一种框架,它将潜在表示与现实世界的物理量对齐,并通过部分已知的物理动力学约束它们的演变。PIWM 中的物理可解释性由两个互补的属性定义:(i) 学习到的潜在状态对应于有意义的物理变量,(ii) 其时间演变遵循物理上一致的动力学。为了在不需要真实物理注释的情况下实现这一点,PIWM 使用弱分布监督,这种监督自然捕捉到实际传感管道中固有的状态不确定性。该架构结合了基于 VQ 的视觉编码器、基于变换器的物理编码器以及基于已知物理方程的可学习动力学模型。在三个案例研究(Cart Pole、Lunar Lander 和 Donkey Car)中,PIWM 实现了准确的长期预测,恢复了真实的系统参数,并显著提高了物理关联性,优于纯数据驱动的模型。这些结果表明,在弱监督下直接从图像中学习物理可解释的世界模型的可行性和优势。
Summary / 总结
The research aims to enhance the reliability and applicability of world models in cyber-physical systems by making their latent representations physically interpretable. PIWM achieves this by aligning latent states with real-world physical variables and constraining their evolution with known physical dynamics, using weak distribution-based supervision. The method integrates a VQ-based visual encoder, a transformer-based physical encoder, and a learnable dynamics model. Experimental results show that PIWM can achieve accurate long-horizon predictions and recover true system parameters, outperforming purely data-driven models in three case studies: Cart Pole, Lunar Lander, and Donkey Car.
研究旨在通过将潜在表示与现实世界的物理量对齐来开发物理可解释的世界模型,以提高网络物理系统的可靠性。方法是提出了一种称为物理可解释世界模型(PIWM)的框架,该框架使用弱分布监督,并结合了基于VQ的视觉编码器、基于变压器的物理编码器和基于已知物理方程的可学习动力学模型。PIWM在Cart Pole、Lunar Lander和Donkey Car三个案例研究中实现了准确的长期预测,并显著提高了物理接地能力。
The Third Pillar of Causal Analysis? A Measurement Perspective on Causal Representations
Authors: Dingling Yao, Shimeng Huang, Riccardo Cadei, Kun Zhang, Francesco Locatello
First: 2025-05-23T10:25:17+00:00 · Latest: 2025-11-17T18:00:52+00:00
Comments: Camera-ready version for NeurIPS2025
Abstract
Causal reasoning and discovery, two fundamental tasks of causal analysis, often face challenges in applications due to the complexity, noisiness, and high-dimensionality of real-world data. Despite recent progress in identifying latent causal structures using causal representation learning (CRL), what makes learned representations useful for causal downstream tasks and how to evaluate them are still not well understood. In this paper, we reinterpret CRL using a measurement model framework, where the learned representations are viewed as proxy measurements of the latent causal variables. Our approach clarifies the conditions under which learned representations support downstream causal reasoning and provides a principled basis for quantitatively assessing the quality of representations using a new Test-based Measurement EXclusivity (T-MEX) score. We validate T-MEX across diverse causal inference scenarios, including numerical simulations and real-world ecological video analysis, demonstrating that the proposed framework and corresponding score effectively assess the identification of learned representations and their usefulness for causal downstream tasks.
中文标题/摘要
标题:因果分析的第三支柱?因果表示的测量视角
因果推理和发现,因果分析的两个基本任务,由于现实世界数据的复杂性、噪声性和高维性,在应用中常常面临挑战。尽管最近在使用因果表示学习(CRL)识别潜在因果结构方面取得了进展,但学习表示对因果下游任务有用的原因以及如何评估它们仍然不甚明了。在本文中,我们使用测量模型框架重新解释CRL,将学习表示视为潜在因果变量的代理测量。我们的方法澄清了学习表示支持下游因果推理的条件,并提供了一个新的基于测试的测量唯一性(T-MEX)分数来定量评估表示质量的理论基础。我们通过数值模拟和真实世界的生态视频分析等多种因果推理场景验证了T-MEX,证明了所提出框架及其相应分数有效地评估了学习表示的识别及其对因果下游任务的有用性。
Summary / 总结
This paper addresses the challenges in causal reasoning and discovery by reinterpreting causal representation learning (CRL) through a measurement model framework. The learned representations are treated as proxy measurements of latent causal variables, which helps in understanding their utility for causal downstream tasks. The authors introduce a new T-MEX score to quantitatively assess the quality of these representations. The T-MEX score is validated across various scenarios, showing its effectiveness in assessing the identification and usefulness of learned representations for causal tasks.
本文通过将因果表示学习(CRL)重新解释为测量模型框架,解决了复杂现实世界数据中的因果推理和发现挑战。作者提出了一种新的基于测试的测量独占性(T-MEX)分数来评估学习表示的质量及其对因果下游任务的有用性。实验结果表明,提出的框架和分数在评估学习表示的识别及其对因果下游任务的有用性方面是有效的。
Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
Authors: Jitesh Chavan, Rohit Lal, Anand Kamat, Mengjia Xu
First: 2025-11-14T12:44:02+00:00 · Latest: 2025-11-17T18:00:42+00:00
Abstract
State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent "Mamba-for-vision" variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior "vision-mamba" variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.
中文标题/摘要
标题:Arcee:用于生成视觉建模的可微循环状态链
状态空间模型(SSMs),特别是Mamba,越来越多地被用于长上下文序列建模,通过输入依赖的、因果的选择性扫描操作提供线性时间聚合。沿着这一思路,最近的“Mamba-for-vision”变体主要探索多种扫描顺序以放松严格的因果性,适用于非序列信号(例如,图像)。与保留跨块记忆不同,Mamba中选择性扫描操作的常规形式从零重新初始化每个块的状态空间动力学,丢弃前一个块的终端状态空间表示(SSR)。Arcee,一种跨块循环状态链,重用每个块的终端状态空间表示作为下一个块的初始条件。跨块的传递构建为一个可微边界映射,其雅可比使端边界处的端到端梯度流动成为可能。为了实用性,Arcee与所有先前的“vision-mamba”变体兼容,无参数,并且具有恒定、可忽略的成本。从建模角度来看,我们视终端SSR为由因果输入扫描诱导的轻微方向先验,而不是非序列信号本身的估计器。为了量化影响,在CelebA-HQ(256×256)的无条件生成中,使用Flow Matching,Arcee将单扫描顺序Zigzag Mamba基线的FID从82.81降低到15.33(降低5.4倍)。高效的CUDA内核和训练代码将被发布以支持严格的和可重复的研究。
Summary / 总结
Arcee is a novel approach that extends Mamba state-space models for generative vision tasks by reusing the terminal state-space representation (SSR) from one block as the initial condition for the next block, enabling a cross-block recurrent state chain. This method allows for efficient gradient flow across blocks via a differentiable boundary map. On CelebA-HQ unconditional generation, Arcee significantly reduces the FID score from 82.81 to 15.33, demonstrating its effectiveness.
Arcee 是一种通过将一个块的终端状态空间表示(SSR)作为下一个块的初始条件,扩展 Mamba 状态空间模型的方法,从而实现跨块的递归状态链。这种方法通过一个可微边界映射允许在块之间高效地进行梯度流动。在 CelebA-HQ 无条件生成任务中,Arcee 将 FID 分数从 82.81 降低到 15.33,展示了其有效性。
Distribution Matching Distillation Meets Reinforcement Learning
Authors: Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Xin Jin, David Liu, Zhen Li, Mengmeng Wang, Peng Gao, Harry Yang
First: 2025-11-17T17:59:54+00:00 · Latest: 2025-11-17T17:59:54+00:00
Comments: The synergy of reinforcement learning and distribution matching distillation. See more: https://github.com/vvvvvjdy/dmdr
Abstract
Distribution Matching Distillation (DMD) distills a pre-trained multi-step diffusion model to a few-step one to improve inference efficiency. However, the performance of the latter is often capped by the former. To circumvent this dilemma, we propose DMDR, a novel framework that combines Reinforcement Learning (RL) techniques into the distillation process. We show that for the RL of the few-step generator, the DMD loss itself is a more effective regularization compared to the traditional ones. In turn, RL can help to guide the mode coverage process in DMD more effectively. These allow us to unlock the capacity of the few-step generator by conducting distillation and RL simultaneously. Meanwhile, we design the dynamic distribution guidance and dynamic renoise sampling training strategies to improve the initial distillation process. The experiments demonstrate that DMDR can achieve leading visual quality, prompt coherence among few-step methods, and even exhibit performance that exceeds the multi-step teacher.
中文标题/摘要
标题:分布匹配蒸馏与强化学习的结合
分布匹配蒸馏(DMD)将预训练的多步扩散模型精简为几步模型以提高推理效率。然而,后者的性能往往受限于前者。为解决这一问题,我们提出了一种新的DMDR框架,将强化学习(RL)技术融入蒸馏过程。我们表明,对于几步生成器的RL,DMD损失本身比传统的正则化更有效。反过来,RL可以帮助更有效地指导DMD中的模式覆盖过程。这些允许我们在同时进行蒸馏和RL的情况下解锁几步生成器的能力。同时,我们设计了动态分布指导和动态重新噪声采样训练策略以改进初始蒸馏过程。实验表明,DMDR可以实现领先的视觉质量、几步方法之间的提示一致性,甚至表现出超越多步教师的性能。
Summary / 总结
The paper proposes DMDR, which combines Reinforcement Learning (RL) with Distribution Matching Distillation (DMD) to improve the performance of few-step generators. By using DMD loss as a regularization for RL, DMDR enhances the mode coverage process and achieves better visual quality and prompt coherence compared to multi-step methods. The dynamic distribution guidance and renoise sampling strategies further refine the initial distillation process. Experiments show that DMDR outperforms multi-step models in terms of visual quality and even surpasses them in some aspects.
该研究提出DMDR框架,结合了强化学习(RL)和分布匹配蒸馏(DMD),以提升few-step生成器的性能。通过使用DMD损失作为RL的正则化手段,DMDR增强了模式覆盖过程,并在视觉质量和提示一致性方面优于multi-step方法。动态分布指导和重新噪声采样策略进一步优化了初始蒸馏过程。实验表明,DMDR在视觉质量方面超越了multi-step模型,并在某些方面甚至超过了它们。
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
Authors: Ziang Cao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu
First: 2025-11-17T17:59:53+00:00 · Latest: 2025-11-17T17:59:53+00:00
Comments: Project page: https://physx-anything.github.io/
Abstract
3D modeling is shifting from static visual representations toward physical, articulated assets that can be directly used in simulation and interaction. However, most existing 3D generation methods overlook key physical and articulation properties, thereby limiting their utility in embodied AI. To bridge this gap, we introduce PhysX-Anything, the first simulation-ready physical 3D generative framework that, given a single in-the-wild image, produces high-quality sim-ready 3D assets with explicit geometry, articulation, and physical attributes. Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. It reduces the number of tokens by 193x, enabling explicit geometry learning within standard VLM token budgets without introducing any special tokens during fine-tuning and significantly improving generative quality. In addition, to overcome the limited diversity of existing physical 3D datasets, we construct a new dataset, PhysX-Mobility, which expands the object categories in prior physical 3D datasets by over 2x and includes more than 2K common real-world objects with rich physical annotations. Extensive experiments on PhysX-Mobility and in-the-wild images demonstrate that PhysX-Anything delivers strong generative performance and robust generalization. Furthermore, simulation-based experiments in a MuJoCo-style environment validate that our sim-ready assets can be directly used for contact-rich robotic policy learning. We believe PhysX-Anything can substantially empower a broad range of downstream applications, especially in embodied AI and physics-based simulation.
中文标题/摘要
标题:PhysX-Anything:从单张图像生成可用于模拟的物理3D资产
3D建模正从静态视觉表示向可以直接用于模拟和交互的物理、可动资产转变。然而,大多数现有的3D生成方法忽略了关键的物理和可动属性,从而限制了它们在具身AI中的应用。为了解决这一问题,我们提出了PhysX-Anything,这是第一个可以直接用于模拟的物理3D生成框架,给定一张野外图像,可以生成高质量的可用于模拟的3D资产,具有明确的几何形状、可动性和物理属性。具体来说,我们提出了第一个基于VLM的物理3D生成模型,以及一种新的3D表示,可以高效地对几何形状进行分词。它将分词数量减少了193倍,可以在标准VLM分词预算内实现明确的几何形状学习,无需在微调过程中引入任何特殊分词,显著提高了生成质量。此外,为了克服现有物理3D数据集的多样性限制,我们构建了一个新的数据集PhysX-Mobility,将先前物理3D数据集中的对象类别扩展了2倍以上,并包含了超过2000个具有丰富物理注释的常见现实世界对象。在PhysX-Mobility和野外图像上的大量实验表明,PhysX-Anything具有强大的生成性能和鲁棒的泛化能力。此外,基于MuJoCo风格环境的模拟实验验证了我们的可用于模拟的资产可以直接用于接触丰富的机器人策略学习。我们相信PhysX-Anything可以显著增强一系列下游应用的能力,特别是在具身AI和基于物理的模拟方面。
Summary / 总结
The research aims to generate high-quality 3D assets with physical and articulation properties suitable for simulation and interaction, addressing limitations in existing methods. The PhysX-Anything framework uses a VLM-based generative model and a new 3D representation to produce sim-ready assets from a single image, reducing token usage by 193x and improving generative quality. Experiments show strong performance and robust generalization, and simulation-based tests validate the assets' usability in robotic policy learning. The PhysX-Mobility dataset, with over 2K real-world objects, further enhances the diversity of physical 3D assets.
研究旨在通过单张图像生成可用于模拟和交互的物理3D资产,解决现有3D生成方法的局限性。PhysX-Anything框架利用视觉语言模型(VLM)生成具有明确几何形状、关节和物理属性的高质量3D资产。该框架引入了一种高效的3D表示,减少了193倍的标记使用量,无需特殊标记即可实现更好的生成质量。此外,还构建了一个新的数据集PhysX-Mobility,扩展了物体类别并提供了丰富的物理注释。实验显示了强大的生成性能和鲁棒的泛化能力,并且基于模拟的测试验证了这些资产在机器人策略学习中的应用。
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Authors: Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo
First: 2025-11-17T17:59:52+00:00 · Latest: 2025-11-17T17:59:52+00:00
Abstract
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/
中文标题/摘要
标题:Part-X-MLLM:基于部分的3D多模态大型语言模型
我们介绍了Part-X-MLLM,这是一种原生的3D多模态大型语言模型,通过将各种3D任务形式化为结构化、可执行的语法程序来统一这些任务。给定一个RGB点云和自然语言提示,我们的模型自回归生成一个单一、连贯的标记序列,编码部分级别的边界框、语义描述和编辑命令。这种结构化的输出作为下游几何感知模块的通用接口,用于基于部分的生成和编辑。通过将符号规划与几何合成解耦,我们的方法允许任何兼容的几何引擎通过单一的语言原生前端进行控制。我们预先训练了一个双编码器架构来分离结构和语义,并在大规模、基于部分的数据集上对模型进行指令微调。实验表明,我们的模型在生成高质量、结构化计划方面表现出色,通过一个统一的接口实现了在基于地面的问答、组合生成和局部编辑方面的最先进的性能。项目页面:https://chunshi.wang/Part-X-MLLM/
Summary / 总结
Part-X-MLLM is a 3D multimodal large language model that formulates 3D tasks as structured programs. Given an RGB point cloud and a natural language prompt, the model generates a coherent token sequence that includes part-level bounding boxes, semantic descriptions, and edit commands. This structured output is used to drive geometry-aware modules for part-based generation and editing. The model is pre-trained to disentangle structure from semantics and fine-tuned on a large-scale, part-centric dataset. Experiments show that Part-X-MLLM performs well in grounded Q&A, compositional generation, and localized editing, achieving state-of-the-art results through a unified interface.
Part-X-MLLM 是一种将各种 3D 任务整合到结构化可执行语法中的 3D 多模态大语言模型。给定一个 RGB 点云和自然语言提示,模型生成一个包含部分级边界框、语义描述和编辑命令的连贯标记序列。这种结构化的输出便于下游几何感知模块进行基于部分的生成和编辑。该模型通过分离结构和语义进行预训练,并在大规模部分中心数据集上进行指令微调。实验表明,Part-X-MLLM 在基于地面的问答、组合生成和局部编辑方面表现出色,通过统一的界面实现了最先进的性能。
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Authors: Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang
First: 2025-11-17T17:58:18+00:00 · Latest: 2025-11-17T17:58:18+00:00
Abstract
Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.
中文标题/摘要
标题:Live-SWE-agent:软件工程代理能否在运行时自我演化?
大型语言模型(LLMs)正在重塑几乎所有行业,包括软件工程。近年来,已经提出了许多LLM代理来解决实际的软件问题。这些软件代理通常配备了一套编程工具,并能自主决定下一步行动,以形成完整的解决端到端软件任务的轨迹。虽然前景广阔,但它们通常需要专门设计,且可能仍不理想,因为彻底探索整个代理架构设计空间极其困难且成本高昂。认识到软件代理本质上也是软件,可以进一步细化/修改,研究人员最近提出了许多自我改进的软件代理,包括达尔文-哥德尔机(DGM)。同时,这些自我改进的代理需要在特定基准上进行昂贵的离线训练,可能在不同LLM或基准之间泛化能力不强。在本文中,我们提出了Live-SWE-agent,这是第一个在解决实际软件问题时能够自主且连续在运行时自我演化的软件代理。具体而言,Live-SWE-agent 从最基本的代理架构开始,仅具有bash工具(例如mini-SWE-agent)的访问权限,并在解决实际软件问题时自主演化其自身的架构实现。我们在广泛研究的SWE-bench Verified基准上的评估显示,Live-SWE-agent 在无需测试时缩放的情况下,实现了令人印象深刻的75.4%的解决率,超越了所有现有的开源软件代理,并接近最佳专有解决方案的性能。此外,Live-SWE-agent 在最近的SWE-Bench Pro基准上超越了最先进的手工构建的软件代理,实现了最佳已知的45.8%的解决率。
Summary / 总结
Live-SWE-agent is a self-evolving software agent that can autonomously improve itself during runtime to solve real-world software engineering problems. Starting with a basic agent scaffold, it evolves its own implementation while solving tasks. On the SWE-bench Verified benchmark, Live-SWE-agent achieved a solve rate of 75.4% without test-time scaling, outperforming existing open-source agents and approaching the performance of the best proprietary solution. It also outperformed state-of-the-art manually crafted agents on the SWE-Bench Pro benchmark with a solve rate of 45.8%.
Live-SWE-agent 是一种可以在运行时自主进化以解决实际软件工程问题的软件代理。它从一个基本的代理框架开始,边解决任务边进化自己的实现。在 SWE-bench Verified 基准上,Live-SWE-agent 达到了 75.4% 的解决率,超过了现有的开源代理,并接近最佳专有解决方案的表现。此外,它还在 SWE-Bench Pro 基准上超越了最先进的手工构建的代理,解决了 45.8% 的任务。
FuseSampleAgg: Fused Neighbor Sampling and Aggregation for Mini-batch GNNs
Authors: Aleksandar Stanković
First: 2025-11-17T17:57:18+00:00 · Latest: 2025-11-17T17:57:18+00:00
Comments: 15 pages. Code and reproducibility scripts: https://github.com/SV25-22/FuseSampleAgg
Abstract
We present FuseSampleAgg, a CUDA operator that fuses neighbor sampling and mean aggregation into a single pass for one and two hop GraphSAGE. By eliminating block materialization and extra kernel launches, FuseSampleAgg reduces memory traffic and overhead while preserving GraphSAGE mean semantics via saved index replay. Across the Reddit, ogbn-arxiv, and ogbn-products benchmarks (batch size 1024, automatic mixed precision enabled), we observe step time speedups up to 51x on ogbn-products, about 4x on Reddit with fanouts 10-10 and 15-10, and about 3.3x on ogbn-arxiv at larger fanouts, with peak GPU memory reductions up to 100x, 36x, and about 3.5x, respectively. The operator is deterministic, integrates with standard PyTorch optimizers, and ships with scripts that reproduce all tables and figures from CSV logs. Code and scripts are available at https://github.com/SV25-22/FuseSampleAgg.
中文标题/摘要
标题:FuseSampleAgg:融合邻居采样和聚合的CUDA操作符用于Mini-batch GNNs
我们提出了FuseSampleAgg,这是一种CUDA操作符,将GraphSAGE的一跳和二跳邻居采样和均值聚合融合为单次通过。通过消除块材料化和额外内核启动,FuseSampleAgg减少了内存流量和开销,同时通过保存索引重放保留了GraphSAGE的均值语义。在Reddit、ogbn-arxiv和ogbn-products基准测试中(批量大小1024,启用自动混合精度),我们观察到在ogbn-products上的步长时间加速高达51倍,在Reddit上的加速约为4倍(fanouts 10-10和15-10),在ogbn-arxiv上的加速约为3.3倍(较大fanouts),峰值GPU内存减少分别高达100倍、36倍和约3.5倍。该操作符是确定性的,与标准PyTorch优化器兼容,并附带可重现所有表格和图表的脚本。代码和脚本可在https://github.com/SV25-22/FuseSampleAgg获取。
Summary / 总结
The research introduces FuseSampleAgg, a CUDA operator that combines neighbor sampling and mean aggregation for GraphSAGE in one or two hops, reducing memory traffic and overhead while maintaining GraphSAGE semantics. It achieves up to 51x step time speedup on ogbn-products, about 4x on Reddit with fanouts 10-10 and 15-10, and about 3.3x on ogbn-arxiv with larger fanouts, along with peak GPU memory reductions of up to 100x, 36x, and about 3.5x, respectively. The operator is deterministic and compatible with PyTorch optimizers, with scripts available for reproducing results.
论文介绍了FuseSampleAgg,这是一种CUDA操作符,将GraphSAGE的一跳或多跳中的邻居采样和均值聚合合并为一步,减少了内存流量和开销,同时保持了GraphSAGE的均值语义。在ogbn-products上实现了高达51倍的步时间加速,在Reddit上实现了约4倍的加速(10-10和15-10的fanouts),在ogbn-arxiv上实现了约3.3倍的加速(更大的fanouts)。它还将峰值GPU内存使用量分别减少至100倍、36倍和约3.5倍。该操作符是确定性的,并且与PyTorch优化器兼容,提供了用于重现结果的脚本。
CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding
Authors: Shrenik Patel, Daivik Patel
First: 2025-11-17T17:56:14+00:00 · Latest: 2025-11-17T17:56:14+00:00
Abstract
Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.
中文标题/摘要
标题:CacheFlow:压缩流式内存以实现高效长视频理解
长视频问答(VQA)使当前的视觉-语言模型(VLMs)不堪重负,因为注意力和键值(KV)缓存会随着运行时间增长,迫使它们要么进行昂贵的推理,要么使用近视的滑动窗口。我们引入了CacheFlow,这是一种无需训练的流水线,它将动态令牌删除(DTD)与压缩的长期记忆相结合。DTD通过余弦相似度在线删除每块的令牌,存活的令牌被压缩到固定大小的块中。这种基于每帧的在线处理使我们的方法从根本上适合于实时流式VQA。随着块的处理,每个块的键将由小型循环编码器总结形成检索索引,而块的完整KV对则被卸载并在稍后重新激活以进行生成,从而保持答案的准确性。在推理时,基于共识的检索机制仅检索最相关的Top-K块,并在检索到的上下文和局部上下文之间进行注意力处理,以实现精确的长距离推理。CacheFlow是即插即用的,架构无关的,并且不需要微调。在离线和流式VQA基准测试中,CacheFlow不仅优于当前的强基线,而且处理的令牌量最多可减少87%。我们的双管齐下方法使VLMs既高效又具有上下文意识,为实用的长视频理解铺平了道路。
Summary / 总结
CacheFlow addresses the challenge of long-form video question answering by introducing a training-free pipeline that combines Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes tokens online based on cosine similarity, and surviving tokens are packed into fixed-size blocks. Each block's keys are summarized by a tiny recurrent encoder, and the full KV pairs are offloaded for later rehydration, ensuring answer fidelity. Experiments show that CacheFlow outperforms strong baselines while processing up to 87% fewer tokens, making VLMs more efficient and context-aware for long-form video understanding.
CacheFlow通过结合动态令牌丢弃(DTD)与压缩型长期记忆,提出了一种无需训练的管道,以解决长视频问答的问题。DTD基于余弦相似性在线修剪令牌,并将存活的令牌打包成固定大小的块。每个块的键由一个小型循环编码器进行总结,而完整的KV对则被卸载并在后续重新激活,以保持答案的准确性。实验表明,CacheFlow在处理多达87%更少的令牌的同时,优于强大的基线模型,使视觉语言模型更加高效且具有上下文意识,为实用的长视频理解铺平了道路。
Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures
Authors: Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang, Lecheng Zheng, Kevin Choi, Balaji Veeramani, Edward Bowen, Alison Hu, Tyler Cody, Dawei Zhou
First: 2025-11-17T17:53:12+00:00 · Latest: 2025-11-17T17:53:12+00:00
Abstract
The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-p sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.
中文标题/摘要
标题:数据价值在扩展时代的理解:混合真实-合成数据下的大规模语言模型扩展动力学
大规模语言模型(LLMs)的迅速进步得益于对融合真实和合成数据的数据集的日益依赖。虽然合成数据提供了可扩展性和成本效益,但它通常引入了系统性的分布差异,特别是在数据生成机制如top-p采样、温度缩放和有限采样导致的尾部知识欠代表方面。这些差异对混合真实-合成数据集的特性和评估构成了根本性的挑战。在本文中,我们识别出一种三阶段的扩展行为,由两个转折点反映模型在学习头部和尾部知识时的行为转变。我们进一步推导出一个适用于真实和合成混合数据的LLM泛化界,揭示了其泛化性能的关键因素。基于我们的理论发现,我们提出了一种有效且高效的数据估值方法,可扩展到大规模数据集。在包括图像分类、情感分类、指令跟随和复杂推理在内的四项任务中,全面的实验表明,我们的方法在数据估值方面超越了最先进的基线,且具有显著较低的计算成本。
Summary / 总结
This paper investigates the impact of real and synthetic data mixtures on large language models (LLMs) and identifies a three-phase scaling behavior with two breakpoints. The authors derive a generalization bound for mixed datasets and propose an efficient data valuation method. Experiments across four tasks show that their method outperforms existing baselines with low computational cost.
本文研究了真实数据和合成数据混合对大型语言模型(LLMs)的影响,并识别出三阶段扩展行为,包含两个转折点。作者为混合数据集推导出一个泛化界限,并提出了一种有效且高效的数据估值方法。在各种任务上的实验结果表明,他们的方法在较低的计算成本下优于现有方法。
Graph Neural Network-Based Reinforcement Learning for Controlling Biological Networks - the GATTACA Framework
Authors: Andrzej Mizera, Jakub Zarzycki
First: 2025-05-05T15:07:20+00:00 · Latest: 2025-11-17T17:35:01+00:00
Abstract
Cellular reprogramming, the artificial transformation of one cell type into another, has been attracting increasing research attention due to its therapeutic potential for complex diseases. However, identifying effective reprogramming strategies through classical wet-lab experiments is hindered by lengthy time commitments and high costs.
In this study, we explore the use of deep reinforcement learning (DRL) to control Boolean network models of complex biological systems, such as gene regulatory and signalling pathway networks. We formulate a novel control problem for Boolean network models under the asynchronous update mode, specifically in the context of cellular reprogramming. To solve it, we devise GATTACA, a scalable computational framework.
To facilitate scalability of our framework, we consider previously introduced concept of a pseudo-attractor and improve the procedure for effective identification of pseudo-attractor states. We then incorporate graph neural networks with graph convolution operations into the artificial neural network approximator of the DRL agent's action-value function. This allows us to leverage the available knowledge on the structure of a biological system and to indirectly, yet effectively, encode the system's modelled dynamics into a latent representation.
Experiments on several large-scale, real-world biological networks from the literature demonstrate the scalability and effectiveness of our approach.
中文标题/摘要
标题:基于图神经网络的强化学习在控制生物网络中的应用——GATTACA框架
细胞重编程,即通过人工手段将一种细胞类型转化为另一种,由于其在治疗复杂疾病方面的潜力,正逐渐受到研究关注。然而,通过传统的湿实验方法识别有效的重编程策略受到时间长和成本高的限制。
在这项研究中,我们探索了使用深度强化学习(DRL)来控制复杂生物系统的布尔网络模型,如基因调控和信号通路网络。我们针对异步更新模式下布尔网络模型提出了一个新颖的控制问题,特别是在细胞重编程的背景下。为了解决这个问题,我们设计了GATTACA,一个可扩展的计算框架。
为了使我们的框架更具可扩展性,我们考虑了之前引入的伪吸引子概念,并改进了伪吸引子状态的有效识别程序。然后,我们将图神经网络与图卷积操作结合到DRL代理的动作-价值函数的人工神经网络逼近器中。这使我们能够利用生物系统结构的知识,并间接但有效地将系统建模的动力学编码到潜在表示中。
实验表明,我们的方法在多个大型真实生物网络上具有可扩展性和有效性。
Summary / 总结
This study aims to use deep reinforcement learning to identify effective cellular reprogramming strategies by controlling Boolean network models of biological systems. The authors propose GATTACA, a scalable framework that incorporates graph neural networks to encode biological system dynamics into a latent representation. Experiments on real-world biological networks show the scalability and effectiveness of this approach.
该研究旨在利用深度强化学习(DRL)通过控制布尔网络模型来识别有效的细胞重编程策略。作者提出了一种可扩展的GATTACA框架,该框架结合了图神经网络,以将系统动力学编码到潜在表示中。在大规模生物网络上的实验表明,该方法在控制复杂生物系统方面的可扩展性和有效性。
CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product
Authors: Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu, Shuai Lyu, Yifan Zhu, Ping Zong Junpeng Ding, Xinyu Liu, Qunlin Chen, Weiwei Qin, Yiran Shen, Jiayi Cen
Venue: AAAI poster
First: 2025-11-17T17:34:05+00:00 · Latest: 2025-11-17T17:34:05+00:00
Comments: 13 pages, 3 figures,The 40th Annual AAAI Conference on Artificial Intelligence(AAAI 2026),Paper has been accepted for a poster presentation
Abstract
Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.
中文标题/摘要
标题:CreBench:从创意到过程再到产品的符合人类判断的创造力评估基准
人类定义的创造力非常抽象,给多模态大型语言模型(MLLMs)理解和评估与人类判断一致的创造力带来了挑战。缺乏现有的基准进一步加剧了这一困境。为此,我们提出了CreBench,它包括两个关键组成部分:1)涵盖从创意到过程再到产品的多个维度的评估基准;2)CreMIT(创造力多模态指令调优数据集),一个包含2200个多元来源的多模态数据集,792000条人类反馈和4700万条多类型指令的多模态创造力评估数据集。具体来说,为了确保MLLMs能够处理各种与创造力相关的查询,我们提示GPT对这些人类反馈进行润色,以激活更强的创造力评估能力。CreBench 为构建理解符合人类判断的创造力的MLLMs奠定了基础。基于CreBench,我们对开源通用MLLMs进行微调,产生了CreExpert,一个多模态创造力评估专家模型。广泛的实验表明,提出的CreExpert模型在与人类创造力评估的一致性方面显著优于最先进的MLLMs,包括最先进的GPT-4V和Gemini-Pro-Vision。
Summary / 总结
CreBench is a benchmark for evaluating human-aligned creativity, consisting of an evaluation benchmark covering creative idea, process, and product dimensions, and a multimodal creativity evaluation dataset, CreMIT, with 2.2K diverse multimodal data, 79.2K human feedbacks, and 4.7M instructions. By fine-tuning open-source general MLLMs, CreBench leads to the creation of CreExpert, a multimodal creativity evaluation expert model. Experiments show that CreExpert outperforms state-of-the-art MLLMs, including GPT-4V and Gemini-Pro-Vision, in aligning with human creativity evaluation.
CreBench 是一个评估人类对齐创造力的基准,涵盖了创意想法、过程和产品。它包括一个包含2.2K多样数据、79.2K人类反馈和4.7M指令的多模态数据集 CreMIT。通过改进人类反馈,作者增强了创造力评估能力。使用 CreBench,他们微调了开源 MLLMs,创建了 CreExpert 模型,该模型在与人类创造力评估的对齐方面显著优于最先进的 MLLMs,如 GPT-4V 和 Gemini-Pro-Vision。
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Authors: Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, Chengchun Shi
First: 2025-04-03T16:16:35+00:00 · Latest: 2025-11-17T17:33:15+00:00
Abstract
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https:// github.com/ VRPO/ VRPO.
中文标题/摘要
标题:大规模语言模型微调中鲁棒的人类反馈强化学习
人类反馈强化学习(RLHF)已成为使大规模语言模型(LLMs)与人类偏好保持一致的关键技术。为了学习奖励函数,大多数现有RLHF算法使用布雷德利-特里模型,该模型依赖于关于人类偏好的假设,这些假设可能无法反映现实世界判断的复杂性和变化性。在本文中,我们提出了一种鲁棒算法,以在这样的奖励模型误设情况下增强现有方法的性能。理论上,我们的算法减少了奖励和策略估计量的方差,从而提高了遗憾界。在大规模语言模型基准数据集上的实证评估表明,所提出的算法在Anthropic有益和无害数据集上始终优于现有方法,有77-81%的响应被优先考虑。代码可在https://github.com/VRPO/VRPO/ 获取。
Summary / 总结
This paper addresses the limitations of existing reinforcement learning from human feedback (RLHF) algorithms by proposing a robust algorithm that reduces the variance of reward and policy estimators, leading to improved regret bounds. Theoretical analysis shows that this approach enhances performance under reward model misspecifications. Empirical evaluations on LLM benchmark datasets show that the proposed algorithm outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset.
本文通过提出一种稳健算法来解决现有RLHF算法的局限性,该算法减少了奖励和策略估计器的方差,从而提高了在奖励模型错配情况下的性能。理论分析表明,这种方法在基准数据集上的表现更好。实证评估显示,在Anthropic有益和无害数据集上,该算法有77-81%的响应优于基线方法。
Alpha Divergence Losses for Biometric Verification
Authors: Dimitrios Koutsianos, Ladislav Mosner, Yannis Panagakis, Themos Stafylakis
First: 2025-11-17T17:27:28+00:00 · Latest: 2025-11-17T17:27:28+00:00
Abstract
Performance in face and speaker verification is largely driven by margin based softmax losses like CosFace and ArcFace. Recently introduced $α$-divergence loss functions offer a compelling alternative, particularly for their ability to induce sparse solutions (when $α>1$). However, integrating an angular margin-crucial for verification tasks-is not straightforward. We find this integration can be achieved in at least two distinct ways: via the reference measure (prior probabilities) or via the logits (unnormalized log-likelihoods). In this paper, we explore both pathways, deriving two novel margin-based $α$-divergence losses: Q-Margin (margin in the reference measure) and A3M (margin in the logits). We identify and address a critical training instability in A3M-caused by the interplay of penalized logits and sparsity-with a simple yet effective prototype re-initialization strategy. Our methods achieve significant performance gains on the challenging IJB-B and IJB-C face verification benchmarks. We demonstrate similarly strong performance in speaker verification on VoxCeleb. Crucially, our models significantly outperform strong baselines at low false acceptance rates (FAR). This capability is crucial for practical high-security applications, such as banking authentication, when minimizing false authentications is paramount.
中文标题/摘要
标题:Alpha 发散损失函数在生物特征识别中的应用
面部和语音识别的性能主要由基于边距的 softmax 损失如 CosFace 和 ArcFace 决定。最近引入的 $α$-发散损失函数提供了一种有吸引力的替代方案,特别是在当 $α>1$ 时能够诱导稀疏解的能力。然而,将关键的角边距(对于验证任务至关重要)整合进去并不直接。我们发现这种整合可以通过参考测度(先验概率)或通过逻辑值(未归一化的对数似然)至少以两种不同的方式实现。在本文中,我们探讨了这两种途径,推导出了两种新的基于边距的 $α$-发散损失函数:Q-边距(边距在参考测度中)和 A3M(边距在逻辑值中)。我们识别并解决了 A3M 中由惩罚逻辑值和稀疏性相互作用引起的关键训练不稳定性问题,通过一个简单而有效的原型重新初始化策略。我们的方法在具有挑战性的 IJB-B 和 IJB-C 面部识别基准测试中实现了显著的性能提升。我们在 VoxCeleb 上展示了类似的强性能。最关键的是,我们的模型在低误接受率(FAR)下显著优于强大的基线模型。这种能力对于实际的高安全应用至关重要,例如银行认证,当减少误认证是首要任务时。
Summary / 总结
This paper explores the integration of angular margins into $α$-divergence losses for face and speaker verification, introducing Q-Margin and A3M as novel margin-based $α$-divergence losses. The authors address a training instability in A3M through prototype re-initialization and achieve significant performance gains, especially at low false acceptance rates, on IJB-B, IJB-C, and VoxCeleb benchmarks.
本文探讨了将$α$-散度损失与角度边际结合用于生物特征识别任务,特别是面部和语音识别。提出了两种新的基于边际的$α$-散度损失:Q-Margin和A3M。A3M由于惩罚性对数似然和稀疏性的交互而面临训练不稳定问题,通过简单的原型重新初始化策略得以解决。该方法在具有挑战性的基准测试中取得了显著的性能提升,特别是在低误接受率下,使其适用于高安全性的实际应用,如银行认证。
Tissue Aware Nuclei Detection and Classification Model for Histopathology Images
Authors: Kesi Xu, Eleni Chiou, Ali Varamesh, Laura Acqualagna, Nasir Rajpoot
First: 2025-11-17T17:21:05+00:00 · Latest: 2025-11-17T17:21:05+00:00
Comments: 5 pages, 3 figures. Under review
Abstract
Accurate nuclei detection and classification are fundamental to computational pathology, yet existing approaches are hindered by reliance on detailed expert annotations and insufficient use of tissue context. We present Tissue-Aware Nuclei Detection (TAND), a novel framework achieving joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND couples a ConvNeXt-based encoder-decoder with a frozen Virchow-2 tissue segmentation branch, where semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). On the PUMA benchmark, TAND achieves state-of-the-art performance, surpassing both tissue-agnostic baselines and mask-supervised methods. Notably, our approach demonstrates remarkable improvements in tissue-dependent cell types such as epithelium, endothelium, and stroma. To the best of our knowledge, this is the first method to condition per-cell classification on learned tissue masks, offering a practical pathway to reduce annotation burden.
中文标题/摘要
标题:组织感知的核检测与分类模型用于组织病理学图像
准确的核检测和分类是计算病理学的基础,但现有方法受限于对详细专家注解的依赖和对组织上下文的不足利用。我们提出了一种新的框架——组织感知核检测(TAND),该框架通过点级监督和组织掩码条件增强实现联合核检测和分类。TAND 结合了一个基于 ConvNeXt 的编码器-解码器和一个冻结的 Virchow-2 组织分割分支,其中语义组织概率通过一种新颖的多尺度空间特征线性调制(Spatial-FiLM)选择性地调节分类流。在 PUMA 基准上,TAND 达到了最先进的性能,超越了组织无关的基线和掩码监督方法。值得注意的是,我们的方法在上皮细胞、内皮细胞和间质等依赖组织的细胞类型上表现出显著的改进。据我们所知,这是第一个通过学习组织掩码对单个细胞分类进行条件处理的方法,提供了一条减少注解负担的实用途径。
Summary / 总结
The research aims to improve nuclei detection and classification in histopathology images by incorporating tissue context. Tissue-Aware Nuclei Detection (TAND) uses a ConvNeXt-based encoder-decoder with a tissue segmentation branch to modulate the classification stream. On the PUMA benchmark, TAND outperforms existing methods, especially for tissue-dependent cell types like epithelium, endothelium, and stroma, by conditioning per-cell classification on learned tissue masks. This approach reduces the need for detailed expert annotations.
研究旨在通过引入组织上下文来提高组织病理学图像中的细胞核检测和分类。Tissue-Aware Nuclei Detection (TAND) 使用基于 ConvNeXt 的编码器-解码器和组织分割分支来调节分类流。在 PUMA 基准上,TAND 在如上皮、内皮和间质等组织依赖性细胞类型方面超越了现有方法,通过在学习到的组织掩码上条件化单个细胞分类来减少专家注释的需求。
P1: Mastering Physics Olympiads with Reinforcement Learning
Authors: Jiacheng Chen, Qianjia Cheng, Fangchen Yu, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Yun Luo, Yufeng Zhao, Futing Wang, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Wenxauan Zeng, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui
First: 2025-11-17T17:18:13+00:00 · Latest: 2025-11-17T17:18:13+00:00
Abstract
Recent progress in large language models (LLMs) has moved the frontier from puzzle-solving to science-grade reasoning-the kind needed to tackle problems whose answers must stand against nature, not merely fit a rubric. Physics is the sharpest test of this shift, which binds symbols to reality in a fundamental way, serving as the cornerstone of most modern technologies. In this work, we manage to advance physics research by developing large language models with exceptional physics reasoning capabilities, especially excel at solving Olympiad-level physics problems. We introduce P1, a family of open-source physics reasoning models trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model with Gold-medal performance at the latest International Physics Olympiad (IPhO 2025), and wins 12 gold medals out of 13 international/regional physics competitions in 2024/2025. P1-30B-A3B also surpasses almost all other open-source models on IPhO 2025, getting a silver medal. Further equipped with an agentic framework PhysicsMinions, P1-235B-A22B+PhysicsMinions achieves overall No.1 on IPhO 2025, and obtains the highest average score over the 13 physics competitions. Besides physics, P1 models also present great performance on other reasoning tasks like math and coding, showing the great generalibility of P1 series.
中文标题/摘要
标题:P1:利用强化学习掌握物理奥林匹克竞赛
大型语言模型(LLMs)的最新进展将前沿从解谜扩展到了科学级别的推理,这种推理能力是解决那些答案必须经得起自然检验的问题所需要的,而不仅仅是符合某种标准。物理是这种转变最严格的考验,因为它以基本的方式将符号与现实联系起来,成为大多数现代技术的基础。在这项工作中,我们通过开发具有出色物理推理能力的大规模语言模型,推进了物理研究,特别是在解决奥林匹克级别物理问题方面表现出色。我们引入了P1,这是一个完全通过强化学习(RL)训练的开源物理推理模型系列。其中,P1-235B-A22B是首个在最新国际物理奥林匹克竞赛(IPhO 2025)中获得金牌的开源模型,并在2024/2025年的13场国际/区域物理竞赛中赢得了12枚金牌。P1-30B-A3B在IPhO 2025中也超越了几乎所有其他开源模型,获得了一枚银牌。进一步配备了物理小助手PhysicsMinions框架后,P1-235B-A22B+PhysicsMinions在IPhO 2025中整体排名第一,并在13场物理竞赛中获得了最高的平均分。除了物理,P1模型在数学和编程等其他推理任务中也表现出色,展示了P1系列的强大通用性。
Summary / 总结
This research aims to enhance the ability of large language models to perform science-grade reasoning, particularly in physics, by using reinforcement learning. The study introduces P1, a family of open-source physics reasoning models, with P1-235B-A22B achieving Gold-medal performance at the International Physics Olympiad (IPhO 2025) and winning 12 out of 13 gold medals in 2024/2025. Additionally, P1-30B-A3B secured a silver medal on IPhO 2025. The models also excel in other reasoning tasks such as math and coding, demonstrating their broad applicability.
研究旨在通过强化学习提升大型语言模型在科学推理,尤其是物理方面的能力。研究介绍了P1这一系列开源物理推理模型,其中P1-235B-A22B在国际物理奥林匹克竞赛(IPhO 2025)中获得金牌,并在2024/2025年的13项国际/区域物理竞赛中赢得了12枚金牌。此外,P1-30B-A3B在IPhO 2025中获得银牌。这些模型在数学和编程等其他推理任务上也表现出色,展示了P1系列模型的广泛适用性。
AtlasMorph: Learning conditional deformable templates for brain MRI
Authors: Marianne Rakic, Andrew Hoopes, S. Mazdak Abulnaga, Mert R. Sabuncu, John V. Guttag, Adrian V. Dalca
First: 2025-11-17T17:13:58+00:00 · Latest: 2025-11-17T17:13:58+00:00
Abstract
Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.
中文标题/摘要
标题:AtlasMorph:学习条件变形模板以用于脑MRI
变形模板或图谱是代表人群典型解剖结构的图像,并常与概率解剖标签图结合使用。它们在医学图像分析中用于人群研究和计算解剖学任务,如配准和分割。由于开发模板是一个计算密集型过程,因此可用的模板相对较少。因此,分析通常使用次优模板,这些模板不能真正代表研究人群,尤其是在该人群内部存在较大差异时。我们提出了一种机器学习框架,使用卷积配准神经网络高效地学习一个函数,该函数根据个体属性(如年龄和性别)输出模板。我们还利用可用的分割图生成结果模板的解剖分割图。学习到的网络还可以用于将受试者图像配准到模板上。我们在一系列3D脑MRI数据集中演示了该方法,并表明它可以学习高质量且代表性的模板。我们发现带有注释的条件模板比未标记的无条件模板能更好地进行配准,并优于其他模板构建方法。
A Gentle Introduction to Conformal Time Series Forecasting
Authors: M. Stocker, W. Małgorzewicz, M. Fontana, S. Ben Taieb
First: 2025-11-17T17:12:51+00:00 · Latest: 2025-11-17T17:12:51+00:00
Abstract
Conformal prediction is a powerful post-hoc framework for uncertainty quantification that provides distribution-free coverage guarantees. However, these guarantees crucially rely on the assumption of exchangeability. This assumption is fundamentally violated in time series data, where temporal dependence and distributional shifts are pervasive. As a result, classical split-conformal methods may yield prediction intervals that fail to maintain nominal validity. This review unifies recent advances in conformal forecasting methods specifically designed to address nonexchangeable data. We first present a theoretical foundation, deriving finite-sample guarantees for split-conformal prediction under mild weak-dependence conditions. We then survey and classify state-of-the-art approaches that mitigate serial dependence by reweighting calibration data, dynamically updating residual distributions, or adaptively tuning target coverage levels in real time. Finally, we present a comprehensive simulation study that compares these techniques in terms of empirical coverage, interval width, and computational cost, highlighting practical trade-offs and open research directions.
中文标题/摘要
标题:温和介绍符合性时间序列预测
符合性预测是一种强大的事后框架,用于不确定性量化,提供无分布的覆盖保证。然而,这些保证的关键前提是可交换性的假设。在时间序列数据中,这种假设由于时间依赖性和分布变化的普遍存在而被根本违反。因此,经典的分割符合性方法可能会导致预测区间无法保持名义有效性。本文综述了最近为解决非可交换数据而设计的符合性预测方法的最新进展。我们首先提供理论基础,在温和的弱依赖条件下推导出分割符合性预测的有限样本保证。然后,我们概述并分类了通过重新加权校准数据、动态更新残差分布或实时自适应调整目标覆盖水平来减轻序列依赖性的最新方法。最后,我们进行了一项全面的模拟研究,比较了这些技术在经验覆盖、区间宽度和计算成本方面的表现,突出了实际权衡和开放的研究方向。
Summary / 总结
The paper addresses the challenge of applying conformal prediction to time series data, where classical methods may fail due to temporal dependence and distribution shifts. It provides a theoretical foundation for split-conformal prediction under weak-dependence conditions and surveys state-of-the-art techniques that mitigate serial dependence. The study includes a comprehensive simulation comparing these methods in terms of empirical coverage, interval width, and computational cost, identifying practical trade-offs and open research areas.
本文解决了将形似预测应用于时间序列数据时遇到的挑战,因为经典方法在这种情况下可能失效,由于存在时间依赖性和分布变化。它为在弱依赖条件下进行拆分形似预测提供了理论基础,并概述了最近用于缓解序列依赖性的技术。研究比较了这些方法在实际覆盖率、区间宽度和计算成本方面的表现,提供了实用权衡和未来研究方向的见解。
Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning
Authors: Avik Kar, Rahul Singh
First: 2024-05-29T06:18:09+00:00 · Latest: 2025-11-17T17:09:46+00:00
Comments: 38 pages, 3 figures
Abstract
We study the infinite-horizon average-reward reinforcement learning (RL) for continuous space Lipschitz MDPs in which an agent can play policies from a given set $Φ$. The proposed algorithms efficiently explore the policy space by ''zooming'' into the ''promising regions'' of $Φ$, thereby achieving adaptivity gains in the performance. We upper bound their regret as $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}} = d^Φ_z+2$ for model-free algoritahm $\textit{PZRL-MF}$ and $d_{\text{eff.}} = 2d_\mathcal{S} + d^Φ_z + 3$ for model-based algorithm $\textit{PZRL-MB}$. Here, $d_\mathcal{S}$ is the dimension of the state space, and $d^Φ_z$ is the zooming dimension given a set of policies $Φ$. $d^Φ_z$ is an alternative measure of the complexity of the problem, and it depends on the underlying MDP as well as on $Φ$. Hence, the proposed algorithms exhibit low regret in case the problem instance is benign and/or the agent competes against a low-complexity $Φ$ (that has a small $d^Φ_z$). When specialized to the case of finite-dimensional policy space, we obtain that $d_{\text{eff.}}$ scales as the dimension of this space under mild technical conditions; and also obtain $d_{\text{eff.}} = 2$, or equivalently $\tilde{\mathcal{O}}(\sqrt{T})$ regret for $\textit{PZRL-MF}$, under a curvature condition on the average reward function that is commonly used in the multi-armed bandit (MAB) literature.
中文标题/摘要
标题:策略缩放:基于自适应离散化的无限时域平均回报强化学习
我们研究了连续空间Lipschitz MDP中的无限时域平均回报强化学习(RL),其中智能体可以从给定集合$Φ$中选择策略。所提出的算法通过“缩放”到$Φ$的“有希望区域”来高效地探索策略空间,从而实现性能上的自适应增益。我们将其遗憾上界表示为$\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$,其中对于无模型算法$\textit{PZRL-MF}$,$d_{\text{eff.}} = d^Φ_z+2$;对于基于模型算法$\textit{PZRL-MB}$,$d_{\text{eff.}} = 2d_\mathcal{S} + d^Φ_z + 3$。这里,$d_\mathcal{S}$是状态空间的维数,$d^Φ_z$是给定策略集合$Φ$的缩放维数。$d^Φ_z$是问题复杂性的另一种度量,它取决于基础MDP以及$Φ$。因此,所提出的算法在问题实例简单且/或智能体与低复杂度$Φ$竞争时表现出低遗憾。当专门应用于有限维策略空间时,在轻微的技术条件下,$d_{\text{eff.}}$与该空间的维数成比例;并且在平均回报函数满足多臂老虎机(MAB)文献中常用的曲率条件下,$d_{\text{eff.}} = 2$,或等价地,$\tilde{\mathcal{O}}(\sqrt{T})$遗憾对于$\textit{PZRL-MF}$。
Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
Authors: Quoc-Huy Trinh, Mustapha Abdullahi, Do Duy Hung Trinh, Bo Zhao, Debesh Jha
First: 2025-11-14T11:21:48+00:00 · Latest: 2025-11-17T17:08:31+00:00
Abstract
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
Summary / 总结
Viper-F1 is designed to improve the efficiency and accuracy of multimodal understanding in resource-constrained scenarios. It uses Liquid State-Space Dynamics instead of Transformer-based cross-attention to reduce computational cost. Additionally, it introduces a Token-Grid Correlation Module to enhance visual grounding by computing lightweight correlations and modulating state-space dynamics. Experimental results show that Viper-F1 achieves fine-grained understanding with significantly improved efficiency compared to existing methods.
Viper-F1 是一种混合状态空间的视觉语言模型,使用高效的液态状态空间动力学代替基于 Transformer 的交叉注意力来降低计算成本。它还包含一个 Token-Grid 相关模块以增强视觉定位。实验结果表明,Viper-F1 提供了准确的、细粒度的理解,并且效率更高,优于现有方法。
HALO: Hardware-aware quantization with low critical-path-delay weights for LLM acceleration
Authors: Rohan Juneja, Shivam Aggarwal, Safeen Huda, Tulika Mitra, Li-Shiuan Peh
First: 2025-02-27T01:08:33+00:00 · Latest: 2025-11-17T16:57:58+00:00
Abstract
Quantization is critical for efficiently deploying large language models (LLMs). Yet conventional methods remain hardware-agnostic, limited to bit-width constraints, and do not account for intrinsic circuit characteristics such as the timing behaviors and energy profiles of Multiply-Accumulate (MAC) units. This disconnect from circuit-level behavior limits the ability to exploit available timing margins and energy-saving opportunities, reducing the overall efficiency of deployment on modern accelerators.
To address these limitations, we propose HALO, a versatile framework for Hardware-Aware Post-Training Quantization (PTQ). Unlike traditional methods, HALO explicitly incorporates detailed hardware characteristics, including critical-path timing and power consumption, into its quantization approach. HALO strategically selects weights with low critical-path-delays enabling higher operational frequencies and dynamic frequency scaling without disrupting the architecture's dataflow. Remarkably, HALO achieves these improvements with only a few dynamic voltage and frequency scaling (DVFS) adjustments, ensuring simplicity and practicality in deployment. Additionally, by reducing switching activity within the MAC units, HALO effectively lowers energy consumption. Evaluations on accelerators such as Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs) demonstrate that HALO significantly enhances inference efficiency, achieving average performance improvements of 270% and energy savings of 51% over baseline quantization methods, all with minimal impact on accuracy.
中文标题/摘要
标题:HALO:硬件感知量化,具有低关键路径延迟权重的LLM加速
量化对于高效部署大型语言模型(LLMs)至关重要。然而,传统的量化方法仍然保持硬件无关性,仅限于位宽约束,并未考虑到乘加(MAC)单元的固有电路特性,如时序行为和能量剖面。这种与电路级行为的脱节限制了利用可用的时间裕度和节能机会的能力,从而降低了在现代加速器上部署的整体效率。
为了解决这些限制,我们提出了HALO,一种硬件感知后训练量化(PTQ)的通用框架。与传统方法不同,HALO 明确地将详细的硬件特性,包括关键路径时序和功耗,纳入其量化方法中。HALO 通过选择具有低关键路径延迟的权重,实现更高的操作频率和动态频率缩放,而不破坏架构的数据流。令人惊讶的是,HALO 仅通过几次动态电压和频率缩放(DVFS)调整,就能实现这些改进,确保部署的简单性和实用性。此外,通过减少MAC单元内的切换活动,HALO 有效降低了能耗。在诸如张量处理单元(TPUs)和图形处理单元(GPUs)等加速器上的评估表明,HALO 显著提高了推理效率,相对于基线量化方法,平均性能提高了270%,能耗降低了51%,且对准确性的影响最小。
Summary / 总结
HALO is a hardware-aware quantization framework that incorporates critical-path timing and power consumption into post-training quantization for large language models. Unlike conventional methods, HALO selects weights with low critical-path-delays to enable higher operational frequencies and dynamic frequency scaling, achieving 270% performance improvements and 51% energy savings on TPUs and GPUs with minimal accuracy loss.
HALO 是一种硬件感知的量化框架,将关键路径时序和功率消耗纳入其方法中,从而实现更高的操作频率和节能效果。与传统方法不同,HALO 选择具有低关键路径延迟的权重,允许动态频率调整而不破坏数据流。在 TPUs 和 GPUs 上的评估表明,与基线方法相比,HALO 将推理效率提高了 270%,能耗降低了 51%,且对准确率的影响很小。
Physics-Informed Neural Networks for Nonlinear Output Regulation
Authors: Sebastiano Mengozzi, Giovanni B. Esposito, Michelangelo Bin, Andrea Acquaviva, Andrea Bartolini, Lorenzo Marconi
First: 2025-11-17T16:55:42+00:00 · Latest: 2025-11-17T16:55:42+00:00
Abstract
This work addresses the full-information output regulation problem for nonlinear systems, assuming the states of both the plant and the exosystem are known. In this setting, perfect tracking or rejection is achieved by constructing a zero-regulation-error manifold π(w) and a feedforward input c(w) that render such manifold invariant. The pair (π(w), c(w)) is characterized by the regulator equations, i.e., a system of PDEs with an algebraic constraint. We focus on accurately solving the regulator equations introducing a physics-informed neural network (PINN) approach that directly approximates π(w) and c(w) by minimizing the residuals under boundary and feasibility conditions, without requiring precomputed trajectories or labeled data. The learned operator maps exosystem states to steady state plant states and inputs, enables real-time inference and, critically, generalizes across families of the exosystem with varying initial conditions and parameters. The framework is validated on a regulation task that synchronizes a helicopter's vertical dynamics with a harmonically oscillating platform. The resulting PINN-based solver reconstructs the zero-error manifold with high fidelity and sustains regulation performance under exosystem variations, highlighting the potential of learning-enabled solvers for nonlinear output regulation. The proposed approach is broadly applicable to nonlinear systems that admit a solution to the output regulation problem.
中文标题/摘要
标题:基于物理信息的神经网络在非线性输出调节中的应用
本文解决了非线性系统的全信息输出调节问题,假设已知工厂和外系统状态。在这种情况下,通过构造零调节误差流形π(w)和前馈输入c(w),实现了完美的跟踪或拒绝。这对(π(w), c(w))由调节方程定义,即一组带有代数约束的偏微分方程。我们通过引入基于物理信息的神经网络(PINN)方法,直接近似π(w)和c(w),在边界和可行性条件下最小化残差,无需预先计算轨迹或标记数据。学习到的操作映射外系统状态到稳态工厂状态和输入,实现实时推理,并且关键地,能够跨具有不同初始条件和参数的外系统家族进行泛化。该框架在同步直升机垂直动态与谐振平台的任务中得到验证。基于PINN的求解器以高保真度重建零误差流形,并在外部系统变化下维持调节性能,突显了学习驱动求解器在非线性输出调节中的潜力。所提出的方法广泛适用于允许输出调节问题解的非线性系统。
Summary / 总结
This work aims to solve the full-information output regulation problem for nonlinear systems by constructing a zero-regulation-error manifold and a feedforward input using regulator equations. A physics-informed neural network (PINN) approach is introduced to directly approximate these components by minimizing residuals under boundary and feasibility conditions, without needing precomputed data. The method successfully reconstructs the zero-error manifold with high fidelity and maintains regulation performance under varying exosystem conditions, demonstrating the potential of PINN for nonlinear output regulation in real-time applications.
该研究解决了非线性系统中的全信息输出调节问题,通过构造零调节误差流形和前馈输入来实现。引入了物理感知神经网络(PINN)方法,直接通过在边界和可行性条件下最小化残差来近似这些元素,无需预先计算轨迹或标记数据。该方法能够跨不同初始条件和参数的外系统进行泛化。该方法在直升机同步任务中得到验证,展示了在不同外系统条件下高保真度重建零误差流形和稳健的调节性能。
Beyond SELECT: A Comprehensive Taxonomy-Guided Benchmark for Real-World Text-to-SQL Translation
Authors: Hao Wang, Yuanfeng Song, Xiaoming Yin, Xing Chen
First: 2025-11-17T16:52:19+00:00 · Latest: 2025-11-17T16:52:19+00:00
Abstract
Text-to-SQL datasets are essential for training and evaluating text-to-SQL models, but existing datasets often suffer from limited coverage and fail to capture the diversity of real-world applications. To address this, we propose a novel taxonomy for text-to-SQL classification based on dimensions including core intents, statement types, syntax structures, and key actions. Using this taxonomy, we evaluate widely used public text-to-SQL datasets (e.g., Spider and Bird) and reveal limitations in their coverage and diversity. We then introduce a taxonomy-guided dataset synthesis pipeline, yielding a new dataset named SQL-Synth. This approach combines the taxonomy with Large Language Models (LLMs) to ensure the dataset reflects the breadth and complexity of real-world text-to-SQL applications. Extensive analysis and experimental results validate the effectiveness of our taxonomy, as SQL-Synth exhibits greater diversity and coverage compared to existing benchmarks. Moreover, we uncover that existing LLMs typically fall short in adequately capturing the full range of scenarios, resulting in limited performance on SQL-Synth. However, fine-tuning can substantially improve their performance in these scenarios. The proposed taxonomy has significant potential impact, as it not only enables comprehensive analysis of datasets and the performance of different LLMs, but also guides the construction of training data for LLMs.
中文标题/摘要
标题:超越SELECT:基于分类学指导的全面基准测试,用于实际文本到SQL转换
文本到SQL数据集对于训练和评估文本到SQL模型至关重要,但现有数据集往往覆盖有限且未能捕捉到实际应用的多样性。为解决这一问题,我们提出了一种基于核心意图、语句类型、语法结构和关键操作等维度的新型文本到SQL分类分类学。利用这种分类学,我们评估了广泛使用的公开文本到SQL数据集(例如Spider和Bird),并揭示了它们在覆盖范围和多样性方面的局限性。然后,我们引入了一种基于分类学的数据集合成管道,生成了一个名为SQL-Synth的新数据集。该方法结合了分类学和大型语言模型(LLMs),以确保数据集反映实际文本到SQL应用的广度和复杂性。广泛的分析和实验结果验证了我们分类学的有效性,因为SQL-Synth在多样性与覆盖范围方面优于现有基准。此外,我们发现现有LLMs通常未能充分捕捉到各种场景,导致在SQL-Synth上的性能有限。然而,微调可以显著提高它们在这些场景中的性能。所提出的分类学具有重大影响,因为它不仅能够对数据集和不同LLMs的性能进行全面分析,还能够指导LLMs训练数据的构建。
Summary / 总结
The research aims to address the limitations of existing text-to-SQL datasets by proposing a taxonomy for text-to-SQL classification that includes core intents, statement types, syntax structures, and key actions. This taxonomy is used to evaluate and reveal the limitations of popular datasets like Spider and Bird. A new dataset, SQL-Synth, is created using a taxonomy-guided synthesis pipeline, which combines the taxonomy with Large Language Models to better reflect real-world applications. Experimental results show that SQL-Synth has greater diversity and coverage, and that fine-tuning LLMs can significantly improve their performance on this dataset.
本文提出了一种基于核心意图、语句类型、语法结构和关键动作的文本到SQL分类 taxonomy,以解决现有文本到SQL数据集的局限性。作者使用该 taxonomy 评估了流行的 Spider 和 Bird 数据集,并揭示了它们的局限性。然后,他们引入了使用 taxonomy 引导的管道和大型语言模型合成的新数据集 SQL-Synth,该数据集展示了更大的多样性和覆盖范围。研究发现,现有的 LLM 在 SQL-Synth 上表现不佳,但微调可以显著提高其性能。该 taxonomy 对于分析数据集和指导 LLM 训练数据的构建具有重要价值。
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
Authors: Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, Zhi Wang
First: 2025-11-17T16:50:58+00:00 · Latest: 2025-11-17T16:50:58+00:00
Abstract
Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.
中文标题/摘要
标题:VVS:通过部分验证跳过加速视觉自回归生成
视觉自回归(AR)生成模型在图像生成方面展现了强大的潜力,但其下一个标记预测范式引入了显著的推理延迟。尽管推测性解码(SD)已被证明对加速视觉AR模型有效,但其“先草拟一步,再验证一步”的范式阻止了直接减少前向传递次数,从而限制了加速潜力。受视觉标记可互换性的启发,我们首次在视觉AR模型生成的SD过程中探索验证跳过,以明确减少目标模型前向传递次数,从而降低推理延迟。基于对草拟阶段特性的分析,我们观察到验证冗余和过时特征的可重用性是保留生成质量和加速无验证步骤的关键因素。受这两个观察的启发,我们提出了一种新的SD框架VVS,通过部分验证跳过加速视觉AR生成,该框架整合了三个互补模块:(1)一种无验证的标记选择器,具有动态截断,(2)标记级别的特征缓存和重用,以及(3)细粒度的跳过步骤调度。因此,VVS相比传统的AR解码将目标模型前向传递次数减少了2.8倍,同时保持了竞争力的生成质量,提供了优于传统SD框架的更优速度-质量权衡,并揭示了重塑SD范式的强大潜力。
Summary / 总结
The research aims to reduce inference latency in visual autoregressive (AR) generation models by exploring verification skipping in speculative decoding (SD). The proposed VVS framework reduces the number of target model forward passes by 2.8 times through three modules: a verification-free token selector, token-level feature caching, and fine-grained skipped step scheduling, while maintaining competitive generation quality and offering a better speed-quality trade-off compared to conventional SD frameworks.
论文通过提出VVS框架,该框架在视觉自回归(AR)生成模型的推测解码过程中跳过某些步骤的验证,以解决高推理延迟问题。VVS相比传统的AR解码将目标模型前向传递的数量减少了2.8倍,同时保持了相当的生成质量。它通过无验证的token选择器、token级别的特征缓存和细粒度的跳过步骤调度来实现这一点。
Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images
Authors: Yinuo Xu, Yan Cui, Mingyao Li, Zhi Huang
First: 2025-11-17T16:49:59+00:00 · Latest: 2025-11-17T16:49:59+00:00
Abstract
Identifying cell types and subtypes from routine histopathology images is essential for improving the computational understanding of human disease. Existing tile-based models can capture detailed nuclear morphology but often fail to incorporate the broader tissue context that influences a cell's function and identity. In addition, available human annotations are typically coarse-grained and unevenly distributed across studies, making fine-grained subtype-level supervision difficult to obtain.
To address these limitations, we introduce NuClass, a pathologist workflow inspired framework for cell-wise multi-scale integration of nuclear morphology and microenvironmental context. NuClass includes two main components: Path local, which focuses on nuclear morphology from 224-by-224 pixel crops, and Path global, which models the surrounding 1024-by-1024 pixel neighborhood. A learnable gating module adaptively balances local detail and contextual cues. To encourage complementary learning, we incorporate an uncertainty-guided objective that directs the global path to prioritize regions where the local path is uncertain. We also provide calibrated confidence estimates and Grad-CAM visualizations to enhance interpretability.
To overcome the lack of high-quality annotations, we construct a marker-guided dataset from Xenium spatial transcriptomics assays, yielding single-cell resolution labels for more than two million cells across eight organs and 16 classes. Evaluated on three fully held-out cohorts, NuClass achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines. Our results show that multi-scale, uncertainty-aware fusion can bridge the gap between slide-level pathological foundation models and reliable, cell-level phenotype prediction.
中文标题/摘要
标题:自适应多尺度集成解锁组织病理学图像中稳健的细胞注释
从常规组织病理学图像中识别细胞类型和亚型对于提高对人类疾病的计算理解至关重要。现有的基于瓦片的模型可以捕捉详细的核形态,但往往无法整合影响细胞功能和身份的更广泛的组织上下文。此外,可用的人工标注通常较为粗糙且在研究之间分布不均,使得获得细粒度的亚型级监督变得困难。
为解决这些限制,我们引入了NuClass,这是一种受病理学家工作流程启发的细胞级多尺度核形态和微环境上下文集成框架。NuClass 包含两个主要组件:局部路径,专注于224×224像素的核形态,和全局路径,建模周围1024×1024像素的邻域。一个可学习的门控模块自适应地平衡局部细节和上下文线索。为了促进互补学习,我们引入了一个基于不确定性目标,引导全局路径优先关注局部路径不确定的区域。我们还提供了校准的置信度估计和Grad-CAM可视化,以增强可解释性。
为克服高质量标注的缺乏,我们从Xenium空间转录组学分析中构建了一个标记引导的数据集,为八个器官和16个类别中的超过两百万个细胞提供了单细胞分辨率的标签。在三个完全独立的队列上评估,NuClass 的最佳类别达到了96%的F1分数,优于强大的基线。我们的结果表明,多尺度、不确定性意识融合可以弥合切片级病理基础模型与可靠、细胞级表型预测之间的差距。
Summary / 总结
The research aims to improve cell annotation in histopathology images by integrating nuclear morphology and microenvironmental context. NuClass, a pathologist-inspired framework, includes Path local for detailed nuclear morphology and Path global for broader tissue context. An adaptive gating module balances local detail and contextual cues, while an uncertainty-guided objective enhances complementary learning. NuClass, evaluated on three cohorts, achieves up to 96 percent F1 for its best-performing class, outperforming strong baselines and demonstrating the effectiveness of multi-scale, uncertainty-aware fusion in cell annotation.
该论文通过引入NuClass框架,解决了在组织病理学图像中准确标注细胞类型和亚型的挑战。NuClass包括两个组件:Path local用于详细核形态学,Path global用于更广泛的组织环境,其中有一个可学习的门控模块来平衡这些。作者使用空间转录组学构建了一个标记引导的数据集进行训练和评估,其最佳性能类别的F1值达到96%,超过了现有模型。