Private Frequency Estimation Via Residue Number Systems
Authors: Héber H. Arcolezi
Venue: AAAI 2026
First: 2025-11-14T18:58:41+00:00 · Latest: 2025-11-14T18:58:41+00:00
Comments: AAAI 2026
Abstract
We present \textsf{ModularSubsetSelection} (MSS), a new algorithm for locally differentially private (LDP) frequency estimation. Given a universe of size $k$ and $n$ users, our $\varepsilon$-LDP mechanism encodes each input via a Residue Number System (RNS) over $\ell$ pairwise-coprime moduli $m_0, \ldots, m_{\ell-1}$, and reports a randomly chosen index $j \in [\ell]$ along with the perturbed residue using the statistically optimal \textsf{SubsetSelection}~(SS) (Wang et al. 2016). This design reduces the user communication cost from $Θ\bigl(ω\log_2(k/ω)\bigr)$ bits required by standard SS (with $ω\approx k/(e^\varepsilon+1)$) down to $\lceil \log_2 \ell \rceil + \lceil \log_2 m_j \rceil$ bits, where $m_j < k$. Server-side decoding runs in $Θ(n + r k \ell)$ time, where $r$ is the number of LSMR (Fong and Saunders 2011) iterations. In practice, with well-conditioned moduli (\textit{i.e.}, constant $r$ and $\ell = Θ(\log k)$), this becomes $Θ(n + k \log k)$. We prove that MSS achieves worst-case MSE within a constant factor of state-of-the-art protocols such as SS and \textsf{ProjectiveGeometryResponse} (PGR) (Feldman et al. 2022), while avoiding the algebraic prerequisites and dynamic-programming decoder required by PGR. Empirically, MSS matches the estimation accuracy of SS, PGR, and \textsf{RAPPOR} (Erlingsson, Pihur, and Korolova 2014) across realistic $(k, \varepsilon)$ settings, while offering faster decoding than PGR and shorter user messages than SS. Lastly, by sampling from multiple moduli and reporting only a single perturbed residue, MSS achieves the lowest reconstruction-attack success rate among all evaluated LDP protocols.
中文标题/摘要
标题:基于同余数系统的一种私有频率估计方法
我们提出了一个新的局部差分隐私(LDP)频率估计算法\textsf{ModularSubsetSelection}(MSS)。给定大小为$k$的宇宙和$n$个用户,我们的$\varepsilon$-LDP机制通过一对互素模数$m_0, \ldots, m_{\ell-1}$上的同余数系统(RNS)对每个输入进行编码,并报告一个随机选择的索引$j \in [\ell]$以及经过扰动的同余数,使用统计上最优的\textsf{SubsetSelection}(SS)(Wang et al. 2016)。这种设计将用户通信成本从标准SS所需的$Θ\bigl(ω\log_2(k/ω)\bigr)$比特降低到$\lceil \log_2 \ell \rceil + \lceil \log_2 m_j \rceil$比特,其中$m_j < k$。服务器端解码运行时间为$Θ(n + r k \ell)$,其中$r$是LSMR(Fong and Saunders 2011)迭代次数。在实际应用中,当模数条件良好(即$r$和$\ell = Θ(\log k)$为常数)时,这变为$Θ(n + k \log k)$。我们证明MSS在最坏情况下的均方误差(MSE)与SS和\textsf{ProjectiveGeometryResponse}(PGR)(Feldman et al. 2022)等最先进的协议相比,仅在常数因子内,同时避免了PGR所需的代数前提和动态规划解码器。实验上,MSS在实际的$(k, \varepsilon)$设置中与SS、PGR和\textsf{RAPPOR}(Erlingsson, Pihur, and Korolova 2014)的估计精度相当,同时PGR的解码速度更快,SS的用户消息更短。最后,通过从多个模数中采样并仅报告一个扰动的同余数,MSS在所有评估的LDP协议中实现了最低的重建攻击成功率。
LARM: A Large Articulated-Object Reconstruction Model
Authors: Sylvia Yuan, Ruoxi Shi, Xinyue Wei, Xiaoshuai Zhang, Hao Su, Minghua Liu
First: 2025-11-14T18:55:27+00:00 · Latest: 2025-11-14T18:55:27+00:00
Comments: project page: https://sylviayuan-sy.github.io/larm-site/
Abstract
Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: https://sylviayuan-sy.github.io/larm-site/
中文标题/摘要
标题:LARM:一种大型 articulated-Object 重建模型
对具有现实几何形状、纹理和运动学的3D articulated对象进行建模对于广泛的应用至关重要。然而,现有的基于优化的重建方法通常需要密集的多视角输入和昂贵的单实例优化,限制了它们的可扩展性。最近的前馈方法提供了更快的替代方案,但经常产生粗糙的几何形状,缺乏纹理重建,并依赖于脆弱且复杂的多阶段管道。我们引入了LARM,这是一种统一的前馈框架,可以从稀疏视角图像中重建3D articulated对象,同时联合恢复详细的几何形状、现实的纹理和准确的关节结构。LARM 将最近的静态3D对象新颖视图合成(NVS)方法LVSM 扩展到 articulated 设置,通过使用基于变压器的架构联合推理相机姿态和articulation 变化,实现可扩展且准确的新视角合成。此外,LARM 生成辅助输出,如深度图和部分掩码,以促进显式的3D网格提取和关节估计。我们的管道消除了密集监督的需要,并支持在多种对象类别中进行高保真重建。广泛的实验表明,LARM 在新颖视图和状态合成以及3D articulated对象重建方面均优于现有最先进的方法,生成高质量的网格,紧密符合输入图像。项目页面:https://sylviayuan-sy.github.io/larm-site/
Summary / 总结
LARM is a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. It extends LVSM, a recent NVS approach, to handle articulated objects using a transformer-based architecture. LARM outperforms existing methods in novel view and state synthesis, and 3D articulated object reconstruction, producing high-quality meshes that closely match input images.
LARM 是一个统一的前馈框架,可以从稀疏视角图像中重建3D articulated对象,同时联合恢复详细的几何结构、逼真的纹理和准确的关节结构。它将最近的新型视图合成方法 LVSM 扩展到处理 articulated 对象,使用基于变压器的架构。LARM 在新型视图和状态合成方面优于最先进的方法,生成与输入图像高度一致的高质量网格,适用于各种对象类别。
The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent
Authors: Yatin Dandi, Luca Pesce, Lenka Zdeborová, Florent Krzakala
Venue: NeurIPS 2025 Spotlight
First: 2025-02-19T18:58:28+00:00 · Latest: 2025-11-14T18:52:37+00:00
Abstract
Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms.
中文标题/摘要
标题:深度的优势:梯度下降学习高维分层函数
理解梯度下降(GD)训练的深层神经网络相较于浅层模型的优势仍然是一个开放的理论挑战。在本文中,我们引入了一类目标函数(单指数和多指数高斯分层目标),这些目标函数包含了一级潜在子空间维度的分层结构。该框架使我们能够在高维极限下,从理论上研究深层网络与浅层网络的学习动态和泛化性能。具体而言,我们的主要定理表明,梯度下降特征学习逐步降低有效维度,将高维问题转化为一系列低维问题。这使得使用梯度下降学习目标函数所需的样本数量远少于浅层网络。虽然结果是在受控训练环境中证明的,我们还讨论了更常见的训练过程,并认为它们通过相同机制进行学习。
Summary / 总结
The paper investigates the computational advantages of deep neural networks over shallow ones in learning hierarchical functions using gradient descent. It introduces Gaussian hierarchical targets to study the learning dynamics and generalization performance. The main result is that deep networks can reduce the effective dimensionality, allowing them to learn with fewer samples compared to shallow networks. This is achieved through successive feature learning that transforms high-dimensional problems into a sequence of lower-dimensional ones.
本文研究了深度神经网络在使用梯度下降学习层次函数时相较于浅层网络的计算优势。通过引入高斯层次目标,作者分析了高维条件下的学习动态和泛化性能。主要发现是,深度网络通过特征学习逐步降低有效维度,使其能够用更少的样本学习,相比浅层网络。虽然理论结果是在受控条件下得出的,但论文也讨论了这些机制如何适用于更常见的训练过程。
Multistability of Self-Attention Dynamics in Transformers
Authors: Claudio Altafini
First: 2025-11-14T18:45:22+00:00 · Latest: 2025-11-14T18:45:22+00:00
Comments: 8 pages, 3 figures
Abstract
In machine learning, a self-attention dynamics is a continuous-time multiagent-like model of the attention mechanisms of transformers. In this paper we show that such dynamics is related to a multiagent version of the Oja flow, a dynamical system that computes the principal eigenvector of a matrix corresponding for transformers to the value matrix. We classify the equilibria of the ``single-head'' self-attention system into four classes: consensus, bipartite consensus, clustering and polygonal equilibria. Multiple asymptotically stable equilibria from the first three classes often coexist in the self-attention dynamics. Interestingly, equilibria from the first two classes are always aligned with the eigenvectors of the value matrix, often but not exclusively with the principal eigenvector.
中文标题/摘要
标题:变换器中自我注意力动力学的多稳态
在机器学习中,自我注意力动力学是一种连续时间的类多智能体模型,用于描述变换器的注意力机制。本文表明,这种动力学与多智能体版本的奥贾流有关,这是一种计算矩阵主特征向量的动力系统,对于变换器而言,该矩阵对应于值矩阵。我们将“单头”自我注意力系统的平衡点分为四类:共识、二分共识、聚类和多边形平衡点。在自我注意力动力学中,前三类的多个渐近稳定平衡点通常共存。有趣的是,前两类的平衡点总是与值矩阵的特征向量对齐,通常但不总是与主特征向量对齐。
Summary / 总结
This paper investigates the self-attention dynamics in transformers, showing that these dynamics are related to the Oja flow, a multiagent system that computes the principal eigenvector of a matrix. The study classifies the equilibria of the self-attention system into four types: consensus, bipartite consensus, clustering, and polygonal equilibria. The key finding is that multiple stable equilibria often coexist, with those from the first two classes typically aligned with the eigenvectors of the value matrix, often the principal eigenvector.
本文研究了变压器中的自注意力动态,将其与用于计算特征向量的Oja流联系起来。作者将单头自注意力系统的平衡点分为四类:共识、双部分共识、聚类和多边形平衡点。他们发现,多种稳定的平衡点通常共存,前两类平衡点通常与值矩阵的特征向量对齐,经常包括主特征向量。
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
Authors: Dawei Zhu, Rui Meng, Jiefeng Chen, Sujian Li, Tomas Pfister, Jinsung Yoon
First: 2025-11-14T18:42:18+00:00 · Latest: 2025-11-14T18:42:18+00:00
Abstract
Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
中文标题/摘要
标题:DocLens : 一种工具增强的多智能体框架,用于长视觉文档理解
理解长视觉文档,其中信息分布在大量的文本和视觉元素页面上,是现代视觉-语言模型(VLMs)面临的一个关键但具有挑战性的任务。现有方法在根本挑战上失败:证据定位。它们难以检索相关页面并忽略视觉元素中的细粒度细节,导致性能有限和模型幻觉。为了解决这个问题,我们提出了DocLens,一种工具增强的多智能体框架,能够有效地“聚焦”在证据上,就像镜头一样。它首先从整个文档导航到相关页面上的特定视觉元素,然后采用采样-裁定机制生成一个可靠的答案。与Gemini-2.5-Pro结合使用时,DocLens在MMLongBench-Doc和FinRAGBench-V上达到了最先进的性能,甚至超过了人类专家。该框架在视觉中心和无法回答的查询方面表现出色,展示了其增强定位能力的强大之处。
Summary / 总结
DocLens is a tool-augmented multi-agent framework designed to improve the understanding of long visual documents by addressing the challenge of evidence localization. It navigates from the full document to specific visual elements on relevant pages and uses a sampling-adjudication mechanism to generate a reliable answer. DocLens, paired with Gemini-2.5-Pro, outperforms existing models and even human experts on MMLongBench-Doc and FinRAGBench-V, especially on vision-centric and unanswerable queries, showcasing its enhanced localization capabilities.
DocLens 是一种工具增强的多智能体框架,旨在通过解决证据定位的挑战来提高对长视觉文档的理解。它通过导航到特定页面上的具体视觉元素,并使用抽样-裁定机制生成可靠的答案。当与 Gemini-2.5-Pro 结合使用时,DocLens 在 MMLongBench-Doc 和 FinRAGBench-V 上超越了现有模型和人类专家,特别是在视觉中心和无法回答的问题上表现出色。
Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping
Authors: Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat
Venue: AAAI 2026
First: 2025-11-14T18:42:18+00:00 · Latest: 2025-11-14T18:42:18+00:00
Comments: Accepted to AAAI 2026 AI Alignment Track
Abstract
The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining the alignment. For the pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
中文标题/摘要
标题:使马基雅维利主义代理一致:测试时行为引导策略塑造
在复杂动态环境中部署决策AI代理,保持与人类价值观或指导方针的一致性是一项关键挑战。仅为了实现目标而训练的代理可能会采取有害行为,这揭示了最大化奖励函数与保持一致之间的关键权衡。对于预训练的代理,确保一致性尤其具有挑战性,因为重新训练是一个成本高且耗时的过程。此外,代表一致性的伦理价值观多样且可能相互冲突,进一步增加了挑战。为了解决这些挑战,我们提出了一种基于模型引导的策略塑造的测试时一致性技术。该方法允许对个体行为属性进行精确控制,适用于多种强化学习(RL)环境,并在不需重新训练代理的情况下,促进伦理一致性和奖励最大化之间的原则性权衡。我们使用MACHIAVELLI基准进行评估,该基准包括134个基于文本的游戏环境和数千个涉及伦理决策的标注场景。首先,RL代理被训练以最大化其各自游戏中的奖励。在测试时,我们通过场景-动作属性分类器应用策略塑造,以确保决策与伦理属性的一致性。我们将我们的方法与先前的训练时方法和通用代理进行比较,并研究了几种类型的伦理违规和权力追求行为。我们的结果表明,测试时策略塑造为在多种环境和一致性属性中缓解不道德行为提供了一种有效且可扩展的解决方案。
Summary / 总结
The paper addresses the challenge of aligning AI agents with human values in complex environments, proposing a test-time policy shaping technique. This method allows for precise control over individual behavioral attributes, generalizes across different reinforcement learning environments, and avoids the need for retraining. Evaluations using the MACHIAVELLI benchmark show that this approach effectively mitigates unethical behavior across various scenarios and attributes.
论文针对在复杂环境中使AI代理与人类价值观保持一致的挑战,提出了一种基于测试时策略塑造的方法,适用于预训练代理。该方法使用场景-动作属性分类器来引导代理行为,无需重新训练即可实现对伦理属性的精确控制。在MACHIAVELLI基准测试中,该方法有效地缓解了各种环境和属性下的不道德行为。
Adaptive LiDAR Scanning: Harnessing Temporal Cues for Efficient 3D Object Detection via Multi-Modal Fusion
Authors: Sara Shoouri, Morteza Tavakoli Taba, Hun-Seok Kim
Venue: AAAI
First: 2025-08-03T03:20:36+00:00 · Latest: 2025-11-14T18:31:11+00:00
Comments: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026
Abstract
Multi-sensor fusion using LiDAR and RGB cameras significantly enhances 3D object detection task. However, conventional LiDAR sensors perform dense, stateless scans, ignoring the strong temporal continuity in real-world scenes. This leads to substantial sensing redundancy and excessive power consumption, limiting their practicality on resource-constrained platforms. To address this inefficiency, we propose a predictive, history-aware adaptive scanning framework that anticipates informative regions of interest (ROI) based on past observations. Our approach introduces a lightweight predictor network that distills historical spatial and temporal contexts into refined query embeddings. These embeddings guide a differentiable Mask Generator network, which leverages Gumbel-Softmax sampling to produce binary masks identifying critical ROIs for the upcoming frame. Our method significantly reduces unnecessary data acquisition by concentrating dense LiDAR scanning only within these ROIs and sparsely sampling elsewhere. Experiments on nuScenes and Lyft benchmarks demonstrate that our adaptive scanning strategy reduces LiDAR energy consumption by over 65% while maintaining competitive or even superior 3D object detection performance compared to traditional LiDAR-camera fusion methods with dense LiDAR scanning.
中文标题/摘要
标题:自适应LiDAR扫描:利用时间线索通过多模态融合进行高效3D物体检测
使用LiDAR和RGB相机的多传感器融合显著增强了3D物体检测任务。然而,传统的LiDAR传感器执行密集、无状态的扫描,忽略了现实场景中的强烈时间连续性。这导致了大量不必要的感测冗余和过度的能耗,限制了它们在资源受限平台上的实用性。为了解决这种低效率,我们提出了一种预测性、历史感知的自适应扫描框架,该框架基于过去的观察来预测具有信息性的感兴趣区域(ROI)。我们的方法引入了一个轻量级的预测网络,该网络将历史空间和时间上下文提炼为精炼的查询嵌入。这些嵌入指导一个可微分的掩码生成网络,该网络利用Gumbel-Softmax采样生成二进制掩码,以识别即将出现帧中的关键ROI。我们的方法通过仅在这些ROI内进行密集的LiDAR扫描,而在其他地方稀疏采样,显著减少了不必要的数据采集。在nuScenes和Lyft基准测试中,我们的自适应扫描策略将LiDAR能耗降低了超过65%,同时保持了与传统密集LiDAR扫描的LiDAR-相机融合方法相当甚至更优的3D物体检测性能。
Summary / 总结
The paper proposes an adaptive LiDAR scanning framework that uses historical data to predict and focus on informative regions of interest, reducing unnecessary data acquisition and energy consumption. Experiments show a 65% reduction in LiDAR energy use while maintaining or improving 3D object detection performance compared to traditional methods.
论文提出了一种基于历史时空线索预测感兴趣区域的自适应LiDAR扫描框架,通过减少不必要的数据采集,将能量消耗降低超过65%,同时保持与传统密集LiDAR扫描方法相当甚至更好的3D物体检测性能。该方法包括一个轻量级预测网络和一个可微分的掩码生成网络来生成二值掩码进行稀疏LiDAR扫描。
LDC: Learning to Generate Research Idea with Dynamic Control
Authors: Ruochen Li, Liqiang Jing, Chi Han, Jiawei Zhou, Xinya Du
First: 2024-12-19T08:28:18+00:00 · Latest: 2025-11-14T18:17:40+00:00
Abstract
Recent advancements in large language models (LLMs) have demonstrated their potential in automating the scientific research ideation. Existing approaches primarily focus on prompting techniques, often producing ideas misaligned with expert standards - novelty, feasibility, and effectiveness, which are widely recognized by the research community as the three key subdimensions of high-quality ideas. Also, balancing these dimensions remains challenging due to their inherent trade-offs. To address these limitations, we propose the first framework that employs a two-stage approach combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL) for the task. In the SFT stage, the model learns foundational patterns from pairs of research papers and their corresponding follow-up ideas. In the RL stage, multi-dimensional reward models guided by fine-grained feedback evaluate and optimize the model across key dimensions. During inference, dimensional controllers coordinated by a sentence-level decoder enable dynamic context-aware steering of the idea generation process. Our framework provides a balanced approach to research idea generation, achieving high-quality outcomes in the experiment by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.
中文标题/摘要
标题:LDC:利用动态控制学习生成研究想法
大型语言模型(LLMs)的最新进展展示了其在自动化科学研究构想方面的潜力。现有方法主要集中在提示技术上,经常产生与专家标准——新颖性、可行性和有效性——不一致的想法,这些标准在研究社区中被广泛认为是高质量想法的三个关键子维度。此外,由于这些维度之间的固有权衡,平衡这些维度仍然具有挑战性。为了解决这些局限性,我们提出了第一个结合监督微调(SFT)和可控强化学习(RL)的框架。在SFT阶段,模型从研究论文及其相应后续想法的配对中学习基础模式。在RL阶段,由精细反馈引导的多维度奖励模型评估和优化模型在关键维度上的表现。在推理过程中,由句子级解码器协调的维度控制器使想法生成过程能够动态地根据上下文进行调整。我们的框架提供了一种平衡的研究想法生成方法,在实验中通过动态导航新颖性、可行性和有效性之间的权衡,实现了高质量的结果。
Summary / 总结
This paper addresses the limitations of existing approaches in generating high-quality research ideas by proposing a two-stage framework combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL). The SFT stage teaches the model from pairs of research papers and their follow-up ideas, while the RL stage evaluates and optimizes the model using multi-dimensional reward models. During inference, a sentence-level decoder dynamically steers the idea generation process to balance novelty, feasibility, and effectiveness. Experiments show that this framework generates high-quality research ideas by navigating the trade-offs among these dimensions.
本文提出了一种结合监督微调(SFT)和可控强化学习(RL)的两阶段框架,以解决现有方法在生成高质量研究想法方面的局限性。SFT阶段通过研究论文及其后续想法的配对来教导模型,而RL阶段使用多维度奖励模型来评估和优化关键维度。在推理过程中,通过句子级解码器动态引导想法生成过程,实现了在新颖性、可行性和有效性方面的平衡结果。
Volumetric Ergodic Control
Authors: Jueun Kwon, Max M. Sun, Todd Murphey
First: 2025-11-14T18:10:40+00:00 · Latest: 2025-11-14T18:10:40+00:00
Comments: 8 pages, 8 figures
Abstract
Ergodic control synthesizes optimal coverage behaviors over spatial distributions for nonlinear systems. However, existing formulations model the robot as a non-volumetric point, but in practice a robot interacts with the environment through its body and sensors with physical volume. In this work, we introduce a new ergodic control formulation that optimizes spatial coverage using a volumetric state representation. Our method preserves the asymptotic coverage guarantees of ergodic control, adds minimal computational overhead for real-time control, and supports arbitrary sample-based volumetric models. We evaluate our method across search and manipulation tasks -- with multiple robot dynamics and end-effector geometries or sensor models -- and show that it improves coverage efficiency by more than a factor of two while maintaining a 100% task completion rate across all experiments, outperforming the standard ergodic control method. Finally, we demonstrate the effectiveness of our method on a robot arm performing mechanical erasing tasks.
中文标题/摘要
标题:体素遍历控制
遍历控制综合了非线性系统在空间分布上的最优覆盖行为。然而,现有的公式化模型将机器人视为非体素点,但在实践中,机器人通过其身体和传感器与环境相互作用,具有物理体积。在本文中,我们引入了一种新的遍历控制公式,使用体素状态表示来优化空间覆盖。我们的方法保留了遍历控制的渐近覆盖保证,增加了最少的实时控制计算开销,并支持任意基于样本的体素模型。我们在搜索和操作任务中评估了我们的方法——涉及多种机器人动力学和末端执行器几何形状或传感器模型——并展示了它在所有实验中将覆盖效率提高了超过两倍,同时保持100%的任务完成率,优于标准的遍历控制方法。最后,我们在执行机械擦除任务的机器人臂上展示了我们方法的有效性。
Summary / 总结
This work addresses the limitation of existing ergodic control methods that model robots as non-volumetric points, instead proposing a new volumetric ergodic control formulation. The method uses a volumetric state representation to optimize spatial coverage and maintains asymptotic coverage guarantees while adding minimal computational overhead. Experiments across various tasks show that this approach improves coverage efficiency by more than a factor of two and achieves a 100% task completion rate, outperforming the standard ergodic control method.
本文通过引入体积状态表示解决了现有ergodic控制方法的局限性,优化了非线性系统的空间覆盖。该方法保持了渐近覆盖的保证,并支持任意体积模型,同时具有最小的计算开销。实验结果表明,它将覆盖效率提高了超过两倍,并在各种任务和机器人动力学下保持了100%的任务完成率,优于标准的ergodic控制方法。该方法在执行机械擦除任务的机器人臂上进行了演示。
DiAReL: Reinforcement Learning with Disturbance Awareness for Robust Sim2Real Policy Transfer in Robot Control
Authors: Mohammadhossein Malmir, Josip Josifovski, Noah Klarmann, Alois Knoll
First: 2023-06-15T10:11:38+00:00 · Latest: 2025-11-14T17:57:58+00:00
Comments: Accepted for publication in IEEE Transactions on Control Systems Technology (TCST)
Abstract
Delayed Markov decision processes (DMDPs) fulfill the Markov property by augmenting the state space of agents with a finite time window of recently committed actions. In reliance on these state augmentations, delay-resolved reinforcement learning algorithms train policies to learn optimal interactions with environments featuring observation or action delays. Although such methods can be directly trained on the real robots, due to sample inefficiency, limited resources, or safety constraints, a common approach is to transfer models trained in simulation to the physical robot. However, robotic simulations rely on approximated models of the physical systems, which hinders the sim2real transfer. In this work, we consider various uncertainties in modeling the robot or environment dynamics as unknown intrinsic disturbances applied to the system input. We introduce the disturbance-augmented Markov decision process (DAMDP) in delayed settings as a novel representation to incorporate disturbance estimation in training on-policy reinforcement learning algorithms. The proposed method is validated across several metrics on learning robotic reaching and pushing tasks and compared with disturbance-unaware baselines. The results show that the disturbance-augmented models can achieve higher stabilization and robustness in the control response, which in turn improves the prospects of successful sim2real transfer.
中文标题/摘要
标题:DiAReL:具有干扰意识的强化学习在机器人控制中鲁棒的Sim2Real策略转移
延迟马尔可夫决策过程(DMDPs)通过扩展代理的状态空间,加入最近执行的动作的时间窗口,来满足马尔可夫性质。基于这些状态扩展,延迟解决的强化学习算法训练策略以学习与具有观察或动作延迟的环境进行最优交互。尽管这些方法可以直接在真实机器人上进行训练,但由于样本效率低下、资源有限或安全限制,一种常见的方法是将模拟中训练的模型转移到物理机器人上。然而,机器人模拟依赖于物理系统的近似模型,这阻碍了Sim2Real的转移。在本文中,我们将建模机器人或环境动力学的各种不确定性视为应用于系统输入的未知内在干扰。我们引入了延迟环境中干扰增强的马尔可夫决策过程(DAMDP)作为一种新的表示方法,以在训练在线强化学习算法时结合干扰估计。所提出的方法在学习机器人抓取和推动物体任务方面通过多个指标进行了验证,并与无干扰基线进行了比较。结果表明,干扰增强的模型在控制响应中实现了更高的稳定性和鲁棒性,从而提高了成功Sim2Real转移的前景。
Summary / 总结
This paper introduces DiAReL, a reinforcement learning approach that incorporates disturbance awareness in delayed Markov decision processes to enhance the robustness of sim2real policy transfer in robot control. By using disturbance-augmented Markov decision processes (DAMDPs), the method improves stabilization and robustness in control responses, leading to better sim2real transfer performance. Experiments on robotic reaching and pushing tasks demonstrate the superiority of the proposed method over disturbance-unaware baselines.
研究旨在解决从仿真到真实机器人转移强化学习策略时面临的延迟和不确定性挑战。方法引入了扰动增强的延迟马尔可夫决策过程(DAMDP),以在训练在线强化学习算法时纳入扰动估计。研究显示,这种方法提高了控制响应的稳定性和鲁棒性,从而在机器人抓取和推物任务中实现了更好的仿真到现实世界的转移性能,优于未考虑扰动的基线方法。
Bridging Hidden States in Vision-Language Models
Authors: Benjamin Fein-Ashley, Jacob Fein-Ashley
First: 2025-11-14T17:55:25+00:00 · Latest: 2025-11-14T17:55:25+00:00
Abstract
Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities "think". We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at https://github.com/jfeinashley/BRIDGE.
中文标题/摘要
标题:视觉-语言模型中隐藏状态的连接
视觉-语言模型(VLMs)是一类新的模型,能够将图像内容与自然语言对齐。现有方法通常在编码器内部通过混合标记/特征(早期融合)或通过比较聚合表示(晚期融合)来进行融合。许多方法还将融合与自回归解码器联系起来。然而,两种模态的隐藏状态已经携带了丰富的、模态特定的结构(视觉中的空间布局;文本中的句法和语义),因此直接对齐这些状态是匹配这两种模态“思考”的自然方式。我们提出了一种轻量级的融合模块:在两个编码器的顶部附近放置几层仅跨模态的双向注意力层。每一层将视觉和文本编码器的隐藏状态序列投影到共享空间,跨模态进行注意,并通过简单的稳定器发送门控残差更新,从而改善对齐。编码器保持非因果性,强于理解,而生成则通过可选的解码器保持清晰地分离。在标准检索、VQA和视觉推理基准测试中,BRIDGE在保持对比模型的双编码器效率的同时,优于可比的VLMs。我们将在https://github.com/jfeinashley/BRIDGE上公开我们的代码。
Summary / 总结
The research aims to improve the alignment of visual and textual information in Vision-Language Models (VLMs) by directly aligning the hidden states of both modalities. The proposed method, BRIDGE, introduces a few cross-modal, bidirectional attention layers near the top of both encoders, which project and align the hidden states from vision and text into a shared space. This approach outperforms existing methods on standard benchmarks while maintaining the efficiency of contrastive models and keeping generation decoupled from the encoders.
研究旨在通过直接对齐视觉和文本模态的隐藏状态来改进Vision-Language模型(VLMs)中的图像和文本表示。所提出的方法BRIDGE在两个编码器的顶部引入了几层轻量级的跨模态注意力层,将视觉和文本的隐藏状态投影并映射到共享空间中。实验结果显示,BRIDGE在标准的检索、VQA和视觉推理基准测试中优于同类VLMs,同时保持对比模型的效率。编码器保持非因果性和强大的理解能力,生成则通过可选的解码器保持清晰分离。代码已公开发布在https://github.com/jfeinashley/BRIDGE。
Interpolation Conditions for Data Consistency and Prediction in Noisy Linear Systems
Authors: Martina Vanelli, Nima Monshizadeh, Julien M. Hendrickx
First: 2025-04-11T12:19:51+00:00 · Latest: 2025-11-14T17:48:35+00:00
Comments: 8 pages, 3 figures
Abstract
We develop an interpolation-based framework for noisy linear systems with unknown system matrix with bounded norm (implying bounded growth or non-increasing energy), and bounded process noise energy. The proposed approach characterizes all trajectories consistent with the measured data and these prior bounds in a purely data-driven manner. This characterization enables data-consistency verification, inference, and one-step ahead prediction, which can be leveraged for safety verification and cost minimization. Ultimately, this work represents a preliminary step toward exploiting interpolation conditions in data-driven control, offering a systematic way to characterize trajectories consistent with a dynamical system within a given class and enabling their use in control design.
中文标题/摘要
标题:噪声线性系统中数据一致性和预测的插值条件
我们开发了一种基于插值的框架,用于具有未知系统矩阵(意味着有界增长或能量非递增)和有界过程噪声能量的噪声线性系统。所提出的方法以纯数据驱动的方式表征所有与测量数据和这些先验界一致的轨迹。这种表征能够实现数据一致性验证、推理和一步预测,这些可以用于安全验证和成本最小化。最终,这项工作代表了在数据驱动控制中利用插值条件的一个初步步骤,提供了一种系统的方法来表征给定类内与动力学系统一致的轨迹,并使它们能够用于控制设计。
Experience-Guided Adaptation of Inference-Time Reasoning Strategies
Authors: Adam Stein, Matthew Trager, Benjamin Bowman, Michael Kleinman, Aditya Chattopadhyay, Wei Xia, Stefano Soatto
First: 2025-11-14T17:45:28+00:00 · Latest: 2025-11-14T17:45:28+00:00
Comments: 29 pages, 5 figures
Abstract
Enabling agentic AI systems to adapt their problem-solving approaches based on post-training interactions remains a fundamental challenge. While systems that update and maintain a memory at inference time have been proposed, existing designs only steer the system by modifying textual input to a language model or agent, which means that they cannot change sampling parameters, remove tools, modify system prompts, or switch between agentic and workflow paradigms. On the other hand, systems that adapt more flexibly require offline optimization and remain static once deployed. We present Experience-Guided Reasoner (EGuR), which generates tailored strategies -- complete computational procedures involving LLM calls, tools, sampling parameters, and control logic -- dynamically at inference time based on accumulated experience. We achieve this using an LLM-based meta-strategy -- a strategy that outputs strategies -- enabling adaptation of all strategy components (prompts, sampling parameters, tool configurations, and control logic). EGuR operates through two components: a Guide generates multiple candidate strategies conditioned on the current problem and structured memory of past experiences, while a Consolidator integrates execution feedback to improve future strategy generation. This produces complete, ready-to-run strategies optimized for each problem, which can be cached, retrieved, and executed as needed without wasting resources. Across five challenging benchmarks (AIME 2025, 3-SAT, and three Big Bench Extra Hard tasks), EGuR achieves up to 14% accuracy improvements over the strongest baselines while reducing computational costs by up to 111x, with both metrics improving as the system gains experience.
中文标题/摘要
标题:基于经验的推理策略适应
使自主AI系统根据后训练交互调整其问题解决方法仍然是一个基本挑战。虽然已经提出了在推理时更新和维护记忆的系统,但现有设计只能通过修改语言模型或代理的文本输入来引导系统,这意味着它们不能改变采样参数、移除工具、修改系统提示或在代理范式和工作流范式之间切换。另一方面,能够更灵活地适应的系统需要离线优化,并且部署后保持静态。我们提出了经验引导的推理器(EGuR),它可以在推理时根据积累的经验动态生成定制策略——包括涉及LLM调用、工具、采样参数和控制逻辑的完整计算程序。我们使用基于LLM的元策略——输出策略的策略——来适应所有策略组件(提示、采样参数、工具配置和控制逻辑)。EGuR 通过两个组件运行:引导器根据当前问题和过去经验的结构化记忆生成多个候选策略,而整合器整合执行反馈以改进未来的策略生成。这产生了针对每个问题优化的完整、可运行策略,可以在需要时进行缓存、检索和执行,而不会浪费资源。在五个具有挑战性的基准测试(AIME 2025、3-SAT 和三个 Big Bench 额外困难任务)中,EGuR 在最强基线之上实现了高达 14% 的准确率改进,同时将计算成本降低了高达 111 倍,两个指标随着系统经验的增加而提高。
Summary / 总结
The paper addresses the challenge of enabling AI systems to adapt their problem-solving strategies dynamically based on post-training interactions. It introduces EGuR, which generates tailored strategies at inference time using an LLM-based meta-strategy. EGuR consists of a Guide and a Consolidator, which together produce optimized strategies involving LLM calls, tools, sampling parameters, and control logic. The system improves accuracy by up to 14% and reduces computational costs by up to 111x across various benchmarks as it gains experience.
该论文介绍了EGuR系统,该系统能够在推理时根据积累的经验动态生成定制策略,适应包括提示、采样参数、工具配置和控制逻辑在内的所有策略组件。EGuR在五个基准测试中优于强基线,实现了高达14%的准确率提升和高达111倍的计算成本降低,这些指标随着系统的经验积累而提高。
OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning
Authors: Xiaoyu Zheng, Xu Chen, Awais Rauf, Qifan Fu, Benedetta Monosi, Felice Rivellese, Myles J. Lewis, Shaogang Gong, Gregory Slabaugh
First: 2025-11-14T17:31:18+00:00 · Latest: 2025-11-14T17:31:18+00:00
Abstract
Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher's attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.
中文标题/摘要
标题:OpenUS:一种基于自适应掩蔽对比学习的全开源超声图像分析基础模型
超声成像(US)是使用最广泛的医学影像技术之一,得益于其低成本、便携性、实时反馈和无电离辐射的特点。然而,US图像解释仍然高度依赖操作者,并且在不同解剖区域、采集协议和设备类型之间存在显著差异。这些差异,加上诸如斑点、低对比度和有限的标准化注释等独特挑战,阻碍了通用化、标签高效超声AI模型的发展。在本文中,我们提出了OpenUS,这是首个基于大量公开数据构建的可复现、开源的超声基础模型。OpenUS 使用一种视觉Mamba骨干网络,能够捕捉图像中的局部和全局长程依赖关系。为了在预训练期间提取丰富的特征,我们引入了一种新颖的自适应掩蔽框架,结合了对比学习和掩蔽图像建模。该策略将教师的注意力图与学生重建损失相结合,自适应地细化临床相关的掩蔽,以增强预训练效果。OpenUS 还应用了动态学习计划,逐步调整预训练过程的难度。为了开发基础模型,我们编译了迄今为止最大的公开可用的超声数据集,包含来自42个公开数据集的超过308,000张图像,涵盖了多种解剖区域、医疗机构、成像设备和疾病类型。我们预训练的OpenUS模型可以通过作为标签高效微调的骨干网络轻松适应特定下游任务。代码可在https://github.com/XZheng0427/OpenUS获取。
Summary / 总结
OpenUS is an open-source foundation model for ultrasound image analysis that addresses the challenges of operator dependency and data variability. It uses a self-adaptive masking framework combining contrastive learning with masked image modeling to enhance feature extraction during pre-training. The model also employs a dynamic learning schedule to adjust pre-training difficulty. OpenUS is trained on a large dataset of over 308K images from 42 public sources, covering various anatomical regions and imaging conditions. This model can be fine-tuned efficiently for specific tasks with minimal labeled data. Code is available at https://github.com/XZheng0427/OpenUS.
论文提出了OpenUS,这是一种使用自适应掩蔽对比学习的开源超声图像分析基础模型,旨在解决超声成像中的操作者依赖性和数据变异性问题。OpenUS 使用了视觉 Mamba 主干网络,并引入了一种新颖的自适应掩蔽框架,以增强预训练过程中的特征提取。该模型在来自42个公共来源的超过308K张图像的大数据集上进行训练,涵盖了多种解剖区域和成像条件。关键发现包括通过标签高效微调提高预训练效果和下游任务适应性。代码可在 https://github.com/XZheng0427/OpenUS 获取。
On the Necessity of Output Distribution Reweighting for Effective Class Unlearning
Authors: Ali Ebrahimpour-Boroojeny, Yian Wang, Hari Sundaram
First: 2025-06-25T23:53:56+00:00 · Latest: 2025-11-14T17:27:58+00:00
Abstract
In this paper, we reveal a significant shortcoming in class unlearning evaluations: overlooking the underlying class geometry can cause privacy leakage. We further propose a simple yet effective solution to mitigate this issue. We introduce a membership-inference attack via nearest neighbors (MIA-NN) that uses the probabilities the model assigns to neighboring classes to detect unlearned samples. Our experiments show that existing unlearning methods are vulnerable to MIA-NN across multiple datasets. We then propose a new fine-tuning objective that mitigates this privacy leakage by approximating, for forget-class inputs, the distribution over the remaining classes that a retrained-from-scratch model would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting (TRW) distribution serves as the desired distribution during fine-tuning. We also show that across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior unlearning metrics. More specifically, on CIFAR-10, it reduces the gap with retrained models by 19% and 46% for U-LiRA and MIA-NN scores, accordingly, compared to the SOTA method for each category.
中文标题/摘要
标题:关于有效类别遗忘中输出分布重权的必要性
在本文中,我们揭示了类别遗忘评估中的一个重要缺陷:忽视底层类别的几何结构会导致隐私泄露。我们进一步提出了一种简单而有效的解决方案来缓解这一问题。我们通过最近邻(MIA-NN)引入了一种成员推断攻击,利用模型分配给相邻类别的概率来检测未遗忘样本。我们的实验表明,现有的遗忘方法在多个数据集上都容易受到MIA-NN的攻击。然后,我们提出了一种新的微调目标,通过近似遗忘类输入下从头训练模型将产生的剩余类别的分布来缓解这种隐私泄露。为了构建这种近似,我们估计了类间相似性,并相应地倾斜目标模型的分布。由此产生的倾斜重权(TRW)分布用于微调过程中的目标分布。我们还展示了在多个基准测试中,TRW在先前的遗忘指标上与现有遗忘方法相当或超越它们。具体来说,在CIFAR-10上,它分别将U-LiRA和MIA-NN得分与最新方法的差距缩小了19%和46%。
Summary / 总结
This paper addresses the issue of privacy leakage in class unlearning evaluations by highlighting the importance of considering the underlying class geometry. It proposes a method called Tilted ReWeighting (TRW) that mitigates this problem by approximating the distribution over remaining classes for forgotten classes. Experiments show that TRW outperforms existing methods on unlearning metrics, reducing the gap with retrained models by 19% and 46% for U-LiRA and MIA-NN scores on CIFAR-10, respectively.
本文探讨了在类遗忘评估中忽略底层类几何结构可能导致隐私泄露的问题,引入了基于最近邻的成员推断攻击(MIA-NN),并提出了一种新的微调目标——倾斜重加权(TRW),以减轻这一问题。TRW方法通过近似重新训练模型会产生的剩余类别的分布,从而减少隐私泄露。实验表明,TRW在多个基准测试中优于现有方法,特别是在CIFAR-10上,它分别将U-LiRA和MIA-NN得分与重新训练模型的差距减少了19%和46%。
Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation
Authors: Mohamad Amin Mohamadi, Tianhao Wang, Zhiyuan Li
First: 2025-11-14T17:20:45+00:00 · Latest: 2025-11-14T17:20:45+00:00
Abstract
Modern language models fail a fundamental requirement of trustworthy intelligence: knowing when not to answer. Despite achieving impressive accuracy on benchmarks, these models produce confident hallucinations, even when wrong answers carry catastrophic consequences. Our evaluations on GSM8K, MedQA and GPQA show frontier models almost never abstain despite explicit warnings of severe penalties, suggesting that prompts cannot override training that rewards any answer over no answer. As a remedy, we propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards (+1 correct, 0 abstention, -$λ$ error) instead of binary. Controlled experiments on logic puzzles reveal that varying $λ$ produces distinct models along a Pareto frontier, where each training penalty yields the optimal model for its corresponding risk regime: low penalties produce aggressive answerers, high penalties conservative abstainers. We then introduce two inference strategies that exploit trained abstention as a coordination signal: cascading routes queries through models with decreasing risk tolerance, while self-cascading re-queries the same model on abstention. Both outperform majority voting with lower computational cost. These results establish abstention as a first-class training objective that transforms ``I don't know'' from failure into a coordination signal, enabling models to earn trust through calibrated honesty about their limits.
中文标题/摘要
标题:诚实胜于精确:通过强化犹豫构建可信赖的语言模型
现代语言模型未能满足可信赖智能的基本要求:知道何时不应作答。尽管在基准测试中取得了令人印象深刻的准确率,这些模型仍会产生自信的幻觉,即使错误的答案可能导致灾难性后果。我们在GSM8K、MedQA和GPQA上的评估显示,前沿模型几乎从不拒绝作答,即使有明确警告严重惩罚也是如此,这表明提示无法克服训练中任何答案优于不作答的奖励。为解决这一问题,我们提出了一种强化犹豫(RH):对可验证奖励的强化学习(RLVR)进行修改,使用三元奖励(+1正确,0拒绝,-$λ$错误)代替二元奖励。对逻辑谜题的受控实验表明,改变$λ$会产生不同的模型,沿着帕累托前沿,每个训练惩罚都会产生对应风险制度下的最优模型:低惩罚产生积极的回答者,高惩罚产生保守的拒绝者。然后我们引入了两种推理策略,利用训练中的拒绝作为协调信号:逐级将查询传递给风险容忍度递减的模型,而自我级联重新查询同一模型以拒绝。两者都比多数投票具有更低的计算成本且表现更优。这些结果确立了拒绝作为首要训练目标的地位,将“我不知道”从失败转变为协调信号,使模型能够通过对其极限的校准诚实来赢得信任。
Summary / 总结
The paper addresses the issue of language models providing confident but potentially incorrect answers, which undermines their trustworthiness. It proposes Reinforced Hesitation (RH), a modification to RLVR that uses ternary rewards to encourage models to abstain when unsure. Experiments on logic puzzles show that varying the penalty for errors leads to different risk profiles in models, with higher penalties producing more conservative abstainers. The authors also introduce inference strategies that leverage trained abstention as a coordination signal, improving performance over majority voting with lower computational cost.
论文针对语言模型提供自信但可能错误的答案,这损害了它们的可信度。提出了一种名为Reinforced Hesitation (RH) 的方法,通过使用三元奖励来鼓励模型在不确定时选择不回答。实验表明,通过调整惩罚参数 $λ$ 可以生成在不同风险环境下更为激进或保守的模型。作者还引入了两种利用训练中的不回答作为协调信号的推理策略,这些策略在较低的计算成本下优于多数投票。这项工作表明,不回答可以作为重要的训练目标,增强模型的诚实性和可信度。
Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks
Authors: Muthukumar Pandaram, Jakob Hollenstein, David Drexel, Samuele Tosatto, Antonio Rodríguez-Sánchez, Justus Piater
First: 2025-11-11T10:43:26+00:00 · Latest: 2025-11-14T17:15:33+00:00
Abstract
The use of learned dynamics models, also known as world models, can improve the sample efficiency of reinforcement learning. Recent work suggests that the underlying causal graphs of such dynamics models are sparsely connected, with each of the future state variables depending only on a small subset of the current state variables, and that learning may therefore benefit from sparsity priors. Similarly, temporal sparsity, i.e. sparsely and abruptly changing local dynamics, has also been proposed as a useful inductive bias.
In this work, we critically examine these assumptions by analyzing ground-truth dynamics from a set of robotic reinforcement learning environments in the MuJoCo Playground benchmark suite, aiming to determine whether the proposed notions of state and temporal sparsity actually tend to hold in typical reinforcement learning tasks.
We study (i) whether the causal graphs of environment dynamics are sparse, (ii) whether such sparsity is state-dependent, and (iii) whether local system dynamics change sparsely.
Our results indicate that global sparsity is rare, but instead the tasks show local, state-dependent sparsity in their dynamics and this sparsity exhibits distinct structures, appearing in temporally localized clusters (e.g., during contact events) and affecting specific subsets of state dimensions. These findings challenge common sparsity prior assumptions in dynamics learning, emphasizing the need for grounded inductive biases that reflect the state-dependent sparsity structure of real-world dynamics.
中文标题/摘要
标题:动态稀疏性:挑战机器人强化学习基准中世界模型学习的常见稀疏性假设
学习动力学模型,即世界模型,可以提高强化学习的样本效率。近期研究表明,此类动力学模型的潜在因果图是稀疏连接的,每个未来状态变量仅依赖于当前状态变量的一个小子集,因此学习可能受益于稀疏先验。同样,时间稀疏性,即局部动力学稀疏且突然变化,也被提议作为有用的归纳偏置。
在这项工作中,我们通过分析MuJoCo Playground基准套件中一组机器人强化学习环境的真实动力学,批判性地检查了这些假设,旨在确定提出的状态和时间稀疏性概念是否在典型的强化学习任务中确实普遍存在。
我们研究了(i) 环境动力学的因果图是否稀疏,(ii) 这种稀疏性是否依赖于状态,以及(iii) 局部系统动力学是否稀疏变化。
我们的结果表明,全局稀疏性很少见,但任务在动力学中表现出局部、状态依赖的稀疏性,并且这种稀疏性表现出不同的结构,出现在时间局部化的簇中(例如,在接触事件期间),并影响特定的状态维度子集。这些发现挑战了动力学学习中常见的稀疏性先验假设,强调了需要反映真实世界动力学状态依赖稀疏性结构的基于事实的归纳偏置的重要性。
Summary / 总结
This study investigates the validity of sparsity assumptions in learning dynamics models for robotic reinforcement learning. By analyzing MuJoCo Playground environments, the research finds that global sparsity is uncommon, but local, state-dependent sparsity exists, often appearing during contact events and affecting specific state dimensions. This challenges common sparsity priors and suggests the need for more grounded inductive biases in dynamics learning.
该研究通过分析MuJoCo Playground基准环境,挑战了机器人强化学习中世界模型学习中关于稀疏性的常见假设。研究考察了因果图的稀疏性、状态依赖的稀疏性以及动态的局部变化。结果表明,全局稀疏性很少见,但存在局部、状态依赖的稀疏性,这种稀疏性在接触事件等时间局部区域中出现,并影响特定的状态维度。这表明传统的稀疏性先验可能不适用,强调了需要更符合现实动态稀疏结构的归纳偏置的必要性。
BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading
Authors: Jonathan Schmidt, Simon Giebenhain, Matthias Niessner
Venue: NeurIPS 2025
First: 2025-06-06T17:53:58+00:00 · Latest: 2025-11-14T17:10:57+00:00
Comments: NeurIPS 2025, Project Page: see https://jonathsch.github.io/becominglit/ , YouTube Video: see https://youtu.be/xPyeIqKdszA
Abstract
We introduce BecomingLit, a novel method for reconstructing relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. Therefore, we propose a new low-cost light stage capture setup, tailored specifically towards capturing faces. Using this setup, we collect a novel dataset consisting of diverse multi-view sequences of numerous subjects under varying illumination conditions and facial expressions. By leveraging our new dataset, we introduce a new relightable avatar representation based on 3D Gaussian primitives that we animate with a parametric head model and an expression-dependent dynamics module. We propose a new hybrid neural shading approach, combining a neural diffuse BRDF with an analytical specular term. Our method reconstructs disentangled materials from our dynamic light stage recordings and enables all-frequency relighting of our avatars with both point lights and environment maps. In addition, our avatars can easily be animated and controlled from monocular videos. We validate our approach in extensive experiments on our dataset, where we consistently outperform existing state-of-the-art methods in relighting and reenactment by a significant margin.
中文标题/摘要
标题:BecomingLit:可重新照明的高分辨率混合神经光照头像
我们提出了BecomingLit,一种用于重建可重新照明、高分辨率头像的新方法,可以从新颖视角以交互速率进行渲染。为此,我们提出了一种新的低成本光场捕捉装置,专门针对面部捕捉。使用此装置,我们收集了一个新的数据集,包含在不同照明条件和面部表情下多种主题的多视角序列。利用新数据集,我们引入了一种基于3D高斯原语的新可重新照明头像表示,通过参数头部模型和表情依赖的动力学模块进行动画处理。我们提出了一种新的混合神经光照方法,结合了神经漫反射BRDF和分析性镜面项。我们的方法从动态光场记录中分离出材料,并使我们的头像能够使用点光源和环境贴图进行全频重新照明。此外,我们的头像可以从单目视频轻松地进行动画处理和控制。我们在数据集上进行了广泛的实验,验证了我们的方法,在重新照明和再现方面始终显著优于现有最先进的方法。
Summary / 总结
BecomingLit is a method for creating relightable, high-resolution head avatars that can be rendered from novel viewpoints at interactive rates. It uses a new low-cost light stage capture setup and a novel dataset of multi-view sequences under varying conditions. The method introduces a relightable avatar representation using 3D Gaussian primitives and a hybrid neural shading approach. Experiments show that BecomingLit outperforms existing methods in relighting and reenactment.
BecomingLit 提出了一种方法,用于重建可在交互速率下从新颖视角渲染的高分辨率头部avatar,并能重新照明。该方法使用一种新的低成本光场捕捉装置和多视角序列数据集,该数据集在不同条件下捕捉了多种面部表情。它采用 3D 高斯模型表示avatar,使用参数化头部模型和表情依赖的动力学模块。提出了一种新的混合神经着色方法,结合了神经漫反射BRDF和分析性镜面项。该方法能够使用点光源和环境贴图进行全频率重新照明,并可以从单目视频轻松地进行动画处理。实验表明,该方法在重新照明和再现方面始终优于现有方法。
FNOPE: Simulation-based inference on function spaces with Fourier Neural Operators
Authors: Guy Moss, Leah Sophie Muhle, Reinhard Drews, Jakob H. Macke, Cornelius Schröder
First: 2025-05-28T16:46:56+00:00 · Latest: 2025-11-14T17:01:20+00:00
Abstract
Simulation-based inference (SBI) is an established approach for performing Bayesian inference on scientific simulators. SBI so far works best on low-dimensional parametric models. However, it is difficult to infer function-valued parameters, which frequently occur in disciplines that model spatiotemporal processes such as the climate and earth sciences. Here, we introduce an approach for efficient posterior estimation, using a Fourier Neural Operator (FNO) architecture with a flow matching objective. We show that our approach, FNOPE, can perform inference of function-valued parameters at a fraction of the simulation budget of state of the art methods. In addition, FNOPE supports posterior evaluation at arbitrary discretizations of the domain, as well as simultaneous estimation of vector-valued parameters. We demonstrate the effectiveness of our approach on several benchmark tasks and a challenging spatial inference task from glaciology. FNOPE extends the applicability of SBI methods to new scientific domains by enabling the inference of function-valued parameters.
中文标题/摘要
标题:FNOPE:基于Fourier神经算子的空间函数模拟推理
基于模拟的推理(SBI)是一种成熟的科学模拟器上进行贝叶斯推理的方法。迄今为止,SBI 在低维参数模型上表现最佳。然而,在涉及时空过程(如气候和地球科学)的学科中,函数值参数的推理非常困难。在这里,我们介绍了一种使用Fourier神经算子(FNO)架构和流匹配目标的有效后验估计方法。我们证明,我们的方法FNOPE可以在比现有方法更小的模拟预算下进行函数值参数的推理。此外,FNOPE支持在域的任意离散化下进行后验评估,并同时估计向量值参数。我们在几个基准任务和来自冰川学的具有挑战性的空间推理任务上展示了我们方法的有效性。FNOPE通过使SBI方法适用于新的科学领域,扩展了SBI方法的应用范围,使其能够推理函数值参数。
Summary / 总结
The research motivation is to address the challenge of inferring function-valued parameters in scientific simulators, which are common in disciplines like climate and earth sciences. The main method involves using a Fourier Neural Operator (FNO) architecture with a flow matching objective to perform efficient posterior estimation. Key experimental findings show that FNOPE can perform inference at a lower simulation budget compared to state-of-the-art methods and supports posterior evaluation at arbitrary discretizations and simultaneous estimation of vector-valued parameters.
研究动机是改进基于模拟的推断(SBI)以处理时空过程中的函数值参数。主要方法是使用傅里叶神经算子(FNO)架构和流匹配目标。关键实验发现表明,FNOPE相比现有最佳方法更高效,支持在任意离散化下进行后验评估和同时估计向量值参数。该方法在多个基准任务和一个来自冰川学的挑战性空间推理任务中得到验证,扩展了SBI方法的应用范围到新的科学领域。
ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation
Authors: Kaishen Wang, Ruibo Chen, Tong Zheng, Heng Huang
First: 2025-11-14T17:00:29+00:00 · Latest: 2025-11-14T17:00:29+00:00
Comments: 12 pages, 5 tables, 6 figures
Abstract
Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.
中文标题/摘要
标题:ImAgent:一种用于测试时可扩展图像生成的统一多模态代理框架
近期的文本到图像(T2I)模型在生成视觉真实且语义一致的图像方面取得了显著进展。然而,它们仍然存在随机性和与给定提示不一致的问题,特别是在文本描述模糊或不明确时更为明显。现有的方法,如提示重写、最佳N采样和自我完善,可以缓解这些问题,但通常需要额外的模块并独立运行,阻碍了测试时可扩展性的效率并增加了计算开销。在本文中,我们引入了ImAgent,这是一种无需训练的统一多模态代理,将推理、生成和自我评估整合到一个框架中,以实现高效的测试时可扩展性。在策略控制器的引导下,多个生成动作动态交互和自我组织,以提高图像保真度和语义对齐,而不依赖于外部模型。在图像生成和编辑任务上的大量实验表明,ImAgent在基线模型上始终表现出改进,并且在基线模型失败的情况下甚至超越了其他强基线,突显了统一多模态代理在测试时可扩展性下的自适应和高效图像生成的潜力。
Summary / 总结
The research motivation is to address the randomness and inconsistency issues in text-to-image generation models, especially when textual descriptions are vague. The main method is to introduce ImAgent, a unified multimodal agent framework that integrates reasoning, generation, and self-evaluation within a single framework, guided by a policy controller. Key experimental findings show that ImAgent consistently improves over the backbone model and outperforms other strong baselines in image generation and editing tasks, demonstrating its potential for adaptive and efficient image generation under test-time scaling.
研究动机是解决文本到图像生成模型中存在的随机性和不一致性问题,尤其是在文本描述模糊时。主要方法是引入ImAgent,这是一种将推理、生成和自我评估整合到单一框架中的统一多模态代理框架,由策略控制器引导。关键实验结果表明,ImAgent在图像生成和编辑任务中始终优于基础模型,并且在其他强大基线模型失败的情况下超越它们,展示了统一多模态代理在测试时间缩放下进行自适应和高效图像生成的潜力。
Inferring response times of perceptual decisions with Poisson variational autoencoders
Authors: Hayden R. Johnson, Anastasia N. Krouglova, Hadi Vafaii, Jacob L. Yates, Pedro J. Gonçalves
Venue: NeurIPS 2025
First: 2025-11-14T16:58:04+00:00 · Latest: 2025-11-14T16:58:04+00:00
Comments: To appear at the NeurIPS 2025 Workshop on Data on the Mind and Brain
Abstract
Many properties of perceptual decision making are well-modeled by deep neural networks. However, such architectures typically treat decisions as instantaneous readouts, overlooking the temporal dynamics of the decision process. We present an image-computable model of perceptual decision making in which choices and response times arise from efficient sensory encoding and Bayesian decoding of neural spiking activity. We use a Poisson variational autoencoder to learn unsupervised representations of visual stimuli in a population of rate-coded neurons, modeled as independent homogeneous Poisson processes. A task-optimized decoder then continually infers an approximate posterior over actions conditioned on incoming spiking activity. Combining these components with an entropy-based stopping rule yields a principled and image-computable model of perceptual decisions capable of generating trial-by-trial patterns of choices and response times. Applied to MNIST digit classification, the model reproduces key empirical signatures of perceptual decision making, including stochastic variability, right-skewed response time distributions, logarithmic scaling of response times with the number of alternatives (Hick's law), and speed-accuracy trade-offs.
中文标题/摘要
标题:使用泊松变分自编码器推断知觉决策的反应时间
知觉决策的许多特性可以用深度神经网络很好地建模。然而,这类架构通常将决策视为即时输出,忽略了决策过程中的时间动态。我们提出了一种图像可计算的知觉决策模型,在该模型中,选择和反应时间源自高效的感官编码和贝叶斯解码神经放电活动。我们使用泊松变分自编码器在一组率编码神经元中无监督学习视觉刺激的表示,这些神经元被建模为独立的同质泊松过程。然后,一个任务优化的解码器不断根据传入的放电活动推断动作的近似后验。将这些组件与基于熵的停止规则结合,产生了一个原理上正确且图像可计算的知觉决策模型,能够生成每次试验的选择和反应时间模式。应用于MNIST数字分类,该模型再现了知觉决策的关键经验特征,包括随机变异性、右偏的反应时间分布、反应时间与选项数量的对数缩放(赫克定律)以及速度-准确性权衡。
Summary / 总结
The research aims to model the temporal dynamics of perceptual decision-making by treating decisions as a process rather than an instantaneous readout. The method involves using a Poisson variational autoencoder to learn representations of visual stimuli and a task-optimized decoder to infer actions based on incoming spiking activity. Key experimental findings include the model's ability to generate trial-by-trial patterns of choices and response times, reproducing empirical signatures such as stochastic variability, right-skewed response time distributions, and speed-accuracy trade-offs in MNIST digit classification.
研究旨在通过将决策视为一个过程而非瞬时读出,来建模感知决策的时间动态。方法是使用Poisson变分自编码器学习视觉刺激的表示,并使用任务优化的解码器根据传入的神经放电活动推断动作。关键实验发现包括模型能够生成每次试验的选择和反应时间模式,并在MNIST数字分类中重现了诸如随机变异性、右偏反应时间分布和速度-准确性权衡等实证特征。
Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models
Authors: Francisco Caetano, Christiaan Viviers, Peter H. N. De With, Fons van der Sommen
Venue: AAAI 2026
First: 2025-06-12T12:19:28+00:00 · Latest: 2025-11-14T16:56:05+00:00
Comments: AAAI 2026
Abstract
Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks.
中文标题/摘要
标题:对称流匹配:基于评分生成模型的统一图像生成、分割和分类
流匹配已成为学习分布之间连续变换的强大框架,使高保真生成建模成为可能。本文引入了对称流匹配(SymmFlow),这是一种新的形式,将语义分割、分类和图像生成统一在一个模型中。通过对称学习目标,SymmFlow 联合建模前向和反向变换,确保双向一致性,同时保留足够的熵以保持生成多样性。引入了一种新的训练目标,以明确保留流中的语义信息,实现高效采样同时保留语义结构,允许一步完成分割和分类而无需迭代细化。与之前需要严格一对一映射掩码和图像的方法不同,SymmFlow 能够泛化到灵活的条件设置,支持像素级和图像级类标签。在各种基准上的实验结果表明,SymmFlow 在语义图像合成上达到了最先进的性能,在CelebAMask-HQ 上获得了 FID 分数 11.9,在COCO-Stuff 上获得了 7.0,仅需 25 次推理步骤。此外,它在语义分割上也取得了竞争力的结果,并展示了在分类任务中的潜力。
Summary / 总结
Symmetrical Flow Matching (SymmFlow) unifies semantic segmentation, classification, and image generation in a single model by jointly learning forward and reverse transformations, ensuring bi-directional consistency and preserving generative diversity. It introduces a new training objective to retain semantic information efficiently, enabling one-step segmentation and classification. SymmFlow achieves state-of-the-art performance on semantic image synthesis with FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff, and competitive results on semantic segmentation and classification tasks.
Symmetrical Flow Matching (SymmFlow) 将语义分割、分类和图像生成统一在一个模型中,通过联合学习前向和反向变换来确保双向一致性并保留足够的生成多样性。该方法引入了一种新的训练目标,以跨流保留语义信息,实现高效采样和一步分割/分类。实验结果显示,SymmFlow 在语义图像合成上的 FID 分数分别为 CelebAMask-HQ 上的 11.9 和 COCO-Stuff 上的 7.0,同时在语义分割和分类任务上取得了竞争力的结果。
Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective
Authors: Nhat Chung, Taisei Hanyu, Toan Nguyen, Huy Le, Frederick Bumgarner, Duy Minh Ho Nguyen, Khoa Vo, Kashu Yamazaki, Chase Rainwater, Tung Kieu, Anh Nguyen, Ngan Le
Venue: AAAI 2026
First: 2025-11-14T16:56:01+00:00 · Latest: 2025-11-14T16:56:01+00:00
Comments: Accepted at AAAI 2026
Abstract
As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM's baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.
中文标题/摘要
标题:从物体中心视角重新思考机器人操作中记忆状态的演变
随着嵌入式代理在日益复杂的环境中操作,感知、跟踪和随时间推移对个体物体实例进行推理的能力变得至关重要,尤其是在需要与视觉上相似的物体进行顺序交互的任务中。在这些非马尔可夫环境中,关键决策线索往往隐藏在物体特定的历史记录中,而不是当前场景中。如果没有持续的记忆(之前交互过什么,它在哪里,或者它如何变化),视知觉运动策略可能会失败,重复过去的动作,或者忽略已完成的动作。为了揭示这一挑战,我们引入了LIBERO-Mem,这是一种非马尔可夫任务套件,用于在物体级别部分可观测性下对机器人操作进行压力测试。它结合了短期和长期的物体跟踪以及时间序列子目标,要求超越当前帧的推理。然而,视觉-语言-动作(VLA)模型在这些环境中往往难以应对,即使任务仅跨越几百帧,标记缩放也会迅速变得不可行。我们提出了一种基于槽的VLA框架Embodied-SlotSSM,该框架旨在实现时间上的可扩展性。它保持时空一致的槽身份,并通过两种机制利用它们:(1)槽状态空间建模以重构短期历史,(2)关系编码器将输入标记与动作解码对齐。这些组件共同使基于时间的、上下文相关的动作预测成为可能。实验表明,Embodied-SlotSSM在LIBERO-Mem和通用任务上的基线性能,提供了一种在物体中心的机器人策略中进行非马尔可夫推理的可扩展解决方案。
Summary / 总结
This paper addresses the challenge of robotic manipulation in non-Markovian environments where decision-making relies on object-specific histories. It introduces LIBERO-Mem, a task suite that tests robotic manipulation under partial observability of objects. To tackle this, the authors propose Embodied-SlotSSM, a slot-centric vision-language-action framework that maintains consistent slot identities and uses slot-state-space modeling and a relational encoder for temporally scalable reasoning. Experiments demonstrate that Embodied-SlotSSM outperforms existing models on both LIBERO-Mem and general tasks, providing a scalable solution for non-Markovian reasoning in robotic manipulation policies.
论文针对复杂环境中物体交互记忆持续性的重要性,引入了LIBERO-Mem任务套件来测试部分可观测条件下的机器人操作能力,并提出了一种基于槽的视觉-语言-动作框架Embodied-SlotSSM,该框架通过保持时空一致的槽身份来实现时间上下文感知的动作预测。实验表明,Embodied-SlotSSM在LIBERO-Mem和通用任务上的表现良好,为物体中心的机器人策略提供了一种可扩展的非马尔可夫推理解决方案。
Context-aware Adaptive Visualizations for Critical Decision Making
Authors: Angela Lopez-Cardona, Mireia Masias Bruns, Nuwan T. Attygalle, Sebastian Idesis, Matteo Salvatori, Konstantinos Raftopoulos, Konstantinos Oikonomou, Saravanakumar Duraisamy, Parvin Emami, Nacera Latreche, Alaa Eddine Anis Sahraoui, Michalis Vakallelis, Jean Vanderdonckt, Ioannis Arapakis, Luis A. Leiva
Venue: ISBN978-1-64368-631-8 2025
First: 2025-11-14T16:53:15+00:00 · Latest: 2025-11-14T16:53:15+00:00
Abstract
Effective decision-making often relies on timely insights from complex visual data. While Information Visualization (InfoVis) dashboards can support this process, they rarely adapt to users' cognitive state, and less so in real time. We present Symbiotik, an intelligent, context-aware adaptive visualization system that leverages neurophysiological signals to estimate mental workload (MWL) and dynamically adapt visual dashboards using reinforcement learning (RL). Through a user study with 120 participants and three visualization types, we demonstrate that our approach improves task performance and engagement. Symbiotik offers a scalable, real-time adaptation architecture, and a validated methodology for neuroadaptive user interfaces.
中文标题/摘要
标题:基于上下文的自适应可视化技术在关键决策中的应用
有效的决策往往依赖于复杂视觉数据的及时洞察。虽然信息可视化(InfoVis)仪表板可以支持这一过程,但它们很少适应用户的认知状态,更不用说实时适应了。我们提出了Symbiotik,一种智能的、基于上下文的自适应可视化系统,该系统利用神经生理信号估计认知负荷(MWL),并使用强化学习(RL)动态调整可视化仪表板。通过一项包含120名参与者的用户研究和三种可视化类型,我们证明了我们的方法可以提高任务性能和参与度。Symbiotik提供了一种可扩展的、实时适应的架构,以及一种验证的神经适应用户界面方法。
Summary / 总结
The research aims to enhance decision-making by adapting visual dashboards in real time based on users' cognitive state. Symbiotik uses neurophysiological signals to estimate mental workload and applies reinforcement learning to dynamically adjust visualizations. The study with 120 participants shows that this approach improves task performance and engagement compared to static visualizations.
研究旨在通过实时适应用户认知状态来提升决策效果。Symbiotik利用神经生理信号估计认知负荷,并使用强化学习动态调整可视化界面。研究涉及120名参与者和三种可视化类型,结果显示任务表现和参与度有所提升。
Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery
Authors: Yijie Kang, Xinliang Wang, Zhenyu Wu, Yifeng Shi, Hailong Zhu
First: 2025-11-14T16:42:03+00:00 · Latest: 2025-11-14T16:42:03+00:00
Abstract
Recent advances in generative modeling have substantially enhanced 3D urban generation, enabling applications in digital twins, virtual cities, and large-scale simulations. However, existing methods face two key challenges: (1) the need for large-scale 3D city assets for supervised training, which are difficult and costly to obtain, and (2) reliance on semantic or height maps, which are used exclusively for generating buildings in virtual worlds and lack connection to real-world appearance, limiting the realism and generalizability of generated cities. To address these limitations, we propose Sat2RealCity, a geometry-aware and appearance-controllable framework for 3D urban generation from real-world satellite imagery. Unlike previous city-level generation methods, Sat2RealCity builds generation upon individual building entities, enabling the use of rich priors and pretrained knowledge from 3D object generation while substantially reducing dependence on large-scale 3D city assets. Specifically, (1) we introduce the OSM-based spatial priors strategy to achieve interpretable geometric generation from spatial topology to building instances; (2) we design an appearance-guided controllable modeling mechanism for fine-grained appearance realism and style control; and (3) we construct an MLLM-powered semantic-guided generation pipeline, bridging semantic interpretation and geometric reconstruction. Extensive quantitative and qualitative experiments demonstrate that Sat2RealCity significantly surpasses existing baselines in structural consistency and appearance realism, establishing a strong foundation for real-world aligned 3D urban content creation. The code will be released soon.
中文标题/摘要
标题:Sat2RealCity:基于几何感知和外观可控的从卫星影像生成3D城市
生成模型的最新进展极大地提升了3D城市生成的能力,使其在数字孪生、虚拟城市和大规模模拟等领域得到了广泛应用。然而,现有方法面临两个关键挑战:(1) 需要大规模的3D城市资产进行监督训练,这些资产获取困难且成本高昂;(2) 依赖于语义或高度图,这些图仅用于虚拟世界中的建筑生成,与现实世界的外观缺乏联系,限制了生成城市的现实感和泛化能力。为了解决这些限制,我们提出了Sat2RealCity,这是一种基于真实卫星影像的几何感知和外观可控的3D城市生成框架。与之前的基于城市级别的生成方法不同,Sat2RealCity以单个建筑实体为基础进行生成,能够利用丰富的先验知识和3D物体生成的预训练知识,大幅减少对大规模3D城市资产的依赖。具体而言,(1) 我们引入了基于OSM的空间先验策略,实现了从空间拓扑到建筑实例的可解释几何生成;(2) 我们设计了外观引导的可控建模机制,以实现细粒度的外观真实感和风格控制;(3) 我们构建了基于MLLM的语义引导生成流水线,实现了语义解释和几何重建的连接。大量的定量和定性实验表明,Sat2RealCity在结构一致性和外观真实感方面显著优于现有基线,为现实世界对齐的3D城市内容创作奠定了坚实基础。代码将在不久后发布。
Summary / 总结
Sat2RealCity is a framework for generating 3D urban environments from satellite imagery, addressing the challenges of requiring large-scale 3D city assets and reliance on semantic or height maps. It introduces OSM-based spatial priors for interpretable geometric generation, an appearance-guided controllable modeling mechanism for fine-grained realism, and an MLLM-powered semantic-guided generation pipeline. Experiments show that Sat2RealCity outperforms existing methods in structural consistency and appearance realism, providing a robust foundation for real-world 3D urban content creation.
Sat2RealCity 是一种从卫星影像生成 3D 城市的框架,解决了需要大量 3D 城市资产和依赖语义或高度图的挑战。它引入了基于 OSM 的空间先验策略以实现可解释的几何生成,设计了细粒度真实感和风格控制的外观引导可控建模机制,并构建了基于 MLLM 的语义引导生成管道。实验表明,Sat2RealCity 在结构一致性和外观真实感方面显著优于现有方法,为现实世界对齐的 3D 城市内容创建奠定了坚实基础。
Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents
Authors: Davide Napolitano, Luca Cagliero, Fabrizio Battiloro
First: 2025-11-14T16:41:10+00:00 · Latest: 2025-11-14T16:41:10+00:00
Abstract
The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.
中文标题/摘要
标题:视觉大型语言模型在丰富视觉文档上对无法回答问题的鲁棒性基准测试
视觉大型语言模型(VLLMs)的发展已经彻底改变了对丰富视觉文档(VRDs)的自动理解,这些文档包含文本和视觉元素。尽管VLLMs在多页VRDs的视觉问答(VQA)上表现出色,但它们检测无法回答的问题的能力仍然是一个开放的研究问题。我们的研究探讨了VLLMs对合理但无法回答的问题的鲁棒性,即看似合理但实际上由于相关概念或合理问题表述之间的交换导致的细微篡改而无法回答的问题。篡改通过用其他类型但属于不同文档元素和不同布局位置或相关文档不同页面的自然语言实体替换原始自然语言实体来生成。为此,我们提出了VRD-UQA(丰富视觉文档无法回答问题问答),这是一个用于评估VLLMs在多个维度上对合理但无法回答的问题的鲁棒性的基准。它自动修改现有VQA数据集中的多页VRDs的问题,使用VLLM作为裁判验证其无法回答性,然后彻底评估VLLMs的性能。在12个模型上进行的实验分析了:(1)VLLMs在页面和文档级别检测无法回答问题的准确性;(2)不同类型的篡改(NLP实体、文档元素、布局)的影响;(3)基于上下文学习的不同知识注入策略(OCR、多页选择或无法回答的可能性)的有效性。我们的研究结果揭示了VLLMs的局限性,并表明VRD-UQA可以作为开发鲁棒文档VQA系统的评估框架。
Summary / 总结
This research aims to evaluate the resilience of Visual Large Language Models (VLLMs) to unanswerable questions on Visually Rich Documents (VRDs). The study introduces VRD-UQA, a benchmark that generates plausible yet unanswerable questions by corrupting natural language entities, document elements, and layout positions. Experiments on 12 models show that VLLMs struggle to detect unanswerable questions, especially when corruptions involve different document elements or layouts. The study also explores the effectiveness of different knowledge injection strategies and highlights the limitations of current VLLMs in VRD understanding.
研究旨在评估视觉大型语言模型(VLLMs)在处理视觉丰富文档(VRDs)上的不可回答问题时的鲁棒性。研究引入了VRD-UQA基准,通过修改现有VQA数据集生成可能但不可回答的问题。实验表明,VLLMs在页面和文档级别检测不可回答问题的性能,不同类型的篡改对性能的影响,以及不同知识注入策略的有效性。研究结果揭示了VLLMs的局限性,并建议VRD-UQA作为改进文档VQA系统的评估框架。
Adaptive Intrusion Detection for Evolving RPL IoT Attacks Using Incremental Learning
Authors: Sumeyye Bas, Kiymet Kaya, Elif Ak, Sule Gunduz Oguducu
First: 2025-11-14T16:35:48+00:00 · Latest: 2025-11-14T16:35:48+00:00
Abstract
The routing protocol for low-power and lossy networks (RPL) has become the de facto routing standard for resource-constrained IoT systems, but its lightweight design exposes critical vulnerabilities to a wide range of routing-layer attacks such as hello flood, decreased rank, and version number manipulation. Traditional countermeasures, including protocol-level modifications and machine learning classifiers, can achieve high accuracy against known threats, yet they fail when confronted with novel or zero-day attacks unless fully retrained, an approach that is impractical for dynamic IoT environments. In this paper, we investigate incremental learning as a practical and adaptive strategy for intrusion detection in RPL-based networks. We systematically evaluate five model families, including ensemble models and deep learning models. Our analysis highlights that incremental learning not only restores detection performance on new attack classes but also mitigates catastrophic forgetting of previously learned threats, all while reducing training time compared to full retraining. By combining five diverse models with attack-specific analysis, forgetting behavior, and time efficiency, this study provides systematic evidence that incremental learning offers a scalable pathway to maintain resilient intrusion detection in evolving RPL-based IoT networks.
中文标题/摘要
标题:基于增量学习的RPL物联网攻击自适应入侵检测
低功耗和丢包网络路由协议(RPL)已成为资源受限物联网系统中的事实上的路由标准,但其轻量级设计使其在路由层面临广泛的攻击,如hello洪泛、降低排名和版本号操纵等关键漏洞。传统的应对措施,包括协议级别的修改和机器学习分类器,可以对已知威胁实现高精度,但在面对新型或零日攻击时却无法有效,除非重新训练,这在动态的物联网环境中是不切实际的。本文研究增量学习作为一种实用且自适应的入侵检测策略在基于RPL的网络中的应用。我们系统地评估了五类模型,包括集成模型和深度学习模型。我们的分析表明,增量学习不仅恢复了对新攻击类别的检测性能,还减轻了对之前学习威胁的灾难性遗忘,同时相比完全重新训练,减少了训练时间。通过结合五种不同的模型、针对特定攻击的分析、遗忘行为和时间效率,本研究提供了增量学习在演化的RPL物联网网络中提供可扩展的、保持弹性入侵检测的系统性证据。
Summary / 总结
This paper investigates the use of incremental learning for adaptive intrusion detection in RPL-based IoT networks, addressing the vulnerability to routing-layer attacks. Five model families, including ensemble and deep learning models, are evaluated. The study demonstrates that incremental learning can restore detection performance on new attack classes, mitigate catastrophic forgetting, and reduce training time compared to full retraining, making it a practical solution for dynamic IoT environments.
本文探讨了增量学习在RPL基于的物联网网络中适应性入侵检测的应用,以应对路由层攻击的脆弱性。评估了五种模型家族,包括集成模型和深度学习模型。研究表明,增量学习不仅可以恢复对新攻击类别的检测性能,还能减轻灾难性遗忘,同时相比完全重新训练减少训练时间,使其成为动态物联网环境中的实用解决方案。
MoCap2Radar: A Spatiotemporal Transformer for Synthesizing Micro-Doppler Radar Signatures from Motion Capture
Authors: Kevin Chen, Kenneth W. Parker, Anish Arora
First: 2025-11-14T16:35:14+00:00 · Latest: 2025-11-14T16:35:14+00:00
Abstract
We present a pure machine learning process for synthesizing radar spectrograms from Motion-Capture (MoCap) data. We formulate MoCap-to-spectrogram translation as a windowed sequence-to-sequence task using a transformer-based model that jointly captures spatial relations among MoCap markers and temporal dynamics across frames. Real-world experiments show that the proposed approach produces visually and quantitatively plausible doppler radar spectrograms and achieves good generalizability. Ablation experiments show that the learned model includes both the ability to convert multi-part motion into doppler signatures and an understanding of the spatial relations between different parts of the human body.
The result is an interesting example of using transformers for time-series signal processing. It is especially applicable to edge computing and Internet of Things (IoT) radars. It also suggests the ability to augment scarce radar datasets using more abundant MoCap data for training higher-level applications. Finally, it requires far less computation than physics-based methods for generating radar data.
中文标题/摘要
标题:MoCap2Radar:基于时空变换器的运动捕捉数据合成微多普勒雷达签名
我们提出了一种纯机器学习过程,用于从运动捕捉(MoCap)数据合成雷达频谱图。我们将MoCap到频谱图的转换形式化为一个带有窗口的序列到序列任务,使用基于变换器的模型同时捕捉MoCap标记之间的空间关系以及帧间的时序动态。实验证明,所提出的方法生成了视觉上和定量上合理的多普勒雷达频谱图,并具有良好的泛化能力。消融实验表明,学习到的模型既具备将多部分运动转换为多普勒签名的能力,也理解了人体不同部分之间的空间关系。结果展示了使用变换器进行时间序列信号处理的有趣示例,特别适用于边缘计算和物联网(IoT)雷达。它还表明,可以使用更丰富的MoCap数据来扩充稀缺的雷达数据集,以训练高级应用。最后,与基于物理的方法相比,它生成雷达数据所需的计算量要少得多。
Towards Generalizable AI-Generated Image Detection via Image-Adaptive Prompt Learning
Authors: Yiheng Li, Zichang Tan, Zhen Lei, Xu Zhou, Yang Yang
First: 2025-08-03T05:41:24+00:00 · Latest: 2025-11-14T16:33:09+00:00
Comments: under review, codes: https://github.com/liyih/IAPL
Abstract
In AI-generated image detection, current cutting-edge methods typically adapt pre-trained foundation models through partial-parameter fine-tuning. However, these approaches often struggle to generalize to forgeries from unseen generators, as the fine-tuned models capture only limited patterns from training data and fail to reflect the evolving traits of new ones. To overcome this limitation, we propose Image-Adaptive Prompt Learning (IAPL), a novel paradigm that dynamically adjusts the prompts fed into the encoder according to each testing image, rather than fixing them after training. This design significantly enhances robustness and adaptability to diverse forged images. The dynamic prompts integrate conditional information with test-time adaptive tokens through a lightweight learnable scaling factor. The conditional information is produced by a Conditional Information Learner, which leverages CNN-based feature extractors to model both forgery-specific and general conditions. The test-time adaptive tokens are optimized during inference on a single sample by enforcing prediction consistency across multiple views, ensuring that the parameters align with the current image. For the final decision, the optimal input with the highest prediction confidence is selected. Extensive experiments show that IAPL achieves state-of-the-art performance, with mean accuracies of 95.61% and 96.7% on the widely used UniversalFakeDetect and GenImage datasets, respectively. Codes and weights will be released on https://github.com/liyih/IAPL.
中文标题/摘要
标题:通过图像自适应提示学习实现通用可迁移的AI生成图像检测
在AI生成图像检测中,当前最先进的方法通常通过部分参数微调预训练的基础模型。然而,这些方法往往难以将伪造图像从未见过的生成器中泛化出来,因为微调模型只能捕捉到有限的训练数据模式,而无法反映新生成器的演变特征。为克服这一局限,我们提出了一种新的图像自适应提示学习(IAPL)范式,该范式根据每张测试图像动态调整输入编码器的提示,而不是在训练后固定提示。这种设计显著增强了对各种伪造图像的鲁棒性和适应性。动态提示通过轻量级可学习缩放因子将条件信息与测试时自适应令牌集成在一起。条件信息由条件信息学习器生成,该学习器利用基于CNN的特征提取器来建模伪造特定和一般条件。测试时自适应令牌在单个样本的推理过程中通过确保预测在多个视图中的一致性进行优化,以确保参数与当前图像对齐。最终决策选择具有最高预测置信度的最佳输入。大量实验表明,IAPL在广泛使用的UniversalFakeDetect和GenImage数据集上分别实现了95.61%和96.7%的平均准确率,达到了最先进的性能。代码和权重将在https://github.com/liyih/IAPL上发布。
Summary / 总结
The research aims to improve the generalizability of AI-generated image detection by addressing the limitations of current methods that rely on partial-parameter fine-tuning. The proposed Image-Adaptive Prompt Learning (IAPL) dynamically adjusts prompts based on each testing image, enhancing robustness and adaptability. Experiments show that IAPL outperforms existing methods with mean accuracies of 95.61% and 96.7% on UniversalFakeDetect and GenImage datasets, respectively.
论文提出了一种名为Image-Adaptive Prompt Learning (IAPL)的方法,该方法根据每个测试图像动态调整提示,以解决AI生成图像的检测难题。该方法通过在UniversalFakeDetect和GenImage数据集上分别达到95.61%和96.7%的准确率,超越了现有方法。关键创新在于使用Conditional Information Learner和测试时自适应令牌来增强对多样伪造图像的鲁棒性和适应性。
Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification
Authors: Qinghao Gao, Jianhai Qu, Yunsong Li, Weiqiang Dong
First: 2025-11-14T16:31:37+00:00 · Latest: 2025-11-14T16:31:37+00:00
Comments: 11 pages, 4 figures
Abstract
Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.
中文标题/摘要
标题:重新思考遥感模态缺失分类中的高效专家混合模型
遥感多模态分类常常因环境干扰、传感器故障或大气效应导致模态缺失,严重影响分类性能。现有两阶段适应方法计算成本高,并假设训练时具有完整多模态数据,限制了其在现实世界不完整情况下的泛化能力。为克服这些问题,我们提出了一种模态缺失感知的Loras混合模型(MaMOL)框架,将模态缺失重新表述为多任务学习问题。MaMOL引入了双重路由机制:一个任务导向的动态路由器,能够根据不同缺失模式自适应激活专家;一个模态特定共享的静态路由器,保持跨模态知识的稳定共享。与先前方法为每种缺失配置训练独立网络不同,MaMOL通过轻量级专家更新和共享专家重用实现了参数高效的适应。在多个遥感基准测试上的实验表明,MaMOL在不同缺失率下具有优越的鲁棒性和泛化能力,且计算开销最小。此外,自然图像数据集上的迁移实验验证了其可扩展性和跨域适用性,突显了MaMOL作为不完整多模态学习的通用高效解决方案的优势。
Summary / 总结
The paper addresses the challenge of multimodal classification in remote sensing where missing modalities can severely degrade performance. It proposes a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL uses a dual-routing mechanism to adaptively activate experts for different missing patterns and maintain stable cross-modal knowledge sharing. Experiments show that MaMOL achieves superior robustness and generalization under varying missing rates with minimal computational overhead, and transfer experiments on natural image datasets validate its scalability and cross-domain applicability.
论文针对遥感中由于环境干扰、传感器故障或大气效应导致的模态缺失问题,提出了一个名为MaMOL的框架,将模态缺失重新定义为一个多任务学习问题。MaMOL使用双重路由机制,根据不同缺失模式适配性激活专家,并保持跨模态知识的稳定共享。实验表明,MaMOL在不同缺失率下表现出更强的鲁棒性和泛化能力,并且计算开销较小。
VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
Authors: Maximilian Rokuss, Moritz Langenberg, Yannick Kirchhoff, Fabian Isensee, Benjamin Hamm, Constantin Ulrich, Sebastian Regnery, Lukas Bauer, Efthimios Katsigiannopulos, Tobias Norajitra, Klaus Maier-Hein
First: 2025-11-14T16:20:07+00:00 · Latest: 2025-11-14T16:20:07+00:00
Abstract
We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: https://www.github.com/MIC-DKFZ/VoxTell
中文标题/摘要
标题:VoxTell:可文本提示的通用3D医学图像分割
我们介绍了VoxTell,一种用于文本提示的体积医学图像分割的视觉语言模型。它将从单个单词到完整的临床句子的自由形式描述映射到3D掩码。VoxTell基于超过62,000个CT、MRI和PET体积,涵盖1,000多个解剖和病理类,通过解码器层的多阶段视觉语言融合,在多个尺度上对齐文本和视觉特征。它在未见过的数据集上实现了跨模态的零样本最佳性能,对熟悉的概念表现出色,同时能够泛化到相关的未见过的类别。大量实验进一步证明了其跨模态的强转移能力、对语言变化和临床语言的鲁棒性,以及对真实世界文本的准确实例特定分割。代码可在:https://www.github.com/MIC-DKFZ/VoxTell 获取
Summary / 总结
VoxTell is a vision-language model designed for text-prompted volumetric medical image segmentation. It is trained on a large dataset of CT, MRI, and PET volumes and uses multi-stage vision-language fusion to align textual and visual features. VoxTell demonstrates state-of-the-art zero-shot performance across different imaging modalities and shows strong cross-modality transfer, robustness to linguistic variations, and accurate instance-specific segmentation from real-world text descriptions.
VoxTell 是一种用于文本提示的体积医学图像分割的视觉语言模型,它基于 CT、MRI 和 PET 体积的大规模数据集进行训练,并使用多阶段的视觉语言融合来对齐文本和视觉特征。VoxTell 在未见过的数据集上实现了跨模态的最先进的零样本性能,并展示了强大的跨模态转移能力和对语言变异和临床语言的鲁棒性。它还能够从实际文本描述中提供准确的实例特定分割。
Efficient Bayer-Domain Video Computer Vision with Fast Motion Estimation and Learned Perception Residual
Authors: Haichao Wang, Jiangtao Wen, Yuxing Han
First: 2025-08-08T03:55:19+00:00 · Latest: 2025-11-14T16:16:52+00:00
Abstract
Video computer vision systems face substantial computational burdens arising from two fundamental challenges: eliminating unnecessary processing and reducing temporal redundancy in back-end inference while maintaining accuracy with minimal extra computation. To address these issues, we propose an efficient video computer vision framework that jointly optimizes both the front end and back end of the pipeline. On the front end, we remove the traditional image signal processor (ISP) and feed Bayer raw measurements directly into Bayer-domain vision models, avoiding costly human-oriented ISP operations. On the back end, we introduce a fast and highly parallel motion estimation algorithm that extracts inter-frame temporal correspondence to avoid redundant computation. To mitigate artifacts caused by motion inaccuracies, we further employ lightweight perception residual networks that directly learn perception-level residuals and refine the propagated features. Experiments across multiple models and tasks demonstrate that our system achieves substantial acceleration with only minor performance degradation.
中文标题/摘要
标题:高效 Bayer 域视频计算机视觉:快速运动估计与学习感知残差
视频计算机视觉系统面临巨大的计算负担,源于两个基本挑战:消除不必要的处理和在后端推理中减少时间冗余,同时保持准确性并减少额外计算。为了解决这些问题,我们提出了一种高效的视频计算机视觉框架,同时优化管道的前端和后端。在前端,我们移除了传统的图像信号处理器(ISP),直接将 Bayer 原始测量值输入到 Bayer 域视觉模型中,避免了昂贵的人工导向的 ISP 操作。在后端,我们引入了一种快速且高度并行的运动估计算法,提取帧间的时间对应关系,以避免冗余计算。为了减轻由运动不准确引起的伪影,我们进一步采用了轻量级的感知残差网络,直接学习感知级残差并细化传播的特征。在多个模型和任务上的实验表明,我们的系统在仅轻微性能下降的情况下实现了显著加速。
Summary / 总结
The research aims to address the computational challenges in video computer vision by optimizing both the front and back ends of the pipeline. The method involves removing the traditional ISP and feeding Bayer raw measurements directly into vision models, and introducing a fast motion estimation algorithm to reduce redundant computation. The system also uses lightweight perception residual networks to refine propagated features and mitigate motion inaccuracies. Experimental results show that the proposed system achieves significant acceleration with minimal performance loss.
研究旨在通过联合优化管道的前端和后端来解决视频计算机视觉中的计算挑战。方法包括移除传统的ISP并直接将Bayer原始测量值输入视觉模型,以及引入快速运动估计算法以减少冗余计算。系统还使用轻量级的感知残差网络来细化传播特征并减轻运动伪影。实验表明,所提出系统可以实现显著加速,并且性能损失很小。
DiffPro: Joint Timestep and Layer-Wise Precision Optimization for Efficient Diffusion Inference
Authors: Farhana Amin, Sabiha Afroz, Kanchon Gharami, Mona Moghadampanah, Dimitrios S. Nikolopoulos
First: 2025-11-14T16:14:58+00:00 · Latest: 2025-11-14T16:14:58+00:00
Abstract
Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID <= 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.
中文标题/摘要
标题:DiffPro: 联合时间步和层级精度优化以提高高效的扩散推断
扩散模型生成高质量图像,但由于去噪步骤众多和繁重的矩阵运算,推断成本高昂。我们提出了DiffPro,一种后训练、硬件忠实的框架,与部署中使用的精确整数内核兼容,并在扩散变换器(DiTs)中联合调优时间步和每层精度,以减少延迟和内存消耗,无需任何训练。DiffPro 结合了三个部分:流形感知灵敏度度量以分配权重位数、动态激活量化以在时间步之间稳定激活值,以及由教师-学生漂移引导的预算时间步选择器。实验表明,DiffPro 可实现高达6.25倍的模型压缩、50%更少的时间步以及2.8倍更快的推断速度,同时Delta FID <= 10,证明了其实用的效率提升。DiffPro 将步骤减少和精度规划统一为一个预算可部署的计划,以实现实时能源感知的扩散推断。
Summary / 总结
DiffPro is a post-training framework that jointly optimizes timesteps and per-layer precision in Diffusion Transformers to reduce inference latency and memory usage without retraining. It uses a manifold-aware sensitivity metric, dynamic activation quantization, and a budgeted timestep selector. Experiments show up to 6.25x model compression, 50% fewer timesteps, and 2.8x faster inference with a Delta FID of <=10 on standard benchmarks, demonstrating practical efficiency gains.
DiffPro 是一个后训练框架,联合优化 Diffusion Transformers 中的时间步和层精度,以减少推理延迟和内存使用。它使用流形感知的灵敏度度量进行权重位分配,动态激活量化以稳定激活,以及一个基于教师-学生漂移的预算时间步选择器。实验显示最高可达 6.25 倍的模型压缩,时间步减少 50%,以及 2.8 倍的更快推理,Delta FID <=10,证明了在标准基准上的实际效率提升,适用于实时节能扩散推理。
Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard
Authors: Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang
First: 2025-11-13T11:50:54+00:00 · Latest: 2025-11-14T16:14:03+00:00
Abstract
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.
中文标题/摘要
标题:语音-音频合成攻击对多模态LLM的影响及其通过SALMONN-Guard的缓解
大型语言模型(LLM)在理解和处理语音及非语音音频方面取得了进展,但复杂的音频输入暴露了现有安全措施无法充分应对的新安全风险。我们引入SACRED-Bench(语音-音频合成用于RED团队训练)来评估LLM在复杂音频攻击下的鲁棒性。不同于依赖噪声优化或白盒访问的现有扰动方法,SACRED-Bench 利用语音-音频合成机制。SACRED-Bench 采用三种机制:(a)语音重叠和多说话人对话,将有害提示嵌入在或与无害语音并列;(b)语音-音频混合,通过非语音音频与无害语音或音频一起暗示不安全意图;(c)多样化的口头指令格式(开放式问答、是/否),以规避仅文本过滤器。实验表明,即使是目前最先进的专有LLM Gemini 2.5 Pro,在SACRED-Bench 测试集中仍表现出66%的攻击成功率,暴露了跨模态、语音-音频合成攻击下的漏洞。为弥补这一差距,我们提出了SALMONN-Guard,这是一种安全防护LLM,联合检查语音、音频和文本以进行安全判断,将攻击成功率降低至20%。我们的结果强调了为多模态LLM的安全性提供音频意识防御的必要性。基准和SALMONN-Guard 检查点可在https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench 获取。警告:本文包含可能具有冒犯性或危害性的示例。
Summary / 总结
The research aims to evaluate the robustness of large language models (LLMs) against complex audio-based attacks, which are not adequately handled by current safeguards. SACRED-Bench, a new benchmark, uses speech-audio composition mechanisms to test LLMs, including speech overlap, speech-audio mixture, and diverse spoken instruction formats. Experiments show that even state-of-the-art LLMs like Gemini 2.5 Pro have a high attack success rate of 66%. To mitigate these attacks, the study proposes SALMONN-Guard, which reduces the attack success rate to 20% by jointly inspecting speech, audio, and text. The findings underscore the necessity of audio-aware defenses for multimodal LLMs.
研究旨在评估大型语言模型(LLMs)在面对复杂的基于音频的攻击时的鲁棒性,这些攻击目前的安全防护措施未能充分应对。SACRED-Bench 是一个新的基准,使用语音-音频合成机制来测试 LLMs,包括语音重叠、语音-音频混合以及多样化的语音指令格式。实验表明,即使是最先进的 LLMs 如 Gemini 2.5 Pro,在 SACRED-Bench 测试集中的攻击成功率仍高达 66%。为了缓解这些攻击,研究提出了 SALMONN-Guard,通过联合检查语音、音频和文本来降低攻击成功率至 20%。研究结果强调了为多模态 LLMs 提供音频意识防御的必要性。
From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
Authors: Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi
First: 2025-11-14T16:07:18+00:00 · Latest: 2025-11-14T16:07:18+00:00
Abstract
Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.
中文标题/摘要
标题:从合成场景到真实表现:增强VLM的空间推理能力
对视觉-语言模型(VLMs)进行微调是一种常见的策略,以提高性能,通常是在收集和标注真实场景数据后进行。然而,这一过程往往容易出现偏差、错误和分布不平衡,导致过拟合和性能不平衡。尽管有一些研究尝试通过生成合成数据来解决这个问题,但它们缺乏对分布偏差和标注质量的控制。为了解决这些挑战,我们以两种方式重新设计了微调过程。首先,我们控制数据及其标注的生成,确保其无偏差、无分布不平衡和无标注错误。我们通过全面采样场景中对象的属性(包括颜色、形状、大小和位置)自动构建数据集。其次,使用此标注数据集,我们微调最先进的VLMs,并在绝对位置任务上评估其性能转移性。我们在合成和真实世界基准上进行了详尽的评估。我们的实验揭示了两个关键发现:1)在平衡的合成数据上进行微调可以在视觉场景中获得一致的性能并减轻常见偏差;2)在合成刺激上进行微调显著提高了在真实世界数据(COCO)上的性能,超过了在匹配设置中进行微调的模型。
Summary / 总结
The study aims to enhance the spatial reasoning capabilities of Vision-Language Models (VLMs) by fine-tuning them on synthetic data to mitigate biases and distribution imbalances. The method involves automatically constructing a dataset with controlled attributes and annotations, ensuring it is free from bias and errors. The experiments show that fine-tuning on balanced synthetic data improves performance across the visual scene and enhances real-world performance on the COCO dataset compared to models fine-tuned on real-world data.
研究旨在通过解决细调数据集中的偏差和分布不平衡问题,增强视觉-语言模型(VLM)的空间推理能力。方法是生成具有可控属性和注释的合成数据,并在该数据集上对VLM进行细调。关键发现包括在视觉场景中的一致性能以及与在真实世界数据上进行细调相比,在COCO数据集上的性能提升。
Retrofit: Continual Learning with Bounded Forgetting for Security Applications
Authors: Yiling He, Junchi Lei, Hongyu She, Shuo Shao, Xinran Zheng, Yiping Liu, Zhan Qin, Lorenzo Cavallaro
First: 2025-11-14T16:07:03+00:00 · Latest: 2025-11-14T16:07:03+00:00
Abstract
Modern security analytics are increasingly powered by deep learning models, but their performance often degrades as threat landscapes evolve and data representations shift. While continual learning (CL) offers a promising paradigm to maintain model effectiveness, many approaches rely on full retraining or data replay, which are infeasible in data-sensitive environments. Moreover, existing methods remain inadequate for security-critical scenarios, facing two coupled challenges in knowledge transfer: preserving prior knowledge without old data and integrating new knowledge with minimal interference.
We propose RETROFIT, a data retrospective-free continual learning method that achieves bounded forgetting for effective knowledge transfer. Our key idea is to consolidate previously trained and newly fine-tuned models, serving as teachers of old and new knowledge, through parameter-level merging that eliminates the need for historical data. To mitigate interference, we apply low-rank and sparse updates that confine parameter changes to independent subspaces, while a knowledge arbitration dynamically balances the teacher contributions guided by model confidence. Our evaluation on two representative applications demonstrates that RETROFIT consistently mitigates forgetting while maintaining adaptability. In malware detection under temporal drift, it substantially improves the retention score, from 20.2% to 38.6% over CL baselines, and exceeds the oracle upper bound on new data. In binary summarization across decompilation levels, where analyzing stripped binaries is especially challenging, RETROFIT achieves around twice the BLEU score of transfer learning used in prior work and surpasses all baselines in cross-representation generalization.
中文标题/摘要
标题:Retrofit:在安全应用中具有有限遗忘的持续学习方法
现代安全分析越来越多地依赖深度学习模型,但随着威胁环境的变化和数据表示的转变,其性能往往会下降。虽然持续学习(CL)提供了一种有希望的范式来保持模型的有效性,但许多方法依赖于完全重新训练或数据回放,这在敏感数据环境中是不可行的。此外,现有方法在安全关键场景中仍然不足,面临着知识转移的两个耦合挑战:在没有旧数据的情况下保留先前知识,以及在最小干扰下整合新知识。
我们提出了一种名为Retrofit的数据回顾自由持续学习方法,以实现有效的知识转移并具有有限遗忘。我们的核心思想是通过参数级合并将先前训练和新微调的模型结合起来,作为旧知识和新知识的教师,从而消除对历史数据的需求。为了减轻干扰,我们应用了低秩和稀疏更新,将参数变化限制在独立子空间中,同时知识仲裁根据模型置信度动态平衡教师贡献。我们在两个代表性应用上的评估表明,Retrofit在减轻遗忘的同时保持了适应性。在时间漂移下的恶意软件检测中,它在持续学习基线上的保留分数从20.2%提高到38.6%,并超过了新数据上的先验上限。在跨分解级别进行二元总结化时,特别是在分析剥离二进制文件特别具有挑战性的场景中,Retrofit的BLEU分数大约是先前工作中使用的迁移学习的两倍,并且在跨表示泛化方面超过了所有基线。
Summary / 总结
The paper addresses the challenge of maintaining deep learning model performance in dynamic security environments where data distributions change over time. It introduces RETROFIT, a continual learning method that avoids the need for historical data by merging parameters of previously trained and newly fine-tuned models. This approach reduces interference between old and new knowledge through low-rank and sparse updates, and dynamically balances teacher contributions. Experimental results show that RETROFIT effectively mitigates forgetting and improves retention scores in malware detection and binary summarization tasks compared to baseline continual learning methods.
Retrofit 是一种持续学习方法,旨在动态安全环境中维护模型性能而不依赖历史数据。它通过参数级合并整合旧模型和新模型,并使用低秩和稀疏更新来最小化知识之间的干扰。在恶意软件检测和二进制摘要任务中,Retrofit 显著提高了保留分数和 BLEU 分数,超过了基线持续学习方法。
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models
Authors: Mingjie Xu, Jinpeng Chen, Yuzhi Zhao, Jason Chun Lok Li, Yue Qiu, Zekang Du, Mengyang Wu, Pingping Zhang, Kun Li, Hongzheng Yang, Wenao Ma, Jiaheng Wei, Qinbin Li, Kangcheng Liu, Wenqiang Lei
Venue: AAAI 2026
First: 2025-11-14T16:06:25+00:00 · Latest: 2025-11-14T16:06:25+00:00
Comments: This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details
Abstract
Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use "visual prompts" (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs' capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models' ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
中文标题/摘要
标题:VP-Bench:多模态大型语言模型视觉提示综合基准
多模态大型语言模型(MLLMs)已使一系列高级视觉-语言应用成为可能,包括细粒度的目标识别和上下文理解。当查询图像中的特定区域或对象时,人类用户自然会使用“视觉提示”(VPs),如边界框,来提供参考。然而,目前没有基准能够系统地评估MLLMs理解VPs的能力。这一空白使得不清楚当前的MLLMs是否能够有效识别VPs,这是一种直观的人类提示方法,并利用它们解决问题。为解决这一局限,我们引入了VP-Bench,一个评估MLLMs在VP感知和利用方面能力的基准。VP-Bench采用两阶段评估框架:第一阶段考察模型在自然场景中感知VPs的能力,使用30,000个可视化提示,涵盖八种形状和355种属性组合。第二阶段研究VPs对下游任务的影响,测量其在现实世界问题解决场景中的有效性。使用VP-Bench,我们评估了28个MLLMs,包括专有系统(如GPT-4o)和开源模型(如InternVL3和Qwen2.5-VL),并提供了影响VP理解的因素的全面分析,如VP属性的变化、问题排列和模型规模。VP-Bench为研究MLLMs如何理解和解决基于参照的问题建立了新的参考框架。
Summary / 总结
VP-Bench is a benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to interpret and utilize visual prompts (VPs) in natural scenes and real-world tasks. It consists of two stages: the first evaluates models' VP perception with 30k prompts, and the second assesses their impact on downstream tasks. The study tests 28 MLLMs, including proprietary and open-source models, and analyzes factors affecting VP understanding, such as VP attributes and model scale. This benchmark fills a gap in the evaluation of MLLMs' VP capabilities and provides a new reference framework for research in this area.
VP-Bench 是一个用于评估多模态大型语言模型(MLLMs)理解和利用视觉提示(VPs)能力的基准。该基准采用两阶段框架:第一阶段评估模型对30k个提示的感知能力,第二阶段评估模型在实际问题解决中的效果。研究发现,MLLMs在理解和使用VPs方面存在差异,VP属性和模型规模等因素会影响性能。VP-Bench 为理解MLLMs处理接地引用问题提供了新的参考框架。
The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models
Authors: Maria-Teresa De Rosa Palmini, Eva Cetinic
First: 2025-11-14T16:03:10+00:00 · Latest: 2025-11-14T16:03:10+00:00
Abstract
Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.
中文标题/摘要
标题:文化记忆的持久性:在扩散模型中探究多模态象征性
我们的工作解决了文本到图像扩散模型中泛化与记忆之间的模糊性,重点关注我们称为多模态象征性的特定案例。这指的是图像和文本唤起的文化共享关联,例如标题唤起熟悉的艺术作品或电影场景。尽管关于记忆和遗忘的先前研究强调遗忘,但我们研究的是被记住的内容及其方式,关注的是识别文化参考与再现之间的平衡。我们引入了一种评估框架,将识别(模型是否识别参考)与实现(通过复制或重新解释如何再现)分开,并通过捕捉两个维度的度量来量化。通过评估五个扩散模型在767个维基数据衍生的文化参考中的表现,跨越静态和动态图像,我们表明我们的框架比现有的基于相似性的方法更有效地区分复制与转变。为了评估语言敏感性,我们使用同义词替换和字面图像描述进行了提示扰动实验,发现即使文本提示被改变,模型也经常再现标志性视觉结构。最后,我们的分析表明,文化对齐不仅与训练数据频率有关,还与文本的独特性、参考的流行度和创作日期有关。我们的工作揭示了扩散模型的价值不仅在于它们再现的内容,还在于它们如何转变和重新语境化文化知识,推动评估从简单的图文匹配向更丰富的语境理解发展。
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Authors: Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua
First: 2025-11-14T16:02:38+00:00 · Latest: 2025-11-14T16:02:38+00:00
Abstract
Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
中文标题/摘要
标题:WEAVE:释放和基准测试上下文交织的跨模态理解和生成
近期统一多模态模型(UMMs)的发展在视觉理解和生成方面取得了显著进展。然而,现有的数据集和基准主要集中在单轮交互上,未能捕捉到现实世界图像创作和编辑中的多轮、上下文依赖性特征。为解决这一差距,我们提出了WEAVE,这是首个用于上下文交织跨模态理解和生成的工具套件。我们的套件包含两个互补部分。WEAVE-100k 是一个包含 100,000 个交织样本的大规模数据集,跨越 370,000 个对话回合和 500,000 张图像,涵盖了需要推理历史上下文的理解和编辑任务。WEAVEBench 是一个基于 480 张图像的 100 项人工注释基准,采用混合 VLM 评判框架,基于参考图像和原始图像与编辑指令的组合,评估模型在多轮生成、视觉记忆和跨领域世界知识推理方面的能力。实验表明,使用 WEAVE-100k 训练能够增强视觉理解、图像编辑和理解生成协作的能力。此外,它促进了统一多模态模型发展出新兴的视觉记忆能力,而广泛的 WEAVEBench 评估揭示了当前方法在多轮、上下文感知图像生成和编辑方面的持续局限性和挑战。我们相信 WEAVE 为多模态社区研究上下文交织的跨模态理解和生成提供了视角和基础。
Summary / 总结
The research aims to address the limitations of existing datasets and benchmarks that focus on single-turn interactions, by introducing WEAVE, a suite for in-context interleaved cross-modality comprehension and generation. The method includes WEAVE-100k, a large-scale dataset with 100K interleaved samples, and WEAVEBench, a benchmark with 100 tasks based on 480 images. Key findings show that training on WEAVE-100k enhances vision comprehension, image editing, and collaboration capabilities, and also helps develop emergent visual-memory capabilities. However, evaluations on WEAVEBench reveal persistent limitations in multi-turn, context-aware image generation and editing for current approaches.
研究旨在通过引入WEAVE套件,解决现有数据集和基准主要关注单轮交互的局限性,该套件包括一个包含100K交错样本的大规模数据集WEAVE-100k和一个基于480张图像的100项任务基准WEAVEBench。实验发现表明,通过WEAVE-100k训练可以增强视觉理解、图像编辑和协作生成能力,并帮助UMMs发展出新兴的视觉记忆能力。然而,WEAVEBench上的评估揭示了当前方法在多轮、上下文感知图像生成和编辑方面的持续局限性和挑战。该研究为多模态社区进一步研究上下文交错交叉模态理解和生成提供了基础。
BubbleOKAN: A Physics-Informed Interpretable Neural Operator for High-Frequency Bubble Dynamics
Authors: Yunhao Zhang, Sidharth S. Menon, Lin Cheng, Aswin Gnanaskandan, Ameya D. Jagtap
First: 2025-08-05T23:05:20+00:00 · Latest: 2025-11-14T15:59:13+00:00
Comments: 36 pages, 21 figures
Abstract
In this work, we employ physics-informed neural operators to map pressure profiles from an input function space to the corresponding bubble radius responses. Our approach employs a two-step DeepONet architecture. To address the intrinsic spectral bias of deep learning models, our model incorporates the Rowdy adaptive activation function, enhancing the representation of high-frequency features. Moreover, we introduce the Kolmogorov-Arnold network (KAN) based two-step DeepOKAN model, which enhances interpretability (often lacking in conventional multilayer perceptron architectures) while efficiently capturing high-frequency bubble dynamics without explicit utilization of activation functions in any form. We particularly investigate the use of spline basis functions in combination with radial basis functions (RBF) within our architecture, as they demonstrate superior performance in constructing a universal basis for approximating high-frequency bubble dynamics compared to alternative formulations. Furthermore, we emphasize on the performance bottleneck of RBF while learning the high frequency bubble dynamics and showcase the advantage of using spline basis function for the trunk network in overcoming this inherent spectral bias. The model is systematically evaluated across three representative scenarios: (1) bubble dynamics governed by the Rayleigh-Plesset equation with a single initial radius, (2) bubble dynamics governed by the Keller-Miksis equation with a single initial radius, and (3) Keller-Miksis dynamics with multiple initial radii. We also compare our results with state-of-the-art neural operators, including Fourier Neural Operators, Wavelet Neural Operators, OFormer, and Convolutional Neural Operators. Our findings demonstrate that the two-step DeepOKAN accurately captures both low- and high-frequency behaviors, and offers a promising alternative to conventional numerical solvers.
中文标题/摘要
标题:BubbleOKAN:一种用于高频气泡动力学的物理知情可解释神经算子
在本文中,我们采用物理知情神经算子将输入函数空间的压力分布映射到相应的气泡半径响应。我们的方法采用两步DeepONet架构。为了解决深度学习模型固有的频谱偏差问题,我们的模型结合了Rowdy自适应激活函数,增强了高频特征的表示能力。此外,我们引入了基于Kolmogorov-Arnold网络(KAN)的两步DeepOKAN模型,该模型增强了可解释性(在传统的多层感知器架构中通常缺乏),同时高效地捕捉高频气泡动力学,而无需以任何形式使用激活函数。我们特别研究了在架构中结合使用样条基函数和径向基函数(RBF)的效果,因为它们在构建近似高频气泡动力学的通用基方面表现出色,优于其他形式的表述。此外,我们强调了在学习高频气泡动力学时RBF的性能瓶颈,并展示了使用样条基函数作为主干网络以克服这种固有频谱偏差的优势。该模型在三个代表性场景中进行了系统评估:(1)由Rayleigh-Plesset方程支配的气泡动力学,初始半径单一;(2)由Keller-Miksis方程支配的气泡动力学,初始半径单一;(3)Keller-Miksis动力学,初始半径多值。我们还将我们的结果与最先进的神经算子进行了比较,包括Fourier神经算子、小波神经算子、OFormer和卷积神经算子。我们的研究结果表明,两步DeepOKAN能够准确捕捉低频和高频行为,并为传统的数值求解器提供了一种有前景的替代方案。
Summary / 总结
This work introduces BubbleOKAN, a physics-informed neural operator that uses a two-step DeepONet architecture with Rowdy adaptive activation functions to enhance the representation of high-frequency features in bubble dynamics. The model, particularly the two-step DeepOKAN, improves interpretability and efficiently captures high-frequency dynamics. The study evaluates the model across three scenarios and compares it with other neural operators, showing that BubbleOKAN accurately captures both low- and high-frequency behaviors.
该研究使用物理知情的神经算子来映射压力分布到气泡半径响应,采用两步DeepONet架构并结合Rowdy自适应激活函数以增强高频特征表示。模型DeepOKAN结合了样条基函数和Kolmogorov-Arnold网络,以提高可解释性并有效捕捉高频气泡动力学。实验结果表明,DeepOKAN能够准确捕捉低频和高频行为,并在三个场景中优于其他最先进的神经算子。