arXiv 论文速递

ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning

Authors: Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo

First: 2025-12-10T18:57:09+00:00 · Latest: 2025-12-10T18:57:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.

中文标题/摘要

标题：ReViSE：统一模型中基于推理的视频编辑方法

视频统一模型在理解和生成方面表现出强大的能力，但在配备了强大内部视觉-语言模型（VLM）的情况下，它们在基于推理的视觉编辑方面仍然存在困难。我们将其差距归因于两个因素：1）现有数据集不适用于训练和评估推理感知的视频编辑，2）模型推理能力和编辑能力之间的固有脱节，这阻碍了丰富的理解有效地指导编辑过程。弥合这一差距需要一个将推理与视觉转换集成的框架。为了解决这一差距，我们提出了基于推理的视频编辑（RVE）任务，该任务要求在编辑过程中考虑物理合理性和因果动态。为了支持系统的评估，我们构建了RVE-Bench，这是一个全面的基准，包含两个互补子集：基于推理的视频编辑和上下文视频生成。这些子集涵盖了多种推理维度和现实世界的编辑场景。在此基础上，我们提出了ReViSE，一种自我反思推理（SRF）框架，将生成和评估统一在一个架构中。模型内部的VLM通过评估编辑后的视频是否逻辑上满足给定的指令，提供内在反馈。差异反馈在训练过程中细化生成器的推理行为。在RVE-Bench上的广泛实验表明，ReViSE 显著提高了编辑准确性和视觉保真度，在基于推理的视频编辑子集上相对于最先进的方法实现了32%的整体分数提升。

Summary / 总结

The paper addresses the challenge of reason-informed video editing by introducing the RVE task and RVE-Bench, which evaluate the ability to reason about physical plausibility and causal dynamics. The proposed ReViSE framework uses a Self-Reflective Reasoning (SRF) approach to unify generation and evaluation, providing intrinsic feedback to refine reasoning behavior. Experiments show that ReViSE improves editing accuracy and visual fidelity by 32% in the reasoning-informed video editing subset compared to state-of-the-art methods.

论文通过引入RVE任务和RVE-Bench来评估物理合理性和因果动态的推理能力。提出的ReViSE框架采用自反推理（SRF）方法统一生成和评估，并提供内在反馈以改进推理。实验表明，ReViSE在推理驱动的视频编辑子集中的编辑准确性和视觉保真度分别比最先进的方法提高了32%。

Splatent: Splatting Diffusion Latents for Novel View Synthesis

Authors: Or Hirschorn, Omer Sela, Inbar Huberman-Spiegelglas, Netalee Efrat, Eli Alshan, Ianir Ideses, Frederic Devernay, Yochai Zvik, Lior Fritz

First: 2025-12-10T18:57:04+00:00 · Latest: 2025-12-10T18:57:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction.

中文标题/摘要

标题：Splatent：在VAE潜在空间中进行新颖视图合成的扩散潜空间细分

辐射场表示最近在VAE的潜在空间中被探索，该空间通常被扩散模型所使用。这一方向提供了高效的渲染和与基于扩散的管道无缝集成。然而，这些方法面临一个根本性的限制：VAE潜在空间缺乏多视图一致性，导致在3D重建过程中出现模糊的纹理和缺失的细节。现有方法试图通过微调VAE来解决这一问题，这会牺牲重建质量，或者依赖预训练的扩散模型来恢复细粒度的细节，但存在一些幻觉的风险。我们提出了Splatent，这是一种基于扩散的增强框架，旨在在VAE的潜在空间中与3D高斯细分（3DGS）相结合。我们的关键见解与传统的3D为中心的观点不同：我们不是在3D空间中重建细粒度的细节，而是通过多视图注意力机制从输入视图中在2D中恢复它们。这种方法保留了预训练VAE的重建质量，同时实现了忠实的细节恢复。在多个基准测试中，Splatent为VAE潜在空间辐射场重建设定了新的最先进水平。我们进一步证明，将我们的方法与现有的前馈框架结合使用，可以一致地提高细节保留，为高质量稀疏视图3D重建开辟新的可能性。

Summary / 总结

Splatent is a diffusion-based enhancement framework that operates on the latent space of VAEs, using 3D Gaussian Splatting and multi-view attention mechanisms to recover fine-grained details from input views. This method improves the reconstruction quality of pretrained VAEs while preserving multi-view consistency, leading to faithful detail recovery and establishing a new state-of-the-art in VAE latent radiance field reconstruction. Integrating Splatent with existing feed-forward frameworks further enhances detail preservation in sparse-view 3D reconstruction.

Splatent 是一种基于扩散的增强框架，它在 VAE 的潜在空间中运行，使用 3D 高斯斑点技术从输入视图中通过多视图注意力机制恢复细粒度的细节。这种方法保留了预训练 VAE 的重建质量，同时实现了忠实的细节恢复，为 VAE 潜在辐射场重建设立了新的标准，并且在现有的前馈框架中一致提高了细节保留能力。

LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating

Authors: Junting Chen, Yunchuan Li, Panfeng Jiang, Jiacheng Du, Zixuan Chen, Chenrui Tie, Jiajun Deng, Lin Shao

First: 2025-12-10T18:54:30+00:00 · Latest: 2025-12-10T18:54:30+00:00

Comments: 8 pages

Abs · PDF · Code1 · Code2 · Project1

Abstract

Towards human-robot coexistence, socially aware navigation is significant for mobile robots. Yet existing studies on this area focus mainly on path efficiency and pedestrian collision avoidance, which are essential but represent only a fraction of social navigation. Beyond these basics, robots must also comply with user instructions, aligning their actions to task goals and social norms expressed by humans. In this work, we present LISN-Bench, the first simulation-based benchmark for language-instructed social navigation. Built on Rosnav-Arena 3.0, it is the first standardized social navigation benchmark to incorporate instruction following and scene understanding across diverse contexts. To address this task, we further propose Social-Nav-Modulator, a fast-slow hierarchical system where a VLM agent modulates costmaps and controller parameters. Decoupling low-level action generation from the slower VLM loop reduces reliance on high-frequency VLM inference while improving dynamic avoidance and perception adaptability. Our method achieves an average success rate of 91.3%, which is greater than 63% than the most competitive baseline, with most of the improvements observed in challenging tasks such as following a person in a crowd and navigating while strictly avoiding instruction-forbidden regions. The project website is at: https://social-nav.github.io/LISN-project/

中文标题/摘要

标题：LISN：基于VLM控制器调节的语言指导社会导航

为了实现人机共存，社会意识导航对于移动机器人至关重要。然而，现有研究主要集中在路径效率和行人碰撞避免上，这些都是基本但仅占社会导航的一小部分。在此之上，机器人还必须遵循用户指令，使其行为与人类表达的任务目标和社会规范相一致。在本文中，我们提出了LISN-Bench，这是首个基于仿真的语言指导社会导航基准。基于Rosnav-Arena 3.0，它首次将指令遵循和场景理解标准化，适用于多种情境。为了解决这一任务，我们进一步提出了社会导航调节器（Social-Nav-Modulator），这是一种快慢层次系统，其中VLM代理调节成本图和控制器参数。通过将低级动作生成与较慢的VLM循环解耦，减少了对高频VLM推理的依赖，同时提高了动态避障和感知适应性。我们的方法在复杂任务（如在人群中跟随一个人和严格避免指令禁止区域）中的成功率达到了91.3%，比最竞争的基线高出63%。项目网站为：https://social-nav.github.io/LISN-project/

Summary / 总结

The research aims to enhance socially aware navigation for mobile robots by incorporating user instructions into their decision-making processes. The method involves a fast-slow hierarchical system called Social-Nav-Modulator, where a Vision-Language Model (VLM) agent modulates costmaps and controller parameters. This approach improves dynamic avoidance and perception adaptability, achieving an average success rate of 91.3%, significantly higher than the most competitive baseline, especially in challenging tasks like following a person in a crowd and avoiding instruction-forbidden regions.

该研究旨在通过引入LISN-Bench，一个基于仿真的语言指导社会导航基准，解决人类与机器人共存中的社会意识导航需求。该基准涵盖了各种情境下的指令遵循和场景理解。提出的Social-Nav-Modulator系统使用VLM代理来调节成本地图和控制器参数，将低级动作生成与高频VLM循环分离。该方法在平均成功率方面达到了91.3%，显著优于现有基线，特别是在跟随人群中的个人和严格避免禁止区域导航等具有挑战性的任务中表现出色。

NordFKB: a fine-grained benchmark dataset for geospatial AI in Norway

Authors: Sander Riisøen Jyhne, Aditya Gupta, Ben Worsley, Marianne Andersen, Ivar Oveland, Alexander Salveson Nossum

First: 2025-12-10T18:47:25+00:00 · Latest: 2025-12-10T18:47:25+00:00

Comments: 8 pages, 2 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

We present NordFKB, a fine-grained benchmark dataset for geospatial AI in Norway, derived from the authoritative, highly accurate, national Felles KartdataBase (FKB). The dataset contains high-resolution orthophotos paired with detailed annotations for 36 semantic classes, including both per-class binary segmentation masks in GeoTIFF format and COCO-style bounding box annotations. Data is collected from seven geographically diverse areas, ensuring variation in climate, topography, and urbanization. Only tiles containing at least one annotated object are included, and training/validation splits are created through random sampling across areas to ensure representative class and context distributions. Human expert review and quality control ensures high annotation accuracy. Alongside the dataset, we release a benchmarking repository with standardized evaluation protocols and tools for semantic segmentation and object detection, enabling reproducible and comparable research. NordFKB provides a robust foundation for advancing AI methods in mapping, land administration, and spatial planning, and paves the way for future expansions in coverage, temporal scope, and data modalities.

中文标题/摘要

标题：NordFKB：挪威地理空间AI的细粒度基准数据集

我们介绍了NordFKB，这是一个源自权威且高度准确的挪威国家Felles KartdataBase (FKB) 的细粒度基准数据集，用于挪威的地理空间AI。该数据集包含高分辨率正射影像，并附有36个语义类的详细注释，包括GeoTIFF格式的每个类别的二元分割掩码和COCO风格的边界框注释。数据来自七个地理上不同的区域，确保了气候、地形和城市化的多样性。仅包含包含至少一个注释对象的瓦片，并通过跨区域的随机采样创建训练/验证分割，以确保代表性类和上下文分布。通过人工专家审查和质量控制，确保了高注释准确性。除了数据集，我们还发布了基准测试仓库，包含标准化的评估协议和工具，用于语义分割和对象检测，使研究具有可重复性和可比性。NordFKB 为推进制图、土地管理及空间规划中的AI方法提供了坚实的基础，并为未来在覆盖范围、时间范围和数据模态方面的扩展铺平了道路。

Summary / 总结

NordFKB is a fine-grained benchmark dataset for geospatial AI in Norway, derived from the national Felles KartdataBase (FKB). It includes high-resolution orthophotos with detailed annotations for 36 semantic classes, ensuring diverse geographic areas and high annotation accuracy. The dataset supports both per-class binary segmentation masks and COCO-style bounding box annotations, and is split into training and validation sets through random sampling. The repository also provides standardized evaluation protocols for reproducible research in semantic segmentation and object detection.

NordFKB 是一个源自权威 FKB 数据库的细粒度基准数据集，用于挪威的地理空间 AI。它包含高分辨率正射影像，并对 36 个语义类别进行了详细标注，确保了地理区域的多样性以及高标注准确性。该数据集支持每类二元分割掩码和 COCO 样式的边界框标注，通过随机采样支持训练和验证。主要发现包括该数据集在地图绘制、土地管理及空间规划中的稳健性，并计划未来扩展覆盖范围、时间范围和数据模态。

Supervised learning pays attention

Authors: Erin Craig, Robert Tibshirani

First: 2025-12-10T18:43:46+00:00 · Latest: 2025-12-10T18:43:46+00:00

Abs · PDF · Code1 · Code2

Abstract

In-context learning with attention enables large neural networks to make context-specific predictions by selectively focusing on relevant examples. Here, we adapt this idea to supervised learning procedures such as lasso regression and gradient boosting, for tabular data. Our goals are to (1) flexibly fit personalized models for each prediction point and (2) retain model simplicity and interpretability. Our method fits a local model for each test observation by weighting the training data according to attention, a supervised similarity measure that emphasizes features and interactions that are predictive of the outcome. Attention weighting allows the method to adapt to heterogeneous data in a data-driven way, without requiring cluster or similarity pre-specification. Further, our approach is uniquely interpretable: for each test observation, we identify which features are most predictive and which training observations are most relevant. We then show how to use attention weighting for time series and spatial data, and we present a method for adapting pretrained tree-based models to distributional shift using attention-weighted residual corrections. Across real and simulated datasets, attention weighting improves predictive performance while preserving interpretability, and theory shows that attention-weighting linear models attain lower mean squared error than the standard linear model under mixture-of-models data-generating processes with known subgroup structure.

中文标题/摘要

标题：监督学习注重细节

上下文学习与注意力机制使大型神经网络能够通过选择性地关注相关示例来做出上下文特定的预测。在这里，我们将这一理念应用于监督学习程序，如套索回归和梯度提升，用于表格数据。我们的目标是（1）为每个预测点灵活拟合个性化模型，并（2）保持模型的简洁性和可解释性。我们的方法通过根据注意力加权训练数据来为每个测试观测拟合局部模型，注意力是一种监督相似性度量，强调那些对结果具有预测性的特征和交互。注意力加权允许该方法以数据驱动的方式适应异质数据，而无需预先指定聚类或相似性。此外，我们的方法具有独特的可解释性：对于每个测试观测，我们确定哪些特征最具预测性，哪些训练观测最具相关性。然后，我们展示了如何使用注意力加权来处理时间序列和空间数据，并提出了一种使用注意力加权残差修正来适应预训练树模型的方法，以应对分布变化。在实际和模拟数据集上，注意力加权提高了预测性能并保持了可解释性，理论表明，在具有已知子组结构的混合模型数据生成过程中，注意力加权线性模型的均方误差低于标准线性模型。

Summary / 总结

The research aims to enhance supervised learning methods like lasso regression and gradient boosting by incorporating attention mechanisms to fit personalized models for each prediction point while maintaining model simplicity and interpretability. The method uses a supervised similarity measure to weight training data, focusing on relevant features and interactions for each test observation. This approach improves predictive performance across various datasets and theoretical analysis shows that attention-weighted linear models achieve lower mean squared error compared to standard linear models under specific data-generating processes.

研究旨在通过引入注意力机制增强监督学习模型的灵活性和可解释性。该方法基于监督相似性度量对训练数据进行加权，以为每个预测点拟合个性化模型。实验结果表明，注意力加权可以提高预测性能同时保持可解释性，特别是在异质数据条件下。理论分析支持在特定数据生成过程中，注意力加权的线性模型比标准线性模型具有更低的均方误差。

STACHE: Local Black-Box Explanations for Reinforcement Learning Policies

Authors: Andrew Elashkin, Orna Grumberg

First: 2025-12-10T18:37:28+00:00 · Latest: 2025-12-10T18:37:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning agents often behave unexpectedly in sparse-reward or safety-critical environments, creating a strong need for reliable debugging and verification tools. In this paper, we propose STACHE, a comprehensive framework for generating local, black-box explanations for an agent's specific action within discrete Markov games. Our method produces a Composite Explanation consisting of two complementary components: (1) a Robustness Region, the connected neighborhood of states where the agent's action remains invariant, and (2) Minimal Counterfactuals, the smallest state perturbations required to alter that decision. By exploiting the structure of factored state spaces, we introduce an exact, search-based algorithm that circumvents the fidelity gaps of surrogate models. Empirical validation on Gymnasium environments demonstrates that our framework not only explains policy actions, but also effectively captures the evolution of policy logic during training - from erratic, unstable behavior to optimized, robust strategies - providing actionable insights into agent sensitivity and decision boundaries.

中文标题/摘要

标题：STACHE：强化学习策略的局部黑盒解释

强化学习代理在稀疏奖励或安全关键环境中常常表现出意外行为，这迫切需要可靠的调试和验证工具。本文提出STACHE，一种生成离散马尔可夫游戏中代理特定行为的局部黑盒解释的综合框架。该方法生成一个综合解释，包括两个互补的组成部分：（1）稳健性区域，即代理行为保持不变的连接状态邻域；（2）最小反事实，即改变该决策所需的最小状态扰动。通过利用状态空间的因子结构，我们引入了一种精确的基于搜索的算法，绕过了代理模型的精度差距。在Gymnasium环境上的实验证明，我们的框架不仅解释了策略行为，还有效地捕捉了训练过程中策略逻辑的演变——从不稳定行为到优化、稳健策略，提供了有关代理敏感性和决策边界的可操作见解。

Summary / 总结

The paper addresses the need for reliable debugging tools for reinforcement learning agents in sparse-reward or safety-critical environments. It introduces STACHE, a framework that generates local, black-box explanations for an agent's specific actions in discrete Markov games. The method includes a Robustness Region and Minimal Counterfactuals to explain the agent's decision-making process. Experimental results show that STACHE effectively captures the policy's evolution from unstable behavior to optimized strategies, providing actionable insights into the agent's sensitivity and decision boundaries.

研究旨在为强化学习代理提供可靠的调试和验证工具，特别是在稀疏奖励或安全关键环境中。提出的STACHE框架为离散马尔可夫游戏中的代理特定行为生成局部、黑盒解释。它由稳健区域和最小反事实组成。实证验证表明，STACHE不仅解释了策略行为，还捕捉了策略从不稳定到优化策略的演变过程，提供了关于代理敏感性和决策边界的见解。

VisualActBench: Can VLMs See and Act like a Human?

Authors: Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo

First: 2025-12-10T18:36:18+00:00 · Latest: 2025-12-10T18:36:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.

中文标题/摘要

标题：VisualActBench：VLMs能否像人类一样观察和行动？

视觉-语言模型（VLMs）在感知和描述视觉环境方面取得了显著进展。然而，它们仅凭视觉输入进行主动推理和行动的能力，而无需明确的文本提示，仍处于探索阶段。我们引入了一个新的任务——视觉行动推理，并提出了一个包含1,074个视频和3,733个人类标注动作的大规模基准VisualActBench，覆盖四个真实场景。每个动作都标注了行动优先级水平（APL）和主动-反应类型，以评估模型的人类对齐推理和价值敏感性。我们在VisualActBench上评估了29个VLMs，并发现尽管前沿模型如GPT4o表现出相对较强的能力，但与人类水平的推理相比，特别是在生成主动、高优先级动作方面，仍存在显著差距。我们的结果突显了当前VLMs在解释复杂背景、预测结果和与人类决策框架对齐方面的局限性。VisualActBench为评估和提高主动视觉中心AI代理的现实世界准备性奠定了全面的基础。

Summary / 总结

The study introduces VisualActionReasoning and VisualActBench, a new benchmark for evaluating VLMs' ability to reason and act proactively based on visual inputs. It includes 1,074 videos and 3,733 human-annotated actions across four scenarios. Evaluating 29 VLMs, the research finds that while models like GPT4o show some capability, they still fall short of human-level reasoning, especially in generating proactive, high-priority actions. This highlights the need for better context interpretation and outcome anticipation in VLMs.

研究引入了VisualActionReasoning和VisualActBench，这是一个新的基准，用于评估VLMs基于视觉输入进行主动推理和行动的能力。该基准包括1,074个视频和3,733个人标注的动作，覆盖四个场景。评估29个VLMs后，研究发现虽然像GPT4o这样的模型展示了一定的能力，但它们在生成主动、高优先级动作方面仍远不及人类水平。这表明VLMs在理解复杂背景和预测结果方面仍存在不足。

YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

Authors: Ryan Meegan, Adam D'Souza, Bryan Bo Cao, Shubham Jain, Kristin Dana

First: 2025-12-10T18:32:38+00:00 · Latest: 2025-12-10T18:32:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.

中文标题/摘要

标题：YOPO-Nav：使用一次过视频生成的3DGS图进行视觉导航

视觉导航已成为传统基于详细地图和路径规划的机器人导航管道的实用替代方案。然而，构建和维护3D地图通常计算成本高且内存密集。当大型环境的探索视频可用时，我们解决了视觉导航的问题。这些视频作为视觉参考，使机器人能够重新追踪已探索的轨迹，而无需依赖度量地图。我们提出的方法YOPO-Nav（You Only Pass Once）将环境编码为由相互连接的局部3D高斯散点图（3DGS）模型组成的紧凑空间表示。在导航过程中，框架将机器人当前的视觉观察与该表示进行对齐，并预测引导其返回演示轨迹的动作。YOPO-Nav采用分层设计：视觉位置识别（VPR）模块提供粗略定位，而局部3DGS模型细化目标和中间姿态以生成控制动作。为了评估我们的方法，我们引入了YOPO-校园数据集，包含4小时的第一人称视频和超过6公里的人机遥控机器人轨迹的机器人控制器输入。我们使用Clearpath Jackal机器人在YOPO-校园轨迹上对最近的视觉导航方法进行了基准测试。实验结果表明，YOPO-Nav在物理机器人上的真实场景图像目标导航中表现出色。该数据集和代码将对视觉导航和场景表示研究公开。

Summary / 总结

YOPO-Nav addresses the challenge of visual navigation using exploration videos without the need for detailed 3D maps. It encodes the environment into a compact 3DGS graph and uses a hierarchical approach with VPR for coarse localization and 3DGS models for refining poses. Experiments on the YOPO-Campus dataset demonstrate YOPO-Nav's superior performance in image-goal navigation on a physical robot, showing its effectiveness in real-world scenarios.

YOPO-Nav通过使用探索视频解决无需详细3D地图的视觉导航问题，将环境编码为紧凑的3DGS图，并采用层次结构的方法，使用视觉位置识别（VPR）进行粗略定位，3DGS模型进行姿态细化。在YOPO-Campus数据集上的实验表明，YOPO-Nav在物理机器人上的图像目标导航中表现出色，展示了其在真实场景中的有效性。

Visual Heading Prediction for Autonomous Aerial Vehicles

Authors: Reza Ahmari, Ahmad Mohammadi, Vahid Hemmati, Mohammed Mynuddin, Parham Kebria, Mahmoud Nabil Mahmoud, Xiaohong Yuan, Abdollah Homaifar

First: 2025-12-10T18:27:37+00:00 · Latest: 2025-12-10T18:27:37+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The integration of Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) is increasingly central to the development of intelligent autonomous systems for applications such as search and rescue, environmental monitoring, and logistics. However, precise coordination between these platforms in real-time scenarios presents major challenges, particularly when external localization infrastructure such as GPS or GNSS is unavailable or degraded [1]. This paper proposes a vision-based, data-driven framework for real-time UAV-UGV integration, with a focus on robust UGV detection and heading angle prediction for navigation and coordination. The system employs a fine-tuned YOLOv5 model to detect UGVs and extract bounding box features, which are then used by a lightweight artificial neural network (ANN) to estimate the UAV's required heading angle. A VICON motion capture system was used to generate ground-truth data during training, resulting in a dataset of over 13,000 annotated images collected in a controlled lab environment. The trained ANN achieves a mean absolute error of 0.1506° and a root mean squared error of 0.1957°, offering accurate heading angle predictions using only monocular camera inputs. Experimental evaluations achieve 95% accuracy in UGV detection. This work contributes a vision-based, infrastructure- independent solution that demonstrates strong potential for deployment in GPS/GNSS-denied environments, supporting reliable multi-agent coordination under realistic dynamic conditions. A demonstration video showcasing the system's real-time performance, including UGV detection, heading angle prediction, and UAV alignment under dynamic conditions, is available at: https://github.com/Kooroshraf/UAV-UGV-Integration

中文标题/摘要

标题：自主空中车辆的视觉航向预测

无人机（UAV）和地面车辆（UGV）的集成日益成为开发用于搜索和救援、环境监测和物流等应用的智能自主系统的中心环节。然而，在实时场景中，这些平台之间的精确协调面临重大挑战，尤其是在GPS或GNSS等外部定位基础设施不可用或性能下降的情况下[1]。本文提出了一种基于视觉的数据驱动框架，用于实时UAV-UGV集成，重点是鲁棒的UGV检测和航向角预测，以实现导航和协调。该系统采用微调的YOLOv5模型检测UGV并提取边界框特征，然后由轻量级人工神经网络（ANN）估计UAV所需的航向角。在训练过程中使用VICON运动捕捉系统生成真实数据，得到超过13,000张标注图像的训练集，这些图像在受控实验室环境中收集。训练后的ANN实现了平均绝对误差0.1506°和均方根误差0.1957°，仅使用单目相机输入即可提供准确的航向角预测。实验评估在UGV检测方面达到了95%的准确性。这项工作贡献了一种基于视觉、无需基础设施的解决方案，展示了在GPS/GNSS受限环境中部署的强大潜力，支持在现实动态条件下可靠的多智能体协调。展示该系统实时性能的演示视频，包括UGV检测、航向角预测和动态条件下UAV对齐，可在以下链接获取：https://github.com/Kooroshraf/UAV-UGV-Integration

Summary / 总结

This paper addresses the challenge of real-time coordination between Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs) in GPS-denied environments. It proposes a vision-based framework using a fine-tuned YOLOv5 model for UGV detection and a lightweight ANN to predict the UAV's heading angle. The system, trained on a dataset of over 13,000 images, achieves a mean absolute error of 0.1506° and 95% accuracy in UGV detection, demonstrating potential for reliable multi-agent coordination.

该论文针对GPS受限环境中无人机（UAV）和地面无人车（UGV）的实时协调挑战，提出了一种基于视觉的框架，使用细调后的YOLOv5模型进行UGV检测，并使用轻量级人工神经网络（ANN）预测无人机的航向角。系统实现了0.1506°的平均绝对误差和95%的UGV检测准确率，展示了在动态条件下实现可靠多智能体协调的潜力。

Human-in-the-Loop and AI: Crowdsourcing Metadata Vocabulary for Materials Science

Authors: Jane Greenberg, Scott McClellan, Addy Ireland, Robert Sammarco, Colton Gerber, Christopher B. Rauch, Mat Kelly, John Kunze, Yuan An, Eric Toberer

First: 2025-12-10T18:22:57+00:00 · Latest: 2025-12-10T18:22:57+00:00

Comments: Metadata and Semantics Research Conference 2025, 14 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Metadata vocabularies are essential for advancing FAIR and FARR data principles, but their development constrained by limited human resources and inconsistent standardization practices. This paper introduces MatSci-YAMZ, a platform that integrates artificial intelligence (AI) and human-in-the-loop (HILT), including crowdsourcing, to support metadata vocabulary development. The paper reports on a proof-of-concept use case evaluating the AI-HILT model in materials science, a highly interdisciplinary domain Six (6) participants affiliated with the NSF Institute for Data-Driven Dynamical Design (ID4) engaged with the MatSci-YAMZ plaform over several weeks, contributing term definitions and providing examples to prompt the AI-definitions refinement. Nineteen (19) AI-generated definitions were successfully created, with iterative feedback loops demonstrating the feasibility of AI-HILT refinement. Findings confirm the feasibility AI-HILT model highlighting 1) a successful proof of concept, 2) alignment with FAIR and open-science principles, 3) a research protocol to guide future studies, and 4) the potential for scalability across domains. Overall, MatSci-YAMZ's underlying model has the capacity to enhance semantic transparency and reduce time required for consensus building and metadata vocabulary development.

中文标题/摘要

标题：人工介入与AI：材料科学领域的元数据词汇众包

元数据词汇对于推进FAIR和FARR数据原则至关重要，但其发展受限于人力资源有限和标准不一致的问题。本文介绍了MatSci-YAMZ平台，该平台结合了人工智能（AI）和人工介入循环（HILT），包括众包，以支持元数据词汇的发展。文章报告了在材料科学这一高度跨学科领域中评估AI-HILT模型的初步案例。六名（6名）与NSF数据驱动动态设计研究所（ID4）相关的参与者在数周内与MatSci-YAMZ平台互动，贡献术语定义并提供示例以提示AI定义的改进。成功创建了十九（19）个AI生成的定义，迭代反馈循环证明了AI-HILT改进的可行性。研究结果证实了AI-HILT模型的可行性，包括1）成功的初步案例，2）与开放科学原则的契合，3）指导未来研究的科研流程，以及4）跨领域的可扩展性。总体而言，MatSci-YAMZ的基础模型有能力增强语义透明度并减少达成共识和元数据词汇开发所需的时间。

Summary / 总结

This paper introduces MatSci-YAMZ, a platform that combines AI and human-in-the-loop (HILT) to develop metadata vocabularies for materials science. Six participants contributed term definitions and examples, which were used to refine AI-generated definitions. The study demonstrated the feasibility of the AI-HILT model, aligning with FAIR and open-science principles, and providing a research protocol for future studies. The findings highlight the potential for scalability across domains and enhanced semantic transparency in metadata development.

该论文介绍了结合AI和人机协作（HILT）的MatSci-YAMZ平台，用于材料科学领域的元数据词汇表开发。六名参与者贡献了术语定义和示例，通过迭代反馈循环来改进AI生成的定义。研究证实了AI-HILT模型的可行性，符合FAIR和开放科学原则，并展示了跨领域的可扩展性。关键发现包括成功概念验证、符合FAIR原则、研究协议以及增强的语义透明度。

Analysis of Dirichlet Energies as Over-smoothing Measures

Authors: Anna Bison, Alessandro Sperduti

First: 2025-12-10T18:17:33+00:00 · Latest: 2025-12-10T18:17:33+00:00

Abs · PDF · Code1 · Code2

Abstract

We analyze the distinctions between two functionals often used as over-smoothing measures: the Dirichlet energies induced by the unnormalized graph Laplacian and the normalized graph Laplacian. We demonstrate that the latter fails to satisfy the axiomatic definition of a node-similarity measure proposed by Rusch \textit{et al.} By formalizing fundamental spectral properties of these two definitions, we highlight critical distinctions necessary to select the metric that is spectrally compatible with the GNN architecture, thereby resolving ambiguities in monitoring the dynamics.

中文标题/摘要

标题：Dirichlet 能量作为过度平滑度量的分析

我们分析了两种常被用作过度平滑度量的功能之间的区别：由未归一化图拉普拉斯算子诱导的 Dirichlet 能量和归一化图拉普拉斯算子诱导的 Dirichlet 能量。我们证明了后者未能满足 Rusch 等人提出的节点相似性度量的公理定义。通过形式化这两种定义的基本谱性质，我们突显了选择与 GNN 架构谱兼容的度量所需的关键区别，从而解决了监控动态过程中的歧义性问题。

Summary / 总结

This study analyzes the differences between two functionals used as over-smoothing measures: the Dirichlet energies from the unnormalized and normalized graph Laplacians. It shows that the normalized graph Laplacian fails to meet the axiomatic definition of a node-similarity measure. By formalizing the spectral properties of these two definitions, the research highlights key distinctions needed to choose the metric compatible with GNN architecture, thus clarifying ambiguities in monitoring GNN dynamics.

研究分析了两种常用于过平滑度量的功能：未标准化和标准化图拉普拉斯诱导的狄利克雷能量之间的区别。研究显示，标准化图拉普拉斯未能满足Rusch等人提出的节点相似性度量的公理定义。通过形式化这两种定义的基本谱性质，研究强调了选择与GNN架构谱兼容的度量标准所需的关键区别，从而澄清了监测GNN动态时的模糊性。

HPM-KD: Hierarchical Progressive Multi-Teacher Framework for Knowledge Distillation and Efficient Model Compression

Authors: Gustavo Coelho Haase, Paulo Henrique Dourado da Silva

First: 2025-12-10T18:15:15+00:00 · Latest: 2025-12-10T18:15:15+00:00

Comments: 9 pages

Abs · PDF · Code1 · Code2

Abstract

Knowledge Distillation (KD) has emerged as a promising technique for model compression but faces critical limitations: (1) sensitivity to hyperparameters requiring extensive manual tuning, (2) capacity gap when distilling from very large teachers to small students, (3) suboptimal coordination in multi-teacher scenarios, and (4) inefficient use of computational resources. We present \textbf{HPM-KD}, a framework that integrates six synergistic components: (i) Adaptive Configuration Manager via meta-learning that eliminates manual hyperparameter tuning, (ii) Progressive Distillation Chain with automatically determined intermediate models, (iii) Attention-Weighted Multi-Teacher Ensemble that learns dynamic per-sample weights, (iv) Meta-Learned Temperature Scheduler that adapts temperature throughout training, (v) Parallel Processing Pipeline with intelligent load balancing, and (vi) Shared Optimization Memory for cross-experiment reuse. Experiments on CIFAR-10, CIFAR-100, and tabular datasets demonstrate that HPM-KD: achieves 10x-15x compression while maintaining 85% accuracy retention, eliminates the need for manual tuning, and reduces training time by 30-40% via parallelization. Ablation studies confirm independent contribution of each component (0.10-0.98 pp). HPM-KD is available as part of the open-source DeepBridge library.

中文标题/摘要

标题：HPM-KD：分层渐进多师框架用于知识蒸馏和高效模型压缩

知识蒸馏（KD）已成为模型压缩的一种有前途的技术，但面临关键限制：（1）对超参数的敏感性，需要大量手动调整，（2）从非常大的教师模型到小型学生模型时的容量差距，（3）多教师场景下的协调不足，以及（4）计算资源的低效利用。我们提出了**HPM-KD**，一个集成六个协同组件的框架：（i）通过元学习的自适应配置管理器，消除手动超参数调整，（ii）渐进蒸馏链，自动确定中间模型，（iii）注意力加权多师集成，学习每样本动态权重，（iv）元学习的温度调度器，适应训练过程中的温度，（v）并行处理管道，智能负载均衡，以及（vi）共享优化内存，用于跨实验重用。在CIFAR-10、CIFAR-100和表格数据集上的实验表明，HPM-KD：实现10-15倍的压缩，同时保持85%的准确率，消除手动调整的需要，并通过并行化将训练时间减少30-40%。消融研究证实每个组件的独立贡献（0.10-0.98个百分点）。HPM-KD作为开源DeepBridge库的一部分提供。

Summary / 总结

HPM-KD is a framework designed to address limitations in Knowledge Distillation (KD) such as hyperparameter sensitivity, capacity gap, suboptimal multi-teacher coordination, and inefficient resource use. It integrates six components: an Adaptive Configuration Manager, a Progressive Distillation Chain, an Attention-Weighted Multi-Teacher Ensemble, a Meta-Learned Temperature Scheduler, a Parallel Processing Pipeline, and a Shared Optimization Memory. Experiments show that HPM-KD achieves 10x-15x model compression with 85% accuracy retention, eliminates manual tuning, and reduces training time by 30-40%. Ablation studies confirm the independent contribution of each component.

HPM-KD 是一个框架，旨在解决知识蒸馏（KD）中的超参数敏感性、容量差距、多教师协调不足和资源使用效率低等问题。它集成了六个组件：自适应配置管理器、渐进式蒸馏链、注意力加权多教师集成、元学习温度调度器、并行处理管道和共享优化内存。实验表明，HPM-KD 可以实现 10 到 15 倍的模型压缩，保留 85% 的准确率，消除手动调参需求，并通过并行化减少 30-40% 的训练时间。消融研究证实了每个组件的独立贡献。该框架是开源的 DeepBridge 库的一部分。

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Authors: Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, Daniel E. Ho

First: 2025-12-10T18:12:29+00:00 · Latest: 2025-12-10T18:12:29+00:00

Abs · PDF · Code1 · Code2

Abstract

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

中文标题/摘要

标题：将AI代理与网络安全专业人员在实际渗透测试中的对比

我们首次在实际企业环境中对AI代理与人类网络安全专业人员进行全面评估。我们在一个包含约8,000个主机和12个子网的大学网络上，评估了10名网络安全专业人员和6个现有AI代理以及我们的新代理框架ARTEMIS。ARTEMIS是一个多代理框架，具备动态提示生成、任意子代理和自动漏洞分类功能。在我们的对比研究中，ARTEMIS总体排名第二，发现了9个有效漏洞，有效提交率为82%，并优于10名人类参与者中的9名。虽然现有的框架如Codex和CyAgent在大多数人类参与者中表现不佳，但ARTEMIS展示了与最强参与者相当的技术复杂性和提交质量。我们观察到，AI代理在系统枚举、并行利用和成本方面具有优势——某些ARTEMIS变体的成本为每小时18美元，而专业渗透测试人员的成本为每小时60美元。我们还发现了一些关键能力差距：AI代理的误报率较高，并且难以处理基于GUI的任务。

Summary / 总结

The study evaluates AI agents against human cybersecurity professionals in a real-world penetration testing scenario. Ten professionals and six AI agents, including the new ARTEMIS framework, were tested on a university network. ARTEMIS, which uses dynamic prompt generation and automatic vulnerability triaging, placed second, discovering 9 valid vulnerabilities with an 82% valid submission rate, outperforming most human participants. ARTEMIS showed technical sophistication and submission quality comparable to the strongest human participants, with cost advantages over professional testers. However, AI agents had higher false-positive rates and struggled with GUI-based tasks.

研究在实际渗透测试环境中评估了AI代理与人类网络安全专业人员的表现。十个专业人士和六个AI代理，包括新的ARTEMIS框架，被测试在一个包含约8000个主机的大学网络上。ARTEMIS利用动态提示生成和自动漏洞分类，排名第二，发现了9个有效的漏洞，提交率高达82%，超过了大多数人类参与者。ARTEMIS展示了与最强的人类参与者相当的技术熟练度和提交质量，成本上也优于专业渗透测试人员。然而，AI代理的误报率较高，并且在基于GUI的任务上遇到困难。

Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics, Revealing a Three-Stage In-Context Learning Mechanism

Authors: Jiajun Bao, Nicolas Boullé, Toni J. B. Liu, Raphaël Sarfati, Christopher J. Earls

First: 2025-09-08T04:08:50+00:00 · Latest: 2025-12-10T18:10:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated emergent in-context learning (ICL) capabilities across a range of tasks, including zero-shot time-series forecasting. We show that text-trained foundation models can accurately extrapolate spatiotemporal dynamics from discretized partial differential equation (PDE) solutions without fine-tuning or natural language prompting. Predictive accuracy improves with longer temporal contexts but degrades at finer spatial discretizations. In multi-step rollouts, where the model recursively predicts future spatial states over multiple time steps, errors grow algebraically with the time horizon, reminiscent of global error accumulation in classical finite-difference solvers. We interpret these trends as in-context neural scaling laws, where prediction quality varies predictably with both context length and output length. To better understand how LLMs are able to internally process PDE solutions so as to accurately roll them out, we analyze token-level output distributions and uncover a consistent three-stage ICL progression: beginning with syntactic pattern imitation, transitioning through an exploratory high-entropy phase, and culminating in confident, numerically grounded predictions.

中文标题/摘要

标题：文本训练的大语言模型可以零样本外推PDE动力学，揭示三阶段上下文学习机制

大型语言模型（LLMs）在一系列任务中展示了广泛的上下文学习（ICL）能力，包括零样本时间序列预测。我们展示了文本训练的基础模型可以在不进行微调或自然语言提示的情况下，从离散化的偏微分方程（PDE）解中准确外推时空动力学。预测准确性随着时间上下文长度的增加而提高，但在更精细的空间离散化时会下降。在多步滚动中，模型在多个时间步长内递归预测未来的空间状态，误差随时间范围呈代数增长，类似于经典有限差分求解器中的全局误差累积。我们将这些趋势解释为上下文神经缩放定律，其中预测质量随着上下文长度和输出长度的增加而可预测地变化。为了更好地理解LLMs如何内部处理PDE解以准确滚动它们，我们分析了输出分布的标记级分布，并发现了一致的三阶段ICL进程：从语法模式模仿开始，过渡到探索性高熵阶段，最终达到自信的、数值上可靠的预测。

Summary / 总结

The study investigates the zero-shot extrapolation capabilities of text-trained large language models (LLMs) for predicting spatiotemporal dynamics from partial differential equations (PDEs). The models show improved accuracy with longer temporal contexts but decreased accuracy with finer spatial discretizations. The multi-step rollouts exhibit algebraic error growth, similar to classical finite-difference solvers. The research identifies a three-stage in-context learning mechanism: initial pattern imitation, exploratory high-entropy phase, and confident, numerically grounded predictions.

研究探讨了文本训练的大语言模型（LLMs）在零样本情况下预测偏微分方程（PDEs）的时空动态的能力。模型在更长的时间上下文中表现出更高的准确性，但在更精细的空间离散化时准确性下降。多步滚动中，误差随时间窗口呈代数增长，类似于经典的有限差分求解器。研究发现了一种三阶段的上下文内学习机制：初始的模式模仿阶段、探索性的高熵阶段以及自信且数值上可靠的预测阶段。

Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

Authors: Pius Horn, Janis Keuper

First: 2025-12-10T18:01:50+00:00 · Latest: 2025-12-10T18:01:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench

中文标题/摘要

标题：基于PDF数学公式提取的文档解析器基准测试

正确解析PDF中的数学公式对于训练大型语言模型和从学术文献中构建科学知识库至关重要，但现有基准要么完全排除公式，要么缺乏语义感知的评估指标。我们引入了一种新的基准测试框架，以精确的LaTeX地面真实值为中心的合成PDF，使布局、公式和内容特征的系统控制成为可能。一个关键的方法贡献是首次使用LLM作为裁判进行语义公式评估，并结合了一个稳健的两阶段匹配管道来处理解析器输出的一致性问题。通过对250个公式对（30名评估者提供了750个评分）的人工验证，我们证明基于LLM的评估与人工判断的相关性（皮尔逊r=0.78）远高于CDM（r=0.34）和文本相似度（r≈0）。在100个合成文档中对20多种当代PDF解析器（包括专门的OCR模型、视觉-语言模型和基于规则的方法）进行评估，其中包含2000多个公式，揭示了显著的性能差异。我们的研究结果为选择适用于下游应用的解析器的实践者提供了宝贵的见解，并建立了稳健、可扩展的方法，以实现PDF公式提取质量的可重复评估。代码和基准数据：https://github.com/phorn1/pdf-parse-bench

Summary / 总结

The study aims to improve the accuracy of parsing mathematical formulas from PDFs, crucial for training large language models and building scientific knowledge bases. It introduces a novel benchmark using synthetically generated PDFs with precise LaTeX ground truth, and evaluates 20+ contemporary parsers. Key findings show that LLM-based evaluation correlates strongly with human judgment (Pearson r=0.78), outperforming other methods. The study highlights significant performance disparities among parsers and provides a robust methodology for reproducible evaluation.

论文引入了一个新的PDF中数学公式解析基准，关注精确的LaTeX地面真实值合成PDF。它使用LLM作为语义公式评估工具，并采用两阶段匹配管道。人工验证显示，基于LLM的评估与人工判断的相关性很强（皮尔逊r=0.78），优于其他方法。对20多种解析器的评估揭示了显著的性能差异，为从业者提供了宝贵的见解，并建立了可重复的评估方法。

FlipLLM: Efficient Bit-Flip Attacks on Multimodal LLMs using Reinforcement Learning

Authors: Khurram Khalil, Khaza Anuarul Hoque

First: 2025-12-10T17:58:18+00:00 · Latest: 2025-12-10T17:58:18+00:00

Comments: Accepted in IEEE HOST 2026

Abs · PDF · Code1 · Code2

Abstract

Generative Artificial Intelligence models, such as Large Language Models (LLMs) and Large Vision Models (VLMs), exhibit state-of-the-art performance but remain vulnerable to hardware-based threats, specifically bit-flip attacks (BFAs). Existing BFA discovery methods lack generalizability and struggle to scale, often failing to analyze the vast parameter space and complex interdependencies of modern foundation models in a reasonable time. This paper proposes FlipLLM, a reinforcement learning (RL) architecture-agnostic framework that formulates BFA discovery as a sequential decision-making problem. FlipLLM combines sensitivity-guided layer pruning with Q-learning to efficiently identify minimal, high-impact bit sets that can induce catastrophic failure. We demonstrate the effectiveness and generalizability of FlipLLM by applying it to a diverse set of models, including prominent text-only LLMs (GPT-2 Large, LLaMA 3.1 8B, and DeepSeek-V2 7B), VLMs such as LLaVA 1.6, and datasets, such as MMLU, MMLU-Pro, VQAv2, and TextVQA. Our results show that FlipLLM can identify critical bits that are vulnerable to BFAs up to 2.5x faster than SOTA methods. We demonstrate that flipping the FlipLLM-identified bits plummets the accuracy of LLaMA 3.1 8B from 69.9% to ~0.2%, and for LLaVA's VQA score from 78% to almost 0%, by flipping as few as 5 and 7 bits, respectively. Further analysis reveals that applying standard hardware protection mechanisms, such as ECC SECDED, to the FlipLLM-identified bit locations completely mitigates the BFA impact, demonstrating the practical value of our framework in guiding hardware-level defenses. FlipLLM offers the first scalable and adaptive methodology for exploring the BFA vulnerability of both language and multimodal foundation models, paving the way for comprehensive hardware-security evaluation.

中文标题/摘要

标题：FlipLLM：使用强化学习针对多模态LLM的高效位翻转攻击

生成式人工智能模型，如大型语言模型（LLM）和大型视觉模型（VLM），表现出最先进的性能，但仍然容易受到基于硬件的威胁，特别是位翻转攻击（BFAs）。现有的BFA发现方法缺乏普适性，难以扩展，往往无法在合理的时间内分析现代基础模型庞大的参数空间和复杂的相互依赖关系。本文提出FlipLLM，这是一种基于强化学习（RL）的架构无关框架，将BFA发现建模为顺序决策问题。FlipLLM结合了敏感性引导的层剪枝与Q学习，以高效地识别出能够导致灾难性失败的最小、高影响位集。我们通过将其应用于包括著名纯文本LLM（GPT-2 Large、LLaMA 3.1 8B、DeepSeek-V2 7B）、视觉语言模型（如LLaVA 1.6）以及数据集（如MMLU、MMLU-Pro、VQAv2、TextVQA）的多样化模型集合，展示了FlipLLM的有效性和普适性。我们的结果表明，FlipLLM可以比当前最佳方法快2.5倍的速度识别出关键的易受BFAs影响的位。我们展示了通过翻转FlipLLM识别的位，可以使LLaMA 3.1 8B的准确率从69.9%降至约0.2%，以及使LLaVA的VQA分数从78%降至几乎0%，仅需翻转5个和7个位。进一步的分析表明，将标准硬件保护机制，如ECC SECDED，应用于FlipLLM识别的位位置可以完全缓解BFAs的影响，证明了我们框架在指导硬件级防御方面的实际价值。FlipLLM提供了探索语言和多模态基础模型BFAs脆弱性的第一种可扩展和自适应方法，为全面的硬件安全评估铺平了道路。

Summary / 总结

FlipLLM is a reinforcement learning-based framework designed to efficiently discover vulnerable bits in multimodal large language models (LLMs) and vision models (VLMs) against bit-flip attacks (BFAs). It combines sensitivity-guided layer pruning with Q-learning to identify minimal, high-impact bit sets that can cause catastrophic failure. FlipLLM demonstrates superior performance, up to 2.5 times faster than state-of-the-art methods, and successfully reduces the accuracy of LLaMA 3.1 8B and LLaVA's VQA score by flipping as few as 5 and 7 bits, respectively. The framework also shows that applying standard hardware protection mechanisms can completely mitigate BFA impacts, highlighting its practical value in hardware security.

FlipLLM 是一种基于强化学习的框架，旨在高效地在多模态大型语言模型（LLMs）和大型视觉模型（VLMs）中发现对位翻转攻击（BFAs）的脆弱位。它结合了敏感性引导的层剪枝与 Q 学习来识别能够导致灾难性失败的最小且高影响位集。FlipLLM 在各种模型，包括文本-only LLMs、VLMs 和数据集上展示了优越的有效性和通用性。它比最先进的方法快 2.5 倍以上的时间来识别关键位，并表明翻转少数几个位可以显著降低模型的准确性。该框架还强调了在 FlipLLM 识别的位位置上应用硬件保护机制（如 ECC SECDED）对于缓解 BFA 影响的重要性。

Diffusion Posterior Sampler for Hyperspectral Unmixing with Spectral Variability Modeling

Authors: Yimin Zhu, Lincoln Linlin Xu

First: 2025-12-10T17:57:52+00:00 · Latest: 2025-12-10T17:57:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Linear spectral mixture models (LMM) provide a concise form to disentangle the constituent materials (endmembers) and their corresponding proportions (abundance) in a single pixel. The critical challenges are how to model the spectral prior distribution and spectral variability. Prior knowledge and spectral variability can be rigorously modeled under the Bayesian framework, where posterior estimation of Abundance is derived by combining observed data with endmember prior distribution. Considering the key challenges and the advantages of the Bayesian framework, a novel method using a diffusion posterior sampler for semiblind unmixing, denoted as DPS4Un, is proposed to deal with these challenges with the following features: (1) we view the pretrained conditional spectrum diffusion model as a posterior sampler, which can combine the learned endmember prior with observation to get the refined abundance distribution. (2) Instead of using the existing spectral library as prior, which may raise bias, we establish the image-based endmember bundles within superpixels, which are used to train the endmember prior learner with diffusion model. Superpixels make sure the sub-scene is more homogeneous. (3) Instead of using the image-level data consistency constraint, the superpixel-based data fidelity term is proposed. (4) The endmember is initialized as Gaussian noise for each superpixel region, DPS4Un iteratively updates the abundance and endmember, contributing to spectral variability modeling. The experimental results on three real-world benchmark datasets demonstrate that DPS4Un outperforms the state-of-the-art hyperspectral unmixing methods.

中文标题/摘要

标题：基于光谱变异建模的扩散后验采样器在高光谱解混中的应用

线性光谱混合模型（LMM）提供了一种简洁的形式来分离单个像素中的组成材料（端元）及其相应的比例（丰度）。关键挑战在于如何建模光谱先验分布和光谱变异。在贝叶斯框架下，可以严格建模先验知识和光谱变异，通过结合观测数据和端元先验分布来推导丰度的后验估计。考虑到关键挑战和贝叶斯框架的优势，提出了一种名为DPS4Un的新型方法，使用扩散后验采样器进行半盲解混，具有以下特点：(1) 将预训练的条件光谱扩散模型视为后验采样器，可以结合学习到的端元先验与观测数据以获得细化的丰度分布。(2) 与使用现有光谱库作为先验可能引起偏差不同，我们基于超像素建立图像端元束，并使用扩散模型训练端元先验学习器。超像素确保子场景更加均匀。(3) 与使用图像级数据一致性约束不同，提出了基于超像素的数据保真项。(4) 对于每个超像素区域，端元初始化为高斯噪声，DPS4Un迭代更新丰度和端元，有助于光谱变异建模。在三个真实世界基准数据集上的实验结果表明，DPS4Un优于最先进的高光谱解混方法。

Summary / 总结

The paper addresses the challenges of modeling spectral prior distribution and variability in linear spectral mixture models (LMM) for hyperspectral unmixing. It proposes a novel method, DPS4Un, which uses a diffusion posterior sampler to refine the abundance distribution by combining learned endmember priors with observations. The method establishes image-based endmember bundles within superpixels to train the endmember prior learner, proposes a superpixel-based data fidelity term, and initializes endmembers as Gaussian noise for each superpixel region. Experimental results show that DPS4Un outperforms existing methods on three real-world benchmark datasets.

该研究提出了一种名为DPS4Un的新方法，用于解决高光谱解混中的光谱先验建模和变异性挑战。该方法使用扩散后验采样器结合学习到的端元先验与观测数据来细化丰度分布。方法在超像素内建立基于图像的端元簇来训练端元先验学习器，并引入了基于超像素的数据保真项。实验结果表明，DPS4Un在三个实际数据集上优于现有方法。

MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI

Authors: Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal

First: 2025-12-10T17:55:06+00:00 · Latest: 2025-12-10T17:55:06+00:00

Comments: Dataset and Code: https://github.com/fengli-wu/MedForget

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pretrained Multimodal Large Language Models (MLLMs) are increasingly deployed in medical AI systems for clinical reasoning, diagnosis support, and report generation. However, their training on sensitive patient data raises critical privacy and compliance challenges under regulations such as HIPAA and GDPR, which enforce the "right to be forgotten". Unlearning, the process of tuning models to selectively remove the influence of specific training data points, offers a potential solution, yet its effectiveness in complex medical settings remains underexplored. To systematically study this, we introduce MedForget, a Hierarchy-Aware Multimodal Unlearning Testbed with explicit retain and forget splits and evaluation sets containing rephrased variants. MedForget models hospital data as a nested hierarchy (Institution -> Patient -> Study -> Section), enabling fine-grained assessment across eight organizational levels. The benchmark contains 3840 multimodal (image, question, answer) instances, each hierarchy level having a dedicated unlearning target, reflecting distinct unlearning challenges. Experiments with four SOTA unlearning methods on three tasks (generation, classification, cloze) show that existing methods struggle to achieve complete, hierarchy-aware forgetting without reducing diagnostic performance. To test whether unlearning truly deletes hierarchical pathways, we introduce a reconstruction attack that progressively adds hierarchical level context to prompts. Models unlearned at a coarse granularity show strong resistance, while fine-grained unlearning leaves models vulnerable to such reconstruction. MedForget provides a practical, HIPAA-aligned testbed for building compliant medical AI systems.

中文标题/摘要

标题：MedForget：面向医疗AI的层次感知多模态遗忘测试平台

预训练多模态大型语言模型（MLLMs）在医疗AI系统中越来越多地用于临床推理、诊断支持和报告生成。然而，它们在敏感患者数据上的训练引发了在HIPAA和GDPR等法规下的关键隐私和合规挑战，这些法规赋予了“被遗忘的权利”。遗忘，即通过调整模型以选择性地去除特定训练数据点的影响，提供了一种潜在的解决方案，但在复杂的医疗环境中其有效性仍待探索。为了系统地研究这一问题，我们引入了MedForget，一个层次感知的多模态遗忘测试平台，具有明确的保留和遗忘分割以及包含重述变体的评估集。MedForget将医院数据视为嵌套层次结构（机构 -> 患者 -> 研究 -> 部分），使评估跨越八个组织级别成为可能。基准数据集包含3840个（图像、问题、答案）多模态实例，每个层次级别都有专门的遗忘目标，反映了不同的遗忘挑战。对三个任务（生成、分类、填空）的四种SOTA遗忘方法进行的实验表明，现有方法在不降低诊断性能的情况下难以实现完整的、层次感知的遗忘。为了测试遗忘是否真正删除了层次路径，我们引入了一种重建攻击，逐步向提示中添加层次级别的上下文。粗粒度遗忘的模型表现出强大的抵抗力，而细粒度遗忘使模型容易受到这种重建攻击。MedForget提供了一个实用的、符合HIPAA的测试平台，用于构建合规的医疗AI系统。

Summary / 总结

MedForget is a hierarchy-aware multimodal unlearning testbed designed to study the effectiveness of unlearning in medical AI systems. It uses a nested hierarchy of hospital data (Institution -> Patient -> Study -> Section) to assess unlearning across eight levels. Experiments with four state-of-the-art unlearning methods on three tasks (generation, classification, cloze) reveal that current methods struggle to achieve complete, hierarchy-aware forgetting without compromising diagnostic performance. Additionally, a reconstruction attack shows that coarse-grained unlearning is more resistant to hierarchical pathway reconstruction compared to fine-grained unlearning.

MedForget 是一个层次感知的多模态遗忘测试平台，旨在研究遗忘在医疗AI系统中的有效性。它使用医院数据的嵌套层次结构（机构 -> 患者 -> 研究 -> 部分）来评估八个层次上的遗忘效果。实验表明，现有的四种最先进的遗忘方法在三个任务（生成、分类、填空）上难以实现完全的、层次感知的遗忘，而不损害诊断性能。细粒度的遗忘特别容易受到重建攻击，而粗粒度的遗忘则表现出更强的抵抗力。这项工作提供了一个实用的、符合HIPAA标准的测试平台，用于开发合规的医疗AI系统。

UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

Authors: Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen

First: 2025-12-10T17:50:29+00:00 · Latest: 2025-12-10T17:50:29+00:00

Comments: Project Page: https://seed-uniugp.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

中文标题/摘要

标题：UniUGP：统一理解、生成与规划以实现端到端自动驾驶

自动驾驶（AD）系统在长尾场景中因世界知识有限和视觉动态建模能力弱而遇到困难。现有的基于视觉-语言-动作（VLA）的方法无法利用未标记的视频进行视觉因果学习，而基于世界模型的方法缺乏大型语言模型的推理能力。在本文中，我们构建了多个专门的数据集，提供了复杂场景的推理和规划注释。然后，提出了一种统一的理解-生成-规划框架，名为UniUGP，通过混合专家架构协同场景推理、未来视频生成和轨迹规划。通过整合预训练的VLM和视频生成模型，UniUGP利用视觉动态和语义推理来增强规划性能。通过多帧观测和语言指令作为输入，它生成可解释的推理链、物理上一致的轨迹和连贯的未来视频。我们引入了一种四阶段训练策略，逐步在多个现有AD数据集以及提出的专门数据集上构建这些能力。实验表明，UniUGP在感知、推理和决策方面达到了最先进的性能，并且在应对具有挑战性的长尾情况时具有更好的泛化能力。

Summary / 总结

The paper addresses the limitations of existing autonomous driving systems in handling long-tail scenarios by proposing UniUGP, a unified framework that combines understanding, generation, and planning. It leverages specialized datasets and a hybrid expert architecture integrating pre-trained vision-language models and video generation models to enhance visual dynamics and semantic reasoning. The framework produces interpretable reasoning, physically consistent trajectories, and coherent future videos. Experiments show superior performance in perception, reasoning, and decision-making, especially in challenging long-tail situations.

论文通过提出UniUGP统一框架来解决现有自动驾驶系统在处理长尾场景时的局限性，该框架将理解、生成和规划相结合，利用专门的数据集和结合预训练的视觉语言模型和视频生成模型的混合专家架构来增强推理和规划能力。该框架处理多帧观测和语言指令以生成可解释的推理、物理上一致的轨迹和连贯的未来视频。实验表明，UniUGP在感知、推理和决策方面优于现有方法，特别是在具有挑战性的长尾情况下的泛化能力更强。

Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation

Authors: Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Adel Moumen, Sanchit Gandhi

First: 2025-10-08T12:44:51+00:00 · Latest: 2025-12-10T17:30:55+00:00

Comments: Leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard ; Code: https://github.com/huggingface/open_asr_leaderboard

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including a dedicated multilingual track. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.

中文标题/摘要

标题：开放ASR排行榜：迈向可再现和透明的多语言语音识别评估

尽管取得了快速进展，ASR评估仍然充斥着简短的英语形式，且效率很少被报告。我们提出了开放ASR排行榜，这是一个完全可再现的基准和交互式排行榜，比较了60多个开源和专有系统在11个数据集上的表现，包括一个专门的多语言赛道。我们标准化了文本规范化，并报告了单词错误率（WER）和逆实时因子（RTFx），以实现公平的准确性和效率比较。对于英语转录，Conformer编码器与LLM解码器结合使用在平均WER上表现最佳，但速度较慢，而CTC和TDT解码器在RTFx上表现更好，使其适用于长格式和离线使用。Whisper衍生的编码器针对英语进行微调可以提高准确性，但通常会牺牲多语言覆盖范围。所有代码和数据集加载器均已开源，以支持透明和可扩展的评估。

SAFT: Structure-Aware Fine-Tuning of LLMs for AMR-to-Text Generation

Authors: Rafiq Kamel, Filippo Guerranti, Simon Geisler, Stephan Günnemann

First: 2025-07-15T18:12:57+00:00 · Latest: 2025-12-10T17:26:08+00:00

Comments: Accepted at the KDD2025 Workshop on Structured Knowledge for LLMs

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) are increasingly applied to tasks involving structured inputs such as graphs. Abstract Meaning Representations (AMRs), which encode rich semantics as directed graphs, offer a rigorous testbed for evaluating LLMs on text generation from such structures. Yet, current methods often arbitrarily linearize AMRs, discarding key structural cues, or rely on architectures incompatible with standard LLMs. We introduce SAFT, a structure-aware fine-tuning approach that injects graph topology into pretrained LLMs without architectural changes. We compute direction-sensitive positional encodings from the magnetic Laplacian of transformed AMRs and project them into the embedding space of the LLM. While possibly applicable to any graph-structured inputs, we focus on AMR-to-text generation as a representative and challenging benchmark. SAFT sets a new state-of-the-art on AMR 3.0 with a 3.5 BLEU improvement over baselines. Gains scale with graph complexity, highlighting the value of structure-aware representations in enhancing LLM performance. SAFT offers a general and effective pathway for bridging structured data and language models.

中文标题/摘要

标题：SAFT：结构感知的LLM微调以实现AMR到文本生成

大型语言模型（LLMs）越来越多地应用于涉及结构输入的任务，如图形。抽象意义表示（AMRs），它们以有向图的形式编码丰富的语义，为评估LLMs从此类结构生成文本的能力提供了严格的测试平台。然而，当前的方法往往随意将AMRs线性化，丢弃了关键的结构线索，或者依赖于与标准LLMs不兼容的架构。我们提出了SAFT，这是一种结构感知的微调方法，它在不改变架构的情况下将图形拓扑注入预训练的LLMs中。我们从转换后的AMRs的磁拉普拉斯算子中计算方向敏感的位置编码，并将其投影到LLMs的嵌入空间中。虽然可能适用于任何图形结构输入，但我们专注于AMR到文本生成作为代表性且具有挑战性的基准。SAFT在AMR 3.0上达到了新的最佳性能，相对于基线提高了3.5个BLEU分数。收益随着图形复杂性的增加而增加，突显了结构感知表示在增强LLM性能方面的价值。SAFT提供了一种通用且有效的途径，用于连接结构化数据和语言模型。

Summary / 总结

The paper introduces SAFT, a structure-aware fine-tuning approach for Large Language Models (LLMs) to generate text from Abstract Meaning Representations (AMRs). SAFT injects graph topology into pretrained LLMs without changing their architecture, using direction-sensitive positional encodings derived from the magnetic Laplacian of AMRs. On the AMR 3.0 benchmark, SAFT achieves a 3.5 BLEU improvement over baselines, demonstrating the importance of structure-aware representations in enhancing LLM performance, especially with more complex graphs.

论文提出了SAFT，这是一种结构感知的微调方法，用于使大型语言模型（LLMs）能够从抽象意义表示（AMRs）生成文本。SAFT通过使用AMRs的磁拉普拉斯变换得到的方向敏感位置编码，将图拓扑注入到预训练的LLMs中，而不改变其架构。该方法在AMR 3.0基准测试中比基线方法提高了3.5个BLEU分数，展示了结构感知表示在增强LLM性能中的重要性，尤其是在处理更复杂的图时。

ChronusOmni: Improving Time Awareness of Omni Large Language Models

Authors: Yijing Chen, Yihan Wu, Kaisi Guan, Yuchen Ren, Yuyue Wang, Ruihua Song, Liyun Ru

First: 2025-12-10T17:22:42+00:00 · Latest: 2025-12-10T17:22:42+00:00

Comments: Code available at https://github.com/YJCX330/Chronus/

Abs · PDF · Code1 · Code2 · Code3

Abstract

Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities--for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs--despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.

中文标题/摘要

标题：ChronusOmni：提高全领域大型语言模型的时间意识

时间意识是全领域大型语言模型的一项基本能力，特别是在理解长视频和回答复杂问题方面尤为重要。以往的方法主要针对视觉-语言场景，专注于明确的时间定位问题，如识别视觉事件发生的时间或确定特定时间发生的事件。然而，这些方法往往未能充分利用音频模态，并忽视了跨模态的时间定位——例如，识别角色说话时视觉上呈现的内容，或确定视觉事件发生时所说的内容——尽管这种跨模态的时间关系在现实世界中普遍存在。在本文中，我们提出ChronusOmni，这是一种旨在增强对显性和隐性视听时间定位的时间意识的全领域大型语言模型。首先，我们通过在每个时间单位中交替使用基于文本的时间戳标记和视觉与音频表示，实现跨模态的统一时间建模。其次，为了确保正确的时序关系并加强细粒度的时间推理，我们结合了强化学习和特别设计的奖励函数。此外，我们构建了ChronusAV数据集，这是一个时间准确、模态完整且跨模态对齐的数据集，以支持视听时间定位任务的训练和评估。实验结果表明，ChronusOmni在ChronusAV上达到了最先进的性能，比其他时间定位基准提高了超过30%的性能，并在大多数指标上取得了最佳结果。这突显了我们的模型在跨模态中的强大时间意识，同时保持了对视频和音频的一般理解能力。

Summary / 总结

ChronusOmni is designed to improve the time awareness of omni large language models, particularly for understanding long videos and answering complex questions. It interleaves timestamp tokens with visual and audio representations and uses reinforcement learning to enforce correct temporal ordering. Experimental results show that ChronusOmni outperforms previous methods by more than 30% on the ChronusAV dataset and achieves top results on most metrics on other temporal grounding benchmarks.

ChronusOmni旨在提高大型语言模型的时间意识，特别是在理解长视频和回答复杂问题方面。它通过将文本时间戳标记与视觉和音频表示相结合，并使用强化学习来确保正确的时序顺序来解决先前方法的局限性。实验表明，ChronusOmni在ChronusAV数据集上的性能比现有模型高出30%以上，并在其他时间定位基准上的大多数指标上取得了最佳结果。

Fast Factorized Learning: Powered by In-Memory Database Systems

Authors: Bernhard Stöckl, Maximilian E. Schüle

First: 2025-12-10T17:14:37+00:00 · Latest: 2025-12-10T17:14:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning models over factorized joins avoids redundant computations by identifying and pre-computing shared cofactors. Previous work has investigated the performance gain when computing cofactors on traditional disk-based database systems. Due to the absence of published code, the experiments could not be reproduced on in-memory database systems. This work describes the implementation when using cofactors for in-database factorized learning. We benchmark our open-source implementation for learning linear regression on factorized joins with PostgreSQL -- as a disk-based database system -- and HyPer -- as an in-memory engine. The evaluation shows a performance gain of factorized learning on in-memory database systems by 70\% to non-factorized learning and by a factor of 100 compared to disk-based database systems. Thus, modern database engines can contribute to the machine learning pipeline by pre-computing aggregates prior to data extraction to accelerate training.

中文标题/摘要

标题：快速因子化学习：由内存数据库系统驱动

通过因子化连接学习模型可以避免冗余计算，通过识别和预计算共享因子。先前的工作研究了在传统基于磁盘的数据库系统上计算因子时的性能提升。由于缺乏已发布的代码，实验无法在内存数据库系统上重现。本工作描述了在数据库中使用因子进行因子化学习的实现。我们使用PostgreSQL（基于磁盘的数据库系统）和HyPer（内存引擎）对因子化连接上的线性回归学习进行了基准测试。评估表明，在内存数据库系统上进行因子化学习的性能比非因子化学习提高了70%，比基于磁盘的数据库系统提高了100倍。因此，现代数据库引擎可以通过在数据提取之前预计算聚合来加速训练，从而参与到机器学习管道中。

Summary / 总结

This study explores the performance benefits of using factorized joins in machine learning models, specifically on in-memory database systems. By pre-computing shared cofactors, the study demonstrates a 70% performance gain over non-factorized learning and a 100-fold improvement compared to disk-based systems. The implementation was benchmarked using PostgreSQL and HyPer, showing the potential of modern database engines to accelerate machine learning training.

研究旨在通过使用因子化连接来避免冗余计算，提高机器学习模型的效率。方法包括预计算共享因子，并在PostgreSQL（一种基于磁盘的系统）和HyPer（一种内存引擎）上进行性能基准测试。结果显示，与非因子化学习相比，性能提高了70%，与基于磁盘的系统相比，提高了100倍。

Interpretation as Linear Transformation: A Cognitive-Geometric Model of Belief and Meaning

Authors: Chainarong Amornbunchornvej

First: 2025-12-10T17:13:01+00:00 · Latest: 2025-12-10T17:13:01+00:00

Comments: The first draft of cognitive geometry model

Abs · PDF · Code1 · Code2

Abstract

This paper develops a geometric framework for modeling belief, motivation, and influence across cognitively heterogeneous agents. Each agent is represented by a personalized value space, a vector space encoding the internal dimensions through which the agent interprets and evaluates meaning. Beliefs are formalized as structured vectors-abstract beings-whose transmission is mediated by linear interpretation maps. A belief survives communication only if it avoids the null spaces of these maps, yielding a structural criterion for intelligibility, miscommunication, and belief death. Within this framework, I show how belief distortion, motivational drift, counterfactual evaluation, and the limits of mutual understanding arise from purely algebraic constraints. A central result-"the No-Null-Space Leadership Condition"-characterizes leadership as a property of representational reachability rather than persuasion or authority. More broadly, the model explains how abstract beings can propagate, mutate, or disappear as they traverse diverse cognitive geometries. The account unifies insights from conceptual spaces, social epistemology, and AI value alignment by grounding meaning preservation in structural compatibility rather than shared information or rationality. I argue that this cognitive-geometric perspective clarifies the epistemic boundaries of influence in both human and artificial systems, and offers a general foundation for analyzing belief dynamics across heterogeneous agents.

中文标题/摘要

标题：作为线性变换的解释：信念与意义的认知-几何模型

本文发展了一种几何框架，用于建模认知异质性代理之间的信念、动机和影响。每个代理由一个个性化的价值空间表示，这是一种向量空间，编码代理内部通过其解释和评估意义的维度。信念被形式化为结构化的向量-抽象实体-其传递由线性解释映射中介。信念只有在避免这些映射的零空间时才能通过通信，从而产生可理解性、误解和信念死亡的结构标准。在这一框架内，我展示了信念失真、动机漂移、反事实评估以及相互理解的局限性是如何纯粹由代数约束产生的。一个核心结果-“无零空间领导条件”-将领导性描述为表示可达性的一种属性，而不是说服或权威。更广泛地说，该模型解释了抽象实体如何在穿越不同的认知几何时传播、变异或消失。该解释统一了概念空间、社会认识论和人工智能价值对齐的洞见，通过结构兼容性而非共享信息或理性来奠定意义保存的基础。我主张，这种认知-几何视角澄清了人类和人工系统中影响的知性边界，并为分析异质代理之间的信念动态提供了普遍的基础。

Summary / 总结

This paper proposes a geometric framework to model belief, motivation, and influence among cognitively diverse agents. Each agent is represented by a personalized value space, and beliefs are formalized as vectors interpreted through linear maps. The key finding is that a belief survives communication only if it avoids the null spaces of these maps, defining intelligibility and belief death. The model explains belief distortion and limits of mutual understanding through algebraic constraints and characterizes leadership based on representational reachability rather than persuasion or authority.

本文提出了一种几何框架来建模认知多样性的代理之间的信念、动机和影响。每个代理由一个个性化的价值空间表示，信念被形式化为通过线性映射解释的向量。关键发现是，信念只有在避免这些映射的零空间时才能通过通信，从而定义了可理解性和信念死亡的标准。该模型通过代数约束解释了信念扭曲和相互理解的局限性，并将领导力定义为表征可达性，而非说服或权威。

Matrix-game 2.0: An open-source real-time and streaming interactive world model

Authors: Xianglong He, Chunli Peng, Zexiang Liu, Boyang Wang, Yifan Zhang, Qi Cui, Fei Kang, Biao Jiang, Mengyin An, Yangyang Ren, Baixin Xu, Hao-Xiang Guo, Kaixiong Gong, Size Wu, Wei Li, Xuchen Song, Yang Liu, Yangguang Li, Yahui Zhou

First: 2025-08-18T15:28:53+00:00 · Latest: 2025-12-10T17:10:47+00:00

Comments: Project Page: https://matrix-game-v2.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.

中文标题/摘要

标题：矩阵游戏2.0：开源实时和流式交互世界模型

近期生成交互视频的进展表明，扩散模型具有作为世界模型的潜力，能够捕捉复杂的物理动态和交互行为。然而，现有的交互世界模型依赖双向注意力和漫长的推理步骤，严重限制了实时性能。因此，它们难以模拟真实世界的动态，其中结果必须基于历史上下文和当前行为即时更新。为了解决这个问题，我们提出了矩阵游戏2.0，这是一种通过少量步骤自回归扩散生成长视频的交互世界模型。我们的框架包括三个关键组件：(1) 用于虚幻引擎和GTA5环境的大规模数据生产流水线，有效生成大量（约1200小时）具有多样交互注释的视频数据；(2) 动作注入模块，使帧级鼠标和键盘输入成为交互条件；(3) 基于因果结构的少量步骤蒸馏，用于实时和流式视频生成。矩阵游戏2.0可以在超快的25 FPS速度下生成跨多种场景的高质量分钟级视频。我们开源了我们的模型权重和代码库，以促进交互世界建模的研究。

Summary / 总结

Matrix-Game 2.0 is designed to address the limitations of existing interactive world models by enabling real-time and streaming video generation. It uses a few-step auto-regressive diffusion model and includes a scalable data production pipeline, an action injection module, and a distillation process. The model can generate high-quality videos at 25 FPS across various scenes. Key findings include the ability to produce diverse interaction annotations and support for real-time and streaming video generation.

Matrix-Game 2.0旨在解决现有交互世界模型的实时性和流式传输能力不足的问题，通过使用几步自回归扩散模型实现这一目标。该模型包含一个可扩展的数据生产管道、一个动作注入模块以及一个蒸馏过程。该模型可以在各种场景下以每秒25帧的速度生成高质量的视频。主要发现包括能够生成多样化的交互注释，并支持实时和流式传输视频生成。

LLMs in Interpreting Legal Documents

Authors: Simone Corbo

First: 2025-12-10T17:09:13+00:00 · Latest: 2025-12-10T17:09:13+00:00

Abs · PDF · Code1 · Code2

Abstract

This chapter explores the application of Large Language Models in the legal domain, showcasing their potential to optimise and augment traditional legal tasks by analysing possible use cases, such as assisting in interpreting statutes, contracts, and case law, enhancing clarity in legal summarisation, contract negotiation, and information retrieval. There are several challenges that can arise from the application of such technologies, such as algorithmic monoculture, hallucinations, and compliance with existing regulations, including the EU's AI Act and recent U.S. initiatives, alongside the emerging approaches in China. Furthermore, two different benchmarks are presented.

中文标题/摘要

标题：大型语言模型在法律文件解释中的应用

本章探讨了大型语言模型在法律领域的应用，展示了它们通过分析可能的应用场景来优化和增强传统法律任务的潜力，例如协助解释法律条文、合同和判例法，提高法律总结、合同谈判和信息检索的清晰度。此类技术的应用可能会带来一些挑战，如算法单一化、幻觉以及遵守现有法规，包括欧盟的AI法案和最近的美国举措，以及中国新兴的方法。此外，还介绍了两种不同的基准。

Summary / 总结

This chapter investigates the use of Large Language Models (LLMs) in legal tasks, aiming to enhance traditional legal processes through interpreting statutes, contracts, and case law. The study presents two benchmarks to address challenges such as algorithmic monoculture, hallucinations, and compliance with regulations. Key findings include the potential of LLMs to improve legal summarization, contract negotiation, and information retrieval, but also highlight the need for careful management of biases and regulatory compliance.

本章探讨了大型语言模型（LLMs）在法律领域的应用，旨在通过解释法律条文、合同和案例法来增强传统法律流程。研究提出了两个基准来应对算法单一性、幻觉以及遵守欧盟AI法案和美国最新举措等现有法规的问题。主要发现包括LLMs在法律总结、合同谈判和信息检索方面具有提升潜力，但也强调了需要谨慎管理偏见和合规性管理的重要性。

RIFT: A Scalable Methodology for LLM Accelerator Fault Assessment using Reinforcement Learning

Authors: Khurram Khalil, Muhammad Mahad Khaliq, Khaza Anuarul Hoque

First: 2025-12-10T17:07:19+00:00 · Latest: 2025-12-10T17:07:19+00:00

Comments: Accepted in the IEEE DATE 2026 conference

Abs · PDF · Code1 · Code2

Abstract

The massive scale of modern AI accelerators presents critical challenges to traditional fault assessment methodologies, which face prohibitive computational costs and provide poor coverage of critical failure modes. This paper introduces RIFT (Reinforcement Learning-guided Intelligent Fault Targeting), a scalable framework that automates the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment. RIFT transforms the complex search for worst-case faults into a sequential decision-making problem, combining hybrid sensitivity analysis for search space pruning with reinforcement learning to intelligently generate minimal, high-impact test suites. Evaluated on billion-parameter Large Language Model (LLM) workloads using NVIDIA A100 GPUs, RIFT achieves a \textbf{2.2$\times$} fault assessment speedup over evolutionary methods and reduces the required test vector volume by over \textbf{99\%} compared to random fault injection, all while achieving \textbf{superior fault coverage}. The proposed framework also provides actionable data to enable intelligent hardware protection strategies, demonstrating that RIFT-guided selective error correction code provides a \textbf{12.8$\times$} improvement in \textbf{cost-effectiveness} (coverage per unit area) compared to uniform triple modular redundancy protection. RIFT automatically generates UVM-compliant verification artifacts, ensuring its findings are directly actionable and integrable into commercial RTL verification workflows.

中文标题/摘要

标题：RIFT：使用强化学习的LLM加速器故障评估可扩展方法

现代AI加速器的庞大规模对传统故障评估方法提出了关键挑战，这些方法面临计算成本高昂的问题，并且对关键故障模式的覆盖不足。本文介绍了RIFT（基于强化学习的智能故障目标），这是一种可扩展的框架，用于自动化发现最小且高影响的故障场景，以实现高效的故障评估。RIFT将复杂的关键故障搜索转化为顺序决策问题，结合混合灵敏度分析进行搜索空间修剪，并使用强化学习智能生成最小且高影响的测试套件。在使用NVIDIA A100 GPU评估具有十亿参数的大型语言模型（LLM）工作负载时，RIFT在故障评估速度上比进化方法快2.2倍，并且与随机故障注入相比，所需的测试向量体积减少了超过99%，同时实现了更优的故障覆盖率。所提出的框架还提供了可操作的数据，以实现智能硬件保护策略，证明RIFT引导的选择性错误纠正码在成本效益（覆盖率每单位面积）方面比均匀的三模冗余保护提高了12.8倍。RIFT自动生成UVM兼容的验证文件，确保其发现可以直接操作并整合到商业RTL验证工作流程中。

Summary / 总结

RIFT is a scalable framework that uses reinforcement learning to automate the discovery of minimal, high-impact fault scenarios for efficient design-time fault assessment in large AI accelerators. It achieves a 2.2x speedup in fault assessment over evolutionary methods and reduces the required test vector volume by over 99% compared to random fault injection, while maintaining superior fault coverage. RIFT also provides actionable data for intelligent hardware protection strategies, demonstrating a 12.8x improvement in cost-effectiveness compared to uniform triple modular redundancy protection.

RIFT 是一个使用强化学习来自动化发现最小且高影响故障场景的可扩展框架，用于大型 AI 加速器的设计时故障评估。它在故障评估方面比进化方法快 2.2 倍，并且与随机故障注入相比，将所需的测试向量体积减少了超过 99%，同时保持了更高的故障覆盖率。RIFT 还提供了用于智能硬件保护策略的可操作数据，证明与均匀三模冗余保护相比，其成本效益提高了 12.8 倍。

Improving Graph Neural Network Training, Defense, Hypergraph Partitioning and Spectral Clustering via Adversarial Robustness Evaluation

Authors: Yongyu Wang

First: 2024-12-19T11:10:48+00:00 · Latest: 2025-12-10T17:03:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph Neural Networks (GNNs) are a highly effective neural network architecture for processing graph-structured data. Unlike traditional neural networks that rely solely on the features of the data as input, GNNs leverage both the graph structure, which represents the relationships between data points, and the feature matrix of the data to optimize their feature representation. This unique capability enables GNNs to achieve superior performance across various tasks. However, it also makes GNNs more susceptible to noise from both the graph structure and data features, which can significantly increase the training difficulty and degrade their performance. To address this issue, this paper proposes a novel method for selecting noise-sensitive training samples from the original training set to construct a smaller yet more effective training set for model training. These samples are used to help improve the model's ability to correctly process data in noisy environments. We have evaluated our approach on three of the most classical GNN models GCN, GAT, and GraphSAGE as well as three widely used benchmark datasets: Cora, Citeseer, and PubMed. Our experiments demonstrate that the proposed method can substantially boost the training of Graph Neural Networks compared to using randomly sampled training sets of the same size from the original training set and the larger original full training set. We further proposed a robust-node based hypergraph partitioning method, an adversarial robustness based graph pruning method for GNN defenses, a robust spectral clustering method and a related spectral edge attack method.

中文标题/摘要

标题：通过对抗鲁棒性评估改进图神经网络训练、防御、超图分区和谱聚类

图神经网络（GNNs）是一种高效的神经网络架构，用于处理图结构数据。与传统神经网络仅依赖数据特征作为输入不同，GNNs 利用图结构，即数据点之间的关系，以及数据的特征矩阵来优化其特征表示。这种独特的能力使GNNs在各种任务中表现出色。然而，这也使得GNNs更容易受到图结构和数据特征噪声的影响，这会显著增加训练难度并降低其性能。为了解决这一问题，本文提出了一种新颖的方法，从原始训练集选择噪声敏感的训练样本，以构建一个更小但更有效的训练集用于模型训练。这些样本用于帮助提高模型在噪声环境下的数据处理能力。我们已在三种最经典的GNN模型GCN、GAT和GraphSAGE以及三种广泛使用的基准数据集Cora、Citeseer和PubMed上评估了我们的方法。实验结果表明，与从原始训练集随机采样的相同大小的训练集和更大的原始完整训练集相比，所提出的方法可以显著提高图神经网络的训练效果。我们还提出了一种基于鲁棒节点的超图分区方法、一种基于对抗鲁棒性的图神经网络防御的图剪枝方法、一种鲁棒谱聚类方法以及相关谱边缘攻击方法。

Summary / 总结

This paper addresses the challenge of noise in graph-structured data for Graph Neural Networks (GNNs) by proposing a method to select noise-sensitive training samples, which are then used to construct a more effective training set. The method improves GNN training and robustness in noisy environments. Experiments on GCN, GAT, and GraphSAGE with Cora, Citeseer, and PubMed datasets show that this approach significantly enhances GNN training performance compared to random sampling or full training sets. Additionally, the paper introduces a robust-node based hypergraph partitioning method, an adversarial robustness based graph pruning method for GNN defenses, and a robust spectral clustering method.

本文针对图神经网络（GNN）对噪声的敏感性问题，提出了一种选择噪声敏感训练样本的方法以提高模型训练效果。该方法增强了GNN在噪声环境下的性能，并在GCN、GAT和GraphSAGE模型以及Cora、Citeseer和PubMed数据集上进行了测试，显示了在随机采样和完整训练集上的显著改进。此外，本文还提出了基于鲁棒节点的超图划分方法、基于对抗鲁棒性的图剪枝方法用于GNN防御、鲁棒谱聚类方法及相关谱边缘攻击方法。

Deep Operator Learning for High-Fidelity Fluid Flow Field Reconstruction from Sparse Sensor Measurements

Authors: Hiep Vo Dang, Phong C. H. Nguyen

First: 2024-12-11T01:28:48+00:00 · Latest: 2025-12-10T17:02:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Reconstructing high-fidelity fluid flow fields from sparse sensor measurements is vital for many science and engineering applications but remains challenging because of dimensional disparities between state and observational spaces. Due to such dimensional differences, the measurement operator becomes ill-conditioned and non-invertible, making the reconstruction of flow fields from sensor measurements extremely difficult. Although sparse optimization and machine learning address the above problems to some extent, questions about their generalization and efficiency remain, particularly regarding the discretization dependence of these models. In this context, deep operator learning offers a better solution as this approach models mappings between infinite-dimensional functional spaces, enabling superior generalization and discretization-independent reconstruction. We introduce FLRONet, a deep operator learning framework that is trained to reconstruct fluid flow fields from sparse sensor measurements. FLRONet employs a branch-trunk network architecture to represent the inverse measurement operator that maps sensor observations to the original flow field, a continuous function of both space and time. Validation performed on the CFDBench dataset has demonstrated that FLRONet consistently achieves high levels of reconstruction accuracy and robustness, even in scenarios where sensor measurements are inaccurate or missing. Furthermore, the operator learning approach endows FLRONet with the capability to perform zero-shot super-resolution in both spatial and temporal domains, offering a solution for rapid reconstruction of high-fidelity flow fields.

中文标题/摘要

标题：高保真流场重构的深度算子学习方法从稀疏传感器测量

从稀疏传感器测量重构高保真流场对于许多科学和工程应用至关重要，但由于状态空间和观测空间维数差异，测量算子变得病态且不可逆，使得从传感器测量重构流场变得极其困难。尽管稀疏优化和机器学习在一定程度上解决了上述问题，但关于它们的泛化能力和效率的问题仍然存在，特别是这些模型的离散化依赖性问题。在此背景下，深度算子学习提供了一个更好的解决方案，因为这种方法建模的是无穷维函数空间之间的映射，能够实现更好的泛化和离散化无关的重构。我们引入了FLRONet，这是一种训练框架，用于从稀疏传感器测量重构流场。FLRONet采用分支-干网络架构来表示逆测量算子，该算子将传感器观测映射到原始流场，这是一个关于空间和时间的连续函数。在CFDBench数据集上的验证表明，FLRONet在重构准确性和鲁棒性方面始终能够达到高水平，即使在传感器测量不准确或缺失的情况下也是如此。此外，算子学习方法赋予FLRONet在空间和时间域上进行零样本超分辨率的能力，为快速重构高保真流场提供了解决方案。

Summary / 总结

The paper addresses the challenge of reconstructing high-fidelity fluid flow fields from sparse sensor measurements, which is crucial for various applications but difficult due to dimensional mismatches. It introduces FLRONet, a deep operator learning framework that uses a branch-trunk network to model the inverse measurement operator, enabling high-accuracy and robust reconstruction even with inaccurate or missing data. The method demonstrates superior generalization and discretization independence, and it can perform zero-shot super-resolution in both spatial and temporal domains.

论文解决了从稀疏传感器测量中重建高保真流体流动场的挑战，这对于各种应用至关重要但因维度差异而困难。它引入了FLRONet，这是一种深度算子学习框架，用于建模逆测量算子以将传感器观测值映射到原始流动场。FLRONet在即使传感器数据不准确或缺失的情况下也能实现高重建准确性和鲁棒性，并且可以在空间和时间域上执行零样本超分辨率。

Composing Concepts from Images and Videos via Concept-prompt Binding

Authors: Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao

First: 2025-12-10T16:57:31+00:00 · Latest: 2025-12-10T16:57:31+00:00

Comments: Project page: https://refkxh.github.io/BiCo_Webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

中文标题/摘要

标题：通过概念提示绑定从图像和视频中构建概念

视觉概念组合旨在将图像和视频中的不同元素整合成一个连贯的视觉输出，但在准确提取复杂概念和灵活组合图像和视频中的概念方面仍存在不足。我们引入了Bind & Compose，这是一种一-shot方法，通过将视觉概念与相应的提示标记绑定，并使用各种来源的绑定标记组合目标提示，实现了灵活的视觉概念组合。该方法采用分层绑定结构，在扩散变换器中进行跨注意力条件编码，将视觉概念编码为相应的提示标记，以准确分解复杂的视觉概念。为了提高概念标记绑定的准确性，我们设计了一种多样化吸收机制，使用额外的吸收标记在使用多样化提示进行训练时消除与概念无关的细节的影响。为了增强图像和视频概念之间的兼容性，我们提出了一种时间解耦策略，将视频概念的训练过程分为两个阶段，并采用双分支绑定结构进行时间建模。评估表明，我们的方法在概念一致性、提示保真度和运动质量方面优于现有方法，为视觉创意开辟了新的可能性。

Summary / 总结

The research aims to improve visual concept composition by integrating elements from images and videos into a coherent output. The method, Bind & Compose, uses a hierarchical binder structure in Diffusion Transformers to bind visual concepts with prompt tokens and compose target prompts from various sources. It introduces a Diversify-and-Absorb Mechanism to enhance concept-token binding accuracy and a Temporal Disentanglement Strategy to improve compatibility between image and video concepts. Experiments show that this method outperforms existing approaches in concept consistency, prompt fidelity, and motion quality, advancing visual creativity capabilities.

研究旨在通过准确提取和组合来自图像和视频的概念来改进视觉概念合成。方法Bind & Compose使用Diffusion Transformers中的分层绑定结构将视觉概念与提示令牌绑定并组成一个连贯的输出。它包括一个多样化和吸收机制以增强概念-令牌绑定，并引入一个时间解耦策略以提高图像和视频概念之间的兼容性。实验表明，该方法在概念一致性、提示保真度和运动质量方面优于现有方法，推动了视觉创意的发展。

A roadmap of geospatial soil quality analysis systems

Authors: Habiba BEN ABDERRAHMANE, Slimane Oulad-Naoui, Benameur ZIANI

First: 2025-12-10T16:40:12+00:00 · Latest: 2025-12-10T16:40:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Soil quality (SQ) plays a crucial role in sustainable agriculture, environmental conservation, and land-use planning. Traditional SQ assessment techniques rely on costly, labor-intensive sampling and laboratory analysis, limiting their spatial and temporal coverage. Advances in Geographic Information Systems (GIS), remote sensing, and machine learning (ML) enabled efficient SQ evaluation. This paper presents a comprehensive roadmap distinguishing it from previous reviews by proposing a unified and modular pipeline that integrates multi-source soil data, GIS and remote sensing tools, and machine learning techniques to support transparent and scalable soil quality assessment. It also includes practical applications. Contrary to existing studies that predominantly target isolated soil parameters or specific modeling methodologies, this approach consolidates recent advancements in Geographic Information Systems (GIS), remote sensing technologies, and machine learning algorithms within the entire soil quality assessment pipeline. It also addresses existing challenges and limitations while exploring future developments and emerging trends in the field that can deliver the next generation of soil quality systems making them more transparent, adaptive, and aligned with sustainable land management.

中文标题/摘要

标题：地理空间土壤质量分析系统的路线图

土壤质量(SQ)在可持续农业、环境保育和土地利用规划中起着关键作用。传统的SQ评估技术依赖于昂贵的、劳动密集型的采样和实验室分析，限制了其空间和时间覆盖范围。地理信息系统(GIS)、遥感和机器学习(ML)的进步使得高效的SQ评估成为可能。本文提出了一种综合的路线图，区别于之前的综述，通过整合多源土壤数据、GIS和遥感工具以及机器学习技术，支持透明和可扩展的土壤质量评估。它还包含实际应用。与现有的大多数研究主要针对孤立的土壤参数或特定建模方法不同，这种方法将地理信息系统(GIS)、遥感技术和机器学习算法的最新进展整合到整个土壤质量评估管道中。同时，它也解决了现有挑战和局限性，探索了该领域的未来发展趋势，以实现下一代更透明、更具适应性和符合可持续土地管理的土壤质量系统。

Summary / 总结

This paper aims to enhance the efficiency and accuracy of soil quality assessment by integrating GIS, remote sensing, and machine learning techniques. The proposed unified pipeline consolidates recent advancements in these technologies to support transparent and scalable soil quality evaluation. Key findings include the development of a modular approach that addresses existing challenges and paves the way for future developments in soil quality systems, making them more adaptive and aligned with sustainable land management practices.

该论文旨在通过解决传统方法成本高、劳动密集的问题，改进土壤质量评估。它提出了一种统一的管道，结合GIS、遥感和机器学习技术，以高效评估土壤质量。关键发现包括整合了这些技术的最新进展，并开发了一个模块化系统，增强了土壤质量评估的透明度和可扩展性，支持可持续的土地管理。

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Authors: Team Seedream, :, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu

First: 2025-09-24T17:59:04+00:00 · Latest: 2025-12-10T16:37:54+00:00

Comments: Seedream 4.0/4.5 Technical Report

Abs · PDF · Code1 · Code2

Abstract

We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. We further scale our model and data as Seedream 4.5. Seedream 4.0 and Seedream 4.5 are accessible on Volcano Engine https://www.volcengine.com/experience/ark?launch=seedream.

中文标题/摘要

标题：Seedream 4.0：迈向下一代多模态图像生成

我们介绍了Seedream 4.0，这是一个高效且高性能的多模态图像生成系统，将文本到图像（T2I）合成、图像编辑和多图像组合统一在一个框架中。我们开发了一种高效的扩散变换器和强大的VAE，可以显著减少图像令牌的数量。这使得我们的模型能够高效训练，并快速生成原生高分辨率图像（例如1K-4K）。Seedream 4.0基于数十亿个跨不同分类和知识中心概念的文本-图像对进行预训练。通过跨数百个垂直场景的全面数据收集和优化策略，确保了稳定和大规模训练，具有强大的泛化能力。通过结合精心微调的VLM模型，我们进行多模态后训练，同时训练T2I和图像编辑任务。为了加速推理，我们整合了对抗蒸馏、分布匹配和量化，以及推测性解码。它在生成2K图像时的推理时间为1.8秒（不使用LLM/VLM作为PE模型）。全面的评估表明，Seedream 4.0在T2I和多模态图像编辑方面均能达到最先进的结果。特别是在复杂任务中，它展示了出色的多模态能力，包括精确的图像编辑和上下文推理，并允许多图像参考，可以生成多个输出图像。这将传统的T2I系统扩展为一个更具互动性和多维创意工具，推动生成AI在创意和专业应用方面的边界。我们进一步扩展了模型和数据，推出了Seedream 4.5。Seedream 4.0和Seedream 4.5可在火山引擎上访问：https://www.volcengine.com/experience/ark?launch=seedream。

DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation

Authors: Zhizhong Wang, Tianyi Chu, Zeyi Huang, Nanyang Wang, Kehan Li

First: 2025-12-10T16:34:03+00:00 · Latest: 2025-12-10T16:34:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.

中文标题/摘要

标题：DynaIP：动态图像提示适配器，用于可扩展的零样本个性化文本到图像生成

个性化文本到图像（PT2I）生成旨在根据参考图像生成定制化图像。一个主要兴趣在于集成图像提示适配器，以在测试时无需微调的情况下实现零样本PT2I。然而，当前方法面临三个基本挑战：1. 概念保留（CP）和提示跟随（PF）之间的难以捉摸的平衡，2. 难以保留参考图像中的细粒度概念细节，3. 限制扩展到多主题个性化。为应对这些挑战，我们提出了动态图像提示适配器（DynaIP），这是一种先进的插件，旨在增强SOTA文本到图像多模态扩散变换器（MM-DiT）的细粒度概念保真度、CP-PF平衡和主题扩展性，以实现PT2I生成。我们的主要发现是，MM-DiT在通过交叉注意力将参考图像特征注入其双分支时表现出解耦学习行为。基于此，我们设计了一种创新的动态解耦策略，在推理过程中去除概念无关信息的干扰，显著提高了CP-PF平衡，并进一步增强了多主题组合的扩展性。此外，我们确定视觉编码器是影响细粒度CP的关键因素，并揭示了常用CLIP的层次特征可以捕捉不同粒度级别的视觉信息。因此，我们引入了一种新颖的层次混合专家特征融合模块，充分利用CLIP的层次特征，显著提高了细粒度概念保真度，同时提供了对视觉粒度的灵活控制。广泛的实验验证了我们的DynaIP在单主题和多主题PT2I任务中均优于现有方法，标志着PT2I生成领域的一个重要进步。

Summary / 总结

DynaIP addresses the challenges of Concept Preservation and Prompt Following, fine-grained concept details retention, and scalability in zero-shot personalized text-to-image generation. It introduces a Dynamic Decoupling Strategy and a Hierarchical Mixture-of-Experts Feature Fusion Module to enhance the performance of state-of-the-art multimodal diffusion transformers. Experiments show that DynaIP outperforms existing methods in both single- and multi-subject tasks.

DynaIP 解决了零样本个性化文本到图像生成中的概念保留和提示跟随、细粒度概念细节保留以及可扩展性等挑战。它引入了动态解耦策略和层次混合专家特征融合模块，以提升最先进的多模态扩散变换器的性能。实验表明，DynaIP 在单主题和多主题任务中均优于现有方法，提高了概念保留和提示跟随之间的平衡，并增强了可扩展性。

OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring

Authors: Ruoxin Xiong, Yanyu Wang, Jiannan Cai, Kaijian Liu, Yuansheng Zhu, Pingbo Tang, Nora El-Gohary

First: 2025-08-15T13:56:21+00:00 · Latest: 2025-12-10T16:33:31+00:00

Abs · PDF · Code1 · Code2

Abstract

The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community's ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.

中文标题/摘要

标题：OpenConstruction：建筑监测中数据驱动人工智能的系统合成开放视觉数据集

建筑行业越来越多地依赖视觉数据来支持现场监测的人工智能（AI）和机器学习（ML）应用。高质量、领域特定的数据集，包括图像、视频和点云，捕捉现场几何形状和时空动态，包括物体、工人和材料的位置和交互。然而，尽管对利用视觉数据集的兴趣日益增长，现有的资源在规模、数据模态、注释质量和对现实世界建筑条件的代表性方面差异很大。缺乏对这些数据特性和应用背景的系统性分类，限制了社区对数据集景观的全面理解，识别关键缺口，并指导未来方向，以实现更有效的、可靠的和可扩展的建筑领域的AI应用。为解决这一缺口，本研究在学术数据库和开放数据平台中进行了广泛的搜索，获得了2005年至2024年间51个公开可用的视觉数据集。这些数据集按照结构化的数据模式进行分类，涵盖（i）数据基础（如大小和许可），（ii）数据模态（如RGB和点云），（iii）注释框架（如边界框），以及（iv）下游应用领域（如进度跟踪）。本研究将这些发现综合成一个开源目录OpenConstruction，支持数据驱动方法的发展。此外，本研究讨论了现有建筑数据集景观中的几个关键局限性，并提出了基于可发现性、可访问性、互操作性和可重用性（FAIR）原则的数据基础设施未来路线图。通过审查当前景观并概述战略优先事项，本研究支持建筑领域数据驱动解决方案的发展。

Summary / 总结

This study addresses the need for high-quality, domain-specific visual datasets in the construction industry for AI and ML applications. It systematically reviews and categorizes 51 publicly available visual datasets from 2005 to 2024, covering data fundamentals, modalities, annotations, and application domains. The findings are synthesized into an open-source catalog, OpenConstruction, to support data-driven method development and highlight critical gaps in the dataset landscape.

该研究旨在解决建筑行业在AI和ML应用中对高质量领域特定视觉数据集的需求。它系统地审查了2005年至2024年间公开可用的51个视觉数据集，根据数据基础、模态、注释框架和应用领域对其进行分类。研究结果被编译成一个开源目录OpenConstruction，以支持数据驱动的方法开发。该研究还指出了现有数据集景观中的关键局限性，并提出了遵循可发现性、可访问性、互操作性和可重用性（FAIR）原则的数据基础设施未来路线图。

AugLift: Uncertainty Aware Depth Descriptors for Robust 2D to 3D Pose Lifting

Authors: Nikolai Warner, Wenjin Zhang, Hamid Badiozamani, Irfan Essa, Apaar Sadhwani

First: 2025-08-09T22:36:31+00:00 · Latest: 2025-12-10T16:33:18+00:00

Comments: Preprint. Under review

Abs · PDF · Code1 · Code2

Abstract

Lifting based 3D human pose estimators infer 3D joints from 2D keypoints, but often struggle to generalize to real world settings with noisy 2D detections. We revisit the input to lifting and propose AugLift, a simple augmentation of standard lifting that enriches each 2D keypoint (x, y) with an Uncertainty Aware Depth Descriptor (UADD). We run a single off the shelf monocular depth estimator to obtain a depth map, and for every keypoint with detector confidence c we extract depth statistics from its confidence scaled neighborhood, forming a compact, interpretable UADD (c, d, d_min, d_max) that captures both local geometry and reliability. AugLift is modular, requires no new sensors or architectural changes, and integrates by expanding the input layer of existing lifting models. Across four datasets and four lifting architectures, AugLift boosts cross dataset (out of distribution) performance on unseen data by an average of 10.1 percent, while also improving in distribution performance by 4.0 percent as measured by MPJPE. A post hoc analysis clarifies when and why it helps: gains are largest on novel poses and significantly occluded joints, where depth statistics resolve front back ambiguities while confidence calibrates the spatial neighborhoods from which they are drawn. We also study interaction with recent image feature lifting methods and find the signals are complementary: adding UADD to image conditioned lifting yields both ID and OOD gains. A learned depth feature extension (AugLiftV2) improves performance further while trading off interpretability. Together, these results indicate that lightweight, confidence aware depth cues are a powerful plug in for robust 2D to 3D pose lifting.

中文标题/摘要

标题：AugLift：基于不确定性感知深度描述符的鲁棒2D到3D姿态提升

基于提升的姿态估计器从2D关键点推断3D关节，但在噪声较大的2D检测中难以泛化到现实世界场景。我们重新审视了提升的输入，并提出AugLift，这是一种对标准提升的简单增强，它为每个2D关键点(x, y)添加了一个不确定性感知深度描述符(UADD)。我们使用单个现成的单目深度估计器获取深度图，对于每个具有检测置信度c的关键点，我们从其置信度缩放的邻域中提取深度统计数据，形成一个紧凑且可解释的UADD(c, d, d_min, d_max)，它同时捕捉局部几何结构和可靠性。AugLift模块化，无需新传感器或架构更改，并通过扩展现有提升模型的输入层进行集成。在四个数据集和四个提升架构上，AugLift在未见过的数据上的跨数据集（异分布）性能平均提高了10.1%，同时在分布内性能也提高了4.0%，以MPJPE衡量。事后分析澄清了它何时以及为何有效：在新颖姿态和显著遮挡的关节上，深度统计数据解决了前后方的歧义，而置信度校准了从中抽取的时空邻域。我们还研究了与最近的图像特征提升方法的交互，发现信号是互补的：将UADD添加到图像条件下的提升中，可以同时获得ID和OOD增益。学习深度特征扩展(AugLiftV2)进一步提高了性能，但牺牲了可解释性。这些结果表明，轻量级、置信度感知的深度线索是鲁棒2D到3D姿态提升的强大插件。

Summary / 总结

AugLift enhances 3D human pose estimation from 2D keypoints by adding depth information to each keypoint through an Uncertainty Aware Depth Descriptor (UADD). This method uses a standard monocular depth estimator to obtain depth maps and integrates with existing lifting models without requiring new sensors or architectural changes. Across multiple datasets and lifting architectures, AugLift improves cross-dataset performance by 10.1% and in-distribution performance by 4.0% as measured by MPJPE, particularly benefiting novel poses and significantly occluded joints. Adding UADD to image-conditioned lifting methods also yields performance gains in both in-distribution and out-of-distribution settings.

AugLift 通过为每个关键点引入一个不确定性感知深度描述符（UADD）来增强从2D关键点到3D姿态的估计。该方法使用单个单目深度估计器获取深度统计数据，然后将其集成到现有的提升模型中。在多个数据集和架构上，AugLift 的跨数据集性能提高了10.1%，在分布内性能提高了4.0%，特别是对新颖姿态和严重遮挡的关键点有显著改善。该方法是模块化的，不需要额外的传感器或架构更改。

Human Motion Unlearning

Authors: Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso

First: 2025-03-24T13:46:27+00:00 · Latest: 2025-12-10T16:23:59+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Human Motion Unlearning and motivate it through the concrete task of preventing violent 3D motion synthesis, an important safety requirement given that popular text-to-motion datasets (HumanML3D and Motion-X) contain from 7\% to 15\% violent sequences spanning both atomic gestures (e.g., a single punch) and highly compositional actions (e.g., loading and swinging a leg to kick). By focusing on violence unlearning, we demonstrate how removing a challenging, multifaceted concept can serve as a proxy for the broader capability of motion "forgetting." To enable systematic evaluation of Human Motion Unlearning, we establish the first motion unlearning benchmark by automatically filtering HumanML3D and Motion-X datasets to create distinct forget sets (violent motions) and retain sets (safe motions). We introduce evaluation metrics tailored to sequential unlearning, measuring both suppression efficacy and the preservation of realism and smooth transitions. We adapt two state-of-the-art, training-free image unlearning methods (UCE and RECE) to leading text-to-motion architectures (MoMask and BAMM), and propose Latent Code Replacement (LCR), a novel, training-free approach that identifies violent codes in a discrete codebook representation and substitutes them with safe alternatives. Our experiments show that unlearning violent motions is indeed feasible and that acting on latent codes strikes the best trade-off between violence suppression and preserving overall motion quality. This work establishes a foundation for advancing safe motion synthesis across diverse applications. Website: https://www.pinlab.org/hmu.

中文标题/摘要

标题：人类运动反学习

我们介绍了人类运动反学习，并通过具体任务防止暴力3D运动合成来激发其动机，鉴于流行的文本到运动数据集（HumanML3D和Motion-X）包含7%到15%的暴力序列，涵盖从原子手势（例如，单次挥拳）到高度组合的动作（例如，加载和挥动腿部踢打）。通过专注于暴力反学习，我们展示了如何通过移除一个复杂概念作为更广泛运动“遗忘”能力的代理。为了系统评估人类运动反学习，我们通过自动过滤HumanML3D和Motion-X数据集建立了第一个运动反学习基准，创建了不同的遗忘集（暴力动作）和保留集（安全动作）。我们引入了针对顺序反学习的评估指标，衡量抑制效果以及保持真实感和平滑过渡。我们将两种最先进的无需训练的图像反学习方法（UCE和RECE）适应领先的文本到运动架构（MoMask和BAMM），并提出了一种新的无需训练的方法——潜在代码替换（LCR），该方法在离散代码表示中识别暴力代码，并用安全替代品替换它们。我们的实验表明，反学习暴力动作是可行的，且在暴力抑制和保持整体运动质量之间采取作用于潜在代码的方法取得了最佳权衡。这项工作为在各种应用中推进安全运动合成奠定了基础。网站：https://www.pinlab.org/hmu.

Summary / 总结

Human Motion Unlearning aims to prevent the synthesis of violent 3D motions by removing violent sequences from popular text-to-motion datasets. The authors introduce a motion unlearning benchmark and evaluate three methods: UCE, RECE, and LCR. LCR, a novel approach, identifies and replaces violent latent codes with safe alternatives, demonstrating effective violence suppression while maintaining motion quality. This work paves the way for safer motion synthesis in various applications.

研究引入了Human Motion Unlearning，以解决文本到动作数据集中暴力3D动作合成的安全问题。通过专注于移除暴力序列，研究展示了动作遗忘的更广泛能力。为了评估这一点，作者通过过滤HumanML3D和Motion-X数据集中的暴力动作创建了一个动作遗忘基准。他们使用评估指标来衡量暴力行为的抑制效果同时保持动作的真实性和流畅过渡。研究中采用了现有的图像遗忘方法，并引入了一种新的Latent Code Replacement方法，结果显示移除暴力动作是可行的，使用潜在代码可以在暴力抑制和整体动作质量之间找到最佳平衡。

Benchmarking Web API Integration Code Generation

Authors: Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Jannis Brugger, Mira Mezini

First: 2025-09-24T14:36:44+00:00 · Latest: 2025-12-10T16:17:57+00:00

Comments: Published in Proceedings of 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware), Data & Benchmark Track; updated artifact link to point to the right version

Abs · PDF · Code1 · Code2

Abstract

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.

中文标题/摘要

标题：Web API集成代码生成基准测试

API集成是数字基础设施的基石，使软件系统能够连接和交互。然而，如许多研究所示，编写或生成正确的代码以调用API，尤其是Web API，是一项挑战。尽管大型语言模型（LLMs）在软件开发中变得流行，但它们在自动化生成Web API集成代码方面的有效性尚未得到探索。为了应对这一挑战，我们提出了WAPIIBench，这是一个数据集和评估管道，旨在评估LLMs生成Web API调用代码的能力。我们的实验显示，生成API调用构成了一个重大挑战，导致生成虚假的端点、错误的参数使用和其他错误。评估的开源模型中没有一个能够解决超过40%的任务。

Summary / 总结

The paper aims to evaluate the capability of large language models (LLMs) in generating web API integration code. WAPIIBench, a dataset and evaluation pipeline, was developed to assess this. Experiments with several open-source LLMs showed that generating correct API invocations is challenging, with errors such as hallucinated endpoints and incorrect argument usage. None of the evaluated models could solve more than 40% of the tasks.

研究旨在评估大型语言模型（LLMs）在生成正确Web API集成代码方面的能力。开发了WAPIIBench数据集和评估管道来测试这一能力。实验显示，生成准确的API调用具有挑战性，开源模型未能解决超过40%的任务，主要问题包括生成虚假的端点和参数使用错误。

Ariel-ML: Computing Parallelization with Embedded Rust for Neural Networks on Heterogeneous Multi-core Microcontrollers

Authors: Zhaolan Huang, Kaspar Schleiser, Gyungmin Myung, Emmanuel Baccelli

First: 2025-12-10T16:13:29+00:00 · Latest: 2025-12-10T16:13:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Low-power microcontroller (MCU) hardware is currently evolving from single-core architectures to predominantly multi-core architectures. In parallel, new embedded software building blocks are more and more written in Rust, while C/C++ dominance fades in this domain. On the other hand, small artificial neural networks (ANN) of various kinds are increasingly deployed in edge AI use cases, thus deployed and executed directly on low-power MCUs. In this context, both incremental improvements and novel innovative services will have to be continuously retrofitted using ANNs execution in software embedded on sensing/actuating systems already deployed in the field. However, there was so far no Rust embedded software platform automating parallelization for inference computation on multi-core MCUs executing arbitrary TinyML models. This paper thus fills this gap by introducing Ariel-ML, a novel toolkit we designed combining a generic TinyML pipeline and an embedded Rust software platform which can take full advantage of multi-core capabilities of various 32bit microcontroller families (Arm Cortex-M, RISC-V, ESP-32). We published the full open source code of its implementation, which we used to benchmark its capabilities using a zoo of various TinyML models. We show that Ariel-ML outperforms prior art in terms of inference latency as expected, and we show that, compared to pre-existing toolkits using embedded C/C++, Ariel-ML achieves comparable memory footprints. Ariel-ML thus provides a useful basis for TinyML practitioners and resource-constrained embedded Rust developers.

中文标题/摘要

标题：Ariel-ML：为异构多核微控制器上的人工神经网络计算并行化嵌入式Rust

低功耗微控制器(MCU)硬件目前正从单核架构向主要的多核架构演变。与此同时，越来越多的嵌入式软件构建块用Rust编写，而C/C++在这一领域的主导地位正在减弱。另一方面，各种类型的微小人工神经网络(ANN)越来越多地部署在边缘AI用例中，因此直接部署并执行在低功耗MCU上。在这种情况下，无论是增量改进还是新型创新服务，都将不得不不断重新适应在已经部署在现役传感/执行系统上的软件嵌入式ANN执行。然而，到目前为止，还没有自动为执行任意TinyML模型的多核MCU上的推理计算进行并行化的Rust嵌入式软件平台。因此，本文通过引入Ariel-ML，一种结合通用TinyML流水线和嵌入式Rust软件平台的新工具包，填补了这一空白，该平台可以充分利用各种32位微控制器家族（Arm Cortex-M、RISC-V、ESP-32）的多核能力。我们发布了其完整开源代码，并使用各种TinyML模型对其进行基准测试。结果显示，Ariel-ML在推理延迟方面优于现有技术，与使用嵌入式C/C++的现有工具包相比，Ariel-ML实现了相当的内存占用。因此，Ariel-ML为TinyML从业者和资源受限的嵌入式Rust开发人员提供了一个有用的基础。

Summary / 总结

Ariel-ML is a toolkit designed to automate parallelization for inference computation on multi-core microcontrollers using embedded Rust. It addresses the need for efficient deployment of artificial neural networks on low-power MCUs. Experimental results show that Ariel-ML outperforms previous methods in terms of inference latency and achieves comparable memory footprints to existing C/C++ toolkits, making it a valuable resource for TinyML practitioners and embedded Rust developers.

该论文旨在解决在多核微控制器上使用Rust进行神经网络推理的并行化需求。它引入了Ariel-ML，一个结合了TinyML流水线和嵌入式Rust平台的工具包，可以利用各种32位微控制器的多核能力。实验结果表明，Ariel-ML在推理延迟方面优于先前的解决方案，并且在内存占用方面与现有的C/C++工具包相当。

Knowledge Diversion for Efficient Morphology Control and Policy Transfer

Authors: Fu Feng, Ruixiao Shi, Yucheng Xie, Jianlu Shen, Jing Wang, Xin Geng

First: 2025-12-10T16:11:51+00:00 · Latest: 2025-12-10T16:11:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Universal morphology control aims to learn a universal policy that generalizes across heterogeneous agent morphologies, with Transformer-based controllers emerging as a popular choice. However, such architectures incur substantial computational costs, resulting in high deployment overhead, and existing methods exhibit limited cross-task generalization, necessitating training from scratch for each new task. To this end, we propose \textbf{DivMorph}, a modular training paradigm that leverages knowledge diversion to learn decomposable controllers. DivMorph factorizes randomly initialized Transformer weights into factor units via SVD prior to training and employs dynamic soft gating to modulate these units based on task and morphology embeddings, separating them into shared \textit{learngenes} and morphology- and task-specific \textit{tailors}, thereby achieving knowledge disentanglement. By selectively activating relevant components, DivMorph enables scalable and efficient policy deployment while supporting effective policy transfer to novel tasks. Extensive experiments demonstrate that DivMorph achieves state-of-the-art performance, achieving a 3$\times$ improvement in sample efficiency over direct finetuning for cross-task transfer and a 17$\times$ reduction in model size for single-agent deployment.

中文标题/摘要

标题：知识转移以实现高效形态控制和策略转移

通用形态控制旨在学习一个适用于不同异构代理形态的通用策略，基于Transformer的控制器因其流行性而受到青睐。然而，此类架构会带来巨大的计算成本，导致部署成本高昂，现有方法在跨任务泛化方面表现有限，需要为每个新任务从头开始训练。为此，我们提出了一种名为\textbf{DivMorph}的模块化训练范式，利用知识转移学习可分解的控制器。DivMorph 在训练前通过SVD将随机初始化的Transformer权重分解为因子单元，并使用动态软门控根据任务和形态嵌入来调节这些单元，将它们分离为共享的\textit{learngenes}和形态及任务特定的\textit{tailors}，从而实现知识的解纠缠。通过选择性激活相关组件，DivMorph 使策略部署更具可扩展性和效率，同时支持向新任务的有效策略转移。大量实验表明，DivMorph 达到了最先进的性能，与直接微调相比，在跨任务转移中的样本效率提高了3倍，单代理部署时模型大小减少了17倍。

Summary / 总结

The research aims to develop an efficient morphology control method for universal policies across different agent morphologies, addressing the high computational costs and limited cross-task generalization of existing Transformer-based controllers. DivMorph proposes a modular training paradigm that factorizes Transformer weights into shared 'learngenes' and task-specific 'tailors' using SVD and dynamic soft gating, achieving scalable and efficient policy deployment with improved sample efficiency and reduced model size.

研究旨在开发高效的方法来控制具有不同形态的代理，解决现有基于Transformer的控制器的高计算成本和有限的跨任务泛化能力问题。DivMorph 提出了一种模块化的训练范式，通过 SVD 和动态软门控将Transformer权重分解为共享的‘learngenes’和任务特定的‘tailors’，从而实现可扩展和高效的策略部署以及有效的策略转移。实验表明，DivMorph 在样本效率上比直接微调高出3倍，并且单个代理部署时模型大小减少了17倍。

TinyDéjàVu: Smaller Memory Footprint & Faster Inference on Sensor Data Streams with Always-On Microcontrollers

Authors: Zhaolan Huang, Emmanuel Baccelli

First: 2025-12-10T16:07:17+00:00 · Latest: 2025-12-10T16:07:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Always-on sensors are increasingly expected to embark a variety of tiny neural networks and to continuously perform inference on time-series of the data they sense. In order to fit lifetime and energy consumption requirements when operating on battery, such hardware uses microcontrollers (MCUs) with tiny memory budget e.g., 128kB of RAM. In this context, optimizing data flows across neural network layers becomes crucial. In this paper, we introduce TinyDéjàVu, a new framework and novel algorithms we designed to drastically reduce the RAM footprint required by inference using various tiny ML models for sensor data time-series on typical microcontroller hardware. We publish the implementation of TinyDéjàVu as open source, and we perform reproducible benchmarks on hardware. We show that TinyDéjàVu can save more than 60% of RAM usage and eliminate up to 90% of redundant compute on overlapping sliding window inputs.

中文标题/摘要

标题：TinyDéjàVu：在始终开启的微控制器上对传感器数据流进行更快推理并占用更小的内存空间

始终开启的传感器越来越多地被期望搭载各种小型神经网络，并持续对它们所感知的数据的时间序列进行推理。为了满足电池供电操作下的寿命和能耗要求，此类硬件使用具有极小内存预算的微控制器（例如128kB的RAM）。在此背景下，优化神经网络层之间的数据流变得至关重要。在本文中，我们介绍了TinyDéjàVu，这是一种新的框架和新型算法，旨在大幅减少在典型微控制器硬件上使用各种小型机器学习模型对传感器数据时间序列进行推理所需的RAM占用空间。我们已将TinyDéjàVu的实现开源，并在硬件上进行了可重复的基准测试。我们展示了TinyDéjàVu可以节省超过60%的RAM使用，并在重叠滑动窗口输入上消除高达90%的冗余计算。

Summary / 总结

The research aims to optimize memory usage and inference speed for always-on sensors using microcontrollers with limited RAM. TinyDéjàVu is a framework that reduces the RAM footprint for inference on sensor data time-series by up to 60% and eliminates up to 90% of redundant compute on overlapping sliding window inputs.

研究旨在优化使用有限RAM的微控制器上始终开启传感器的数据流和推理速度。提出了TinyDéjàVu框架和算法，以减少对传感器数据时间序列进行推理所需的RAM占用。实验表明，TinyDéjàVu可以节省超过60%的RAM使用，并在重叠滑动窗口输入上减少高达90%的冗余计算。