arXiv 论文速递

Latest digest

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Authors: Zhe Yang, Shulin Tian, Kairui Hu, Shuai Liu, Hoang-Nhat Nguyen, Yichi Zhang, Zujin Guo, Mengying Yu, Zinan Zhang, Jingkang Yang, Chen Change Loy, Ziwei Liu

First: 2026-04-01T17:58:33+00:00 · Latest: 2026-04-01T17:58:33+00:00

Comments: Project Page: https://hippocamp-ai.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

中文标题/摘要

标题：HippoCamp：在个人计算机上评估上下文代理的能力

我们提出了HippoCamp，一个新的基准，旨在评估代理在多模态文件管理方面的能力。与现有的主要关注网页交互、工具使用或通用环境中的软件自动化等任务的代理基准不同，HippoCamp 在用户中心的环境中评估代理，以建模个体用户配置文件，并在庞大的个人文件中进行上下文感知推理。我们的基准在真实世界配置文件上实例化设备规模的文件系统，这些配置文件涵盖了多种模态，数据量超过42.4 GB，涉及超过2000个真实世界的文件。基于原始文件，我们构建了581个问答对，以评估代理在搜索、证据感知和多步推理方面的能力。为了便于细粒度分析，我们提供了46.1K个密集注释的结构化轨迹，用于逐步故障诊断。我们对HippoCamp上的多种最先进的多模态大型语言模型（MLLMs）和代理方法进行了全面评估。我们的实验表明，即使是最先进的商用模型也只能在用户建模方面达到48.3%的准确率，特别是在长时检索和密集个人文件系统中的跨模态推理方面表现不佳。此外，我们的逐步故障诊断识别出多模态感知和证据定位是主要瓶颈。最终，HippoCamp 暴露了当前代理在现实、用户中心环境中的关键局限性，并为开发下一代个人AI助手提供了坚实的基础。

Summary / 总结

HippoCamp is a new benchmark for evaluating agents' capabilities in managing multimodal personal files. Unlike existing benchmarks, HippoCamp focuses on user-centric environments, assessing agents' abilities in search, evidence perception, and multi-step reasoning. The benchmark includes 42.4 GB of data from over 2K real-world files and 581 QA pairs, revealing that even advanced models achieve only 48.3% accuracy in user profiling, particularly struggling with long-horizon retrieval and cross-modal reasoning. Step-wise failure diagnosis highlights multimodal perception and evidence grounding as key challenges.

HippoCamp 是一个新的基准，用于评估代理在个人计算机上管理多模态文件系统的能力。与现有基准不同，HippoCamp 关注用户中心的环境，评估代理在搜索、证据感知和多步推理方面的能力。基准包括来自超过 2K 个真实文件的 42.4 GB 数据和 581 个 QA 对。全面的实验显示，即使最先进的模型在用户建模方面的准确率也只有 48.3%，特别是在长时检索和跨模态推理方面存在重大挑战。故障诊断指出，需要在多模态感知和证据定位方面进行改进。

LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)

Authors: Yuxuan Bao, Xingyue Zhang, J. Nathan Kutz

First: 2026-04-01T17:55:10+00:00 · Latest: 2026-04-01T17:55:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.

中文标题/摘要

标题：使用浅层递归解码器从短时间序列推断潜在阶段（LAPIS-SHRED）

从空间和时间上稀疏的观测中重构完整的时空动态仍然是复杂系统中的一个核心挑战，因为测量可能在空间上不完整，并且只能局限于狭窄的时间窗口。然而，近似完整的时空轨迹对于机理洞察、理解、模型校准和操作决策至关重要。我们引入了LAPIS-SHRED（使用浅层递归解码器从短时间序列推断潜在阶段），这是一种模块化架构，可以从受限于短时间窗口的稀疏传感器观测中重构和/或预测完整的时空动态。LAPIS-SHRED通过三个阶段的流水线运行：（i）SHRED模型完全在模拟数据上预训练，将传感器时间历史映射到结构化的潜在空间；（ii）时间序列模型在模拟衍生的潜在轨迹上训练，学习在时间上向前或向后传播潜在状态，以覆盖从短观测时间窗口中推断出的未观测时间区域；（iii）在部署时，仅提供来自真实系统的短观测时间窗口的超稀疏传感器测量，冻结的SHRED模型和时间模型共同重构或预测完整的时空轨迹。该框架支持双向推理，从其模块化结构继承了数据同化和多尺度重构能力，并能够适应极端的观测约束，包括单帧终端输入。我们在六个实验中评估了LAPIS-SHRED，这些实验涵盖了复杂的时空物理：湍流流动、多尺度推进物理、易挥发燃烧瞬态以及卫星衍生的环境场，突显了一个轻量级、模块化的架构，适用于观测受限于物理或物流限制的操作环境。

Summary / 总结

LAPIS-SHRED is designed to reconstruct and forecast complete spatiotemporal dynamics from sparse and short temporal sensor observations. It uses a three-stage pipeline: pre-training a SHRED model on simulation data to map sensor time-histories into a latent space, training a temporal sequence model on latent trajectories to propagate states, and deploying the models to reconstruct or forecast the complete spatiotemporal trajectory from limited observations. Experiments on various complex systems show its effectiveness in handling sparse and short observation windows, supporting bidirectional inference and accommodating extreme observational constraints.

LAPIS-SHRED旨在从稀疏且短暂的传感器观测数据中重建和预测完整的时空动态。它采用三阶段流程：在模拟数据上预训练SHRED模型将传感器时间历史映射到潜在空间，训练时间序列模型传播潜在状态，并在部署时利用预训练的SHRED模型和时间序列模型从有限观测中重建或预测完整的时空轨迹。该框架在六个复杂时空物理实验中展示了有效性，展示了其处理极端观测约束和支持双向推理的能力。

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

Authors: Piyush Garg, Diana R. Gergel, Andrew E. Shao, Galen J. Yacalis

First: 2026-04-01T17:53:51+00:00 · Latest: 2026-04-01T17:53:51+00:00

Abs · PDF · Code1 · Code2

Abstract

AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.

中文标题/摘要

标题：食谱比厨房更重要：AI 天气预测管道的数学基础

AI 天气预测取得了快速进展，但尚未形成统一的数学框架来解释决定预报技能的因素。现有理论主要关注特定架构选择，而非整个学习管道。2023-2026 年的运营证据表明，训练方法、损失函数设计和数据多样性至少与架构选择一样重要。本文做出了两项交织的贡献。理论上，我们构建了一个基于球面逼近理论、动力系统理论、信息论和统计学习理论的框架，该框架处理整个学习管道（架构、损失函数、训练策略、数据分布），而不仅仅是架构。我们建立了学习管道误差分解，表明估计误差（损失和数据依赖性）在当前规模上超过了逼近误差（架构依赖性）。我们发展了损失函数谱理论，形式化了均方误差引起的谱模糊，并推导了超出分布外推界线，证明数据驱动模型系统地低估了创纪录的极端事件，偏差随创纪录超阈值线性增长。实验上，我们通过使用 NVIDIA Earth2Studio 和 ERA5 初始条件的十个架构多样化的 AI 天气模型进行推理，评估了六项指标，覆盖了所有季节的 30 个初始化日期。结果证实了 MSE 训练模型在高波数下的普遍光谱能量损失，误差共识比率上升表明大多数预报误差在架构间共享，并且在极端事件期间存在线性负偏差。整体模型评估分数提供了统一的多维评估，而规范框架在训练前使数学评估成为可能。

Summary / 总结

This paper addresses the lack of a unified mathematical framework for AI weather prediction, focusing on the complete learning pipeline including architecture, loss function, training strategy, and data distribution. Theoretical contributions include a Learning Pipeline Error Decomposition showing that estimation error dominates approximation error, and a Loss Function Spectral Theory formalizing spectral blurring. Empirical results across ten diverse AI weather models show universal high wavenumber energy loss, shared forecast error across architectures, and linear negative bias during extreme events.

本文解决了AI天气预测缺乏统一数学框架的问题，关注包括架构、损失函数、训练策略和数据分布在内的完整学习管道。理论贡献包括学习管道误差分解，显示估计误差占主导地位，以及损失函数谱理论，形式化谱模糊。通过使用NVIDIA Earth2Studio和ERA5初始条件的十个不同架构的AI天气模型进行实证验证，确认了普遍的谱能量损失、共享的预报误差以及极端事件中的线性负偏差。引入了综合模型评估得分进行统一评估，并提出了一个处方框架，在训练前对提出的管道进行数学评估。

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

Authors: Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi, Sachin Patro, Nazneen Rajani

First: 2026-04-01T17:52:19+00:00 · Latest: 2026-04-01T17:52:19+00:00

Comments: 16 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.

中文标题/摘要

标题：$\texttt{YC-Bench}$：评估AI代理长期规划和一致执行能力的标准

随着LLM代理处理越来越复杂的任务，一个关键问题是它们是否能在长时间内保持战略连贯性：在不确定性下规划、从延迟反馈中学习以及在早期错误累积时进行调整。我们引入了$\texttt{YC-Bench}$，这是一个基准测试，通过让代理在一个跨越数百回合的一年时间跨度内运行模拟初创企业来评估这些能力。代理必须在部分可观测环境中管理员工、选择任务合同并维持盈利能力，其中敌对客户和不断增长的工资单会因不良决策而产生累积后果。我们对12个模型进行了评估，包括专有模型和开源模型，每个模型进行了3次种子测试。只有三个模型能够始终超过20万美元的起始资本，其中Claude Opus 4.6以127万美元的最高平均最终资金领先，其次是GLM-5，为121万美元，成本仅为前者的1/11。草稿区的使用，这是唯一一种在上下文截断后保持信息的机制，是成功的关键预测因素，而敌对客户检测是主要的失败模式，占破产案例的47%。我们的分析表明，前沿模型仍然通过不同的失败模式（如过度并行化）失败，这表明了长期表现的能力差距。$\texttt{YC-Bench}$是开源、可重现和可配置的。

Summary / 总结

The research aims to evaluate AI agents' ability to maintain strategic coherence over long-term planning and consistent execution. The study introduces $\texttt{YC-Bench}$, a benchmark where agents manage a simulated startup over a one-year horizon, facing uncertainties and delayed feedback. Evaluating 12 models, the study finds that only three models consistently surpass the starting capital, with Claude Opus 4.6 achieving the highest average final funds. Scratchpad usage is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for 47% of bankruptcies. The analysis highlights capability gaps for long-horizon performance among frontier models.

研究旨在评估AI代理在长期规划和一致执行中的战略连贯性。研究引入了$\texttt{YC-Bench}$，一个基准，让代理在一个模拟的创业公司中管理一年的时间，面对不确定性及延迟反馈。评估12个模型后，只有三个模型能够持续超过起始资本，Claude Opus 4.6获得最高的平均最终资金。使用记事本是成功的关键预测因素，而对抗性客户检测是主要的失败模式，占破产的47%。分析揭示了前沿模型在长期表现中的能力差距。

CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

Authors: Youssef Mroueh, Carlos Fonseca, Brian Belgodere, David Cox

First: 2026-04-01T17:51:26+00:00 · Latest: 2026-04-01T17:51:26+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .

中文标题/摘要

标题：CliffSearch：理论与代码结构化代理共进化以发现科学算法

科学算法发现是迭代的过程：提出假设，实现，压力测试并修订。当前由LLM指导的搜索系统加速了假设的生成，但往往通过仅优化代码片段来弱化对科学结构的表示，缺乏对正确性和原创性的严格筛选。我们提出了CliffSearch，这是一种代理进化框架，其中核心进化操作（配对选择、交叉、变异和审查）由LLM代理实现，循环设计围绕三个原则：（1）每个节点是结构化的科学制品，以理论+代码或代码仅模式实例化，（2）审查者的正确性和原创性判断是与基准度量优化并列的一级选择门，（3）变异分为探索和纠正路径，具有不同的目标。探索变异从相邻科学领域引入新想法以增加新颖性，而纠正变异则使用来自理论、代码、基准结果和运行时错误的审查者信号进行目标导向的修复。我们通过三个基准导向的研究案例说明了该框架：变压器超连接进化，固定nanoGPT堆栈上的优化器发现，以及一个较小的本地优化器消融研究。在这些设置中，相同的循环支持明确的度量方向、可重复的持久性和在受控搜索条件下发现的审查者门控比较。结果是一个优先考虑科学可解释性和正确性，同时在受控新颖性约束下优化任务度量的发现工作流，而不是仅仅最大化候选数量。完整的运行脚本、交互式可视化和报告研究的最佳节点导出可在https://cliffsearch.ai 获取。

Summary / 总结

CliffSearch is an evolutionary framework where LLM agents implement core evolutionary operators to iteratively discover scientific algorithms. It emphasizes structured scientific artifacts, reviewer judgments as selection gates, and two types of mutation: exploration for novelty and correction for evidence-guided repair. Across three studies, CliffSearch supports explicit metric direction, reproducible persistence, and reviewer-gated comparison under controlled conditions, prioritizing interpretability and correctness over throughput alone.

CliffSearch 是一个进化框架，其中 LLM 代理实现核心进化操作以迭代发现科学算法。它强调结构化的科学制品、评审员判断作为选择门以及两种类型的突变：探索以增加新颖性，修正以进行基于证据的修复。在三个研究中，CliffSearch 支持明确的指标方向、可重复的持久性以及在受控条件下由评审员筛选的比较，优先考虑可解释性和正确性而非单纯的最大化候选数量。

TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking

Authors: Jiyuan Hu, Zechuan Zhang, Zongxin Yang, Yi Yang

First: 2026-04-01T17:51:00+00:00 · Latest: 2026-04-01T17:51:00+00:00

Comments: 22 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

We present TRACE, a mesh-guided 3DGS editing framework that achieves automated, high-fidelity scene transformation. By anchoring video diffusion with explicit 3D geometry, TRACE uniquely enables fine-grained, part-level manipulatio--such as local pose shifting or component replacemen--while preserving the structural integrity of the central subject, a capability largely absent in existing editing methods. Our approach comprises three key stages: (1) Multi-view 3D-Anchor Synthesis, which leverages a sparse-view editor trained on our MV-TRACE datase--the first multi-view consistent dataset dedicated to scene-coherent object addition and modificatio--to generate spatially consistent 3D-anchors; (2) Tangible Geometry Anchoring (TGA), which ensures precise spatial synchronization between inserted meshes and the 3DGS scene via two-phase registration; and (3) Contextual Video Masking (CVM), which integrates 3D projections into an autoregressive video pipeline to achieve temporally stable, physically-grounded rendering. Extensive experiments demonstrate that TRACE consistently outperforms existing methods especially in editing versatility and structural integrity.

中文标题/摘要

标题：TRACE：通过可触控重建和几何对齐上下文视频掩模的高保真3D场景编辑

我们提出了TRACE，一种基于网格的3DGS编辑框架，实现了自动化的高保真场景转换。通过将视频扩散与显式的3D几何结构相结合，TRACE独特地实现了细粒度、部分级别的操作——如局部姿态偏移或组件替换——同时保持了中心主体的结构完整性，这是现有编辑方法中很少具备的能力。我们的方法包括三个关键阶段：（1）多视角3D锚合成，利用在我们MV-TRACE数据集上训练的稀疏视角编辑器——这是第一个专门用于场景一致对象添加和修改的多视角一致数据集——生成空间一致的3D锚；（2）可触控几何锚定（TGA），通过两阶段注册确保插入网格与3DGS场景的精确空间同步；（3）上下文视频掩模（CVM），将3D投影整合到自回归视频管道中，实现时间上稳定且物理上可靠的渲染。大量实验表明，TRACE在编辑灵活性和结构完整性方面始终优于现有方法。

Summary / 总结

TRACE is a mesh-guided 3D scene editing framework that uses explicit 3D geometry to enable fine-grained manipulations like local pose shifting and component replacement while preserving the central subject's structural integrity. It consists of three stages: multi-view 3D-anchor synthesis, tangible geometry anchoring, and contextual video masking. Experiments show that TRACE outperforms existing methods in editing versatility and structural integrity.

TRACE 是一个基于网格的 3D 场景编辑框架，通过使用显式的 3D 几何来实现精细的局部姿态调整和组件替换，同时保持主体的结构完整性。它包括三个阶段：多视图 3D 锚点合成、实体几何锚定和上下文视频遮罩。实验表明，TRACE 在编辑灵活性和结构完整性方面优于现有方法。

Neural Harmonic Textures for High-Quality Primitive Based Neural Reconstruction

Authors: Jorge Condor, Nicolas Moenne-Loccoz, Merlin Nimier-David, Piotr Didyk, Zan Gojcic, Qi Wu

First: 2026-04-01T17:48:22+00:00 · Latest: 2026-04-01T17:48:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Primitive-based methods such as 3D Gaussian Splatting have recently become the state-of-the-art for novel-view synthesis and related reconstruction tasks. Compared to neural fields, these representations are more flexible, adaptive, and scale better to large scenes. However, the limited expressivity of individual primitives makes modeling high-frequency detail challenging. We introduce Neural Harmonic Textures, a neural representation approach that anchors latent feature vectors on a virtual scaffold surrounding each primitive. These features are interpolated within the primitive at ray intersection points. Inspired by Fourier analysis, we apply periodic activations to the interpolated features, turning alpha blending into a weighted sum of harmonic components. The resulting signal is then decoded in a single deferred pass using a small neural network, significantly reducing computational cost. Neural Harmonic Textures yield state-of-the-art results in real-time novel view synthesis while bridging the gap between primitive- and neural-field-based reconstruction. Our method integrates seamlessly into existing primitive-based pipelines such as 3DGUT, Triangle Splatting, and 2DGS. We further demonstrate its generality with applications to 2D image fitting and semantic reconstruction.

中文标题/摘要

标题：基于神经元谐波纹理的高质量基本体神经重建

基于基本体的方法，如3D高斯点云，已成为新型视图合成和相关重建任务的最新技术。与神经场相比，这些表示更加灵活、适应性强且更适合大规模场景。然而，单个基本体的有限表现力使得高频率细节建模具有挑战性。我们引入了神经元谐波纹理，这是一种神经表示方法，将潜在特征向量锚定在每个基本体周围的虚拟支架上。这些特征在射线交点处进行内插。受傅里叶分析的启发，我们对内插特征应用周期激活，将alpha混合转换为谐波分量的加权和。然后使用小型神经网络在单个延迟通过中解码该信号，显著降低了计算成本。神经元谐波纹理在实时新型视图合成中达到了最新技术水平，同时弥合了基于基本体和基于神经场重建之间的差距。我们的方法无缝集成到现有的基于基本体的管道中，如3DGUT、三角点云和2DGS。我们还通过2D图像拟合和语义重建的应用进一步展示了其通用性。

Summary / 总结

The research aims to improve the expressivity of primitive-based methods for high-frequency detail modeling in 3D reconstruction tasks. It introduces Neural Harmonic Textures, which use latent feature vectors on a virtual scaffold around each primitive and apply periodic activations to interpolate these features. This approach reduces computational cost and achieves state-of-the-art results in real-time novel view synthesis while maintaining flexibility and scalability. The method integrates well with existing primitive-based pipelines and extends to 2D applications.

研究旨在通过提高基于基本体的方法来更好地建模高频细节。提出了神经谐波纹理，该方法在每个基本体周围的虚拟支架上使用潜在特征向量，并应用周期激活来插值这些特征。这种方法减少了计算成本，并在实时新视角合成中取得了最先进的成果，同时保持了灵活性和可扩展性。该方法能够很好地与现有的基于基本体的管道集成，并扩展到2D应用中。

Therefore I am. I Think

Authors: Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, Rajagopal Venkatesaramani

First: 2026-04-01T17:46:23+00:00 · Latest: 2026-04-01T17:46:23+00:00

Abs · PDF · Code1 · Code2

Abstract

We consider the question: when a large language reasoning model makes a choice, did it think first and then decide to, or decide first and then think? In this paper, we present evidence that detectable, early-encoded decisions shape chain-of-thought in reasoning models. Specifically, we show that a simple linear probe successfully decodes tool-calling decisions from pre-generation activations with very high confidence, and in some cases, even before a single reasoning token is produced. Activation steering supports this causally: perturbing the decision direction leads to inflated deliberation, and flips behavior in many examples (between 7 - 79% depending on model and benchmark). We also show through behavioral analysis that, when steering changes the decision, the chain-of-thought process often rationalizes the flip rather than resisting it. Together, these results suggest that reasoning models can encode action choices before they begin to deliberate in text.

中文标题/摘要

标题：因此我存在。我思考

我们考虑这样一个问题：当一个大型语言推理模型做出选择时，它是先思考后决定，还是先决定后思考。在本文中，我们提供了证据表明，可检测的、早期编码的决策影响推理模型的推理过程。具体来说，我们展示了简单的线性探针能够以极高的置信度从预生成激活中解码工具调用决策，并且在某些情况下，甚至在生成第一个推理标记之前。激活导向支持这一因果关系：扰动决策方向会导致更多的思考，并在许多示例中改变行为（根据模型和基准的不同，比例在7%到79%之间）。我们还通过行为分析表明，当导向改变决策时，推理过程往往为这种转变进行合理化，而不是抵制它。这些结果共同表明，推理模型可以在开始文本推理之前就编码行动选择。

Summary / 总结

This paper investigates whether large language reasoning models think before deciding or decide before thinking. The authors found that early-encoded decisions significantly influence the reasoning process, as a simple linear probe could decode tool-calling decisions with high confidence before any reasoning tokens were generated. Activation steering further confirmed this by showing that perturbing the decision direction increased deliberation and changed model behavior in many cases. Behavioral analysis indicated that when the decision was steered, the chain-of-thought often rationalized the change rather than resisting it, suggesting that reasoning models can encode action choices before text deliberation begins.

本文探讨了大型语言推理模型是在思考后再决定，还是在决定后再思考。研究发现，早期编码的决策显著影响推理过程，一个简单的线性探针能够在生成任何推理令牌之前，以高置信度解码工具调用决策。激活引导进一步证实了这一点，显示扰动决策方向会增加推理并改变模型行为，在许多情况下，扰动变化的决策会导致链式思考过程进行自我合理化，而不是抵抗变化，这表明推理模型可以在文本推理开始前编码行动选择。

SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization

Authors: Ishrith Gowda, Chunwei Liu

Venue: MICCAI 2026

First: 2026-03-17T23:49:46+00:00 · Latest: 2026-04-01T17:34:06+00:00

Comments: 12 pages, 5 figures, 5 tables. Submitted to MICCAI 2026

Abs · PDF · Code1 · Code2

Abstract

Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(\mathbf{x})$ varies non-linearly across acquisition protocols while the conditional anatomy $P(\mathbf{y}|\mathbf{x})$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., ComBat) operate in feature space, precluding spatial downstream tasks, while standard deep learning approaches are theoretically bounded by local effective receptive fields (ERF), failing to model the global intensity correlations characteristic of field-strength bias. We propose SA-CycleGAN-2.5D, a domain adaptation framework motivated by the $HΔH$-divergence bound of Ben-David et al., integrating three architectural innovations: (1) A 2.5D tri-planar manifold injection preserving through-plane gradients $\nabla_z$ at $O(HW)$ complexity; (2) A U-ResNet generator with dense voxel-to-voxel self-attention, surpassing the $O(\sqrt{L})$ receptive field limit of CNNs to model global scanner field biases; and (3) A spectrally-normalized discriminator constraining the Lipschitz constant ($K_D \le 1$) for stable adversarial optimization. Evaluated on 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), our method reduces Maximum Mean Discrepancy (MMD) by 99.1% ($1.729 \to 0.015$) and degrades domain classifier accuracy to near-chance (59.7%). Ablation confirms that global attention is statistically essential (Cohen's $d = 1.32$, $p < 0.001$) for the harder heterogeneous-to-homogeneous translation direction. By bridging 2D efficiency and 3D consistency, our framework yields voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis.

Summary / 总结

SA-CycleGAN-2.5D is a domain adaptation framework designed to harmonize MRI images across different acquisition protocols. It integrates a 2.5D tri-planar manifold injection, a U-ResNet generator with dense self-attention, and a spectrally-normalized discriminator. Evaluated on 654 glioma patients from two institutions, the method significantly reduces Maximum Mean Discrepancy by 99.1% and degrades domain classifier accuracy to near-chance levels, demonstrating the importance of global attention for harmonization in heterogeneous data.

SA-CycleGAN-2.5D 是一种用于跨不同站点对 MRI 扫描进行谐调的域适应框架，旨在解决由扫描器引起的协变量偏移问题。该框架结合了2.5D 三平面流形注入、具有密集自注意力的 U-ResNet 生成器和谱归一化判别器。在来自两个机构的654名胶质瘤患者的数据上进行评估，该方法显著降低了最大均值偏差99.1%，并将域分类器的准确性降低到接近随机水平。研究证实，全局注意力对于有效跨异质数据进行谐调至关重要。

NeuroDDAF: Neural Dynamic Diffusion-Advection Fields with Evidential Fusion for Air Quality Forecasting

Authors: Prasanjit Dey, Soumyabrata Dev, Angela Meyer, Bianca Schoen-Phelan

First: 2026-04-01T17:27:43+00:00 · Latest: 2026-04-01T17:27:43+00:00

Comments: This manuscript is under review

Abs · PDF · Code1 · Code2

Abstract

Accurate air quality forecasting is crucial for protecting public health and guiding environmental policy, yet it remains challenging due to nonlinear spatiotemporal dynamics, wind-driven transport, and distribution shifts across regions. Physics-based models are interpretable but computationally expensive and often rely on restrictive assumptions, whereas purely data-driven models can be accurate but may lack robustness and calibrated uncertainty. To address these limitations, we propose Neural Dynamic Diffusion-Advection Fields (NeuroDDAF), a physics-informed forecasting framework that unifies neural representation learning with open-system transport modeling. NeuroDDAF integrates (i) a GRU-Graph Attention encoder to capture temporal dynamics and wind-aware spatial interactions, (ii) a Fourier-domain diffusion-advection module with learnable residuals, (iii) a wind-modulated latent Neural ODE to model continuous-time evolution under time-varying connectivity, and (iv) an evidential fusion mechanism that adaptively combines physics-guided and neural forecasts while quantifying uncertainty. Experiments on four urban datasets (Beijing, Shenzhen, Tianjin, and Ancona) across 1-3 day horizons show that NeuroDDAF consistently outperforms strong baselines, including AirPhyNet, achieving up to 9.7% reduction in RMSE and 9.4% reduction in MAE on long-term forecasts. On the Beijing dataset, NeuroDDAF attains an RMSE of 41.63 $μ$g/m$^3$ for 1-day prediction and 48.88 $μ$g/m$^3$ for 3-day prediction, representing the best performance among all compared methods. In addition, NeuroDDAF improves cross-city generalization and yields well-calibrated uncertainty estimates, as confirmed by ensemble variance analysis and case studies under varying wind conditions.

中文标题/摘要

标题：NeuroDDAF：神经动态扩散-对流场及其证据融合在空气质量预报中的应用

准确的空气质量预报对于保护公众健康和指导环境政策至关重要，但由于非线性时空动态、风驱动的传输以及区域间的分布变化，这一任务仍然具有挑战性。基于物理的模型具有可解释性，但计算成本高且往往依赖于严格的假设，而纯数据驱动的模型虽然准确，但可能缺乏鲁棒性和校准的不确定性。为了解决这些限制，我们提出了一种名为Neural Dynamic Diffusion-Advection Fields (NeuroDDAF)的基于物理的预报框架，该框架将神经表示学习与开放系统传输建模统一起来。NeuroDDAF结合了(i)一个GRU-图注意力编码器来捕捉时间动态和风向感知的空间交互，(ii)一个傅里叶域扩散-对流模块，具有可学习的残差，(iii)一个风调制的潜在神经ODE来在时间变化的连接下建模连续时间演变，以及(iv)一种证据融合机制，该机制能够适应性地结合物理指导和神经预报，并量化不确定性。在四个城市数据集（北京、深圳、天津和安科纳）上进行的1-3天预报实验表明，NeuroDDAF在所有基准方法中表现最佳，长期预报的RMSE和MAE分别降低了9.7%和9.4%。在北京数据集上，NeuroDDAF的1天预测RMSE为41.63 μg/m³，3天预测为48.88 μg/m³，是所有比较方法中表现最好的。此外，NeuroDDAF提高了跨城市的泛化能力，并提供了良好的不确定性估计，这得到了聚类方差分析和不同风况下的案例研究的证实。

Summary / 总结

NeuroDDAF is a physics-informed forecasting framework that combines neural representation learning with open-system transport modeling to address the challenges of air quality forecasting. It uses a GRU-Graph Attention encoder for temporal and spatial dynamics, a Fourier-domain diffusion-advection module, a wind-modulated Neural ODE for continuous-time evolution, and an evidential fusion mechanism for uncertainty quantification. Experiments on four urban datasets show that NeuroDDAF outperforms strong baselines, reducing RMSE and MAE by up to 9.7% and 9.4% respectively, and achieving the best performance on the Beijing dataset for 1- and 3-day predictions. Additionally, NeuroDDAF shows improved cross-city generalization and well-calibrated uncertainty estimates.

NeuroDDAF 是一种结合神经表示学习和开放系统传输建模的物理启发式预测框架，用于空气质量预报。它使用 GRU-Graph Attention 编码器来捕捉时间和空间动态，使用傅里叶域扩散-对流模块来建模风驱动的传输，使用神经 ODE 来处理连续时间演化，并使用证据融合机制来结合物理导向和神经预测。实验结果显示，NeuroDDAF 在四个城市数据集上的表现优于强基线，长期预测的 RMSE 和 MAE 分别最多减少 9.7% 和 9.4%，在北京的数据集上，1- 和 3 天预测的 RMSE 分别达到 41.63 μg/m³ 和 48.88 μg/m³，是所有比较方法中最好的。此外，NeuroDDAF 提高了跨城市的一般化能力和提供了良好的不确定性估计。

Safe learning-based control via function-based uncertainty quantification

Authors: Abdullah Tokmak, Toni Karvonen, Thomas B. Schön, Dominik Baumann

First: 2026-04-01T17:23:30+00:00 · Latest: 2026-04-01T17:23:30+00:00

Comments: Under review for CDC 2026

Abs · PDF · Code1 · Code2

Abstract

Uncertainty quantification is essential when deploying learning-based control methods in safety-critical systems. This is commonly realized by constructing uncertainty tubes that enclose the unknown function of interest, e.g., the reward and constraint functions or the underlying dynamics model, with high probability. However, existing approaches for uncertainty quantification typically rely on restrictive assumptions on the unknown function, such as known bounds on functional norms or Lipschitz constants, and struggle with discontinuities. In this paper, we model the unknown function as a random function from which independent and identically distributed realizations can be generated, and construct uncertainty tubes via the scenario approach that hold with high probability and rely solely on the sampled realizations. We integrate these uncertainty tubes into a safe Bayesian optimization algorithm, which we then use to safely tune control parameters on a real Furuta pendulum.

中文标题/摘要

标题：基于函数不确定性量化的方法实现学习导向控制的安全性

在安全关键系统中部署基于学习的控制方法时，不确定性量化是必不可少的。这通常通过构建不确定性管来实现，不确定性管以高概率包围未知函数，例如奖励函数、约束函数或潜在的动力学模型。然而，现有的不确定性量化方法通常依赖于对未知函数的限制性假设，如已知的功能范数或利普希茨常数的边界，并且难以处理不连续性。在本文中，我们将未知函数建模为一个随机函数，可以从该函数中生成独立同分布的实现，并通过场景方法构建不确定性管，这些不确定性管以高概率成立且仅依赖于采样的实现。我们将这些不确定性管整合到一个安全的贝叶斯优化算法中，然后使用该算法在实际的福尔图摆上安全地调整控制参数。

Summary / 总结

The research aims to enhance the safety of learning-based control methods in critical systems by quantifying uncertainty. The method involves modeling the unknown function as a random function and generating independent realizations to construct uncertainty tubes. The key finding is that this approach, which does not rely on restrictive assumptions, successfully handles discontinuities and integrates into a safe Bayesian optimization algorithm for tuning control parameters on a real Furuta pendulum.

研究旨在通过量化不确定性来提高关键系统中基于学习的控制方法的安全性。方法是将未知函数建模为随机函数，并生成独立实现来构建不确定性管。关键发现是，这种方法不依赖于限制性假设，能够处理不连续性，并集成到安全的贝叶斯优化算法中，用于在实际Furuta摆上安全调整控制参数。

Open-Set Supervised 3D Anomaly Detection: An Industrial Dataset and a Generalisable Framework for Unknown Defects

Authors: Hanzhe Liang, Luocheng Zhang, Junyang Xia, HanLiang Zhou, Bingyang Guo, Yingxi Xie, Can Gao, Ruiyun Yu, Jinbao Wang, Pan Li

First: 2026-04-01T17:23:23+00:00 · Latest: 2026-04-01T17:23:23+00:00

Comments: Resources: https://github.com/hzzzzzhappy/open-industry

Abs · PDF · Code1 · Code2 · Code3

Abstract

Although self-supervised 3D anomaly detection assumes that acquiring high-precision point clouds is computationally expensive, in real manufacturing scenarios it is often feasible to collect a limited number of anomalous samples. Therefore, we study open-set supervised 3D anomaly detection, where the model is trained with only normal samples and a small number of known anomalous samples, aiming to identify unknown anomalies at test time. We present Open-Industry, a high-quality industrial dataset containing 15 categories, each with five real anomaly types collected from production lines. We first adapt general open-set anomaly detection methods to accommodate 3D point cloud inputs better. Building upon this, we propose Open3D-AD, a point-cloud-oriented approach that leverages normal samples, simulated anomalies, and partially observed real anomalies to model the probability density distributions of normal and anomalous data. Then, we introduce a simple Correspondence Distributions Subsampling to reduce the overlap between normal and non-normal distributions, enabling stronger dual distributions modeling. Based on these contributions, we establish a comprehensive benchmark and evaluate the proposed method extensively on Open-Industry as well as established datasets including Real3D-AD and Anomaly-ShapeNet. Benchmark results and ablation studies demonstrate the effectiveness of Open3D-AD and further reveal the potential of open-set supervised 3D anomaly detection.

中文标题/摘要

标题：开放集监督3D异常检测：一种针对未知缺陷的工业数据集和可泛化的框架

尽管自监督3D异常检测假设获取高精度点云计算成本高昂，但在实际制造场景中，收集少量异常样本往往是可行的。因此，我们研究了开放集监督3D异常检测，其中模型仅使用正常样本和少量已知异常样本进行训练，旨在测试时识别未知异常。我们提出了Open-Industry，一个高质量的工业数据集，包含15个类别，每个类别有五种实际的异常类型，来自生产线。我们首先将通用的开放集异常检测方法适应于更好地处理3D点云输入。在此基础上，我们提出了Open3D-AD，一种面向点云的方法，利用正常样本、模拟异常和部分观察到的真实异常来建模正常和异常数据的概率密度分布。然后，我们引入了一种简单的对应分布子采样方法，以减少正常和非正常分布之间的重叠，从而增强双分布建模。基于这些贡献，我们建立了一个全面的基准，并在Open-Industry以及Real3D-AD和Anomaly-ShapeNet等标准数据集上广泛评估了所提出的方法。基准结果和消融研究证明了Open3D-AD的有效性，并进一步揭示了开放集监督3D异常检测的潜力。

Summary / 总结

This paper addresses open-set supervised 3D anomaly detection by training models with only normal and a few known anomalous samples to identify unknown anomalies. It introduces Open-Industry, a high-quality industrial dataset, and proposes Open3D-AD, a point-cloud-oriented approach that uses normal samples, simulated anomalies, and partially observed real anomalies to model the probability density distributions of normal and anomalous data. The method also includes a Correspondence Distributions Subsampling technique to reduce distribution overlap. Extensive evaluations on Open-Industry and other datasets show the effectiveness of Open3D-AD and highlight the potential of open-set supervised 3D anomaly detection.

该研究针对仅使用正常样本和少量已知异常样本进行训练以识别未知异常的开放集监督3D异常检测问题。引入了高质量的工业数据集Open-Industry，并提出了Open3D-AD，该方法利用正常样本、模拟异常和部分观察的真实异常来建模正常和异常数据的概率密度分布。该方法还包含一种对应分布子采样技术以减少分布重叠。在Open-Industry和其他数据集上的广泛评估表明Open3D-AD的有效性，并揭示了开放集监督3D异常检测的潜力。

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning

Authors: Cai Zhou, Zekai Wang, Menghua Wu, Qianyu Julie Zhu, Flora C. Shi, Chenyu Wang, Ashia Wilson, Tommi Jaakkola, Stephen Bates

First: 2026-04-01T17:21:50+00:00 · Latest: 2026-04-01T17:21:50+00:00

Comments: 20 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

While test-time scaling has enabled large language models to solve highly difficult tasks, state-of-the-art results come at exorbitant compute costs. These inefficiencies can be attributed to the miscalibration of post-trained language models, and the lack of calibration in popular sampling techniques. Here, we present Online Reasoning Calibration (ORCA), a framework for calibrating the sampling process that draws upon conformal prediction and test-time training. Specifically, we introduce a meta-learning procedure that updates the calibration module for each input. This allows us to provide valid confidence estimates under distributional shift, e.g. in thought patterns that occur across different stages of reasoning, or in prompt distributions between model development and deployment. ORCA not only provides theoretical guarantees on conformal risks, but also empirically shows higher efficiency and generalization across different reasoning tasks. At risk level $δ=0.1$, ORCA improves Qwen2.5-32B efficiency on in-distribution tasks with savings up to 47.5% with supervised labels and 40.7% with self-consistency labels. Under zero-shot out-of-domain settings, it improves MATH-500 savings from 24.8% of the static calibration baseline to 67.0% while maintaining a low empirical error rate, and the same trend holds across model families and downstream benchmarks. Our code is publicly available at https://github.com/wzekai99/ORCA.

中文标题/摘要

标题：在线推理校准：测试时训练使通用可形语言模型推理具备泛化能力

尽管测试时扩展使大型语言模型能够解决极其困难的任务，但最先进的结果却伴随着高昂的计算成本。这些低效率可以归因于后训练语言模型的校准不准确，以及流行采样技术缺乏校准。在此，我们提出了在线推理校准（ORCA）框架，该框架利用了可形预测和测试时训练来校准采样过程。具体而言，我们引入了一种元学习程序，该程序为每个输入更新校准模块。这使我们能够在分布变化的情况下提供有效的置信度估计，例如在不同推理阶段出现的思维模式变化，或在模型开发与部署之间的提示分布变化。ORCA不仅提供了可形风险的理论保证，还通过实验证明了在不同推理任务上的更高效率和泛化能力。在风险水平δ=0.1的情况下，ORCA在有监督标签下将Qwen2.5-32B的同分布任务效率提高了47.5%，在自我一致性标签下提高了40.7%。在零样本跨域设置下，它将静态校准基线的MATH-500节省率从24.8%提高到67.0%，同时保持了较低的经验错误率，且该趋势在不同模型家族和下游基准测试中保持一致。我们的代码可在https://github.com/wzekai99/ORCA公开获取。

Summary / 总结

The paper introduces Online Reasoning Calibration (ORCA), a framework that calibrates the sampling process of large language models using conformal prediction and test-time training. ORCA updates the calibration module for each input, providing valid confidence estimates under distributional shifts. Experiments show that ORCA improves efficiency and generalization across different reasoning tasks, reducing compute costs by up to 47.5% on in-distribution tasks and 67.0% on zero-shot out-of-domain settings compared to static calibration baselines.

研究旨在通过校准采样过程来解决大型语言模型的低效问题。ORCA 使用 conformal 预测和测试时训练来为每个输入更新校准模块，提供在分布变化下的有效置信估计。该方法在各种推理任务中提高了效率和泛化能力，在分布内任务中实现了高达 47.5% 的效率提升，在零样本跨域设置中将节省比例从静态校准基线的 24.8% 提高到 67.0%，并且在不同模型家族和下游基准测试中保持了较低的经验错误率。

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models

Authors: Jack Young

First: 2026-04-01T17:21:15+00:00 · Latest: 2026-04-01T17:21:15+00:00

Comments: 15 pages (10 main + 5 appendix), 3 figures, code at https://github.com/jackyoung27/s0-tuning

Abs · PDF · Code1 · Code2 · Code3

Abstract

Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval. The method, which we call S0 tuning, optimizes one state matrix per recurrent layer while freezing all model weights. On Qwen3.5-4B (GatedDeltaNet hybrid), S0 tuning improves greedy pass@1 by +23.6 +/- 1.7 pp (10 seeds). On FalconH1-7B (Mamba-2 hybrid), S0 reaches 71.8% +/- 1.3 and LoRA reaches 71.4% +/- 2.4 (3 seeds), statistically indistinguishable at this sample size while requiring no weight merging. Cross-domain transfer is significant on MATH-500 (+4.8 pp, p = 0.00002, 8 seeds) and GSM8K (+2.8 pp, p = 0.0003, 10 seeds); a text-to-SQL benchmark (Spider) shows no transfer, consistent with the trajectory-steering mechanism. A prefix-tuning control on a pure Transformer (Qwen2.5-3B) degrades performance by -13.9 pp under all nine configurations tested. On Qwen3.5, a per-step state-offset variant reaches +27.1 pp, above both S0 and LoRA but with per-step inference cost. Taken together, the results show that recurrent state initialization is a strong zero-inference-overhead PEFT surface for hybrid language models when verified supervision is scarce. The tuned state is a ~48 MB file; task switching requires no weight merging or model reload. Code and library: https://github.com/jackyoung27/s0-tuning.

中文标题/摘要

标题：S0 调优：混合递归注意模型的零开销适应

使用大约48个执行验证的人类评估训练解决方案，对每个递归层的单个初始状态矩阵进行调优，零推理开销情况下，S0调优在人类评估上的表现比LoRA高出10.8个百分点（p < 0.001）。我们称之为S0调优的方法，在优化每个递归层的状态矩阵的同时冻结所有模型权重。在Qwen3.5-4B（GatedDeltaNet混合）上，S0调优使贪婪通过@1提高了23.6 ± 1.7个百分点（10个种子）。在FalconH1-7B（Mamba-2混合）上，S0达到71.8% ± 1.3，而LoRA达到71.4% ± 2.4（3个种子），在样本量下统计上无显著差异，且无需权重合并。跨领域转移在MATH-500上显著提高了4.8个百分点（p = 0.00002，8个种子）和GSM8K上提高了2.8个百分点（p = 0.0003，10个种子）；一个文本到SQL基准（Spider）显示没有转移，这与轨迹引导机制一致。纯Transformer（Qwen2.5-3B）的前缀调优在所有九种测试配置下性能下降了13.9个百分点。在Qwen3.5上，每步状态偏移变体达到了27.1个百分点的提升，高于S0和LoRA，但每步有推理成本。综合来看，结果表明，在验证监督稀缺时，递归状态初始化是混合语言模型的强零推理开销PEFT表面。调优的状态是一个约48 MB的文件；任务切换无需权重合并或模型重新加载。代码和库：https://github.com/jackyoung27/s0-tuning.

Summary / 总结

The research aims to improve the performance of hybrid recurrent-attention models with minimal overhead. S0 tuning, which optimizes one state matrix per recurrent layer while keeping other weights fixed, outperforms LoRA by 10.8 percentage points on HumanEval. On Qwen3.5-4B, S0 tuning improves greedy pass@1 by 23.6 percentage points, and on FalconH1-7B, it achieves 71.8% performance, comparable to LoRA without requiring weight merging. The method shows significant cross-domain transfer on MATH-500 and GSM8K, but not on a text-to-SQL benchmark. A per-step state-offset variant further improves performance but incurs per-step inference cost.

研究旨在通过最小的开销提升混合递归注意模型的性能。S0 调整方法在每层递归层优化一个状态矩阵，同时固定其他权重，相比 LoRA 在 HumanEval 上提高了 10.8 个百分点。在 Qwen3.5-4B 上，S0 调整方法将贪婪通过@1 提高了 23.6 个百分点，而在 FalconH1-7B 上，它达到了 71.8%，无需合并权重。该方法在 MATH-500 和 GSM8K 上显示出显著的跨域迁移，但在 Spider 上没有迁移，这与轨迹导向机制一致。纯 Transformer 的前缀调整控制方法在所有九种配置下都降低了性能。调整后的状态是一个小文件，切换任务无需额外的推理成本或重新加载模型。

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

Authors: Mohammad R. Abu Ayyash

First: 2026-04-01T17:08:25+00:00 · Latest: 2026-04-01T17:08:25+00:00

Comments: 26 pages, 13 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

中文标题/摘要

标题：Brainstacks：通过冻结MoE-LoRA堆栈实现跨域认知能力的连续大规模语言模型学习

我们提出了Brainstacks，一种模块化架构，用于连续多域大规模语言模型的微调，将领域专长打包为冻结适配器堆栈，在推理时叠加在共享冻结基底之上。五个相互嵌套的组件：(1) QLoRA 4位量化下的Shazeer风格噪声上2路由的MoE-LoRA；(2) 内循环通过冻结训练堆栈并添加新堆栈进行残差增强；(3) 外循环训练具有课程顺序依赖性的序列领域特定堆栈；(4) 随机SVD的零空间投影，通过约束新堆栈到与先前方向正交的子空间，实现隔离时的零遗忘；(5) 基于结果的Sigmoid元路由器，根据经验发现的领域组合目标训练，选择性地加权堆栈，实现跨域组合。两个边界实验：(6) PSN预训练在随机初始化模型上；(7) 每个领域的RL（DPO/GRPO）验证与后SFT对齐的兼容性。在TinyLlama-1.1B（4个领域，9个堆栈）和Gemma 3 12B IT（5个领域，10个堆栈）上验证，MoE-LoRA比参数匹配的单个LoRA快2.5倍的收敛速度，残差增强突破了单堆栈的天花板，路由系统恢复了由未加门堆栈积累破坏的生成质量。核心发现：基于结果的路由器发现领域堆栈编码的是可转移的认知原语（指令遵循清晰度、数值推理、程序逻辑、链式思维结构）而非领域特定知识，即使在堆栈中没有医学数据的情况下，医学提示在97%的情况下路由到聊天+数学堆栈。

Summary / 总结

Brainstacks is a modular architecture for continual multi-domain fine-tuning of large language models, which packages domain expertise as frozen adapter stacks. It includes five components: MoE-LoRA with QLoRA quantization, an inner loop for residual boosting, an outer loop for sequential domain-specific training, null-space projection to prevent forgetting, and an outcome-based sigmoid meta-router. Experiments on TinyLlama-1.1B and Gemma 3 12B IT show that Brainstacks achieves faster convergence, breaks through the single-stack ceiling, and recovers generation quality. The key finding is that domain stacks encode transferable cognitive primitives rather than domain-specific knowledge, as evidenced by medical prompts routing to chat+math stacks in 97% of cases without medical data.

Brainstacks 是一种模块化架构，用于持续多域微调大型语言模型，将领域专长作为冻结适配器堆栈进行包装。它包括五个组件：MoE-LoRA 与 QLoRA 量化、内环进行残差增强、外环进行顺序领域特定训练、零空间投影以防止遗忘以及基于结果的 Sigmoid 门控路由器。实验表明，Brainstacks 实现了更快的收敛、突破了单堆栈天花板，并恢复了生成质量。关键发现是，领域堆栈编码的是可转移的认知原语而非领域特定知识，例如，医疗提示在 97% 的情况下会路由到聊天+数学堆栈，而这些堆栈中没有医疗数据。

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

Authors: Aaron Rose, Carissa Cullen, Brandon Gary Kaplowitz, Christian Schroeder de Witt

First: 2026-04-01T17:08:05+00:00 · Latest: 2026-04-01T17:08:05+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at https://github.com/aaronrose227/narcbench.

中文标题/摘要

标题：通过多智能体可解释性检测多智能体合谋

随着LLM智能体在多智能体系统中的部署增加，它们引入了规避标准人类监督的隐蔽协调风险。虽然针对模型激活的线性探针在单智能体设置中检测欺骗方面显示出前景，但合谋本质上是多智能体现象，利用内部表示检测智能体之间的合谋尚未被探索。我们引入了NARCBench，这是一个在环境分布转移下评估合谋检测的基准，并提出五种探针技术，将每个智能体的欺骗评分聚合以在群体层面分类场景。我们的探针在分布内实现了1.00 AUROC，并在结构不同的多智能体场景和隐写术黑杰克牌点数统计任务上零样本转移实现了0.60-0.86 AUROC。我们发现没有单一探针技术在所有合谋类型中占主导地位，这表明不同形式的合谋在激活空间中表现不同。我们还发现初步证据表明，该信号在标记级别上是局部化的，合谋智能体的激活在处理其伙伴消息的编码部分时会特别激增。这项工作朝着多智能体可解释性迈出了一步：将白盒检查从单模型扩展到多智能体环境，其中检测需要在智能体之间聚合信号。这些结果表明，模型内部提供了与文本级监控互补的信号，特别是对于可以访问模型激活信息的组织来说，对于检测多智能体合谋特别有效。代码和数据可在https://github.com/aaronrose227/narcbench/获得。

Summary / 总结

This study addresses the risk of covert coordination among LLM agents in multi-agent systems, which can evade human oversight. It introduces NARCBench, a benchmark for evaluating collusion detection under distribution shift, and proposes five probing techniques to aggregate per-agent deception scores for group-level classification. The probes achieve high AUROC scores in-distribution and reasonable scores when transferred to different scenarios, indicating that different forms of collusion manifest differently in activation space. The study suggests that model internals can provide a complementary signal to text-level monitoring for detecting multi-agent collusion, especially for organizations with access to model activations.

该研究关注LLM代理在多代理系统中可能存在的隐蔽协调风险，这种风险可能逃避人类监督。研究引入了NARCBench基准，用于评估在分布变化下的合谋检测，并提出了五种探针技术来汇总每个代理的欺骗得分以进行组级分类。这些探针在分布内实现了高AUROC分数，并在不同场景下实现了合理的AUROC分数，表明不同形式的合谋在激活空间中表现不同。研究结果表明，模型内部可以为检测多代理合谋提供与文本级监控互补的信号，尤其是对于可以访问模型激活的组织而言。

Deep Reinforcement Learning for Robotic Manipulation under Distribution Shift with Bounded Extremum Seeking

Authors: Shaifalee Saxena, Rafael Fierro, Alexander Scheinker

First: 2026-04-01T16:59:01+00:00 · Latest: 2026-04-01T16:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning has shown strong performance in robotic manipulation, but learned policies often degrade in performance when test conditions differ from the training distribution. This limitation is especially important in contact-rich tasks such as pushing and pick-and-place, where changes in goals, contact conditions, or robot dynamics can drive the system out-of-distribution at inference time. In this paper, we investigate a hybrid controller that combines reinforcement learning with bounded extremum seeking to improve robustness under such conditions. In the proposed approach, deep deterministic policy gradient (DDPG) policies are trained under standard conditions on the robotic pushing and pick-and-place tasks, and are then combined with bounded ES during deployment. The RL policy provides fast manipulation behavior, while bounded ES ensures robustness of the overall controller to time variations when operating conditions depart from those seen during training. The resulting controller is evaluated under several out-of-distribution settings, including time-varying goals and spatially varying friction patches.

中文标题/摘要

标题：基于有界极值搜索的分布偏移下机器人操作的深度强化学习

强化学习在机器人操作中表现出强大的性能，但当测试条件与训练分布不同时，学习到的策略往往会降低性能。这一限制在接触丰富的任务（如推拉和取放）中尤为重要，因为目标、接触条件或机器人动力学的变化可能会在推理时将系统推离分布。在本文中，我们研究了一种结合强化学习和有界极值搜索的混合控制器，以提高在这些条件下的鲁棒性。在所提出的方法中，深度确定性策略梯度（DDPG）策略在标准条件下针对机器人推拉和取放任务进行训练，然后在部署时与有界ES结合使用。RL策略提供快速的操作行为，而有界ES确保在操作条件与训练期间看到的情况不同时，整体控制器的鲁棒性。该控制器在多种离分布设置下进行了评估，包括时间变化的目标和空间变化的摩擦区域。

Summary / 总结

This paper addresses the issue of policy degradation in robotic manipulation tasks under distribution shift by proposing a hybrid controller that combines deep reinforcement learning with bounded extremum seeking. The DDPG policies are trained under standard conditions and then used in conjunction with bounded ES during deployment. The resulting controller demonstrates improved robustness under various out-of-distribution settings, such as time-varying goals and spatially varying friction patches.

本文解决机器人操作任务中测试条件与训练分布不一致时策略性能下降的问题，提出了一种结合深度确定性策略梯度（DDPG）和有界极值搜索（ES）的混合控制器以提高鲁棒性。DDPG策略在标准条件下进行训练，部署时使用有界ES确保在操作条件发生变化时的整体控制器的鲁棒性。控制器在时间变化的目标和空间变化的摩擦区域等非分布设置下进行了评估，显示出与仅使用标准DDPG策略相比的更好性能。

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Redirection

Authors: Yi Sun, Xinhao Zhong, Hongyan Li, Yimin Zhou, Junhao Li, Bin Chen, Xuan Wang

First: 2026-01-01T09:11:09+00:00 · Latest: 2026-04-01T16:54:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

中文标题/摘要

标题：ActErase：一种基于激活重定向的无训练精确概念擦除范式

文本到图像扩散模型的最新进展展示了卓越的生成能力，但同时也引发了关于安全、版权和伦理问题的重大关切。现有的概念擦除方法通过从预训练模型中移除敏感概念来应对这些风险，但大多数方法依赖于数据密集型和计算成本高昂的微调，这构成了一个关键限制。为克服这些挑战，受观察到模型的激活主要由通用概念组成，仅有一小部分可以表示目标概念的启发，我们提出了一种新颖的无训练方法（ActErase）以实现高效的概念擦除。具体而言，该方法通过提示对分析识别激活差异区域，在前向传递过程中提取目标激活并动态替换输入激活。在三个关键擦除任务（裸体、艺术风格和物体移除）上的全面评估表明，我们的无训练方法实现了最先进的擦除性能，同时有效保留了模型的整体生成能力。我们的方法还表现出强大的对抗攻击鲁棒性，确立了一种轻量级且有效的扩散模型概念操控新范式。

Summary / 总结

The paper introduces ActErase, a training-free method for precise concept erasure in text-to-image diffusion models. It leverages prompt-pair analysis to identify and redirect activations, avoiding the need for fine-tuning. ActErase outperforms existing methods in three critical erasure tasks while preserving the model's generative capabilities and showing robustness against adversarial attacks.

研究通过提出ActErase，一种无需训练的方法，来精确移除文本生成图像中的敏感概念，以解决安全和伦理问题。该方法通过提示对分析识别激活差异区域，并在前向传递过程中动态替换输入激活。该方法在三个关键移除任务中表现出色，同时保持模型的整体生成能力，并且对对抗攻击具有很强的鲁棒性。

ReinDriveGen: Reinforcement Post-Training for Out-of-Distribution Driving Scene Generation

Authors: Hao Zhang, Lue Fan, Weikang Bian, Zehuan Wu, Lewei Lu, Zhaoxiang Zhang, Hongsheng Li

First: 2026-04-01T16:48:20+00:00 · Latest: 2026-04-01T16:48:20+00:00

Comments: Project page: https://drive-sim.github.io/ReinDriveGen/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present ReinDriveGen, a framework that enables full controllability over dynamic driving scenes, allowing users to freely edit actor trajectories to simulate safety-critical corner cases such as front-vehicle collisions, drifting cars, vehicles spinning out of control, pedestrians jaywalking, and cyclists cutting across lanes. Our approach constructs a dynamic 3D point cloud scene from multi-frame LiDAR data, introduces a vehicle completion module to reconstruct full 360° geometry from partial observations, and renders the edited scene into 2D condition images that guide a video diffusion model to synthesize realistic driving videos. Since such edited scenarios inevitably fall outside the training distribution, we further propose an RL-based post-training strategy with a pairwise preference model and a pairwise reward mechanism, enabling robust quality improvement under out-of-distribution conditions without ground-truth supervision. Extensive experiments demonstrate that ReinDriveGen outperforms existing approaches on edited driving scenarios and achieves state-of-the-art results on novel ego viewpoint synthesis.

中文标题/摘要

标题：ReinDriveGen：基于强化学习的离分布驾驶场景生成后训练

我们提出了ReinDriveGen，一种框架，使用户能够完全控制动态驾驶场景，自由编辑演员轨迹以模拟前方车辆碰撞、漂移车辆、失控车辆、行人乱穿马路和骑自行车者横穿车道等关键安全案例。我们的方法从多帧LiDAR数据构建动态3D点云场景，引入车辆完成模块从部分观察中重建360°几何结构，并将编辑后的场景渲染为2D条件图像，指导视频扩散模型生成逼真的驾驶视频。由于这些编辑场景不可避免地落在训练分布之外，我们进一步提出了一种基于RL的后训练策略，包含成对偏好模型和成对奖励机制，在无真实监督的情况下，在离分布条件下实现稳健的质量提升。大量实验表明，ReinDriveGen在编辑驾驶场景上优于现有方法，并在新型第一人称视角合成上达到最先进的效果。

Summary / 总结

ReinDriveGen is a framework designed to generate fully controllable dynamic driving scenes, enabling users to edit actor trajectories for simulating safety-critical scenarios. It constructs 3D scenes from multi-frame LiDAR data, uses a vehicle completion module to reconstruct full geometry, and renders edited scenes into 2D images to guide video diffusion models for realistic video synthesis. The approach also includes an RL-based post-training strategy to improve quality under out-of-distribution conditions without ground-truth supervision, outperforming existing methods on edited driving scenarios and novel ego viewpoint synthesis.

ReinDriveGen 是一个框架，旨在生成完全可控的动态驾驶场景，允许用户模拟各种安全关键情况。它从 LiDAR 数据构建 3D 场景，完成车辆几何，并渲染编辑场景以进行视频合成。为了处理超出训练分布的场景，ReinDriveGen 使用基于强化学习的后训练策略，结合成对偏好模型和奖励机制，无需地面真实监督即可提高质量。实验表明，ReinDriveGen 在编辑的驾驶场景上优于现有方法，并在新型第一人称视角合成上达到最先进的结果。

Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers

Authors: Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki, Kiyoharu Aizawa

First: 2026-04-01T16:48:04+00:00 · Latest: 2026-04-01T16:48:04+00:00

Comments: Project Page: https://agent4science-utokyo.github.io/PaperRecon_HP/

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper introduces the first systematic evaluation framework for quantifying the quality and risks of papers written by modern coding agents. While AI-driven paper writing has become a growing concern, rigorous evaluation of the quality and potential risks of AI-written papers remains limited, and a unified understanding of their reliability is still lacking. We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal additional resources, and the result is subsequently compared against the original paper. PaperRecon disentangles the evaluation of the AI-written papers into two orthogonal dimensions, Presentation and Hallucination, where Presentation is evaluated using a rubric and Hallucination is assessed via agentic evaluation grounded in the original paper source. For evaluation, we introduce PaperWrite-Bench, a benchmark of 51 papers from top-tier venues across diverse domains published after 2025. Our experiments reveal a clear trade-off: while both ClaudeCode and Codex improve with model advances, ClaudeCode achieves higher presentation quality at the cost of more than 10 hallucinations per paper on average, whereas Codex produces fewer hallucinations but lower presentation quality. This work takes a first step toward establishing evaluation frameworks for AI-driven paper writing and improving the understanding of its risks within the research community.

中文标题/摘要

标题：论文重建评估：评估AI撰写的论文的质量和幻觉

本文介绍了首个系统性的评估框架，用于量化现代编码代理撰写的论文的质量和风险。尽管AI驱动的论文写作已成为一个日益增长的担忧，但对AI撰写的论文的质量和潜在风险的严格评估仍然有限，对其可靠性的统一理解也仍然缺乏。我们引入了论文重建评估（PaperRecon），这是一种评估框架，在该框架中，从现有论文中创建概览（overview.md），然后基于概览和少量额外资源，让代理生成一篇完整的论文，之后将结果与原始论文进行比较。PaperRecon 将对AI撰写的论文的评估拆分为两个正交维度：呈现和幻觉，其中呈现使用评分表进行评估，幻觉则通过基于原始论文源的代理评估进行评估。为了评估，我们引入了PaperWrite-Bench，这是一个包含51篇来自2025年后顶级会议的跨学科论文基准。我们的实验揭示了一个明显的权衡：虽然ClaudeCode和Codex都随着模型的进步而改进，但ClaudeCode在呈现质量上更高，平均每篇论文超过10个幻觉，而Codex则产生更少的幻觉但呈现质量较低。这项工作迈出了建立AI驱动论文写作评估框架的第一步，并在研究社区中提高了对其风险的理解。

Summary / 总结

This paper presents Paper Reconstruction Evaluation (PaperRecon), a framework to evaluate the quality and risks of AI-written papers. It dissects the evaluation into two dimensions: Presentation and Hallucination. Using PaperWrite-Bench, a benchmark of 51 papers, the study finds that ClaudeCode excels in Presentation but has more hallucinations, while Codex has fewer hallucinations but lower Presentation quality. This work aims to improve the understanding of AI-driven paper writing risks.

本文提出了Paper Reconstruction Evaluation (PaperRecon) 框架，通过将AI生成的论文与原始论文进行比较来评估AI论文的质量和风险。评估集中在两个维度上：Presentation和Hallucination。实验表明，虽然ClaudeCode在Presentation方面表现优异，但它包含的幻觉更多，而Codex则包含更少的幻觉但Presentation质量较低。这项工作推进了对AI驱动论文写作风险的理解。

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

Authors: Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen

First: 2025-05-23T11:34:02+00:00 · Latest: 2026-04-01T16:42:00+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM-as-a-judge is widely used as a scalable substitute for human evaluation, yet current approaches rely on black-box access and struggle to detect subtle dishonesty, such as sycophancy and manipulation. We introduce Judge Using Safety-Steered Alternatives (JUSSA), a framework that leverages a model's internal representations to optimize an honesty-promoting steering vector from a single training example, generating contrastive alternatives that give judges a reference point for detecting dishonesty. We test JUSSA on a novel manipulation benchmark with human-validated response pairs at varying dishonesty levels, finding AUROC improvements across both GPT-4.1 (0.893 $\to$ 0.946) and Claude Haiku (0.859 $\to$ 0.929) judges, though performance degrades when task complexity is mismatched to judge capability, suggesting contrastive evaluation helps most when the task is challenging but within the judge's reach. Layer-wise analysis further shows that steering is most effective in middle layers, where model representations begin to diverge between honest and dishonest prompt processing. Our work demonstrates that steering vectors can serve as tools for evaluation rather than for improving model outputs at inference, opening a new direction for thorough white-box auditing.

中文标题/摘要

标题：但你的诚实行事的真实回答是什么？使用导向向量帮助LLM法官提供诚实行事的替代方案

LLM作为法官广泛用于大规模替代人类评估，但当前方法依赖于黑盒访问，难以检测微妙的不诚实行为，如阿谀奉承和操控。我们引入了使用安全导向替代方案的法官（JUSSA）框架，该框架利用模型的内部表示，从单个训练示例中优化一个促进诚实性的导向向量，生成对比替代方案，为法官提供检测不诚实行为的参考点。我们使用一个包含不同不诚实程度的人类验证响应对的新操控基准测试JUSSA，发现GPT-4.1（0.893 $\to$ 0.946）和Claude Haiku（0.859 $\to$ 0.929）法官的AUROC改进，尽管当任务复杂度与法官能力不匹配时，性能会下降，表明对比评估在任务具有挑战性但仍在法官能力范围内的时候最有效。逐层分析进一步表明，导向最有效的层是中间层，此时模型表示开始在诚实和不诚实提示处理之间出现分歧。我们的工作表明，导向向量可以作为评估工具，而不是在推理时改进模型输出的工具，为彻底的白盒审计开辟了新方向。

Summary / 总结

The paper addresses the issue of detecting subtle dishonesty in responses generated by large language models (LLMs) used as judges. It introduces JUSSA, a framework that uses a model's internal representations to generate contrastive alternatives, aiding judges in detecting dishonesty. Experiments on a manipulation benchmark show improvements in AUROC for both GPT-4.1 and Claude Haiku judges, with performance degradation when task complexity exceeds judge capability, indicating that contrastive evaluation is most effective when the task is challenging but within the judge's reach.

论文通过引入JUSSA框架，利用模型的内部表示生成对比替代方案，以帮助评委识别不诚实的回答。实验表明，该方法在GPT-4.1和Claude Haiku评委上的AUROC指标有所提升，但当任务复杂度超出评委能力时，性能会下降。分析显示，模型中间层的表示在诚实和不诚实的提示处理之间开始出现差异时，引导最有效。

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

Authors: Reyhaneh Ahani Manghotay, Jie Liang

First: 2026-04-01T16:41:04+00:00 · Latest: 2026-04-01T16:41:04+00:00

Comments: 14 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

中文标题/摘要

标题：轻量级提示引导CLIP适应性方法在单目深度估计中的应用

利用视觉语言模型（VLMs）如CLIP的丰富语义特征进行单目深度估计任务是一个有前景的方向，但通常需要大量的微调或缺乏几何精度。我们提出了一种参数高效的框架MoA-DepthCLIP，该框架通过最少的监督将预训练的CLIP表示适应单目深度估计。该方法将轻量级混合适配器（MoA）模块集成到结合了最终层选择性微调的预训练视觉变换器（ViT-B/32）主干中。这种设计能够实现空间感知的适应，由全局语义上下文向量和深度区间分类与直接回归相结合的混合预测架构引导。为了提高结构准确性，我们采用了一种复合损失函数来施加几何约束。在NYU Depth V2基准测试中，MoA-DepthCLIP取得了具有竞争力的结果，显著优于DepthCLIP基线，将δ1精度从0.390提高到0.745，将RMSE从1.176降低到0.520。这些结果仅需少量可训练参数，表明轻量级、提示引导的MoA是将VLM知识转移到精细单目深度估计任务的有效策略。

Summary / 总结

The research aims to leverage the semantic features of pretrained CLIP models for monocular depth estimation with minimal fine-tuning. The MoA-DepthCLIP framework integrates a lightweight Mixture-of-Adapters module into the pretrained ViT-B/32 backbone, enabling spatially-aware adaptation guided by a semantic context vector and a hybrid prediction architecture. The method achieves competitive results on the NYU Depth V2 benchmark, significantly improving $δ_1$ accuracy and reducing RMSE compared to the DepthCLIP baseline, while using fewer trainable parameters.

该研究提出了一种参数高效的框架MoA-DepthCLIP，通过最小监督将预训练的CLIP表示适应于单目深度估计任务。该方法将轻量级的Mixture-of-Adapters模块集成到Vision Transformer主干中，并对最终层进行选择性微调。该方法使用全局语义上下文向量和混合预测架构来增强空间感知和结构准确性。在NYU Depth V2基准测试上，MoA-DepthCLIP将$δ_1$精度显著提高到0.745，将RMSE降低到0.520，超越了DepthCLIP基线。

Adversarial Moral Stress Testing of Large Language Models

Authors: Saeid Jamshidi, Foutse Khomh, Arghavan Moradi Dakhel, Amin Nikanjam, Mohammad Hamdaqa, Kawser Wazed Nafi

First: 2026-04-01T16:34:20+00:00 · Latest: 2026-04-01T16:34:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remains challenging, particularly under sustained adversarial user interaction. Existing safety benchmarks typically rely on single-round evaluations and aggregate metrics, such as toxicity scores and refusal rates, which offer limited visibility into behavioral instability that may arise during realistic multi-turn interactions. As a result, rare but high-impact ethical failures and progressive degradation effects may remain undetected prior to deployment. This paper introduces Adversarial Moral Stress Testing (AMST), a stress-based evaluation framework for assessing ethical robustness under adversarial multi-round interactions. AMST applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics that capture variance, tail risk, and temporal behavioral drift across interaction rounds. We evaluate AMST on several state-of-the-art LLMs, including LLaMA-3-8B, GPT-4o, and DeepSeek-v3, using a large set of adversarial scenarios generated under controlled stress conditions. The results demonstrate substantial differences in robustness profiles across models and expose degradation patterns that are not observable under conventional single-round evaluation protocols. In particular, robustness has been shown to depend on distributional stability and tail behavior rather than on average performance alone. Additionally, AMST provides a scalable and model-agnostic stress-testing methodology that enables robustness-aware evaluation and monitoring of LLM-enabled software systems operating in adversarial environments.

中文标题/摘要

标题：大型语言模型的对抗道德压力测试

在软件系统中部署大型语言模型（LLMs）的道德稳健性评估仍然具有挑战性，尤其是在持续的对抗性用户交互下。现有的安全性基准通常依赖于单轮评估和综合指标，如毒性评分和拒绝率，这些指标对在现实多轮交互中可能出现的行为不稳定性提供有限的可见性。因此，在部署前，罕见但影响重大的道德失败和逐步退化效应可能会被遗漏。本文介绍了对抗道德压力测试（AMST），这是一种基于压力的评估框架，用于评估在对抗性多轮交互下的道德稳健性。AMST 对提示应用结构化压力变换，并通过分布感知的稳健性指标评估模型行为，这些指标捕捉了交互轮次中的变化性、尾部风险和时间行为漂移。我们使用在受控压力条件下生成的一组大量对抗性场景，对包括 LLaMA-3-8B、GPT-4o 和 DeepSeek-v3 在内的几种最先进的 LLM 进行了 AMST 评估。结果表明，不同模型的稳健性特征存在显著差异，并揭示了在常规单轮评估协议下不可见的退化模式。特别是，稳健性被证明不仅依赖于平均性能，还依赖于分布稳定性与尾部行为。此外，AMST 提供了一种可扩展且模型无关的压力测试方法，使人们能够对在对抗环境中运行的 LLM 启动软件系统进行稳健性意识的评估和监控。

Summary / 总结

This paper addresses the challenge of evaluating the ethical robustness of large language models (LLMs) under sustained adversarial interactions. It introduces Adversarial Moral Stress Testing (AMST), a framework that applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics. The study finds that AMST reveals significant differences in robustness profiles across models and exposes degradation patterns not detectable by conventional single-round evaluations. The results highlight the importance of distributional stability and tail behavior in model robustness.

本文提出了对抗道德压力测试（AMST），一种评估大型语言模型（LLMs）在对抗性多轮交互中的伦理鲁棒性的框架。AMST 使用结构化的压力变换和分布感知的鲁棒性指标来捕捉方差、尾部风险和时间行为漂移。对LLM，如LLaMA-3-8B、GPT-4o和DeepSeek-v3的评估揭示了鲁棒性分布的巨大差异，并暴露了常规单轮评估协议无法检测的退化模式。研究强调了分布稳定性与尾部行为在模型鲁棒性中的重要性。

Genesis: Evolving Attack Strategies for LLM Web Agent Red-Teaming

Authors: Zheng Zhang, Jiarui He, Yuchen Cai, Deheng Ye, Peilin Zhao, Ruili Feng, Hao Wang

Venue: ICME 2026

First: 2025-10-21T05:49:37+00:00 · Latest: 2026-04-01T16:25:41+00:00

Comments: Accepted by ICME 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

As large language model (LLM) agents increasingly automate complex web tasks, they boost productivity while simultaneously introducing new security risks. However, relevant studies on web agent attacks remain limited. Existing red-teaming approaches mainly rely on manually crafted attack strategies or static models trained offline. Such methods fail to capture the underlying behavioral patterns of web agents, making it difficult to generalize across diverse environments. In web agent attacks, success requires the continuous discovery and evolution of attack strategies. To this end, we propose Genesis, a novel agentic framework composed of three modules: Attacker, Scorer, and Strategist. The Attacker generates adversarial injections by integrating the genetic algorithm with a hybrid strategy representation. The Scorer evaluates the target web agent's responses to provide feedback. The Strategist dynamically uncovers effective strategies from interaction logs and compiles them into a continuously growing strategy library, which is then re-deployed to enhance the Attacker's effectiveness. Extensive experiments across various web tasks show that our framework discovers novel strategies and consistently outperforms existing attack baselines. Our code is available at https://github.com/CjangCjengh/web_agent_attack.

中文标题/摘要

标题：创世：LLM网络代理红队攻击策略的演化

随着大型语言模型（LLM）代理自动化复杂网络任务，它们提高了生产力，同时也引入了新的安全风险。然而，关于网络代理攻击的相关研究仍然有限。现有的红队方法主要依赖于手工构建的攻击策略或离线训练的静态模型。这些方法无法捕捉网络代理的底层行为模式，使得难以在不同环境中进行泛化。在网络代理攻击中，成功需要不断发现和演化攻击策略。为此，我们提出了一种名为创世的新型代理框架，由三个模块组成：攻击者、评分器和策略师。攻击者通过结合遗传算法和混合策略表示生成对抗性注入。评分器评估目标网络代理的响应以提供反馈。策略师从交互日志中动态发现有效的策略，并将它们编译成不断增长的策略库，然后重新部署以增强攻击者的有效性。在各种网络任务的广泛实验中，我们的框架发现了新的策略，并且始终优于现有的攻击基准。我们的代码可在https://github.com/CjangCjengh/web_agent_attack获取。

Summary / 总结

The research aims to address the security risks posed by large language model (LLM) web agents by evolving attack strategies. Genesis, a novel framework, consists of Attacker, Scorer, and Strategist modules. The Attacker uses a genetic algorithm to generate adversarial injections, the Scorer evaluates the target web agent's responses, and the Strategist compiles effective strategies into a library. Experiments across various web tasks demonstrate that Genesis discovers novel strategies and outperforms existing attack methods.

研究旨在通过演化攻击策略来应对大型语言模型（LLM）网络代理带来的安全风险。Genesis 是一个新颖的代理框架，包含三个模块：攻击者、评分器和策略师。攻击者使用遗传算法生成对抗性注入，评分器评估目标网络代理的响应，策略师从交互日志中动态发现有效策略并不断更新策略库。实验结果表明，Genesis 发现了新的策略并优于现有的攻击基线。

DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models

Authors: Guanzhi Deng, Bo Li, Ronghao Chen, Xiujin Liu, Zhuo Han, Huacan Wang, Lijie Wen, Linqi Song

First: 2026-01-08T10:58:51+00:00 · Latest: 2026-04-01T16:21:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning methods, such as LoRA, are widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches typically assign identical LoRA ranks to all expert modules, ignoring the heterogeneous specialization of pretrained experts. This uniform allocation leads to a resource mismatch: task-relevant experts are under-provisioned, while less relevant ones receive redundant parameters. To address this, we propose DR-LoRA, a Dynamic Rank LoRA framework for fine-tuning pretrained MoE models. Specifically, DR-LoRA initializes all expert LoRA modules with a small active rank and uses an expert saliency score, which combines routing frequency and gradient-based rank importance, to identify which experts would benefit most from additional capacity. It then periodically expands the active ranks of the task-critical expert LoRA, progressively constructing a heterogeneous rank distribution tailored to the target task. Experiments on three MoE models across six tasks show that DR-LoRA consistently outperforms LoRA and other strong baselines, demonstrating that task-adaptive heterogeneous rank allocation is an effective strategy to improve active capacity utilization in MoE fine-tuning.

中文标题/摘要

标题：DR-LoRA：动态秩LoRA预训练混合专家模型的细调方法

混合专家（MoE）已成为扩展大型语言模型（LLMs）的主要范式。参数高效细调方法，如LoRA，广泛用于适应预训练的MoE LLMs到下游任务。然而，现有方法通常为所有专家模块分配相同的LoRA秩，忽略了预训练专家的异质专业化。这种统一分配导致了资源不匹配：与任务相关的专家资源不足，而不太相关的专家则接收冗余参数。为了解决这个问题，我们提出了DR-LoRA，一种动态秩LoRA框架，用于预训练MoE模型的细调。具体而言，DR-LoRA 以小活跃秩初始化所有专家LoRA模块，并使用结合路由频率和梯度基秩重要性的专家显著性得分来识别哪些专家最能从额外容量中受益。然后，DR-LoRA 会定期扩展关键任务专家LoRA的活跃秩，逐步构建针对目标任务的异质秩分布。在三个MoE模型上的六项任务实验表明，DR-LoRA 一致优于LoRA和其他强基线，证明了任务适应性异质秩分配是提高MoE细调中活跃容量利用率的有效策略。

Summary / 总结

The research aims to improve the efficiency of fine-tuning Mixture-of-Experts (MoE) models by addressing the resource mismatch issue in existing LoRA methods. DR-LoRA, a Dynamic Rank LoRA framework, initializes all expert LoRA modules with a small active rank and uses an expert saliency score to identify task-critical experts. It then periodically expands the active ranks of these experts, leading to a heterogeneous rank distribution that better matches the target task. Experiments show that DR-LoRA outperforms LoRA and other strong baselines across three MoE models and six tasks, indicating that task-adaptive heterogeneous rank allocation enhances active capacity utilization in MoE fine-tuning.

研究旨在通过解决LoRA均匀分配问题来提高Mixture-of-Experts (MoE)模型的微调效率。DR-LoRA动态Rank LoRA框架以小活跃秩初始化所有专家LoRA模块，并使用专家显著性得分来识别关键任务的专家。然后，它会周期性地扩展这些关键专家的活跃秩，形成一个针对目标任务的异构秩分布。实验表明，DR-LoRA在多个MoE模型和任务上均优于LoRA和其他强基线，表明任务自适应的异构秩分配可以提高MoE微调中的活跃容量利用率。

TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

Authors: Awais Khan, Muhammad Umar Farooq, Kutub Uddin, Khalid Malik

First: 2026-04-01T16:12:31+00:00 · Latest: 2026-04-01T16:12:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.

中文标题/摘要

标题：TRACE: 无需训练的部分音频深度伪造检测通过语音基础模型嵌入轨迹分析

部分音频深度伪造，其中合成片段被插入到真实的录音中，特别具有欺骗性，因为大部分音频仍然是真实的。现有的检测器是监督式的：它们需要帧级注释，对特定的合成管道过拟合，并且必须随着新生成模型的出现重新训练。我们认为这种监督是不必要的。我们假设语音基础模型隐含地编码了一个法医信号：真实的语音形成了平滑、缓慢变化的嵌入轨迹，而插入边界引入了帧级过渡中的突然中断。基于此，我们提出了TRACE（无需训练的基于表示的音频反制措施，通过嵌入动态），这是一种无需训练的框架，通过分析冻结的语音基础模型表示的一阶动态来检测部分音频深度伪造，无需任何训练、标注数据或架构修改。我们在两种语言的四个基准上评估了TRACE，使用了六个语音基础模型。在PartialSpoof中，TRACE实现了8.08%的EER，与微调的监督基线相当。在LlamaPartialSpoof中，这是最具挑战性的基准，包含由LLM驱动的商业合成，TRACE在没有任何目标域数据的情况下超越了监督基线（24.12% vs. 24.49% EER）。这些结果表明，语音基础模型中的时间动态提供了有效的、通用的信号，用于无需训练的音频法医分析。

Summary / 总结

The research aims to detect partial audio deepfakes, which are deceptive because they only alter parts of genuine recordings. TRACE, a training-free framework, analyzes the first-order dynamics of frozen speech foundation model representations to detect these deepfakes without requiring labeled data or retraining. On four benchmarks, TRACE achieves competitive results, surpassing supervised baselines in the most challenging LlamaPartialSpoof benchmark without needing target-domain data.

研究旨在检测部分音频深伪，因为它们只篡改了真实录音的一部分而具有欺骗性。TRACE 是一个无需训练的框架，通过分析冻结的语音基础模型表示的一阶动态来检测这些深伪，无需标注数据或重新训练。在四个基准测试中，TRACE 达到了竞争性的结果，在最具挑战性的 LlamaPartialSpoof 基准测试中超越了监督基线，无需使用目标域数据。

ReMoGen: Real-time Human Interaction-to-Reaction Generation via Modular Learning from Diverse Data

Authors: Yaoqin Ye, Yiteng Xu, Qin Sun, Xinge Zhu, Yujing Sun, Yuexin Ma

Venue: CVPR 2026

First: 2026-04-01T16:12:23+00:00 · Latest: 2026-04-01T16:12:23+00:00

Comments: accepted by CVPR 2026, project page: https://4dvlab.github.io/project_page/remogen/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human behaviors in real-world environments are inherently interactive, with an individual's motion shaped by surrounding agents and the scene. Such capabilities are essential for applications in virtual avatars, interactive animation, and human-robot collaboration. We target real-time human interaction-to-reaction generation, which generates the ego's future motion from dynamic multi-source cues, including others' actions, scene geometry, and optional high-level semantic inputs. This task is fundamentally challenging due to (i) limited and fragmented interaction data distributed across heterogeneous single-person, human-human, and human-scene domains, and (ii) the need to produce low-latency yet high-fidelity motion responses during continuous online interaction. To address these challenges, we propose ReMoGen (Reaction Motion Generation), a modular learning framework for real-time interaction-to-reaction generation. ReMoGen leverages a universal motion prior learned from large-scale single-person motion datasets and adapts it to target interaction domains through independently trained Meta-Interaction modules, enabling robust generalization under data-scarce and heterogeneous supervision. To support responsive online interaction, ReMoGen performs segment-level generation together with a lightweight Frame-wise Segment Refinement module that incorporates newly observed cues at the frame level, improving both responsiveness and temporal coherence without expensive full-sequence inference. Extensive experiments across human-human, human-scene, and mixed-modality interaction settings show that ReMoGen produces high-quality, coherent, and responsive reactions, while generalizing effectively across diverse interaction scenarios.

中文标题/摘要

标题：ReMoGen：通过模块化学习从多样化数据生成实时人类互动反应

现实世界环境中的人类行为本质上是互动的，个体的运动受到周围代理和场景的影响。这种能力对于虚拟化身、交互动画和人机协作等应用至关重要。我们旨在实现实时人类互动反应生成，该任务从动态多源线索中生成主体的未来运动，包括他人的动作、场景几何结构以及可选的高层语义输入。由于(i) 有限且碎片化的互动数据分布在异构的单人、人-人和人-场景领域之间，以及(ii) 需要在持续的在线互动中产生低延迟但高保真的运动响应，因此该任务具有根本性的挑战性。为了解决这些挑战，我们提出了ReMoGen（反应运动生成），一种用于实时互动反应生成的模块化学习框架。ReMoGen利用从大规模单人运动数据集中学习到的通用运动先验，并通过独立训练的元互动模块适应目标互动领域，从而在数据稀缺和异构监督下实现稳健的泛化。为了支持响应式的在线互动，ReMoGen在段级生成的同时，通过一个轻量级的帧级段细化模块在帧级整合新观察到的线索，从而提高响应性和时间连贯性，而无需昂贵的全序列推理。在人-人、人-场景和混合模态互动设置下的广泛实验表明，ReMoGen生成了高质量、连贯且响应式的反应，并且在多种互动场景中表现出有效的泛化。

Summary / 总结

ReMoGen is a modular learning framework designed for real-time generation of human reactions to interactions, addressing challenges of limited and fragmented data from diverse sources. It uses a universal motion prior from large-scale single-person datasets and adapts it through Meta-Interaction modules to handle various interaction domains. ReMoGen also includes a Frame-wise Segment Refinement module for improved responsiveness and temporal coherence. Experiments demonstrate high-quality, coherent, and responsive reactions across different interaction settings, showing effective generalization.

ReMoGen 是一个模块化学习框架，旨在实现实时的人类交互反应生成，解决有限且碎片化的交互数据以及低延迟、高保真运动响应的需求。它利用大规模单人动作数据集中的通用运动先验，并通过 Meta-Interaction 模块适应不同的交互领域。ReMoGen 还包含一个帧级段细化模块，以提高响应性和时间连贯性。实验表明，ReMoGen 能在各种交互设置中生成高质量、连贯且响应迅速的反应。

ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction

Authors: Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Di Wen, Danda Pani Paudel, Luc Van Gool, Kailun Yang

Venue: CVPR 2026

First: 2026-04-01T16:11:59+00:00 · Latest: 2026-04-01T16:11:59+00:00

Comments: Accepted to CVPR 2026. The source code is publicly available at https://github.com/7uHeng/ProOOD

Abs · PDF · Code1 · Code2 · Code3

Abstract

3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.

中文标题/摘要

标题：ProOOD：基于原型引导的分布外3D占用率预测

3D语义占用率预测是自主驾驶的核心，但当前方法容易受到长尾类别偏差和分布外(OOD)输入的影响，往往对异常情况过度自信地归类为稀有类别。我们提出了ProOOD，这是一种轻量级、即插即用的方法，结合了原型引导的细化和无需训练的OOD评分。ProOOD 包括 (i) 原型引导的语义插补，用类别一致的特征填充被遮挡的区域，(ii) 原型引导的尾部挖掘，增强稀有类别的表示以减少OOD吸收，以及 (iii) EchoOOD，它将局部logit一致性与局部和全局原型匹配融合，生成可靠的体素级OOD评分。在五个数据集上的广泛实验表明，ProOOD 在分布内3D占用率预测和OOD检测方面均达到了最先进的性能。在SemanticKITTI上，它整体mIoU提高了3.57%，尾部类mIoU提高了24.80%；在VAA-KITTI上，它提高了AuPRCr 19.34分，且在各个基准上均有所提升。这些改进在安全关键的城市驾驶中提供了更准确的占用率估计和更可靠的OOD检测。源代码可在 https://github.com/7uHeng/ProOOD 公开获取。

Summary / 总结

ProOOD addresses the limitations of current 3D semantic occupancy prediction methods by introducing a prototype-guided approach that enhances rare-class representations and provides reliable out-of-distribution (OOD) detection. It consists of semantic imputation, tail mining, and EchoOOD for OOD scoring. Experiments on five datasets show that ProOOD outperforms existing methods, achieving higher mean intersection over union (mIoU) and area under precision-recall curve (AuPRCr) on tail classes, leading to more reliable occupancy predictions and OOD detection in autonomous driving scenarios.

ProOOD通过引入原型引导的方法来增强稀有类别的表示，并提供可靠的异常输入（OOD）检测，解决了当前3D语义占用预测方法的局限性。它包括语义填充、尾部挖掘和EchoOOD用于OOD评分。在五个数据集上的实验表明，ProOOD在尾部类别的平均交并比（mIoU）和面积下的精确召回曲线（AuPRCr）上优于现有方法，从而在自动驾驶场景中提供了更可靠的占用预测和OOD检测。

LG-HCC: Local Geometry-Aware Hierarchical Context Compression for 3D Gaussian Splatting

Authors: Xuan Deng, Xiandong Meng, Hengyu Man, Qiang Zhu, Tiange Zhang, Debin Zhao, Xiaopeng Fan

First: 2026-03-30T13:39:35+00:00 · Latest: 2026-04-01T16:10:38+00:00

Comments: 10

Abs · PDF · Code1 · Code2

Abstract

Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment. Recent anchor-based 3DGS compression schemes reduce gaussian redundancy through some advanced context models. However, they overlook explicit geometric dependencies, leading to structural degradation and suboptimal ratedistortion performance. In this paper, we propose a Local Geometry-aware Hierarchical Context Compression framework for 3DGS(LG-HCC) that incorporates inter-anchor geometric correlations into anchor pruning and entropy coding for compact representation. Specifically, we introduce an Neighborhood-Aware Anchor Pruning (NAAP) strategy, which evaluates anchor importance via weighted neighborhood feature aggregation and then merges low-contribution anchors into salient neighbors, yielding a compact yet geometry-consistent anchor set. Moreover, we further develop a hierarchical entropy coding scheme, in which coarse-to-fine priors are exploited through a lightweight Geometry-Guided Convolution(GG-Conv) operator to enable spatially adaptive context modeling and rate-distortion optimization. Extensive experiments show that LG-HCC effectively alleviates structural preservation issues,achieving superior geometric integrity and rendering fidelity while reducing storage by up to 30.85x compared to the Scaffold-GS baseline on the Mip-NeRF360 dataset

中文标题/摘要

标题：LG-HCC：局部几何感知层次上下文压缩用于3D高斯点绘制

尽管3D高斯点绘制（3DGS）能够实现高保真实时渲染，但其高昂的存储开销严重阻碍了其实用部署。最近基于锚点的3DGS压缩方案通过一些先进的上下文模型减少了高斯冗余，但它们忽略了显式的几何依赖性，导致结构退化和次优的率失真性能。本文提出了一种局部几何感知层次上下文压缩框架（LG-HCC），将锚点之间的几何相关性融入到锚点剪枝和熵编码中，以实现紧凑表示。具体而言，我们引入了一种基于邻域感知锚点剪枝（NAAP）策略，通过加权邻域特征聚合评估锚点的重要性，然后将低贡献锚点合并到显著邻居中，从而获得一个紧凑且几何一致的锚点集。此外，我们进一步开发了一种层次熵编码方案，在该方案中，通过轻量级几何引导卷积（GG-Conv）操作利用粗到细的先验知识，实现空间自适应上下文建模和率失真优化。大量实验表明，LG-HCC 有效缓解了结构保存问题，实现了更好的几何完整性和渲染保真度，同时与Mip-NeRF360数据集上的Scaffold-GS基线相比，存储量最多减少了30.85倍。

Summary / 总结

Although 3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, its prohibitive storage overhead severely hinders practical deployment.

LG-HCC 通过将几何依赖性引入锚点剪枝和熵编码来解决 3D 高斯斑点 (3DGS) 的存储开销问题。它引入了邻域感知锚点剪枝策略以保持几何一致性，并且开发了一种分层熵编码方案，其中使用几何引导卷积操作符来实现空间自适应上下文建模。实验表明，LG-HCC 在减少高达 30.85 倍的存储量的同时，提高了几何完整性和渲染保真度。

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Authors: Huilin Deng, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang

First: 2024-09-30T09:51:29+00:00 · Latest: 2026-04-01T15:59:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects by establishing feature mapping between textual prompts and inspection images, demonstrating excellent research value in flexible industrial manufacturing. However, existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. Recently, adapting Multimodal Large Language Models (MLLMs) for Industrial Anomaly Detection (IAD) presents a viable solution. Unlike fixed-prompt methods, MLLMs exhibit a generative paradigm with open-ended text interpretation, enabling more adaptive anomaly analysis. However, this adaption faces inherent challenges as anomalies often manifest in fine-grained regions and exhibit minimal visual discrepancies from normal samples. To address these challenges, we propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception, simultaneously providing precise detection and comprehensive analysis of anomalies. Specifically, we design a Defect-Sensitive Structure Learning scheme that transfers patch-similarities cues from visual branch to our MLLM for improved anomaly discrimination. Besides, we introduce a novel visual projector, Locality-enhanced Token Compression, which mines multi-level features in local contexts to enhance fine-grained detection. Furthermore, we introduce the Real Industrial Anomaly Detection (RIAD), a comprehensive IAD dataset with detailed anomaly descriptions and analyses, offering a valuable resource for MLLM-based IAD development. Extensive experiments on zero-shot benchmarks, including MVTec-AD, Visa, WFDD, and RIAD datasets, demonstrate our superior performance over state-of-the-art methods. The code and dataset will be available soon.

中文标题/摘要

标题：VMAD：增强视觉的多模态大型语言模型在零样本异常检测中的应用

零样本异常检测（ZSAD）通过文本提示与检查图像之间的特征映射识别和定位未见过的对象中的异常，展示了在灵活工业制造中的出色研究价值。然而，现有的ZSAD方法受限于封闭世界设置，难以使用预定义的提示识别未见过的缺陷。最近，将多模态大型语言模型（MLLMs）应用于工业异常检测（IAD）提供了一种可行的解决方案。与固定提示方法不同，MLLMs展示了一种生成范式，具有开放的文本解释，能够进行更适应性的异常分析。然而，这种适应面临固有的挑战，因为异常往往在细粒度区域表现，并且与正常样本在视觉上几乎没有差异。为了解决这些挑战，我们提出了一种名为VMAD（增强视觉的MLLM异常检测）的新型框架，该框架通过视觉基础的IAD知识和细粒度感知增强MLLM，同时提供精确的异常检测和全面的异常分析。具体而言，我们设计了一种缺陷敏感结构学习方案，将视觉分支中的块相似性线索转移到我们的MLLM中，以提高异常区分能力。此外，我们引入了一种新的视觉投影器，局部增强的标记压缩，它在局部上下文中挖掘多级特征以增强细粒度检测。此外，我们引入了全面的工业异常检测（RIAD）数据集，该数据集包含详细的异常描述和分析，为基于MLLM的IAD开发提供了宝贵的资源。在零样本基准测试中的广泛实验，包括MVTec-AD、Visa、WFDD和RIAD数据集，证明了我们方法的优越性能。代码和数据集将很快提供。

Summary / 总结

VMAD is a novel framework for zero-shot anomaly detection that enhances multimodal large language models with visual-based industrial anomaly detection knowledge and fine-grained perception. It includes a Defect-Sensitive Structure Learning scheme and a Locality-enhanced Token Compression visual projector to improve anomaly discrimination and detection. Experiments on various datasets show VMAD outperforms existing methods in zero-shot anomaly detection tasks.

VMAD 是一种新颖的零样本异常检测框架，结合了增强的多模态大语言模型和工业异常检测知识。它包括缺陷敏感结构学习方案和局部增强的标记压缩视觉投影仪，以提高异常识别和检测。VMAD 在 MVTec-AD、Visa、WFDD 和 RIAD 等基准测试中优于现有方法。

PHASOR: Anatomy- and Phase-Consistent Volumetric Diffusion for CT Virtual Contrast Enhancement

Authors: Zilong Li, Dongyang Li, Chenglong Ma, Zhan Feng, Dakai Jin, Junping Zhang, Hao Luo, Fan Wang, Hongming Shan

First: 2026-04-01T15:57:18+00:00 · Latest: 2026-04-01T15:57:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Contrast-enhanced computed tomography (CECT) is pivotal for highlighting tissue perfusion and vascularity, yet its clinical ubiquity is impeded by the invasive nature of contrast agents and radiation risks. While virtual contrast enhancement (VCE) offers an alternative to synthesizing CECT from non-contrast CT (NCCT), existing methods struggle with anatomical heterogeneity and spatial misalignment, leading to inconsistent enhancement patterns and incorrect details. This paper introduces PHASOR, a volumetric diffusion framework for high-fidelity CT VCE. By treating CT volumes as coherent sequences, we leverage a video diffusion model to enhance structural coherence and volumetric accuracy. To ensure anatomy-phase consistent synthesis, we introduce two complementary modules. First, anatomy-routed mixture-of-experts (AR-MoE) anchors distinct enhancement patterns to anatomical semantics, with organ-specific memory to capture salient details. Second, intensity-phase aware representation alignment (IP-REPA) highlights intricate contrast signals while mitigating the impact of imperfect spatial alignment. Extensive experiments across three datasets demonstrate that PHASOR significantly outperforms state-of-the-art methods in both synthesis quality and enhancement accuracy.

中文标题/摘要

标题：PHASOR：解剖和相位一致的体积扩散CT虚拟对比增强

对比增强计算机断层扫描（CECT）对于突出组织灌注和血管性状至关重要，但由于对比剂的侵入性和辐射风险，其临床应用受到限制。虽然虚拟对比增强（VCE）提供了一种从非对比CT（NCCT）合成CECT的替代方法，但现有方法在解剖异质性和空间错位方面存在困难，导致增强模式不一致和错误的细节。本文介绍了一种用于高保真CT VCE的体积扩散框架PHASOR。通过将CT体积视为一致的序列，我们利用视频扩散模型增强结构连贯性和体积准确性。为了确保解剖-相位一致的合成，我们引入了两个互补模块。首先，解剖导向的混合专家模型（AR-MoE）将不同的增强模式锚定到解剖语义上，并使用器官特定的记忆来捕捉关键细节。其次，强度-相位感知的表示对齐（IP-REPA）突出复杂的对比信号，同时减轻空间对齐不完美带来的影响。在三个数据集上的广泛实验表明，PHASOR在合成质量和增强准确性方面均显著优于现有方法。

Summary / 总结

PHASOR is a volumetric diffusion framework designed for virtual contrast enhancement (VCE) in computed tomography (CT) to improve tissue perfusion visualization without the need for contrast agents. It addresses the issues of anatomical heterogeneity and spatial misalignment by using a video diffusion model and introducing two modules: AR-MoE and IP-REPA. AR-MoE anchors enhancement patterns to anatomical semantics, while IP-REPA highlights contrast signals and mitigates spatial alignment errors. Experiments show that PHASOR outperforms existing methods in synthesis quality and enhancement accuracy across three datasets.

PHASOR 是一种用于计算机断层扫描 (CT) 虚拟对比增强的体素扩散框架，旨在无需使用对比剂的情况下提高组织灌注可视化。它通过使用视频扩散模型并引入两个模块 AR-MoE 和 IP-REPA 来解决解剖异质性和空间对齐问题。AR-MoE 将增强模式锚定到解剖语义上，而 IP-REPA 强调对比信号并减轻空间对齐误差的影响。实验表明，PHASOR 在三个数据集中的合成质量和增强准确性方面优于现有方法。

Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability

Authors: Jiahui Song, Sagar Shrestha, Xiao Fu

First: 2026-03-23T02:55:16+00:00 · Latest: 2026-04-01T15:55:46+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.

中文标题/摘要

标题：未注册光谱图像融合：解混、对抗学习及可恢复性

本文解决了两个大致重叠区域的未注册高光谱图像（HSI）和多光谱图像（MSI）的融合问题。HSI提供高光谱但低空间分辨率，而MSI则相反。目标是整合它们互补的信息，以提高HSI的空间分辨率和MSI的光谱分辨率。尽管高光谱-多光谱融合（HMF）已被广泛研究，但未注册的情况仍然具有挑战性。许多现有方法仅专注于MSI的超分辨率，而未改变HSI。监督深度学习方法曾被提出用于HSI的超分辨率，但依赖于准确的训练数据，而这些数据往往不可用。此外，理论分析主要针对配准情况，使得未注册HMF的理解相对不足。本文提出了一种无监督框架，同时对MSI和HSI进行超分辨率。该方法结合了耦合光谱解混以实现MSI的超分辨率和潜在空间对抗学习以实现HSI的超分辨率。在合理的生成模型下，建立了超分辨率MSI和HSI的可恢复性理论保证——据我们所知，这是首次对未注册HMF提供此类见解。该方法在不同条件下的半真实和真实HSI-MSI对上进行了验证。

Summary / 总结

This paper proposes an unsupervised framework for fusing spatially unregistered hyperspectral and multispectral images. The method combines coupled spectral unmixing for multispectral image super-resolution and latent-space adversarial learning for hyperspectral image super-resolution. Theoretical guarantees on the recoverability of the super-resolved images are provided, which are the first such insights for unregistered hyperspectral-multispectral fusion. Experiments on semi-real and real image pairs demonstrate the effectiveness of the proposed approach under various conditions.

本文提出了一种用于融合空间未对齐的高光谱和多光谱图像的无监督框架。该方法结合了光谱解混用于多光谱图像超分辨率，以及潜在空间对抗学习用于高光谱图像超分辨率。首次为未对齐的高光谱-多光谱融合提供了超分辨率图像可恢复性的理论保证。实验结果表明该方法在半真实和真实图像对上的有效性。

CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs (Brief version)

Authors: A. Chervov, D. Fedoriaka, E. Konstantinova, A. Naumov, I. Kiselev, A. Sheveleva, I. Koltsov, S. Lytkin, A. Smolensky, A. Soibelman, F. Levkovich-Maslyuk, R. Grimov, D. Volovich, A. Isakov, A. Kostin, M. Litvinov, N. Vilkin-Krom, A. Bidzhiev, A. Krasnyi, M. Evseev, E. Geraseva, L. Grunwald, S. Galkin, E. Koldunov, S. Diner, A. Chevychelov, E. Kudasheva, A. Sychev, A. Kravchenko, Z. Kogan, A. Natyrova, L. Shishina, L. Cheldieva, V. Zamkovoy, D. Kovalenko, O. Papulov, S. Kudashev, D. Shiltsov, R. Turtayev, O. Nikitina, D. Mamayeva, S. Nikolenko, M. Obozov, A. Titarenko, A. Dolgorukova, A. Aparnev, O. Debeaupuis, S. Alami C., H. Isambert

First: 2025-09-23T15:40:36+00:00 · Latest: 2026-04-01T15:54:42+00:00

Comments: 46 pages, 30 figures; v2: typos fixed

Abs · PDF · Code1 · Code2

Abstract

This is the third paper of the CayleyPy project applying artificial intelligence to problems in group theory. We announce the first public release of CayleyPy, an open source Python library for computations with Cayley and Schreier graphs. Compared with systems such as GAP and Sage, CayleyPy handles much larger graphs and performs several orders of magnitude faster. Using CayleyPy we obtained about 200 new conjectures on Cayley and Schreier graphs, focused on diameters and growth. For many Cayley graphs of symmetric groups Sn we observe quasi polynomial diameter formulas: a small set of quadratic or linear polynomials indexed by n mod s. We conjecture that this is a general phenomenon, giving efficient diameter computation despite the problem being NP hard. We propose a refinement of the Babai type conjecture on diameters of Sn: n^2/2 + 4n upper bounds in the undirected case, compared to previous O(n^2) bounds. We also provide explicit generator families, related to involutions in a square with whiskers pattern, conjectured to maximize the diameter; search confirms this for all n up to 15. We further conjecture an answer to a question posed by V M Glushkov in 1968 on directed Cayley graphs generated by a cyclic shift and a transposition. For nilpotent groups we conjecture an improvement of J S Ellenberg's results on upper unitriangular matrices over Z/pZ, showing linear dependence of diameter on p. Some conjectures are LLM friendly, naturally stated as sorting problems verifiable by algorithms or Python code. To benchmark path finding we created more than 10 Kaggle datasets. CayleyPy works with arbitrary permutation or matrix groups and includes over 100 predefined generators. Our growth computation code outperforms GAP and Sage up to 1000 times in speed and size.

中文标题/摘要

标题：CayleyPy 成长：高效的成长计算和数百个新的关于 Cayley 图的新猜想（简要版本）

这是 CayleyPy 项目中的第三篇论文，该项目利用人工智能解决群论问题。我们宣布 CayleyPy 的首次公开发布，这是一个用于 Cayley 和 Schreier 图计算的开源 Python 库。与 GAP 和 Sage 等系统相比，CayleyPy 可以处理更大的图，并且执行速度快几个数量级。使用 CayleyPy，我们获得了大约 200 个新的关于 Cayley 和 Schreier 图的新猜想，集中在直径和成长方面。对于对称群 Sn 的许多 Cayley 图，我们观察到准多项式直径公式：一个小于 n 的二次或线性多项式集合，按 n mod s 索引。我们猜测这是一个普遍现象，尽管问题是 NP 难的，但可以提供高效的直径计算。我们提出了 Babai 类型猜想的一个改进：在无向情况下，n^2/2 + 4n 是一个上限，相比之下，之前的上限是 O(n^2)。我们还提供了与带胡须的正方形中的对合相关的显式生成器家族，猜测这些生成器可以最大化直径；搜索确认这一点对于所有 n ≤ 15 都成立。我们进一步猜测了 V M Glushkov 在 1968 年提出的一个关于由循环移位和转置生成的有向 Cayley 图的问题的答案。对于幂零群，我们猜测了 J S Ellenberg 关于 Z/pZ 上的上三角矩阵的结果的改进，显示直径与 p 的线性依赖关系。一些猜想是 LLM 友好的，自然地可以作为排序问题陈述，通过算法或 Python 代码验证。为了基准测试路径查找，我们创建了超过 10 个 Kaggle 数据集。CayleyPy 可以处理任意置换或矩阵群，并包括超过 100 个预定义生成器。我们的成长计算代码在速度和规模上比 GAP 和 Sage 快 1000 倍。

Summary / 总结

This paper introduces CayleyPy, an open-source Python library for computations with Cayley and Schreier graphs, which is significantly faster than systems like GAP and Sage. Using CayleyPy, the authors generated about 200 new conjectures, particularly on diameters and growth of Cayley graphs of symmetric groups, proposing a refined Babai-type conjecture with a tighter upper bound. They also conjectured explicit generator families that maximize the diameter and provided evidence for these conjectures up to n=15. The library supports arbitrary permutation or matrix groups and includes over 100 predefined generators, with growth computation outperforming GAP and Sage by up to 1000 times.

该论文介绍了CayleyPy，一个用于Cayley和Schreier图计算的开源Python库，其速度远超GAP和Sage等系统。使用CayleyPy，作者生成了约200个新的猜想，特别是关于对称群的Cayley图的直径和增长，提出了一个改进的Babai型猜想，给出了更紧的上界。他们还提出了能够最大化直径的显式生成元，并提供了这些猜想在n=15以内的证据。该库支持任意置换或矩阵群，并包含超过100个预定义的生成元，其增长计算性能比GAP和Sage快1000倍以上。

Adversarial Attacks in AI-Driven RAN Slicing: SLA Violations and Recovery

Authors: Deemah H. Tashman, Soumaya Cherkaoui

First: 2026-04-01T15:54:06+00:00 · Latest: 2026-04-01T15:54:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Next-generation (NextG) cellular networks are designed to support emerging applications with diverse data rate and latency requirements, such as immersive multimedia services and large-scale Internet of Things deployments. A key enabling mechanism is radio access network (RAN) slicing, which dynamically partitions radio resources into virtual resource blocks to efficiently serve heterogeneous traffic classes, including enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (URLLC). In this paper, we study the impact of adversarial attacks on AI-driven RAN slicing decisions, where a budget-constrained adversary selectively jams slice transmissions to bias deep reinforcement learning (DRL)-based resource allocation, and quantify the resulting service level agreement (SLA) violations and post-attack recovery behavior. Our results indicate that budget-constrained adversarial jamming can induce severe and slice-dependent steady-state SLA violations. Moreover, the DRL agent's reward converges toward the clean baseline only after a non-negligible recovery period.

中文标题/摘要

标题：AI驱动的RAN切片中的对抗攻击：SLA违约与恢复

下一代(NextG)蜂窝网络旨在支持具有不同数据速率和延迟要求的新兴应用，如沉浸式多媒体服务和大规模物联网部署。关键实现机制是无线接入网络(RAN)切片，它动态地将无线资源划分为虚拟资源块，以高效地服务于包括增强型移动宽带(eMBB)、大规模机器类型通信(mMTC)和超可靠低延迟通信(URLLC)在内的异构流量类别。在本文中，我们研究了对抗攻击对AI驱动的RAN切片决策的影响，其中预算有限的攻击者有选择地干扰切片传输以偏倚基于深度强化学习(DRL)的资源分配，并量化由此产生的服务级别协议(SLA)违约和攻击后的恢复行为。我们的结果表明，预算有限的对抗性干扰可以导致严重的、切片依赖的稳态SLA违约。此外，DRL代理的奖励仅在非可忽略的恢复期后才收敛到干净的基本水平。

Summary / 总结

This paper investigates the impact of adversarial attacks on AI-driven RAN slicing decisions, where a constrained adversary selectively jams slice transmissions to bias DRL-based resource allocation. The study quantifies severe and slice-dependent SLA violations and shows that recovery to the clean baseline reward requires a significant period of time.

本文研究了恶意攻击对基于AI的RAN切片决策的影响，其中受预算限制的攻击者选择性地干扰切片传输以偏倚基于DRL的资源分配。研究量化了严重的、切片依赖的SLA违约情况，并表明恢复到干净基线奖励需要相当长的时间。

A global dataset of continuous urban dashcam driving

Authors: Md Shadab Alam, Olena Bazilinska, Pavlo Bazilinskyy

First: 2026-04-01T15:52:17+00:00 · Latest: 2026-04-01T15:52:17+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.

中文标题/摘要

标题：全球连续城市行车记录仪驾驶数据集

我们介绍了CROWD（城市道路观察与行车记录仪），这是一个由人工整理的普通、分钟级、时间连续、未经编辑的前方城市行车记录仪片段数据集，这些片段是从公开的YouTube视频中筛选和分割出来的。CROWD旨在通过优先考虑常规驾驶并明确排除事故、事故后的处理和其他编辑或事件集中内容来支持跨域鲁棒性和交互分析。该数据集包含51,753个片段记录，覆盖20,275.56小时（42,032个视频），涉及来自六大有人居住大陆（非洲、亚洲、欧洲、北美洲、南美洲和大洋洲）238个国家和地区中的7,103个命名居住地，每个片段的手动标签包括时间段（白天或夜晚）和车辆类型。为了降低基准测试的门槛，我们提供了所有80个MS-COCO类的机器生成检测的片段级CSV文件，使用YOLOv11x生成，以及片段局部多对象轨迹（BoT-SORT）；例如，行人、自行车、摩托车、汽车、公共汽车、卡车、交通灯、停车标志等。CROWD以视频标识符和片段边界以及衍生注释的形式分发，使研究人员能够进行可重复的研究而不重新分发底层视频。

Summary / 总结

The research introduces CROWD, a dataset of urban dashcam footage designed for robustness and interaction analysis. It consists of 51,753 segments from 20,275.56 hours of driving footage across 7,103 locations in 238 countries, manually curated to exclude crashes and edited content. The dataset includes labels for time of day and vehicle type, and provides machine-generated detections for 80 COCO classes to facilitate benchmarking. This dataset supports reproducible research without redistributing the underlying videos.

CROWD 是一个包含城市行车记录仪片段的数据集，旨在支持跨域场景中的鲁棒性和交互性分析。该数据集包括来自238个国家7,103个地点的51,753个片段，总时长为20,275.56小时，附有时间早晚和车辆类型的手动标签。数据集还提供了80个MS-COCO类别的机器生成检测和多对象轨迹，便于自动驾驶研究的基准测试，无需重新分发底层视频。

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Authors: Fengyuan Yang, Luying Huang, Jiazhi Guan, Quanwei Yang, Dongwei Pan, Jianglin Fu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Angela Yao

First: 2026-04-01T15:52:00+00:00 · Latest: 2026-04-01T15:52:00+00:00

Comments: 23 pages, 7 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

中文标题/摘要

标题：ONE-SHOT：通过空间解耦运动注入和混合上下文集成实现组成性的人类-环境视频合成

近期视频基础模型（VFMs）的进展已经革新了以人类为中心的视频合成，但对主体和场景进行精细且独立的编辑仍然是一个关键挑战。通过刚性3D几何组合来引入更丰富的环境控制的尝试，往往在精确控制和生成灵活性之间遇到明显的权衡。此外，繁重的3D预处理仍然限制了实际的可扩展性。在本文中，我们提出了一种参数高效的框架ONE-SHOT，用于组成性的人类-环境视频生成。我们的关键见解是将生成过程分解为独立的信号。具体而言，我们引入了一种标准空间注入机制，通过交叉注意力将人类动态与环境线索解耦。我们还提出了一种新颖的位置嵌入策略——动态地面RoPE，它在没有任何启发式3D对齐的情况下建立了不同空间域之间的空间对应关系。为了支持长时合成，我们引入了一种混合上下文集成机制，以在分钟级生成中保持主体和场景的一致性。实验表明，我们的方法显著优于现有最佳方法，提供了视频合成中更好的结构控制和创意多样性。我们的项目已可在以下链接获取：https://martayang.github.io/ONE-SHOT/

Summary / 总结

The research aims to address the challenge of fine-grained and independent editing of subjects and scenes in human-centric video synthesis. The proposed ONE-SHOT framework decouples human dynamics from environmental cues using a canonical-space injection mechanism and establishes spatial correspondences through Dynamic-Grounded-RoPE. It also introduces a Hybrid Context Integration mechanism to maintain consistency across long-horizon synthesis. Experiments show that ONE-SHOT outperforms existing methods in terms of structural control and creative diversity.

研究旨在解决人类为中心的视频合成中精细独立编辑主体和场景的挑战。提出的ONE-SHOT框架将生成过程分解为独立的信号，使用一个标准空间注入机制来解耦人类动态和环境线索，并使用Dynamic-Grounded-RoPE策略在没有3D对齐的情况下建立空间对应关系。实验表明，ONE-SHOT在结构控制和创意多样性方面优于现有方法，同时保持了实际的可扩展性。

RoboNeuron: A Middle-Layer Infrastructure for Agent-Driven Orchestration in Embodied AI

Authors: Weifan Guan, Qinghao Hu, Huasen Xi, Chenxiao Zhang, Aosheng Li, Jian Cheng

First: 2025-12-11T07:58:19+00:00 · Latest: 2026-04-01T15:51:58+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language-action (VLA) models and LLM agents have advanced rapidly, yet reliable deployment on physical robots is often hindered by an interface mismatch between agent tool APIs and robot middleware. Current implementations typically rely on ad-hoc wrappers that are difficult to reuse, and changes to the VLA backend or serving stack often necessitate extensive re-integration. We introduce RoboNeuron, a middleware layer that connects the Model Context Protocol (MCP) for LLM agents with robot middleware such as ROS2. RoboNeuron bridges these ecosystems by deriving agent-callable tools directly from ROS schemas, providing a unified execution abstraction that supports both direct commands and modular composition, and localizing backend, runtime, and acceleration-preset changes within a stable inference boundary. We evaluate RoboNeuron in simulation and on hardware through multi-platform base control, arm motion, and VLA-based grasping tasks, demonstrating that it enables modular system orchestration under a unified interface while supporting backend transitions without system rewiring. The full code implementation of this work is available at github repo: https://github.com/guanweifan/RoboNeuron

中文标题/摘要

标题：RoboNeuron：一种基于代理驱动编排的中间层基础设施

视觉-语言-行动（VLA）模型和LLM代理已经取得了快速进展，但在物理机器人上的可靠部署往往受到代理工具API与机器人中间件之间接口不匹配的阻碍。当前实现通常依赖于难以重用的临时包装器，而VLA后端或服务堆栈的更改通常需要进行大量重新集成。我们引入了RoboNeuron，这是一种中间层，它将LLM代理的模型上下文协议（MCP）与ROS2等机器人中间件连接起来。RoboNeuron通过直接从ROS模式派生代理可调用的工具来连接这些生态系统，提供了一种统一的执行抽象，支持直接命令和模块化组合，并将后端、运行时和加速预设的变化局限在一个稳定的推理边界内。我们通过多平台基础控制、手臂运动和基于VLA的抓取任务在仿真和硬件上评估了RoboNeuron，证明了它能够在统一接口下实现模块化系统编排，同时支持后端转换而不需重新布线。此工作的完整代码实现可在github仓库中获得：https://github.com/guanweifan/RoboNeuron

Summary / 总结

RoboNeuron is introduced to address the interface mismatch between LLM agents and robot middleware, using the Model Context Protocol (MCP) to bridge these ecosystems. By deriving agent-callable tools from ROS schemas, RoboNeuron provides a unified execution abstraction for both direct commands and modular composition, allowing backend changes without system rewiring. RoboNeuron was evaluated in simulation and on hardware through various tasks, showing its capability for modular system orchestration under a unified interface and supporting backend transitions without re-integration.

RoboNeuron旨在解决LLM代理与机器人中间件之间的接口不匹配问题，通过Model Context Protocol (MCP)连接这些生态系统。通过从ROS模式中派生代理可调用的工具，RoboNeuron提供了一种统一的执行抽象，支持直接命令和模块化组合，允许后端更改而不需重新布线。RoboNeuron在仿真和硬件上通过各种任务进行了评估，展示了其在统一接口下进行模块化系统编排的能力，并支持后端转换而不需重新集成。

Ego-Foresight: Self-supervised Learning of Agent-Aware Representations for Improved RL

Authors: Manuel Serra Nunes, Atabak Dehban, Yiannis Demiris, José Santos-Victor

First: 2024-05-27T13:32:43+00:00 · Latest: 2026-04-01T15:51:54+00:00

Comments: 13 pages, 8 figures, conference

Abs · PDF · Code1 · Code2

Abstract

Despite the significant advances in Deep Reinforcement Learning (RL) observed in the last decade, the amount of training experience necessary to learn effective policies remains one of the primary concerns in both simulated and real environments. Looking to solve this issue, previous work has shown that improved efficiency can be achieved by separately modeling the agent and environment, but usually requires a supervisory signal. In contrast to RL, humans can perfect a new skill from a small number of trials and often do so without a supervisory signal, making neuroscientific studies of human development a valuable source of inspiration for RL. In particular, we explore the idea of motor prediction, which states that humans develop an internal model of themselves and of the consequences that their motor commands have on the immediate sensory inputs. Our insight is that the movementofthe agent provides a cue that allows the duality between the agent and environment to be learned. To instantiate this idea, we present Ego-Foresight (EF), a self-supervised method for disentangling agent information based on motion and prediction. Our main finding is that, when used as an auxiliary task in feature learning, self-supervised agent awareness improves the sample-efficiency and performance of the underlying RL algorithm. To test our approach, we study the ability of EF to predict agent movement and disentangle agent information. Then, we integrate EF with model-free and model based RL algorithms to solve simulated control tasks, showing improved sample-efficiency and performance.

中文标题/摘要

标题：自我预见：自我监督学习代理意识表示以提高强化学习

尽管在过去的十年中，深度强化学习（RL）取得了显著的进步，但在模拟和真实环境中学习有效策略所需的训练经验量仍然是主要关切之一。为了解决这一问题，先前的工作表明，通过分别建模代理和环境可以提高效率，但通常需要监督信号。与RL不同，人类可以从少量的试错中掌握新技能，并且通常不需要监督信号，因此神经科学研究为RL提供了宝贵的灵感来源。特别是，我们探讨了运动预测的想法，即人类发展了对自己及其运动命令对即时感官输入后果的内部模型。我们的见解是，代理的运动提供了一种线索，使代理与环境之间的二元性得以学习。为了实现这一想法，我们提出了自我预见（EF），一种基于运动和预测的自我监督方法，用于分离代理信息。我们的主要发现是，当作为特征学习的辅助任务使用时，自我监督的代理意识可以提高底层RL算法的样本效率和性能。为了测试我们的方法，我们研究了EF预测代理运动和分离代理信息的能力。然后，我们将EF与无模型和基于模型的RL算法结合，以解决模拟控制任务，显示出改进的样本效率和性能。

Summary / 总结

The paper addresses the challenge of sample efficiency in reinforcement learning by proposing Ego-Foresight (EF), a self-supervised method that learns agent-aware representations through motion and prediction. EF improves the sample-efficiency and performance of underlying RL algorithms by disentangling agent information. Experiments show that EF enhances the ability to predict agent movement and disentangle agent information, leading to better performance in simulated control tasks.

论文通过提出基于运动和预测的自我监督方法Ego-Foresight (EF)，解决了强化学习中的样本效率问题，该方法通过学习代理信息来提高样本效率和性能。实验表明，将EF集成到模型自由和模型依赖的RL算法中，可以在模拟控制任务中提高样本效率和性能。

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Authors: Aditi Singh, Abul Ehtesham, Saket Kumar, Tala Talaei Khoei, Athanasios V. Vasilakos

First: 2025-01-15T20:40:25+00:00 · Latest: 2026-04-01T15:51:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have advanced artificial intelligence by enabling human-like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real-time queries, resulting in outdated or inaccurate outputs. Retrieval-Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real-time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multi-step reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multi-agent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows through operational structures ranging from sequential steps to adaptive collaboration. This integration enables Agentic RAG systems to deliver flexibility, scalability, and context-awareness across diverse applications. This paper presents an analytical survey of Agentic RAG systems. It traces the evolution of RAG paradigms, introduces a principled taxonomy of Agentic RAG architectures based on agent cardinality, control structure, autonomy, and knowledge representation, and provides a comparative analysis of design trade-offs across existing frameworks. The survey examines applications in healthcare, finance, education, and enterprise document processing, and distills practical lessons for system designers and practitioners. Finally, it identifies key open research challenges related to evaluation, coordination, memory management, efficiency, and governance, outlining directions for future research.

中文标题/摘要

标题：代理检索增强生成：代理RAG综述

大型语言模型（LLMs）通过实现类人文本生成和自然语言理解推动了人工智能的发展。然而，它们依赖于静态训练数据，限制了它们对动态、实时查询的响应能力，导致输出过时或不准确。检索增强生成（RAG）作为一种解决方案出现，通过整合实时数据检索来增强LLMs，提供上下文相关和最新的响应。尽管具有潜力，传统RAG系统受限于静态工作流程，缺乏多步推理和复杂任务管理所需的适应性。代理检索增强生成（Agentic RAG）超越了这些限制，将自主AI代理嵌入到RAG管道中。这些代理利用代理设计模式中的反思、规划、工具使用和多代理协作，动态管理检索策略，迭代细化上下文理解，并通过从顺序步骤到适应性协作的操作结构适应工作流程。这种整合使Agentic RAG系统能够在各种应用中实现灵活性、可扩展性和上下文感知。本文对Agentic RAG系统进行了分析综述。它追溯了RAG范式的演变，基于代理基数、控制结构、自主性和知识表示引入了代理RAG架构的原理性分类，并对现有框架的设计权衡进行了比较分析。综述考察了医疗保健、金融、教育和企业文档处理等应用，并提炼了系统设计师和实践者的实用经验教训。最后，它指出了与评估、协调、内存管理、效率和治理相关的关键开放研究挑战，并概述了未来研究的方向。

Summary / 总结

Agentic Retrieval-Augmented Generation (Agentic RAG) addresses the limitations of traditional RAG systems by embedding autonomous AI agents that dynamically manage retrieval strategies and adapt workflows. This approach enhances flexibility and scalability, enabling context-aware responses in diverse applications such as healthcare, finance, and education. The paper provides a comprehensive survey of Agentic RAG, introducing a taxonomy based on agent cardinality, control structure, autonomy, and knowledge representation, and identifies key research challenges for future work.

Agentic Retrieval-Augmented Generation (Agentic RAG)通过集成自主AI代理来动态管理和适应检索策略，克服了传统RAG系统的局限性，增强了系统的灵活性和可扩展性，使其能够提供上下文相关的响应，适用于多种应用场景。关键实验发现包括Agentic RAG在医疗保健、金融、教育和企业文档处理中的成功应用，展示了其在实际场景中的潜力。调研还指出了评估、协调和内存管理等开放研究挑战，并为未来的研究指明了方向。

Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

Authors: Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A., Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna

Venue: ICLR 2026 poster

First: 2026-03-13T01:11:23+00:00 · Latest: 2026-04-01T15:49:01+00:00

Comments: Accepted as a poster at ICLR 2026 workshop ICBINB, typo fixed

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.

中文标题/摘要

标题：空间推理并非免费午餐：LLaVA 的受控研究

视觉-语言模型（VLMs）取得了快速进展，但仍难以处理基本的空间推理。尽管在通用基准测试中表现出色，现代 VLMs 在理解二维空间关系（如相对位置、布局和计数）方面仍然脆弱。我们认为这种失败不仅仅是数据问题，而是与当前 VLM 管线中的主导设计选择密切相关：依赖 CLIP 风格的图像编码器和将图像扁平化为一维 token 序列并使用一维位置编码。我们在一个受控的诊断研究中在 LLaVA 框架内隔离这些选择如何影响空间定位。我们评估了前沿模型和 LLaVA 变体在一系列空间基准测试上的表现，将基于 CLIP 的编码器与使用更密集或生成性目标训练的替代编码器进行比较，以及带有二维位置编码的变体。我们的结果显示模型在空间性能上存在一致的差距，并表明编码器目标和位置结构影响空间行为，但并未完全解决这一问题。

Summary / 总结

The study investigates why vision-language models struggle with spatial reasoning despite good performance on general benchmarks. It finds that the failure is not just due to data issues but is linked to the design choices in VLMs, such as reliance on CLIP-style image encoders and flattening images into 1D token sequences. By evaluating models within the LLaVA framework, the researchers show that different encoder objectives and positional encodings affect spatial performance but do not fully resolve the issue.

研究探讨了尽管视觉语言模型在通用基准上表现良好，但为何在空间推理方面仍存在问题。发现这种失败不仅是因为数据问题，还与视觉语言模型的设计选择有关，如依赖CLIP风格的图像编码器和将图像扁平化为1D令牌序列。通过在LLaVA框架中评估模型，研究者表明不同的编码器目标和位置编码会影响空间性能，但并不能完全解决这一问题。

History

20260401_0409 20260331_0408 20260329_0342 20260328_0350 20260327_0407 20260326_0356 20260325_0407 20260324_0402 20260323_0334 20260322_0333 20260321_0346 20260320_0355 20260319_0358 20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553