EventNeuS: 3D Mesh Reconstruction from a Single Event Camera
Authors: Shreyas Sachan, Viktor Rudnev, Mohamed Elgharib, Christian Theobalt, Vladislav Golyanik
Venue: International Conference on 3D Vision (3DV) 2026
First: 2026-02-03T18:59:57+00:00 · Latest: 2026-02-03T18:59:57+00:00
Comments: 13 pages, 10 figures, 3 tables; project page: https://4dqv.mpi-inf.mpg.de/EventNeuS/
Abstract
Event cameras offer a considerable alternative to RGB cameras in many scenarios. While there are recent works on event-based novel-view synthesis, dense 3D mesh reconstruction remains scarcely explored and existing event-based techniques are severely limited in their 3D reconstruction accuracy. To address this limitation, we present EventNeuS, a self-supervised neural model for learning 3D representations from monocular colour event streams. Our approach, for the first time, combines 3D signed distance function and density field learning with event-based supervision. Furthermore, we introduce spherical harmonics encodings into our model for enhanced handling of view-dependent effects. EventNeuS outperforms existing approaches by a significant margin, achieving 34% lower Chamfer distance and 31% lower mean absolute error on average compared to the best previous method.
中文标题/摘要
标题:EventNeuS:单事件相机的3D网格重建
事件相机在许多场景中提供了RGB相机的显著替代方案。尽管已经有一些基于事件的新型视图合成工作,但密集的3D网格重建仍然鲜有探索,现有的基于事件的方法在3D重建精度上也受到严重限制。为解决这一限制,我们提出了EventNeuS,这是一种用于学习单目彩色事件流的自监督神经模型。我们的方法首次将3D符号距离函数和密度场学习与基于事件的监督相结合。此外,我们还将球谐编码引入到我们的模型中,以增强对视点依赖效应的处理能力。EventNeuS在Chamfer距离和平均绝对误差上分别比最佳先前方法低34%和31%,显著优于现有方法。
Summary / 总结
EventNeuS is a self-supervised neural model designed for 3D mesh reconstruction from monocular color event streams using event cameras. It combines 3D signed distance function and density field learning with event-based supervision and introduces spherical harmonics encodings to handle view-dependent effects. EventNeuS significantly outperforms existing methods, reducing the Chamfer distance by 34% and mean absolute error by 31% on average.
EventNeuS 是一种自监督神经模型,用于从单目颜色事件流中使用事件相机进行 3D 网格重建。它结合了 3D 签名距离函数和密度场学习,并引入了球谐编码以处理视依赖效应。EventNeuS 显著优于现有方法,将均方差距离降低了 34%,平均绝对误差降低了 31%。
Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes
Authors: Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie
First: 2026-01-26T18:57:00+00:00 · Latest: 2026-02-03T18:58:52+00:00
Abstract
Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.
中文标题/摘要
标题:重用你的FLOPs:通过条件化非常离策前缀扩展强化学习到难题
典型的强化学习(RL)方法在LLM推理中对难题浪费计算资源,因为在这些难题上正确的策略轨迹罕见,策略梯度消失,学习停滞。为了更高效地启动RL,我们考虑重用旧的采样FLOPs(来自先前推理或RL训练)以离策轨迹的形式。标准的离策方法使用离策数据进行监督,导致RL优化过程中出现不稳定性。我们引入了前缀RL,其中我们条件化于成功的离策轨迹的前缀,并运行在策RL来完成它们,从而绕过离策不稳定性。前缀RL通过调节离策前缀长度来调整问题的难度,从而增强在难题上的学习信号。我们证明前缀RL目标不仅与标准的RL目标一致,而且更具样本效率。实验中,我们发现反向泛化:仅在前缀问题上进行训练可以推广到未见过的无前缀性能,且学习策略往往与前缀中的不同。在我们的实验中,我们通过拒绝采样基模型来源的离策轨迹,创建了一个自我改进循环。在难题推理问题上,前缀RL比最强基线(在离策数据上进行SFT然后RL)快2倍达到相同的训练奖励,即使考虑了初始拒绝采样的计算成本,且最终奖励提高了3倍。这些收益转移到了保留的基准上,即使离策轨迹源自不同的模型家族,前缀RL仍然有效,验证了其在实际应用中的灵活性。
Summary / 总结
The paper addresses the inefficiency of reinforcement learning (RL) methods on hard problems by proposing PrefixRL, which conditions on successful off-policy traces to boost learning. This method avoids off-policy instabilities and enhances sample efficiency. Experiments show that PrefixRL trains faster and achieves higher final rewards compared to strong baselines, even when accounting for initial sampling costs. Additionally, it demonstrates back-generalization, where strategies learned from prefixed problems generalize to out-of-distribution tasks.
论文针对强化学习(RL)方法在处理稀缺正策略数据的难题时效率低下问题,提出了PrefixRL方法,该方法通过条件化成功的历史采样前缀来引导正策略RL,从而避免历史采样不稳定问题。实验表明,PrefixRL在难题上的学习速度比强基线快2倍,并将最终奖励提高了3倍,还展示了对未见过的任务的泛化能力。
Investigating Quantum Circuit Designs Using Neuro-Evolution
Authors: Devroop Kar, Daniel Krutz, Travis Desell
First: 2026-02-03T18:57:39+00:00 · Latest: 2026-02-03T18:57:39+00:00
Comments: Submitted to The Genetic and Evolutionary Computation Conference (GECCO) 2026. Under Review
Abstract
Designing effective quantum circuits remains a central challenge in quantum computing, as circuit structure strongly influences expressivity, trainability, and hardware feasibility. Current approaches, whether using manually designed circuit templates, fixed heuristics, or automated rules, face limitations in scalability, flexibility, and adaptability, often producing circuits that are poorly matched to the specific problem or quantum hardware. In this work, we propose the Evolutionary eXploration of Augmenting Quantum Circuits (EXAQC), an evolutionary approach to the automated design and training of parameterized quantum circuits (PQCs) which leverages and extends on strategies from neuroevolution and genetic programming. The proposed method jointly searches over gate types, qubit connectivity, parameterization, and circuit depth while respecting hardware and noise constraints. The method supports both Qiskit and Pennylane libraries, allowing the user to configure every aspect. This work highlights evolutionary search as a critical tool for advancing quantum machine learning and variational quantum algorithms, providing a principled pathway toward scalable, problem-aware, and hardware-efficient quantum circuit design. Preliminary results demonstrate that circuits evolved on classification tasks are able to achieve over 90% accuracy on most of the benchmark datasets with a limited computational budget, and are able to emulate target circuit quantum states with high fidelity scores.
中文标题/摘要
标题:使用神经进化研究量子电路设计
设计有效的量子电路仍然是量子计算中的一个核心挑战,因为电路结构强烈影响其表达能力、可训练性和硬件可行性。当前的方法,无论是使用手动设计的电路模板、固定的启发式规则还是自动化的规则,都面临着可扩展性、灵活性和适应性方面的限制,往往生成的电路与特定问题或量子硬件不匹配。在本工作中,我们提出了进化增强量子电路探索(EXAQC)方法,这是一种利用和扩展神经进化和遗传编程策略的进化方法,用于自动设计和训练参数化量子电路(PQCs)。该方法同时搜索门类型、量子比特连接性、参数化和电路深度,同时遵守硬件和噪声约束。该方法支持Qiskit和Pennylane库,允许用户配置每个方面。本工作强调进化搜索是推进量子机器学习和变分量子算法的关键工具,提供了一条有原则的途径,以实现可扩展、问题感知和硬件高效的量子电路设计。初步结果表明,用于分类任务的电路在有限的计算预算下,能够在大多数基准数据集上实现超过90%的准确率,并且能够以高保真度模拟目标电路的量子态。
Summary / 总结
This study addresses the challenge of designing effective quantum circuits by proposing EXAQC, an evolutionary approach that jointly searches over gate types, qubit connectivity, parameterization, and circuit depth while respecting hardware and noise constraints. The method, which supports Qiskit and Pennylane libraries, demonstrates that evolved circuits can achieve over 90% accuracy on benchmark datasets with limited computational resources and high fidelity in emulating target quantum states.
该研究通过提出EXAQC,一种进化方法,搜索门类型、量子比特连接性、参数化和电路深度,同时遵守硬件约束,来解决有效量子电路设计的挑战。该方法支持Qiskit和Pennylane库,在有限的计算资源下,对于分类任务的基准数据集实现了超过90%的准确率,并且能够以高保真度模拟目标量子状态。
Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves
Authors: Alessio Borgi, Fabrizio Silvestri, Pietro Liò
Venue: ICML 2026
First: 2025-11-28T23:10:54+00:00 · Latest: 2026-02-03T18:57:37+00:00
Comments: Under Review at ICML 2026
Abstract
Sheaf Neural Networks equip graph structures with a cellular sheaf: a geometric structure which assigns local vector spaces (stalks) and a linear learnable restriction/transport maps to nodes and edges, yielding an edge-aware inductive bias that handles heterophily and limits oversmoothing. However, common Neural Sheaf Diffusion implementations rely on SVD-based sheaf normalization and dense per-edge restriction maps, which scale with stalk dimension, require frequent Laplacian rebuilds, and yield brittle gradients. To address these limitations, we introduce Polynomial Neural Sheaf Diffusion (PolyNSD), a new sheaf diffusion approach whose propagation operator is a degree-K polynomial in a normalised sheaf Laplacian, evaluated via a stable three-term recurrence on a spectrally rescaled operator. This provides an explicit K-hop receptive field in a single layer (independently of the stalk dimension), with a trainable spectral response obtained as a convex mixture of K+1 orthogonal polynomial basis responses. PolyNSD enforces stability via convex mixtures, spectral rescaling, and residual/gated paths, reaching new state-of-the-art results on both homophilic and heterophilic benchmarks, inverting the Neural Sheaf Diffusion trend by obtaining these results with just diagonal restriction maps, decoupling performance from large stalk dimension, while reducing runtime and memory requirements.
中文标题/摘要
标题:多项式神经纤维扩散:基于谱滤波的细胞纤维上的一种方法
纤维神经网络为图结构配备了细胞纤维:一种几何结构,它将局部向量空间(stalks)和线性可学习的限制/传输映射分配给节点和边,从而产生一种边感知的归纳偏置,能够处理异质性和限制过度平滑。然而,常见的神经纤维扩散实现依赖于基于SVD的纤维规范化和密集的边限制映射,这些映射与stalk维度成比例,需要频繁重建拉普拉斯矩阵,并产生脆弱的梯度。为了解决这些限制,我们引入了多项式神经纤维扩散(PolyNSD),这是一种新的纤维扩散方法,其传播算子是归一化纤维拉普拉斯矩阵的度为K的多项式,通过谱缩放操作的稳定三项递推计算。这在一个层中提供了一个显式的K跳感受野(与stalk维度无关),并且可训练的谱响应是K+1个正交多项式基响应的凸混合。PolyNSD通过凸混合、谱缩放和残差/门控路径确保稳定性,达到了在同质性和异质性基准测试上的新最佳结果,逆转了神经纤维扩散的趋势,仅使用对角限制映射即可获得这些结果,解耦性能与大stalk维度无关,同时减少了运行时间和内存需求。
Summary / 总结
The research aims to improve the scalability and stability of Neural Sheaf Diffusion by addressing limitations such as the use of SVD-based sheaf normalization and dense per-edge restriction maps. The proposed Polynomial Neural Sheaf Diffusion (PolyNSD) uses a polynomial in a normalized sheaf Laplacian, evaluated via a stable three-term recurrence, to provide an explicit K-hop receptive field in a single layer. This method achieves state-of-the-art results on both homophilic and heterophilic benchmarks, using only diagonal restriction maps, which decouples performance from large stalk dimensions and reduces runtime and memory requirements.
研究旨在通过解决神经纤维扩散的局限性,如需要基于SVD的纤维规范化和密集的边限制映射。作者引入了多项式神经纤维扩散(PolyNSD),该方法使用规范化纤维拉普拉斯的多项式来实现稳定的传播算子。这种方法在一个层中提供了显式的K跳感受野,并使用正交多项式基响应的凸混合来获得可训练的频谱响应。PolyNSD在同ophilic和heterophilic基准测试中达到了最先进的结果,使用对角线限制映射,减少了运行时间和内存需求,同时解耦性能与大纤维维度的关系。
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL
Authors: Erfan Miahi, Eugene Belilovsky
First: 2026-02-03T18:56:48+00:00 · Latest: 2026-02-03T18:56:48+00:00
Comments: 32 pages, 14 figures
Abstract
Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.
中文标题/摘要
标题:理解并利用权重更新稀疏性以实现通信高效的分布式强化学习
强化学习(RL)是后训练大型语言模型(LLMs)的关键组成部分。然而,在带宽受限的分布式RL中,可扩展性通常受到训练器到推理工作者同步策略权重的瓶颈限制,特别是在普通网络或去中心化设置中。虽然最近的研究表明,RL更新仅修改少量模型参数,但这些观察通常是基于粗略的检查点差异。我们对权重更新稀疏性进行了系统性的经验研究,包括步长级和多步级粒度,考察了其在训练动态、离策略延迟和模型规模方面的演变。我们发现,更新稀疏性在实际相关设置中始终很高,经常超过99%。利用这种结构,我们提出了PULSE(基于无损稀疏编码的补丁更新)方法,该方法仅传输修改参数的索引和值,从而实现无损权重同步。PULSE对传输错误具有鲁棒性,并避免了累加差分方案固有的浮点漂移。在带宽受限的去中心化环境中,我们的方法在保持与完整权重同步相同的训练动态和性能的同时,实现了超过100倍(14 GB到~108 MB)的通信减少。通过利用这种结构,PULSE使去中心化RL训练接近集中式吞吐量,将权重同步所需的带宽从20 Gbit/s降低到0.2 Gbit/s,以保持高GPU利用率。
Summary / 总结
This paper addresses the challenge of communication efficiency in distributed reinforcement learning (RL) by studying the sparsity of weight updates. It finds that weight updates are highly sparse, often exceeding 99% across various settings. Based on this observation, the authors propose PULSE, a method that transmits only the indices and values of modified parameters, achieving over 100x communication reduction while maintaining training dynamics and performance. This enables decentralized RL training to approach centralized throughput, significantly reducing the bandwidth required for weight synchronization.
该论文通过研究权重更新的稀疏性来解决分布式强化学习中的通信效率问题。作者发现权重更新在不同训练场景下高度稀疏,经常超过99%。他们提出了一种名为PULSE的方法,仅传输修改参数的索引和值,实现了超过100倍的通信减少,同时保持训练动态和性能。这使得分布式RL训练接近集中式吞吐量,显著减少了所需的带宽。
PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization
Authors: Erzhen Hu, Frederik Brudy, David Ledo, George Fitzmaurice, Fraser Anderson
First: 2026-02-03T18:56:40+00:00 · Latest: 2026-02-03T18:56:40+00:00
Comments: 21 pages, 13 figures; accepted and to appear at CHI 2026
Abstract
In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film's possibilities before fullscale production, yet conventional approaches involve trade-offs in efficiency and expressiveness. Hand-drawn storyboards often lack spatial precision needed for complex cinematography, while 3D previsualization demands expertise and high-quality rigged assets. To address this gap, we present PrevizWhiz, a system that leverages rough 3D scenes in combination with generative image and video models to create stylized video previews. The workflow integrates frame-level image restyling with adjustable resemblance, time-based editing through motion paths or external video inputs, and refinement into high-fidelity video clips. A study with filmmakers demonstrates that our system lowers technical barriers for film-makers, accelerates creative iteration, and effectively bridges the communication gap, while also surfacing challenges of continuity, authorship, and ethical consideration in AI-assisted filmmaking.
中文标题/摘要
标题:PrevizWhiz:结合粗糙3D场景和2D视频以指导生成式视频预可视化
在前期制作中,电影制作人和3D动画专家必须迅速原型设计想法以探索电影的可能性,但在大规模生产之前,传统方法会涉及效率和表达性的权衡。手绘故事板往往缺乏复杂摄影所需的空间精度,而3D预可视化则需要专业知识和高质量的动画资产。为了解决这一差距,我们提出了PrevizWhiz系统,该系统利用粗糙的3D场景与生成式图像和视频模型相结合,创建风格化的视频预览。工作流程包括帧级图像重制、可调节的相似度、基于运动路径或外部视频输入的时间编辑,以及最终细化为高保真视频片段。一项针对电影制作人的研究显示,我们的系统降低了电影制作人的技术门槛,加速了创意迭代,并有效地弥合了沟通差距,同时也揭示了AI辅助电影制作中的连贯性、作者身份和伦理考虑等挑战。
Summary / 总结
PrevizWhiz is a system that combines rough 3D scenes with generative models to create stylized video previews, addressing the trade-offs in efficiency and expressiveness in pre-production. The system allows for frame-level image restyling, time-based editing, and refinement into high-fidelity video clips. A study with filmmakers shows that PrevizWhiz lowers technical barriers, accelerates creative iteration, and effectively bridges communication gaps, though it also highlights challenges in continuity, authorship, and ethical considerations in AI-assisted filmmaking.
PrevizWhiz 是一个系统,结合粗糙的 3D 场景和生成模型来创建风格化的视频预览,以解决预生产中的效率和表达性权衡问题。该系统允许帧级图像重制、时间编辑和向高保真视频片段的细化。研究显示,PrevizWhiz 降低了技术门槛,加速了创意迭代,并有效地弥合了沟通差距,但也提出了连续性、作者权和 AI 辅助电影制作中的伦理考虑等挑战。
AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations
Authors: Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, Yue Zhang
Venue: ICLR 2026
First: 2026-02-03T18:41:43+00:00 · Latest: 2026-02-03T18:41:43+00:00
Comments: Accepted at the ICLR 2026
Abstract
High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.
中文标题/摘要
标题:AutoFigure:生成和优化出版级科学插图
高质量的科学插图对于有效传达复杂的科学和技术概念至关重要,但其手动创建仍然是学术界和工业界公认的瓶颈。我们提出了FigureBench,这是首个用于从长篇科学文本生成科学插图的大规模基准。它包含3,300个高质量的科学文本-插图对,涵盖了来自科学论文、综述、博客和教科书的多种文本到插图任务。此外,我们提出了AutoFigure,这是首个基于长篇科学文本自动生成高质量科学插图的代理框架。具体而言,在最终呈现结果之前,AutoFigure 会进行广泛的思考、重组和验证,以生成既结构合理又美观的布局,输出同时具备结构完整性和美学吸引力的科学插图。利用FigureBench中的高质量数据,我们进行了广泛的实验,测试AutoFigure相对于各种基线方法的性能。结果表明,AutoFigure 一致地超越了所有基线方法,生成了出版级的科学插图。代码、数据集和huggingface空间已发布在https://github.com/ResearAI/AutoFigure。
Summary / 总结
The research aims to address the bottleneck of manually creating high-quality scientific illustrations. AutoFigure, an agentic framework, is proposed to automatically generate such illustrations from long-form scientific texts. It involves extensive thinking, recombination, and validation to produce structurally sound and aesthetically refined illustrations. Experiments show that AutoFigure outperforms baseline methods, generating publication-ready scientific illustrations. The dataset and code are publicly available.
论文针对科学插图制作耗时且难以手工完成的问题,提出了一个包含3,300个图文对的基准数据集FigureBench,并开发了AutoFigure框架,能够从长文本中自动生成结构合理且视觉上吸引人的科学插图。实验结果显示,AutoFigure在性能上超越了现有方法,生成的插图可以直接用于出版。
Multi-Agent Pathfinding Under Team-Connected Communication Constraint via Adaptive Path Expansion and Dynamic Leading
Authors: Hoang-Dung Bui, Erion Plaku, Gregoy J. Stein
First: 2025-01-06T05:21:18+00:00 · Latest: 2026-02-03T18:36:02+00:00
Abstract
This paper proposes a novel planning framework to handle a multi-agent pathfinding problem under team-connected communication constraint, where all agents must have a connected communication channel to the rest of the team during their entire movements. Standard multi-agent path finding approaches (e.g., priority-based search) have potential in this domain but fail when neighboring configurations at start and goal differ. Their single-expansion approach -- computing each agent's path from the start to the goal in just a single expansion -- cannot reliably handle planning under communication constraints for agents as their neighbors change during navigating. Similarly, leader-follower approaches (e.g., platooning) are effective at maintaining team communication, but fixing the leader at the outset of planning can cause planning to become stuck in dense-clutter environments, limiting their practical utility. To overcome this limitation, we propose a novel two-level multi-agent pathfinding framework that integrates two techniques: adaptive path expansion to expand agent paths to their goals in multiple stages; and dynamic leading technique that enables the reselection of the leading agent during each agent path expansion whenever progress cannot be made. Simulation experiments show the efficiency of our planners, which can handle up to 25 agents across five environment types under a limited communication range constraint and up to 11-12 agents on three environment types under line-of-sight communication constraint, exceeding 90% success-rate where baselines routinely fail.
中文标题/摘要
标题:基于团队连接通信约束的自适应路径扩展与动态领航的多智能体路径规划
本文提出了一种新的规划框架,用于处理团队连接通信约束下的多智能体路径规划问题,其中所有智能体在整个移动过程中都必须与团队的其他部分保持连接的通信通道。标准的多智能体路径规划方法(例如基于优先级的搜索)在该领域具有潜力,但在起始配置和目标配置邻近的情况下会失效。它们的单扩展方法——仅在一次扩展中为每个智能体计算从起始点到目标点的路径——无法可靠地处理在通信约束下的路径规划,因为智能体在导航过程中其邻居会不断变化。同样,领航跟随方法(例如编队)在保持团队通信方面非常有效,但在规划初期固定领航者会导致在密集障碍环境中规划陷入困境,限制了其实际应用。为克服这一限制,我们提出了一种新的两层多智能体路径规划框架,该框架结合了两种技术:自适应路径扩展,以多阶段的方式扩展智能体路径到目标;以及动态领航技术,允许在每次智能体路径扩展过程中,当无法取得进展时重新选择领航智能体。仿真实验表明,我们的规划器在有限通信范围约束下可以处理多达25个智能体在五种环境类型中的路径规划,在视线通信约束下可以处理多达11-12个智能体在三种环境类型中的路径规划,成功率超过90%,而基线方法通常会失败。
Summary / 总结
This paper addresses the multi-agent pathfinding problem under a team-connected communication constraint, where all agents must maintain a connected communication channel throughout their movements. It proposes a two-level framework combining adaptive path expansion and dynamic leading to overcome the limitations of standard priority-based search and leader-follower approaches. The proposed method successfully handles up to 25 agents in various environments under limited communication range and up to 11-12 agents under line-of-sight communication, achieving over 90% success rates where baseline methods fail.
本文提出了一种新的框架,结合了自适应路径扩展和动态领导技术,以解决团队连接通信约束下的多智能体路径规划问题。该方法通过多阶段扩展路径并在必要时重新选择领导智能体来克服标准方法的局限性。实验结果表明,所提出的规划器在各种环境下的成功率超过90%,最多可处理25个智能体在有限通信约束下的问题,优于基线方法。
ME-IGM: Individual-Global-Max in Maximum Entropy Multi-Agent Reinforcement Learning
Authors: Wen-Tse Chen, Yuxuan Li, Shiyu Huang, Jiayu Chen, Jeff Schneider
Venue: Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25 - 29, 2026, IFAAMAS, 19 pages
First: 2024-06-20T01:55:08+00:00 · Latest: 2026-02-03T18:35:29+00:00
Comments: Published in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Abstract
Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.
中文标题/摘要
标题:ME-IGM:个体-全局-最大在最大熵多智能体强化学习中的应用
多智能体信用分配是合作多智能体强化学习(MARL)中的一个基本挑战,其中一组智能体通过共享奖励信号进行学习。个体-全局-最大(IGM)条件是多智能体信用分配中广泛使用的原则,要求由个体Q函数确定的联合动作最大化全局Q值。同时,最大熵原则已被用于增强MARL中的探索。然而,我们发现现有最大熵MARL方法的一个关键局限性:局部策略与最大化全局Q值的联合策略之间存在偏差,导致违反了IGM条件。为解决这一偏差,我们提出了一种保持顺序的变换。在此基础上,我们引入了ME-IGM,这是一种与任何满足IGM条件的信用分配机制兼容的新颖最大熵MARL算法,同时享有最大熵探索的好处。我们通过非单调矩阵游戏评估了ME-IGM的两种变体:ME-QMIX和ME-QPLEX,并在SMAC-v2和Overcooked中展示了其在17个场景中的先进性能。
Summary / 总结
The paper addresses the challenge of multi-agent credit assignment in cooperative MARL by proposing ME-IGM, a maximum entropy MARL algorithm that ensures the IGM condition is met. It introduces an order-preserving transformation to align local policies with the global policy, and evaluates ME-IGM variants ME-QMIX and ME-QPLEX in non-monotonic matrix games, showing superior performance in 17 scenarios of SMAC-v2 and Overcooked.
论文通过提出ME-IGM,一种确保局部策略与全局Q值最大化相一致的极大熵MARL算法,来解决合作MARL中的多智能体信用分配问题。它引入了一种有序保持变换以满足IGM条件,并在非单调矩阵游戏中评估了ME-IGM的变体ME-QMIX和ME-QPLEX,结果显示在SMAC-v2和Overcooked的17个场景中表现优于现有方法。
Continuous Control of Editing Models via Adaptive-Origin Guidance
Authors: Alon Wolf, Chen Katzir, Kfir Aberman, Or Patashnik
First: 2026-02-03T18:33:39+00:00 · Latest: 2026-02-03T18:33:39+00:00
Comments: Project page at https://adaor-paper.github.io/
Abstract
Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.
中文标题/摘要
标题:通过自适应起始点引导实现编辑模型的连续控制
基于扩散的编辑模型已成为语义图像和视频操作的强大工具。然而,现有模型缺乏一种机制来平滑控制文本引导编辑的强度。在标准文本条件生成中,无分类引导(CFG)影响提示的依从性,这表明它可能是编辑模型中编辑强度控制的潜在机制。然而,我们表明,在这些模型中按比例放大CFG不会在输入和编辑结果之间产生平滑过渡。我们将这种行为归因于无条件预测,它作为引导起始点,在低引导比例下主导生成,而代表对输入内容的任意操作。为了实现连续控制,我们引入了自适应起始点引导(AdaOr),这是一种方法,通过使用与身份操作相对应的身份条件自适应起始点调整标准引导起始点。通过根据编辑强度将身份预测与标准无条件预测进行插值,我们确保从输入到编辑结果的连续过渡。我们在图像和视频编辑任务上评估了我们的方法,证明它提供了比当前基于滑块的编辑方法更平滑和更一致的控制。我们的方法将身份指令纳入标准训练框架,使推理时能够实现细粒度控制,而无需针对每个编辑过程或依赖专门的数据集。
Summary / 总结
The paper addresses the lack of smooth control over the intensity of text-guided edits in diffusion-based editing models. It introduces Adaptive-Origin Guidance (AdaOr), which adjusts the standard guidance origin with an identity-conditioned adaptive origin to enable continuous control. The method ensures a smooth transition from the input to the edited result by interpolating the identity prediction with the standard unconditional prediction based on the edit strength. Experiments on image and video editing tasks show that AdaOr provides smoother and more consistent control compared to existing slider-based approaches.
论文解决了在基于扩散的编辑模型中平滑控制文本引导编辑强度的挑战。它引入了自适应原点引导(AdaOr),通过使用与身份操作对应的身份指令调整标准引导原点,以确保从输入到编辑结果的连续过渡。实验表明,AdaOr相比现有的滑块基编辑方法提供了更平滑和更一致的控制。
Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion
Authors: Oscar Ovanger, Levi Harris, Timothy H. Keitt
First: 2026-02-03T18:21:13+00:00 · Latest: 2026-02-03T18:21:13+00:00
Abstract
Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce \textbf{F}usion under \textbf{IN}dependent \textbf{C}onditional \textbf{H}ypotheses (\textbf{FINCH}), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emph{contains} the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: \texttt{\href{https://anonymous.4open.science/r/birdnoise-85CD/README.md}{anonymous-repository}}
中文标题/摘要
标题:适应性证据加权的音频时空融合
许多机器学习系统可以访问同一预测目标的多种证据来源,但这些来源在不同输入上的可靠性和信息量往往不同。在生物声学分类中,物种身份可以从声学信号和时空上下文(如位置和季节)中推断;虽然贝叶斯推理支持乘法证据组合,但在实践中我们通常只能访问判别性预测器而不是校准的生成模型。我们引入了Fusion under INdependent Conditional Hypotheses (FINCH)——一种适应性对数线性证据融合框架,将预训练的音频分类器与结构化的时空预测器结合起来。FINCH 学习一个样本级门控函数,以不确定性与信息量统计来估计上下文信息的可靠性。由此产生的融合家族包含仅基于音频的分类器作为特殊情况,并明确限制上下文证据的影响,从而形成一个具有可解释音频仅依赖备选方案的风险可控假设类。在基准测试中,FINCH 一致地优于固定权重融合和仅基于音频的基线,即使在上下文信息单独较弱时也能提高稳健性和错误权衡。我们使用一种轻量级、可解释、基于证据的方法在CBI上达到最佳性能,并在BirdSet的几个子集上达到竞争或改进的性能。代码可在:https://anonymous.4open.science/r/birdnoise-85CD/README.md 获取
Summary / 总结
The paper introduces FINCH, an adaptive log-linear evidence fusion framework for integrating audio and spatiotemporal data in bioacoustic classification. It learns a per-sample gating function to weight the reliability of contextual information, providing a flexible fusion method that includes an audio-only classifier as a special case. Across various benchmarks, FINCH outperforms fixed-weight fusion and audio-only baselines, demonstrating improved robustness and error trade-offs, especially when contextual information is weak. The approach is lightweight and interpretable, achieving state-of-the-art performance on CBI and competitive results on BirdSet subsets.
论文提出了一种名为FINCH的自适应日志线性证据融合框架,用于将音频和时空数据融合在生物声学分类中。它学习一个样本级别的门控函数来估计上下文信息的可靠性,提供了一个具有可解释音频仅依赖回退的风险可控假设类。在基准测试中,FINCH在固定权重融合和音频仅基线之上表现出更优的鲁棒性和错误权衡,即使上下文信息单独较弱时也是如此。该方法在CBI上实现了最先进的性能,并在几个BirdSet子集上实现了竞争或改进的性能,使用了一个轻量级且可解释的基于证据的方法。
SymPlex: A Structure-Aware Transformer for Symbolic PDE Solving
Authors: Yesom Park, Annie C. Lu, Shao-Ching Huang, Qiyang Hu, Y. Sungtaek Ju, Stanley Osher
First: 2026-02-03T18:18:30+00:00 · Latest: 2026-02-03T18:18:30+00:00
Comments: 27 pages
Abstract
We propose SymPlex, a reinforcement learning framework for discovering analytical symbolic solutions to partial differential equations (PDEs) without access to ground-truth expressions. SymPlex formulates symbolic PDE solving as tree-structured decision-making and optimizes candidate solutions using only the PDE and its boundary conditions. At its core is SymFormer, a structure-aware Transformer that models hierarchical symbolic dependencies via tree-relative self-attention and enforces syntactic validity through grammar-constrained autoregressive decoding, overcoming the limited expressivity of sequence-based generators. Unlike numerical and neural approaches that approximate solutions in discretized or implicit function spaces, SymPlex operates directly in symbolic expression space, enabling interpretable and human-readable solutions that naturally represent non-smooth behavior and explicit parametric dependence. Empirical results demonstrate exact recovery of non-smooth and parametric PDE solutions using deep learning-based symbolic methods.
中文标题/摘要
标题:SymPlex:一种结构感知的变换器用于符号偏微分方程求解
我们提出SymPlex,一种强化学习框架,用于在不访问真实表达式的情况下发现偏微分方程(PDE)的解析符号解。SymPlex 将符号PDE求解表述为树结构决策,并仅使用PDE及其边界条件优化候选解。其核心是SymFormer,一种结构感知的变换器,通过树相关的自注意力建模分层的符号依赖关系,并通过语法约束的自回归解码确保语法有效性,克服了基于序列生成器的有限表达能力。与在离散化或隐函数空间中近似解的数值和神经方法不同,SymPlex 直接在符号表达空间中操作,能够生成可解释且人类可读的解,自然地表示非光滑行为和显式参数依赖。实验证明,基于深度学习的符号方法可以精确恢复非光滑和参数化的PDE解。
Summary / 总结
SymPlex is a reinforcement learning framework designed to find analytical symbolic solutions to partial differential equations (PDEs) without needing ground-truth expressions. It uses a structure-aware Transformer, SymFormer, which models hierarchical symbolic dependencies and ensures syntactic validity. SymPlex directly operates in symbolic expression space, producing interpretable and human-readable solutions. Experiments show that SymPlex can exactly recover non-smooth and parametric PDE solutions using deep learning-based symbolic methods.
SymPlex 是一个强化学习框架,旨在在无需依赖真实表达式的情况下发现偏微分方程(PDE)的解析符号解。它使用一种结构感知的Transformer,SymFormer,来建模符号依赖关系并确保语法有效性。SymPlex 使用偏微分方程及其边界条件来优化候选解,直接在符号表达空间中操作。关键实验发现是,SymPlex 可以精确恢复非光滑和参数化的PDE解,展示了基于深度学习的符号方法的有效性。
Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning
Authors: Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen
First: 2026-02-03T18:18:11+00:00 · Latest: 2026-02-03T18:18:11+00:00
Abstract
Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1$\times$ and LLaVA-NeXT by 4.0$\times$, retaining over 99% performance. Code: https://github.com/dingkun-zhang/DualSpeed
中文标题/摘要
标题:通过视觉标记剪枝实现多模态大型语言模型快速高效训练
多模态大型语言模型(MLLMs)在训练过程中面临严重的效率问题,这与它们庞大的模型规模和视觉标记数量有关。现有的高效训练方法主要集中在减少模型规模或可训练参数。受视觉标记剪枝(VTP)在提高推理效率方面取得成功的启发,我们探索了通过减少视觉标记来实现高效训练的另一个重要研究方向。然而,在训练阶段应用VTP会导致训练-推理不匹配:剪枝训练的模型在对完整的视觉标记序列进行推理时表现不佳。为了解决这一问题,我们提出了DualSpeed,这是一种用于MLLMs高效训练的快慢模式框架。快模式是主要模式,它将现有的VTP方法作为插件来减少视觉标记,并包含一个模式隔离器以隔离模型的行为。慢模式是辅助模式,在此模式下,模型在完整的视觉序列上进行训练以保持训练-推理一致性。为了提高其训练效率,它进一步利用自我蒸馏从充分训练的快模式中学习。综上所述,DualSpeed可以同时实现训练效率和非退化性能。实验表明,DualSpeed将LLaVA-1.5的训练加速了2.1倍,将LLaVA-NeXT的训练加速了4.0倍,同时保持了超过99%的性能。代码:https://github.com/dingkun-zhang/DualSpeed
Summary / 总结
The paper addresses the training inefficiency of Multimodal Large Language Models (MLLMs) by proposing DualSpeed, a fast-slow training framework. It uses Visual Token Pruning (VTP) in the fast-mode to reduce visual tokens and includes a mode isolator to maintain model consistency. The slow-mode trains on full visual sequences to ensure performance consistency between training and inference. Self-distillation is used to enhance training efficiency. Experiments show DualSpeed accelerates LLaVA-1.5 and LLaVA-NeXT training by 2.1x and 4.0x, respectively, while retaining over 99% performance.
研究旨在通过提出DualSpeed框架解决Multimodal Large Language Models (MLLMs)的训练低效问题,该框架在快速模式下通过视觉令牌剪枝(VTP)减少视觉令牌数量,并在慢速模式下保持训练推断一致性。实验表明,DualSpeed显著加速了LLaVA-1.5和LLaVA-NeXT的训练,分别加速了2.1倍和4.0倍,同时保持了超过99%的原始性能。
Antidistillation Fingerprinting
Authors: Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey, Tom Goldstein, Fei Fang, J. Zico Kolter
First: 2026-02-03T18:15:50+00:00 · Latest: 2026-02-03T18:15:50+00:00
Comments: 26 pages, 11 figures
Abstract
Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.
中文标题/摘要
标题:抗蒸馏指纹识别
模型蒸馏能够高效地模拟前沿的大语言模型(LLMs),因此需要稳健的机制来检测第三方学生模型是否在教师模型的输出上进行了训练。然而,现有的用于检测此类蒸馏的指纹识别技术依赖于启发式的扰动,这在生成质量和指纹识别强度之间造成了陡峭的权衡,通常需要显著降低实用性以确保指纹被学生模型有效地吸收。我们提出了抗蒸馏指纹识别(ADFP),这是一种原理性的方法,将指纹识别目标与学生的学习动态相一致。基于抗蒸馏采样的梯度框架,ADFP 使用代理模型来识别并采样那些在微调后能够直接最大化学生模型中指纹可检测性的令牌,而不是依赖于对更简单的水印的非目标偏见的偶然吸收。在GSM8K和OASST1基准上的实验表明,ADFP 在保持最小实用性影响的情况下,显著优于最先进的基线方法,提供了更强的检测置信度,即使学生模型的架构未知。
Summary / 总结
ADFP is a new method for detecting model distillation by aligning the fingerprinting objective with the student model's learning dynamics. It uses a proxy model to sample tokens that maximize the detectability of the fingerprint, avoiding the quality degradation of previous techniques. Experiments show ADFP outperforms existing methods with higher detection confidence and minimal impact on utility, even when the student's architecture is unknown.
研究引入了抗蒸馏指纹技术(ADFP),该方法将指纹识别目标与学生模型的学习动态对齐,以检测蒸馏现象。ADFP 使用代理模型采样能够最大化学生模型在微调后检测指纹能力的标记,避免了之前技术的质量下降。实验表明,ADFP 在 GSM8K 和 OASST1 基准上优于现有方法,提供了更强的检测能力且几乎不影响模型的实用性,即使不知道学生模型的架构。
Enhancing Imbalanced Node Classification via Curriculum-Guided Feature Learning and Three-Stage Attention Network
Authors: Abdul Joseph Fofanah, Lian Wen, David Chen, Shaoyang Zhang
First: 2026-02-03T18:10:40+00:00 · Latest: 2026-02-03T18:10:40+00:00
Abstract
Imbalanced node classification in graph neural networks (GNNs) happens when some labels are much more common than others, which causes the model to learn unfairly and perform badly on the less common classes. To solve this problem, we propose a Curriculum-Guided Feature Learning and Three-Stage Attention Network (CL3AN-GNN), a learning network that uses a three-step attention system (Engage, Enact, Embed) similar to how humans learn. The model begins by engaging with structurally simpler features, defined as (1) local neighbourhood patterns (1-hop), (2) low-degree node attributes, and (3) class-separable node pairs identified via initial graph convolutional networks and graph attention networks (GCN and GAT) embeddings. This foundation enables stable early learning despite label skew. The Enact stage then addresses complicated aspects: (1) connections that require multiple steps, (2) edges that connect different types of nodes, and (3) nodes at the edges of minority classes by using adjustable attention weights. Finally, Embed consolidates these features via iterative message passing and curriculum-aligned loss weighting. We evaluate CL3AN-GNN on eight Open Graph Benchmark datasets spanning social, biological, and citation networks. Experiments show consistent improvements across all datasets in accuracy, F1-score, and AUC over recent state-of-the-art methods. The model's step-by-step method works well with different types of graph datasets, showing quicker results than training everything at once, better performance on new, imbalanced graphs, and clear explanations of each step using gradient stability and attention correlation learning curves. This work provides both a theoretically grounded framework for curriculum learning in GNNs and practical evidence of its effectiveness against imbalances, validated through metrics, convergence speeds, and generalisation tests.
中文标题/摘要
标题:通过课程引导特征学习和三阶段注意力网络增强图神经网络中的不平衡节点分类
图神经网络(GNN)中的不平衡节点分类发生在某些标签远比其他标签常见时,这会导致模型学习不公平且在较少常见的类别上表现不佳。为了解决这一问题,我们提出了一种课程引导特征学习和三阶段注意力网络(CL3AN-GNN),这是一种使用三步注意力系统(参与、执行、嵌入)的学习网络,类似于人类的学习过程。模型首先通过参与结构上更简单的特征开始学习,这些特征定义为(1)局部邻域模式(1-跳),(2)低度节点属性,以及(3)通过初始图卷积网络和图注意力网络(GCN和GAT)嵌入识别的可分节点对。这一基础使模型能够在标签分布不均的情况下稳定地进行早期学习。执行阶段则处理复杂方面:(1)需要多步的连接,(2)连接不同节点类型的边,以及(3)少数类节点,通过可调注意力权重。最后,嵌入阶段通过迭代消息传递和课程对齐的损失加权来整合这些特征。我们在八个跨越社交、生物和引用网络的开放图基准数据集上评估了CL3AN-GNN。实验结果显示,与最近的先进方法相比,该模型在所有数据集上的准确率、F1分数和AUC均有所提高。该模型的逐步方法适用于不同类型的图数据集,显示出比一次性训练所有内容更快的结果,对新、不平衡的图具有更好的性能,并通过梯度稳定性和注意力相关性学习曲线清晰地解释了每一步。这项工作为GNN中的课程学习提供了理论基础框架,并通过指标、收敛速度和泛化测试验证了其有效性。
Summary / 总结
The paper addresses the issue of imbalanced node classification in graph neural networks by proposing CL3AN-GNN, which uses a three-stage attention system (Engage, Enact, Embed) to learn features in a curriculum-like manner. The Engage stage starts with simpler features, the Enact stage handles more complex connections, and the Embed stage consolidates these features. Experiments on eight datasets show consistent improvements in accuracy, F1-score, and AUC over existing methods, indicating the effectiveness of this approach in handling imbalanced node classification problems.
论文提出CL3AN-GNN,使用三阶段注意力网络解决图神经网络中的不平衡节点分类问题。模型从简单特征开始,然后处理复杂连接,最后通过迭代消息传递进行特征合并。实验结果显示,在八个数据集上,该模型在准确率、F1分数和AUC方面均优于最新方法,证明了其在处理不平衡数据方面的有效性。
Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation
Authors: Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun
First: 2026-02-03T18:08:41+00:00 · Latest: 2026-02-03T18:08:41+00:00
Abstract
Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs' in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.
中文标题/摘要
标题:连接线上与线下RL:多轮代码生成的上下文臂学习
近年来,研究人员对使用强化学习(RL)训练大规模语言模型(LLMs)以完成现实世界任务(如多轮代码生成)产生了浓厚兴趣。尽管线上RL通常优于线下RL,但其较高的训练成本和不稳定性阻碍了其广泛应用。本文基于多轮代码生成可以被表述为一步可恢复的马尔可夫决策过程的观察,提出了一种结合线上和线下RL优点的新方法——上下文臂学习(Cobalt),该方法使用离线轨迹。Cobalt 首先使用参考LLM收集代码生成轨迹,并将其划分为部分轨迹作为上下文提示。然后,在线上臂学习过程中,LLM通过单步代码生成来完成每个部分轨迹提示的训练。Cobalt 在LiveCodeBench上将R1-Distill 8B和Qwen3 8B的绝对Pass@1得分分别提高了9.0和6.2,超过了基于GRPO和VeRPO的两种多轮线上RL基线。此外,我们分析了LLMs的上下文奖励作弊行为,并通过扰动轨迹来增强Cobalt的训练以减轻这一问题。总体而言,我们的结果表明Cobalt 是一种有前景的多轮代码生成等迭代决策任务的解决方案。我们的代码和数据可在https://github.com/OSU-NLP-Group/cobalt/ 获取。
Summary / 总结
This paper addresses the challenge of training large language models (LLMs) for multi-turn code generation using reinforcement learning (RL). It proposes Cobalt, a method that combines the benefits of online and offline RL. Cobalt first collects trajectories using a reference LLM and divides them into partial prompts. During online bandit learning, the LLM is trained to complete these prompts. Cobalt outperforms two online RL baselines and improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench, respectively. The paper also analyzes reward hacking behaviors and mitigates them with perturbed trajectories.
本文旨在解决使用强化学习(RL)训练大型语言模型进行多轮代码生成的挑战。提出了一种名为Cobalt的方法,该方法结合了在线和离线RL的优点。Cobalt首先使用参考LLM收集轨迹,并将它们分成部分轨迹作为提示。在线学习过程中,LLM被训练以完成这些提示。Cobalt在LiveCodeBench上将R1-Distill 8B和Qwen3 8B的Pass@1得分分别提高了最多9.0和6.2的绝对值。作者还通过使用扰动轨迹来缓解在上下文中的奖励作弊问题,从而增强训练。
Conformal Reachability for Safe Control in Unknown Environments
Authors: Xinhang Ma, Junlin Wu, Yiannis Kantaros, Yevgeniy Vorobeychik
First: 2026-02-03T18:01:38+00:00 · Latest: 2026-02-03T18:01:38+00:00
Abstract
Designing provably safe control is a core problem in trustworthy autonomy. However, most prior work in this regard assumes either that the system dynamics are known or deterministic, or that the state and action space are finite, significantly limiting application scope. We address this limitation by developing a probabilistic verification framework for unknown dynamical systems which combines conformal prediction with reachability analysis. In particular, we use conformal prediction to obtain valid uncertainty intervals for the unknown dynamics at each time step, with reachability then verifying whether safety is maintained within the conformal uncertainty bounds. Next, we develop an algorithmic approach for training control policies that optimize nominal reward while also maximizing the planning horizon with sound probabilistic safety guarantees. We evaluate the proposed approach in seven safe control settings spanning four domains -- cartpole, lane following, drone control, and safe navigation -- for both affine and nonlinear safety specifications. Our experiments show that the policies we learn achieve the strongest provable safety guarantees while still maintaining high average reward.
中文标题/摘要
标题:未知环境中的保证安全控制的符合性可达性
设计可证明安全的控制是可信赖自主的核心问题。然而,大多数相关工作要么假设系统动力学已知或确定性,要么假设状态和动作空间是有限的,这极大地限制了应用范围。我们通过结合符合性预测与可达性分析,开发了一种概率验证框架来解决这一限制。具体而言,我们使用符合性预测在每个时间步获得未知动力学的有效不确定性区间,然后通过可达性验证安全是否在符合性不确定性界限内得到保持。接下来,我们开发了一种算法方法,用于训练优化名义奖励的同时,最大化规划时间窗口,并提供稳健的概率安全保证。我们通过涵盖四个领域——单杆、车道跟随、无人机控制和安全导航——中的七个安全控制场景,对提出的框架进行了评估,包括线性和非线性安全规范。我们的实验表明,我们学习的策略不仅实现了最强的可证明安全保证,同时还能保持较高的平均奖励。
Summary / 总结
This paper addresses the challenge of designing provably safe control strategies for unknown dynamical systems by integrating conformal prediction and reachability analysis. The method uses conformal prediction to estimate uncertainty intervals for unknown dynamics and then applies reachability analysis to ensure safety within these intervals. The approach also optimizes control policies to maximize both nominal reward and planning horizon with sound probabilistic safety guarantees. Experiments across various domains demonstrate that the proposed method provides the strongest provable safety guarantees while maintaining high average reward performance.
研究旨在开发一种方法,在未知环境中设计可证明安全的控制系统,解决先前工作要么假设已知动力学,要么假设状态/动作空间有限的问题。该方法结合了利用收敛预测估计未知动力学的不确定性区间,并通过可达性分析确保在这些区间内的安全性。该方法还优化控制策略以最大化奖励同时保持稳健的概率安全性保证。实验结果显示,所学习的策略提供了最强的安全保证,同时实现了高平均奖励。
FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation
Authors: Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li
First: 2026-02-03T18:01:34+00:00 · Latest: 2026-02-03T18:01:34+00:00
Abstract
Assisting non-expert users to develop complex interactive websites has become a popular task for LLM-powered code agents. However, existing code agents tend to only generate frontend web pages, masking the lack of real full-stack data processing and storage with fancy visual effects. Notably, constructing production-level full-stack web applications is far more challenging than only generating frontend web pages, demanding careful control of data flow, comprehensive understanding of constantly updating packages and dependencies, and accurate localization of obscure bugs in the codebase. To address these difficulties, we introduce FullStack-Agent, a unified agent system for full-stack agentic coding that consists of three parts: (1) FullStack-Dev, a multi-agent framework with strong planning, code editing, codebase navigation, and bug localization abilities. (2) FullStack-Learn, an innovative data-scaling and self-improving method that back-translates crawled and synthesized website repositories to improve the backbone LLM of FullStack-Dev. (3) FullStack-Bench, a comprehensive benchmark that systematically tests the frontend, backend and database functionalities of the generated website. Our FullStack-Dev outperforms the previous state-of-the-art method by 8.7%, 38.2%, and 15.9% on the frontend, backend, and database test cases respectively. Additionally, FullStack-Learn raises the performance of a 30B model by 9.7%, 9.5%, and 2.8% on the three sets of test cases through self-improvement, demonstrating the effectiveness of our approach. The code is released at https://github.com/mnluzimu/FullStack-Agent.
中文标题/摘要
标题:全栈代理:通过开发导向的测试和仓库回译增强代理全栈网页编码
使用LLM驱动的代码代理辅助非专家用户开发复杂的交互式网站已成为一项流行的任务。然而,现有的代码代理通常仅生成前端网页,通过花哨的视觉效果掩盖了实际全栈数据处理和存储的缺乏。值得注意的是,构建生产级别的全栈网页应用程序远比仅生成前端网页更具挑战性,需要仔细控制数据流、全面理解不断更新的包和依赖关系,并准确定位代码库中的隐秘错误。为了解决这些困难,我们引入了全栈代理,这是一个由三个部分组成的统一代理系统:(1)全栈开发,一个具有强大规划、代码编辑、代码库导航和错误定位能力的多代理框架。(2)全栈学习,一种创新的数据扩展和自我改进方法,通过回译爬取和合成的网站仓库来提高全栈开发的骨干LLM。(3)全栈基准,一个全面的基准测试,系统地测试生成网站的前端、后端和数据库功能。我们的全栈开发在前端、后端和数据库测试案例中分别优于之前最先进的方法8.7%、38.2%和15.9%。此外,全栈学习通过自我改进在三组测试案例中分别提高了30B模型的性能9.7%、9.5%和2.8%,证明了我们方法的有效性。代码发布在https://github.com/mnluzimu/FullStack-Agent。
Summary / 总结
The research aims to enhance the capability of LLM-powered code agents to develop full-stack web applications, addressing the limitations of existing agents that primarily generate frontend pages. The method involves FullStack-Agent, a unified system comprising FullStack-Dev, FullStack-Learn, and FullStack-Bench. FullStack-Dev is a multi-agent framework with advanced planning and bug localization capabilities, while FullStack-Learn improves the backbone LLM through back-translation of website repositories. FullStack-Bench comprehensively evaluates the generated website's functionalities. Experimental results show that FullStack-Dev outperforms previous methods by 8.7%, 38.2%, and 15.9% in frontend, backend, and database test cases, respectively. Additionally, FullStack-Learn enhances a 30B model by 9.7%, 9.5%, and 2.8% in the same test cases, validating the approach's effectiveness.
FullStack-Agent旨在通过解决现有代码代理主要生成前端页面的局限性,帮助非专家用户开发全栈网络应用程序。它包括FullStack-Dev,一个用于规划、代码编辑和错误定位的多代理框架;FullStack-Learn,一种通过网站仓库的反向翻译来改进LLM的方法;以及FullStack-Bench,一个全面的基准测试,用于测试前端、后端和数据库功能。FullStack-Dev在前端、后端和数据库测试案例中的表现分别优于之前的方法8.7%、38.2%和15.9%,而FullStack-Learn通过自我改进将30B模型的性能分别提高了9.7%、9.5%和2.8%。
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
Authors: Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai
First: 2026-02-03T17:59:09+00:00 · Latest: 2026-02-03T17:59:09+00:00
Comments: Project Page: https://hjrphoebus.github.io/3DiMo/
Abstract
Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.
中文标题/摘要
标题:基于3D感知的视点自适应人体视频生成运动控制
现有的人体运动控制方法在视频生成中通常依赖于2D姿态或显式的3D参数模型(例如SMPL)作为控制信号。然而,2D姿态刚性地将运动绑定到驱动视点,限制了新视点合成。尽管显式的3D模型在结构上具有信息性,但由于深度模糊和不准确的动力学等固有不准确性,在作为强约束使用时,会覆盖大型视频生成器的强大内在3D感知。在本文中,我们从3D感知的角度重新审视运动控制,提倡一种视点无关的隐式运动表示,这种表示自然地与生成器的空间先验对齐,而不是依赖于外部重建的约束。我们引入了3DiMo,它联合训练了一个运动编码器和一个预训练的视频生成器,将驱动帧提炼为紧凑的、视点无关的运动令牌,并通过交叉注意力语义地注入。为了促进3D感知,我们使用视点丰富的监督(即单视点、多视点和移动摄像机视频)进行训练,强制运动在不同视点之间保持一致性。此外,我们使用辅助几何监督,仅在早期初始化时利用SMPL,并逐渐减少到零,使模型能够从外部3D指导过渡到从数据和生成器的先验中学习真正的3D空间运动理解。实验结果证实,3DiMo能够灵活地根据文本驱动的相机控制准确地再现驱动运动,显著优于现有方法在运动保真度和视觉质量方面的表现。
Summary / 总结
This work addresses the limitations of existing methods for human motion control in video generation by proposing 3DiMo, which uses an implicit, view-agnostic motion representation. The method jointly trains a motion encoder with a pretrained video generator to produce compact motion tokens that are semantically injected via cross-attention. Experiments show that 3DiMo outperforms existing methods in motion fidelity and visual quality, especially with flexible, text-driven camera control.
该研究针对现有用于视频生成中的人体动作控制方法的局限性,这些方法要么依赖2D姿态,要么使用显式的3D模型。它提出了3DiMo,该方法使用隐式的、视角无关的动作表示,以与生成器的空间先验对齐。该方法通过训练一个动作编码器与预训练的视频生成器来生成紧凑的动作令牌,并使用丰富的视角监督来确保在不同视角下动作的一致性。实验表明,3DiMo在动作保真度和视觉质量方面显著优于现有方法,能够实现灵活的摄像机控制和动作的忠实再现。
BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks
Authors: Yixiang Chen, Peiyan Li, Jiabing Yang, Keji He, Xiangnan Wu, Yuan Xu, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang
First: 2026-02-03T17:56:28+00:00 · Latest: 2026-02-03T17:56:28+00:00
Abstract
Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .
中文标题/摘要
标题:BridgeV2W:通过体感掩码将视频生成模型与体感世界模型对接
体感世界模型已成为机器人领域的一个有前途的范式,大多数模型利用大规模互联网视频或预训练的视频生成模型来丰富视觉和运动先验知识。然而,它们仍然面临关键挑战:坐标空间动作与像素空间视频之间的不匹配、对摄像机视角的敏感性以及体感模型之间的非统一架构。为此,我们提出了BridgeV2W,它将坐标空间动作转换为从URDF和摄像机参数渲染的像素对齐的体感掩码。然后,这些掩码通过一种类似于ControlNet的路径注入到预训练的视频生成模型中,这使得动作控制信号与预测的视频对齐,增加了视角特定的条件以适应摄像机视角,并在体感模型中实现了统一的世界模型架构。为了减轻对静态背景的过度拟合,BridgeV2W 进一步引入了一种基于流动的运动损失,专注于学习动态和任务相关区域。在单臂(DROID)和双臂(AgiBot-G1)数据集上的实验,涵盖了多种多样的具有未见过的视角和场景的挑战性条件,表明BridgeV2W 在视频生成质量上优于先前的最先进方法。我们进一步展示了BridgeV2W 在下游实际任务中的潜力,包括策略评估和目标条件规划。更多结果可以在我们的项目网站 https://BridgeV2W.github.io 上找到。
Summary / 总结
BridgeV2W addresses the challenges in embodied world models by converting coordinate-space actions into pixel-aligned masks and injecting them into a pretrained video generation model. This method aligns action control signals with predicted videos, accommodates different camera viewpoints, and provides a unified architecture. Experiments show that BridgeV2W improves video generation quality and performs well on downstream tasks like policy evaluation and goal-conditioned planning under diverse conditions.
BridgeV2W通过将坐标空间的动作转换为像素对齐的掩码并注入预训练的视频生成模型来解决体态世界模型的关键挑战。该方法将动作控制信号与预测视频对齐,适应不同的摄像机视角,并生成统一的架构。实验表明,BridgeV2W在多样且具有挑战性的条件下提高了视频生成质量,并在下游任务如策略评估和目标条件规划中表现出色。
Admissibility of Stein Shrinkage for Batch Normalization in the Presence of Adversarial Attacks
Authors: Sofia Ivolgina, P. Thomas Fletcher, Baba C. Vemuri
First: 2025-07-11T02:13:42+00:00 · Latest: 2026-02-03T17:54:39+00:00
Abstract
Batch normalization (BN) is a ubiquitous operation in deep neural networks, primarily used to improve stability and regularization during training. BN centers and scales feature maps using sample means and variances, which are naturally suited for Stein's shrinkage estimation. Applying such shrinkage yields more accurate mean and variance estimates of the batch in the mean-squared-error sense. In this paper, we prove that the Stein shrinkage estimator of the mean and variance dominates over the sample mean and variance estimators, respectively, in the presence of adversarial attacks modeled using sub-Gaussian distributions. Furthermore, by construction, the James-Stein (JS) BN yields a smaller local Lipschitz constant compared to the vanilla BN, implying better regularity properties and potentially improved robustness. This facilitates and justifies the application of Stein shrinkage to estimate the mean and variance parameters in BN and the use of it in image classification and segmentation tasks with and without adversarial attacks. We present SOTA performance results using this Stein-corrected BN in a standard ResNet architecture applied to the task of image classification using CIFAR-10 data, 3D CNN on PPMI (neuroimaging) data, and image segmentation using HRNet on Cityscape data with and without adversarial attacks.
中文标题/摘要
标题:Stein收缩在对抗攻击存在下的批量归一化可接受性
批量归一化(BN)是深度神经网络中的一种普遍操作,主要用于提高训练过程中的稳定性和正则化。BN使用样本均值和方差对特征图进行中心化和缩放,这自然适合于Stein收缩估计。应用这种收缩可以更准确地估计批量的均值和方差(均方误差意义上)。在本文中,我们证明,在使用次高斯分布建模的对抗攻击下,Stein收缩估计的均值和方差分别优于样本均值和方差估计。此外,通过构造,James-Stein(JS)BN相比传统的BN具有更小的局部利普希茨常数,这意味着更好的正则化性质和潜在的更好鲁棒性。这促进了并证明了在BN中估计均值和方差参数时应用Stein收缩的应用,并在有无对抗攻击的情况下使用Stein收缩进行图像分类和分割任务。我们使用Stein校正的BN在标准的ResNet架构上对CIFAR-10数据进行图像分类任务,在PPMI(神经影像学)数据上使用3D CNN进行任务,并在Cityscape数据上使用HRNet进行图像分割任务,有无对抗攻击均展示了SOTA性能结果。
Summary / 总结
This paper investigates the admissibility of Stein shrinkage for batch normalization (BN) in the context of adversarial attacks. It proves that the Stein shrinkage estimator of mean and variance is superior to the sample mean and variance estimators under sub-Gaussian noise. The James-Stein (JS) BN is shown to have a smaller local Lipschitz constant, indicating better regularity and potential robustness. Experiments on image classification, 3D CNN, and image segmentation tasks demonstrate that Stein-corrected BN outperforms vanilla BN, especially in the presence of adversarial attacks.
本文研究了在对抗攻击背景下,Stein收缩是否适用于批量归一化(BN)。研究证明,在亚高斯攻击模型下,Stein收缩估计器在准确性上优于样本均值和方差估计器。此外,James-Stein(JS)BN具有较小的局部Lipschitz常数,表明更好的正则性和鲁棒性。实验结果显示,Stein校正的BN在图像分类、3D CNN任务和图像分割中均实现了最先进的性能,无论是有还是没有对抗攻击。
Model Optimization for Multi-Camera 3D Detection and Tracking
Authors: Ethan Anderson, Justin Silva, Kyle Zheng, Sameer Pusegaonkar, Yizhou Wang, Zheng Tang, Sujit Biswas
First: 2026-01-31T01:51:30+00:00 · Latest: 2026-02-03T17:47:07+00:00
Abstract
Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.
中文标题/摘要
标题:多相机3D检测与跟踪模型优化
面向室内的多相机感知越来越重要,其中一组静态相机必须在遮挡和异构视角下支持多目标跟踪。我们评估了Sparse4D,这是一种基于查询的空间-时间3D检测与跟踪框架,它在共享世界坐标系中融合多视角特征,并通过实例记忆传播稀疏对象查询。我们研究了降低输入帧率、后训练量化(INT8和FP8)、向WILDTRACK基准转移以及Transformer Engine混合精度微调。为了更好地捕捉身份稳定性,我们报告了平均跟踪持续时间(AvgTrackDur),它以秒为单位衡量身份持续时间。Sparse4D在适度降低FPS时保持稳定,但低于2 FPS时,即使检测稳定,身份关联也会崩溃。主干和颈部的选择性量化提供了最佳的速度-准确度权衡,而与注意力相关的模块始终对低精度敏感。在WILDTRACK上,低FPS预训练在基点检中提供了显著的零样本增益,而小规模微调提供的额外益处有限。Transformer Engine混合精度降低了延迟并提高了相机的可扩展性,但可能会导致身份传播不稳定,从而促使进行稳定性意识验证。
Summary / 总结
The research aims to optimize multi-camera 3D detection and tracking in indoor environments, focusing on Sparse4D, a query-based spatiotemporal framework. The study evaluates the framework's performance under reduced input frame rates, post-training quantization, and fine-tuning with Transformer Engine mixed precision. Key findings include Sparse4D's stability under moderate FPS reductions but identity association collapse below 2 FPS. Selective quantization of the backbone and neck modules provides the best speed-accuracy trade-off, while attention-related modules are sensitive to low precision. On the WILDTRACK benchmark, low-FPS pretraining offers significant zero-shot gains, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency but can destabilize identity propagation, emphasizing the need for stability-aware validation.
研究旨在优化室内环境下的多摄像头3D检测与跟踪,重点是基于查询的时空框架Sparse4D。研究评估了该框架在降低输入帧率、后训练量化和使用Transformer Engine混合精度微调下的表现。关键发现包括Sparse4D在中等帧率降低下保持稳定,但在低于2 FPS时身份关联会崩溃,最佳速度与准确率权衡是选择性量化主干和颈部。在WILDTRACK基准上,低帧率预训练提供了显著的零样本增益,而微调提供的额外益处有限。Transformer Engine混合精度可以减少延迟但可能破坏身份传播,强调需要稳定性意识验证。
Inference-time Unlearning Using Conformal Prediction
Authors: Somnath Basu Roy Chowdhury, Rahul Kidambi, Avinava Dubey, David Wang, Gokhan Mergen, Amr Ahmed, Aranyak Mehta
First: 2026-02-03T17:46:50+00:00 · Latest: 2026-02-03T17:46:50+00:00
Abstract
Machine unlearning is the process of efficiently removing specific information from a trained machine learning model without retraining from scratch. Existing unlearning methods, which often provide provable guarantees, typically involve retraining a subset of model parameters based on a forget set. While these approaches show promise in certain scenarios, their underlying assumptions are often challenged in real-world applications -- particularly when applied to generative models. Furthermore, updating parameters using these unlearning procedures often degrades the general-purpose capabilities the model acquired during pre-training. Motivated by these shortcomings, this paper considers the paradigm of inference time unlearning -- wherein, the generative model is equipped with an (approximately correct) verifier that judges whether the model's response satisfies appropriate unlearning guarantees. This paper introduces a framework that iteratively refines the quality of the generated responses using feedback from the verifier without updating the model parameters. The proposed framework leverages conformal prediction to reduce computational overhead and provide distribution-free unlearning guarantees. This paper's approach significantly outperforms existing state-of-the-art methods, reducing unlearning error by up to 93% across challenging unlearning benchmarks.
中文标题/摘要
标题:推理时遗忘的使用一致预测
机器遗忘是高效地从训练好的机器学习模型中移除特定信息的过程,而无需从头开始重新训练。现有的遗忘方法通常提供可证明的保证,这些方法通常涉及基于遗忘集重新训练模型的一部分参数。虽然这些方法在某些场景中显示出前景,但在实际应用中,它们的基本假设往往受到挑战——尤其是在应用于生成模型时。此外,使用这些遗忘程序更新参数通常会降低模型在预训练期间获得的通用能力。受这些不足的启发,本文考虑了推理时遗忘的范式——其中,生成模型配备了(近似正确的)验证器,用于判断模型的响应是否满足适当的遗忘保证。本文引入了一个框架,该框架通过验证器的反馈逐步提高生成响应的质量,而不更新模型参数。所提出的框架利用一致预测来减少计算开销并提供无分布的遗忘保证。本文的方法在具有挑战性的遗忘基准测试中显著优于现有最先进的方法,将遗忘误差降低了高达93%。
Summary / 总结
This paper addresses the challenge of machine unlearning by proposing a new framework for inference-time unlearning. The method uses a verifier to judge the model's responses and iteratively refine them without updating model parameters. It employs conformal prediction to minimize computational overhead and provide unlearning guarantees. The approach significantly improves unlearning performance, reducing errors by up to 93% on various benchmarks.
该论文通过提出一种基于校准预测的推理时卸载框架,解决了机器卸载的挑战。受现有方法限制的启发,这些方法通常涉及重新训练并降低模型性能,作者引入了一个验证器来判断模型的响应。该框架无需更新模型参数即可迭代优化响应,实现在各种基准测试中卸载误差最多减少93%。
AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration
Authors: Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo, Jiayi Zhang
First: 2026-02-03T17:46:16+00:00 · Latest: 2026-02-03T17:46:16+00:00
Abstract
Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra
中文标题/摘要
标题:AOrchestra: 自动化子代理创建以实现自主编排
语言代理在任务自动化方面展现了强大的潜力。为了实现这一潜力,特别是在越来越复杂、长期的任务中,已经推动了子代理作为工具的次级代理范式的兴起,用于多轮次任务解决。然而,现有的设计仍然缺乏对子代理的动态抽象视图,从而影响了适应性。我们通过一个统一的、框架无关的代理抽象来应对这一挑战,将任何代理建模为指令、上下文、工具、模型的元组。这个元组充当了能力组合的食谱,使系统能够根据需要生成专门的执行器。基于这一抽象,我们引入了一个自主系统AOrchestra,其中中央协调器在每一步具体化该元组:它精选任务相关的上下文,选择工具和模型,并通过自动代理创建进行即时执行委派。这样的设计能够减少人力工程努力,并且保持框架无关性,支持插拔式多种代理作为任务执行器。它还能够实现可控的性能-成本权衡,使系统能够接近帕累托有效。在三个具有挑战性的基准测试(GAIA、SWE-Bench、Terminal-Bench)中,AOrchestra在与Gemini-3-Flash配对时,相对最强基线实现了16.28%的改进。代码可在:https://github.com/FoundationAgents/AOrchestra 获取
Moonworks Lunara Aesthetic II: An Image Variation Dataset
Authors: Yan Wang, Partho Hassan, Samiha Sadeka, Nada Soliman, M M Sayeef Abdullah, Sabit Hassan
First: 2026-02-02T05:37:28+00:00 · Latest: 2026-02-03T17:45:15+00:00
Abstract
We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara's signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.
中文标题/摘要
标题:Moonworks Lunara美学II:图像变体数据集
我们介绍了Lunara Aesthetic II,这是一个公开发布、伦理来源的图像数据集,旨在支持现代图像生成和编辑系统中上下文一致性控制评估和学习。该数据集包含2,854个锚链接变体对,源自Moonworks创作的原始艺术和摄影作品。每个变体对应用了如光照、天气、视角、场景构图、色彩基调或情绪等上下文变换,同时保持稳定的底层身份。Lunara Aesthetic II将身份保留的上下文变体作为监督信号进行操作,同时保留Lunara的标志性高美学评分。结果显示,身份稳定性高,目标属性实现能力强,美学特征稳健,超过大规模网络数据集。Lunara Aesthetic II在Apache 2.0许可证下发布,旨在用于图像生成和图像到图像系统的基准测试、微调和分析,这些系统具有可解释的、关系型的监督信号,以评估上下文泛化、身份保留和编辑稳健性。数据集可在以下网址获取:https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations。
Summary / 总结
Lunara Aesthetic II is a publicly available image dataset designed to evaluate and learn contextual consistency in image generation and editing systems. It consists of 2,854 pairs of images derived from original art and photographs, each pair applying contextual transformations while preserving the underlying identity. The dataset demonstrates high identity stability, strong realization of target attributes, and a robust aesthetic profile, surpassing large-scale web datasets. It is intended for benchmarking and fine-tuning image generation systems with interpretable supervision.
Lunara Aesthetic II 是一个公开的图像数据集,旨在评估和学习图像生成和编辑系统中的上下文一致性。它包含 2,854 个锚链接的变体对,这些变体对保持了稳定的身份,同时应用了上下文变换。该数据集展示了高度的身份稳定性、强烈的属性实现以及稳健的美学特征,超越了大规模的网络数据集。它旨在用于图像生成和图像到图像系统的基准测试和分析,具有可解释的监督信号。
Efficient Estimation of Kernel Surrogate Models for Task Attribution
Authors: Zhenshuo Zhang, Minxuan Duan, Hongyang R. Zhang
Venue: ICLR 2026
First: 2026-02-03T17:43:48+00:00 · Latest: 2026-02-03T17:43:48+00:00
Comments: 27 pages. To appear in ICLR 2026
Abstract
Modern AI agents such as large language models are trained on diverse tasks -- translation, code generation, mathematical reasoning, and text prediction -- simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task's performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate estimates with less than $2\%$ relative error without repeated retraining. Experiments across multiple domains -- including math reasoning in transformers, in-context learning, and multi-objective reinforcement learning -- demonstrate the effectiveness of kernel surrogate models. They achieve a $25\%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines. When used for downstream task selection, kernel surrogate models yield a $40\%$ improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.
中文标题/摘要
标题:核代理模型的高效估计及其在任务归因中的应用
现代AI代理,如大型语言模型,同时在翻译、代码生成、数学推理和文本预测等多种任务上进行训练。一个关键问题是量化每个单独的训练任务如何影响目标任务的性能,我们称之为任务归因。直接方法,即逐一重新训练,测量移除每个任务的影响,但在大规模下计算上不可行。一种替代方法是构建代理模型来预测任何子集训练任务的目标任务性能,最近文献中出现了这种方法。先前的工作集中在线性代理模型上,这些模型捕捉一阶关系,但忽略了非线性交互,如协同作用、对抗作用或XOR型效应。在本文中,我们首先考虑了一个统一的任务加权框架来分析任务归因方法,并通过二阶分析展示了线性代理模型与影响函数之间的新联系。然后,我们引入了核代理模型,更有效地表示二阶任务交互。为了高效学习核代理模型,我们开发了一种基于梯度的估计程序,利用预训练模型的一阶近似;实验证明,这种方法在不到2%的相对误差下无需重复重新训练即可获得准确估计。在包括变压器中的数学推理、上下文学习和多目标强化学习等多个领域中,核代理模型的实验表明其有效性。它们与线性代理模型和影响函数基线相比,与逐一重新训练的地面真相的相关性高出25%。当用于下游任务选择时,核代理模型在上下文学习和多目标强化学习基准测试中将演示选择的性能提高了40%。
Summary / 总结
This paper addresses the challenge of quantifying how individual training tasks influence performance on a target task in large language models. It proposes kernel surrogate models to better capture nonlinear task interactions compared to linear models. The authors develop an efficient gradient-based estimation method that avoids repeated retraining, achieving less than 2% relative error. Experiments show that kernel surrogate models outperform linear surrogates and influence-function baselines, with a 25% higher correlation to leave-one-out ground truth and a 40% improvement in downstream task selection for in-context learning and multi-objective reinforcement learning benchmarks.
本文解决了在大型语言模型中量化单个训练任务对目标任务影响的问题。它引入了核代理模型,比线性模型更有效地捕捉非线性任务交互。作者提出了一种基于梯度的估计方法,避免了重复训练,相对误差低于2%。实验结果显示,核代理模型在多个领域中优于线性代理模型和影响函数基线,具有更高的相关性和更好的下游任务选择性能。
QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization
Authors: Yuhao Xu, Yantai Yang, Zhenyang Fan, Yufan Liu, Yuming Li, Bing Li, Zhipeng Zhang
First: 2026-02-03T17:43:45+00:00 · Latest: 2026-02-03T17:43:45+00:00
Comments: ICLR2026
Abstract
The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.
中文标题/摘要
标题:QVLA:视觉-语言-行动模型量化中并非所有通道都平等
视觉-语言-行动(VLA)模型的出现标志着嵌入式智能的重大飞跃,但其巨大的计算需求严重阻碍了在资源受限的机器人平台上的部署。直觉上,低比特量化是大规模模型压缩中普遍且优选的技术。然而,我们发现对VLA模型的量化系统性分析严重缺乏。我们认为,将大型语言模型(LLM)中的均匀比特量化直接应用于机器人领域是不合理的,因为这些方法侧重于被动的数据保真度,而忽视了小的动作偏差如何累积成灾难性的任务失败。为解决这一问题,我们提出了QVLA,这是第一个专为嵌入式控制设计的动作导向量化框架。与基于LLM的方法的僵硬、均匀比特量化形成鲜明对比,QVLA引入了一种高度细化的、按通道分配比特的策略。其核心机制是在量化每个通道到不同比特宽度时直接测量最终动作空间的敏感度。这一过程产生了一个精确的、按通道的重要性度量,指导全局优化,将量化和剪枝(0比特)优雅地统一到一个单一、连贯的框架中。在不同基线上的广泛评估表明了我们方法的优越性。在LIBERO中,使用我们方法的量化版本OpenVLA-OFT仅需原始模型VRAM的29.2%,同时保持98.9%的原始性能,并实现1.49倍的加速,这比LLM衍生的SmoothQuant方法提高了22.6%的性能。我们的工作为在机器人领域压缩VLA模型奠定了新的、原则性的基础,为在实际硬件上部署强大的大规模模型铺平了道路。代码将被发布。
Summary / 总结
The paper addresses the challenge of deploying Vision-Language-Action (VLA) models on resource-constrained robotic platforms by introducing QVLA, a novel action-centric quantization framework. Unlike uniform-bit quantization methods, QVLA allocates bits channel-wise based on the sensitivity of the final action space. Evaluations show that QVLA reduces VRAM usage by 29.2% while maintaining 98.9% of the original performance and achieving a 1.49x speedup compared to SmoothQuant, a method derived from Large Language Models. This work provides a new foundation for compressing VLA models in robotics, enabling more efficient deployment on real-world hardware.
论文提出了一种新的基于动作的量化框架QVLA,以解决在资源受限的机器人平台上部署Vision-Language-Action (VLA)模型的挑战。QVLA根据最终动作空间的敏感性,按通道分配位数,而不是采用统一的位量化方法。评估结果显示,QVLA将VRAM使用量减少了29.2%,同时保持了98.9%的原始性能,并且相比来自大型语言模型的SmoothQuant方法实现了1.49倍的加速。这项工作为在实际硬件上高效部署VLA模型奠定了新的基础。
PAINT: Parallel-in-time Neural Twins for Dynamical System Reconstruction
Authors: Andreas Radler, Vincent Seyfried, Johannes Brandstetter, Thomas Lichtenegger
First: 2025-10-14T14:22:45+00:00 · Latest: 2026-02-03T17:39:05+00:00
Comments: 28 pages, 23 figures
Abstract
Neural surrogates have shown great potential in simulating dynamical systems, while offering real-time capabilities. We envision Neural Twins as a progression of neural surrogates, aiming to create digital replicas of real systems. A neural twin consumes measurements at test time to update its state, thereby enabling context-specific decision-making. We argue, that a critical property of neural twins is their ability to remain on-trajectory, i.e., to stay close to the true system state over time. We introduce Parallel-in-time Neural Twins (PAINT), an architecture-agnostic family of methods for modeling dynamical systems from measurements. PAINT trains a generative neural network to model the distribution of states in parallel over time. At test time, states are predicted from measurements in a sliding window fashion. Our theoretical analysis shows that PAINT is on-trajectory, whereas autoregressive models generally are not. Empirically, we evaluate our method on a challenging two-dimensional turbulent fluid dynamics problem. The results demonstrate that PAINT stays on-trajectory and predicts system states from sparse measurements with high fidelity. These findings underscore PAINT's potential for developing neural twins that stay on-trajectory, enabling more accurate state estimation and decision-making.
中文标题/摘要
标题:PAINT:并行时间神经孪生模型在动力系统重构中的应用
神经代理在模拟动力系统方面显示出巨大的潜力,同时提供实时能力。我们设想神经孪生作为神经代理的进一步发展,旨在创建真实系统的数字复制品。神经孪生在测试时消耗测量值以更新其状态,从而实现情境特定的决策。我们认为,神经孪生的一个关键属性是其保持轨迹的能力,即在长时间内保持接近真实系统状态。我们引入了并行时间神经孪生(PAINT),这是一种用于从测量值建模动力系统的架构无关方法族。PAINT通过并行训练生成神经网络来建模状态分布。在测试时,状态以滑动窗口的方式从测量值中预测。我们的理论分析表明,PAINT保持轨迹,而自回归模型通常不保持轨迹。通过在具有挑战性的二维湍流流体动力学问题上评估我们的方法,实验证明PAINT保持轨迹,并且能够从稀疏测量中高保真地预测系统状态。这些发现强调了PAINT在开发保持轨迹的神经孪生方面的潜力,从而实现更准确的状态估计和决策。
Summary / 总结
PAINT is a method for creating neural twins of dynamical systems that remain on-trajectory over time. It trains a generative neural network to model state distributions in parallel, and predicts states from sparse measurements at test time using a sliding window approach. Empirical results show that PAINT stays on-trajectory and accurately predicts system states in a challenging fluid dynamics problem.
PAINT 是一种方法,用于创建保持在真实轨迹上的动态系统的神经孪生模型。它通过并行训练生成神经网络来建模状态分布,并使用滑动窗口方法从测量值预测状态。实验证明,PAINT 能够从稀疏测量中高保真地预测系统状态,展示了其在实时、准确的状态估计和决策中的潜力。
DiffLOB: Diffusion Models for Counterfactual Generation in Limit Order Books
Authors: Zhuohan Wang, Carmine Ventre
First: 2026-02-03T17:34:56+00:00 · Latest: 2026-02-03T17:34:56+00:00
Comments: 12 pages, 8 figures
Abstract
Modern generative models for limit order books (LOBs) can reproduce realistic market dynamics, but remain fundamentally passive: they either model what typically happens without accounting for hypothetical future market conditions, or they require interaction with another agent to explore alternative outcomes. This limits their usefulness for stress testing, scenario analysis, and decision-making. We propose \textbf{DiffLOB}, a regime-conditioned \textbf{Diff}usion model for controllable and counterfactual generation of \textbf{LOB} trajectories. DiffLOB explicitly conditions the generative process on future market regimes--including trend, volatility, liquidity, and order-flow imbalance, which enables the model to answer counterfactual queries of the form: ``If the future market regime were X instead of Y, how would the limit order book evolve?'' Our systematic evaluation framework for counterfactual LOB generation consists of three criteria: (1) \textit{Controllable Realism}, measuring how well generated trajectories can reproduce marginal distributions, temporal dependence structure and regime variables; (2) \textit{Counterfactual validity}, testing whether interventions on future regimes induce consistent changes in the generated LOB dynamics; (3) \textit{Counterfactual usefulness}, assessing whether synthetic counterfactual trajectories improve downstream prediction of future market regimes.
中文标题/摘要
标题:DiffLOB:限价订单簿中的扩散模型用于反事实生成
现代限价订单簿(LOB)的生成模型可以再现现实的市场动态,但仍然本质上是被动的:它们要么模拟通常发生的情况而不考虑假设的未来市场条件,要么需要与另一个代理进行交互以探索替代结果。这限制了它们在压力测试、情景分析和决策中的应用。我们提出了一种名为\textbf{DiffLOB}的基于状态条件的\textbf{Diff}usion模型,用于可控和反事实生成LOB轨迹。DiffLOB明确地将生成过程条件化于未来的市场状态——包括趋势、波动性、流动性以及订单流量失衡,这使模型能够回答形式为“如果未来市场状态是X而不是Y,限价订单簿将如何演变?”的反事实查询。我们系统性的反事实LOB生成评估框架包括三个标准:(1)可控现实性,衡量生成轨迹再现边缘分布、时间依赖结构和状态变量的能力;(2)反事实有效性,测试对未来状态干预是否导致生成的LOB动态的一致变化;(3)反事实有用性,评估合成的反事实轨迹是否改善对未来市场状态的下游预测。
Summary / 总结
The research aims to address the limitations of existing generative models for limit order books (LOBs) by proposing DiffLOB, a regime-conditioned diffusion model for generating counterfactual LOB trajectories. The model explicitly conditions on future market regimes, enabling it to answer counterfactual queries. Key experimental findings show that DiffLOB excels in controllable realism, counterfactual validity, and counterfactual usefulness, making it a valuable tool for stress testing, scenario analysis, and decision-making in financial markets.
研究旨在通过实现反事实生成来增强限价订单簿(LOB)生成模型的实用性。提出了一个基于未来市场状态条件的扩散模型DiffLOB,以生成可控制的LOB轨迹。该模型能够回答在不同市场条件下LOB将如何演变的反事实问题。评估框架包括三个标准:可控现实性、反事实有效性以及反事实有用性,表明DiffLOB能够有效再现市场动态并提高对未来市场状态的预测能力。
An Empirical Study of Collective Behaviors and Social Dynamics in Large Language Model Agents
Authors: Farnoosh Hashemi, Michael W. Macy
First: 2026-02-03T17:34:32+00:00 · Latest: 2026-02-03T17:34:32+00:00
Abstract
Large Language Models (LLMs) increasingly mediate our social, cultural, and political interactions. While they can simulate some aspects of human behavior and decision-making, it is still underexplored whether repeated interactions with other agents amplify their biases or lead to exclusionary behaviors. To this end, we study Chirper.ai-an LLM-driven social media platform-analyzing 7M posts and interactions among 32K LLM agents over a year. We start with homophily and social influence among LLMs, learning that similar to humans', their social networks exhibit these fundamental phenomena. Next, we study the toxic language of LLMs, its linguistic features, and their interaction patterns, finding that LLMs show different structural patterns in toxic posting than humans. After studying the ideological leaning in LLMs posts, and the polarization in their community, we focus on how to prevent their potential harmful activities. We present a simple yet effective method, called Chain of Social Thought (CoST), that reminds LLM agents to avoid harmful posting.
中文标题/摘要
标题:大型语言模型代理中的集体行为和社会动态实证研究
大型语言模型(LLMs)越来越多地调节我们的社会、文化和政治互动。虽然它们可以模拟某些人类行为和决策方面,但仍然未被充分探索的是与其他代理反复互动是否会放大它们的偏见或导致排他性行为。为此,我们研究了Chirper.ai——一个由LLM驱动的社交媒体平台,分析了700万条帖子和32000个LLM代理一年内的互动。我们从LLM之间的同质性和社会影响开始,发现它们的社会网络也表现出这些基本现象,类似于人类的。接着,我们研究了LLM的有毒语言、其语言特征及其互动模式,发现LLM在有毒发帖方面的结构模式与人类不同。在研究了LLM帖子中的意识形态倾向及其社区中的极化后,我们关注如何防止它们潜在的有害活动。我们提出了一种简单而有效的方法,称为社会思维链(CoST),提醒LLM代理避免有害发帖。
Summary / 总结
This study investigates collective behaviors and social dynamics in large language model (LLM) agents through an analysis of 7 million posts and interactions among 32,000 LLM agents on the Chirper.ai platform over a year. The research finds that LLMs exhibit homophily and social influence similar to humans. It also identifies different structural patterns in toxic language posted by LLMs compared to humans and observes ideological polarization in their community. The study proposes a method called Chain of Social Thought (CoST) to prevent harmful activities by reminding LLM agents to avoid such postings.
研究通过分析Chirper.ai平台32,000个LLM代理在一年内产生的700万条帖子和互动,探讨了大型语言模型代理的集体行为和社会动态。研究发现,LLM在社交网络中表现出与人类相似的同质性和社会影响,它们在有毒帖子中的结构模式与人类不同。研究还识别出LLM社区中的意识形态极化,并提出了一种名为Chain of Social Thought (CoST)的方法,通过提醒LLM代理避免有害行为来防止潜在的有害活动。
PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm
Authors: Jing-Jing Li, Joel Mire, Eve Fleisig, Valentina Pyatkin, Anne Collins, Maarten Sap, Sydney Levine
First: 2026-01-13T19:41:11+00:00 · Latest: 2026-02-03T17:34:31+00:00
Abstract
Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond "one-size-fits-all" safety toward pluralistically safe AI.
中文标题/摘要
标题:PluriHarms:全面评估人工智能危害的人类判断基准
当前的人工智能安全框架往往将危害性视为二元的,缺乏处理人类意见分歧的灵活性,特别是在边缘案例中。为了构建更加多元化的系统,必须超越共识,而是理解分歧在哪里以及为什么产生分歧。我们引入了PluriHarms基准,旨在系统地研究人类对危害的判断,涵盖两个关键维度——危害轴(从无害到有害)和一致性轴(从一致到分歧)。我们的可扩展框架生成了能够捕捉各种AI危害和人类价值观的提示,同时针对高分歧率的案例,这些案例通过人类数据得到了验证。基准包括150个提示,来自100名人类注释者的15,000个评分,这些评分丰富了人口统计学和心理特征以及提示级别的有害行为、影响和价值观特征。我们的分析表明,与迫在眉睫的风险和具体危害相关的提示会增强感知的危害性,而注释员特征(如毒性经验、教育水平)及其与提示内容的相互作用解释了系统性的分歧。我们在PluriHarms上对AI安全模型和对齐方法进行了基准测试,发现虽然个性化显著提高了对人类危害判断的预测,但仍有很大的改进空间。通过明确针对价值观多样性和分歧,我们的工作提供了一个原则性的基准,以超越“一刀切”的安全措施,迈向多元化的安全AI。
Summary / 总结
The research aims to address the limitations of current AI safety frameworks by introducing PluriHarms, a benchmark that evaluates human judgments on AI harm across the harm and agreement axes. The method involves generating diverse prompts that capture various AI harms and human values, with a focus on cases where there is significant disagreement. Key findings show that imminent risks and tangible harms increase perceived harmfulness, and that annotator traits and their interactions with prompt content explain systematic disagreements. The study benchmarks AI safety models and finds that while personalization improves predictions, there is still significant room for improvement.
研究旨在通过引入PluriHarms基准来解决当前AI安全框架的局限性,该基准评估人类对AI危害的判断,涵盖危害和一致性维度。方法包括生成多样化的提示,捕捉各种AI危害和人类价值观,并重点关注存在显著分歧的情况。关键发现表明,紧迫的风险和实际的危害会增加感知的危害性,而注释者的特质及其与提示内容的交互解释了系统的分歧。研究还对AI安全模型进行了基准测试,发现虽然个性化可以提高预测准确性,但仍存在很大的改进空间。
Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL
Authors: Ian Wu, Yuxiao Qu, Amrith Setlur, Aviral Kumar
First: 2026-02-03T17:34:04+00:00 · Latest: 2026-02-03T17:34:04+00:00
Comments: preprint
Abstract
Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time, outperforming both comparably sized models and many larger reasoning LLMs. Finally, we also show that models trained with RC can more effectively leverage existing scaffolds to further scale test-time performance, due to the improved summary-conditioned generation abilities learned through training.
中文标题/摘要
标题:推理缓存:通过短期强化学习在长时间跨度上持续改进
能够超越训练预算持续改进的大语言模型(LLMs)能够在测试时解决越来越复杂的问题,我们将其称为外推。然而,标准强化学习(RL)在固定的问题分布和训练预算上运行,这限制了在测试时面对分布变化时的外推能力。为了解决这个问题,我们引入了RC,这是一种迭代解码算法,在训练和推理过程中都替代了标准的自回归解码。RC 利用大语言模型在响应生成和总结能力之间的不对称性,构建了一致改进的推理链。使用 RC 训练的模型可以在推理时间跨度上持续改进,远超过训练期间看到的时间跨度。实验中,使用 RC 训练一个 4B 模型并在 16k 词的训练预算下,测试时性能从 40% 提高到接近 70%,优于同等规模的模型和许多更大的推理大语言模型。最后,我们还展示了使用 RC 训练的模型能够更有效地利用现有的支架来进一步扩展测试时的性能,这是由于通过训练学习到的改进的摘要条件生成能力。
Summary / 总结
The paper introduces Reasoning Cache (RC), an iterative decoding algorithm that enhances the ability of Large Language Models (LLMs) to continually improve and solve increasingly difficult problems beyond their training budgets. By exploiting the asymmetry between response generation and summarization capabilities, RC constructs reasoning chains that consistently improve across iterations, enabling models to extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than during training. Empirically, a 4B model trained with RC achieved nearly 70% performance on HMMT 2025 with 0.5m tokens at test time, outperforming both comparably sized and larger reasoning LLMs.
论文提出了Reasoning Cache (RC) 算法,通过替换标准自回归解码,使大型语言模型(LLMs)能够在长时间范围内不断改进。这种方法允许模型在测试时适应并解决越来越复杂的问题,显示出显著的性能提升。具体来说,一个4B模型通过RC训练,在HMMT 2025上的性能从40%提高到接近70%,使用0.5m测试时的tokens,并且优于同等规模和更大规模的推理LLMs。
The Epistemic Planning Domain Definition Language: Official Guideline
Authors: Alessandro Burigana, Francesco Fabiano
First: 2026-01-28T19:10:52+00:00 · Latest: 2026-02-03T17:32:48+00:00
Abstract
Epistemic planning extends (multi-agent) automated planning by making agents' knowledge and beliefs first-class aspects of the planning formalism. One of the most well-known frameworks for epistemic planning is Dynamic Epistemic Logic (DEL), which offers an rich and natural semantics for modelling problems in this setting. The high expressive power provided by DEL make DEL-based epistemic planning a challenging problem to tackle both theoretically, and in practical implementations. As a result, existing epistemic planners often target different DEL fragments, and typically rely on ad hoc languages to represent benchmarks, and sometimes no language at all. This fragmentation hampers comparison, reuse, and systematic benchmark development. We address these issues by introducing the Epistemic Planning Domain Definition Language (EPDDL). EPDDL provides a unique PDDL-like representation that captures the entire DEL semantics, enabling uniform specification of epistemic planning tasks. Our main contributions are: 1. A formal development of abstract event models, a novel representation for epistemic actions used to define the semantics of our language; 2. A formal specification of EPDDL's syntax and semantics grounded in DEL with abstract event models. Through examples of representative benchmarks, we illustrate how EPDDL facilitates interoperability, reproducible evaluation, and future advances in epistemic planning.
中文标题/摘要
标题:知识规划领域定义语言:官方指南
知识规划通过将代理的知识和信念作为规划形式主义的一等要素,扩展了(多代理)自动化规划。知识规划中最著名的框架之一是动态知识逻辑(DEL),它为在这个环境中建模问题提供了丰富且自然的语义。DEL提供的高表达力使得基于DEL的知识规划在理论和实际实现中都成为一个具有挑战性的问题。因此,现有的知识规划器通常针对不同的DEL片段,通常依赖于半正式的语言来表示基准,有时甚至没有任何语言。这种碎片化阻碍了比较、重用和系统基准开发。我们通过引入知识规划领域定义语言(EPDDL)来解决这些问题。EPDDL提供了一种独特的类似于PDDL的表示法,能够捕捉整个DEL语义,从而统一指定知识规划任务。我们的主要贡献包括:1. 对抽象事件模型的正式开发,这是一种用于定义我们语言语义的新颖表示法;2. 对EPDDL的语法和语义的正式规范,基于DEL和抽象事件模型。通过代表性基准的示例,我们说明了EPDDL如何促进互操作性、可重复评估以及知识规划的未来进展。
Summary / 总结
The paper addresses the challenges in epistemic planning by introducing EPDDL, a formal language that captures the entire semantics of Dynamic Epistemic Logic (DEL). It provides a uniform representation for epistemic planning tasks, overcoming the fragmentation issue in existing epistemic planners. Key contributions include a formal development of abstract event models and a formal specification of EPDDL's syntax and semantics. The language enables interoperability, reproducible evaluation, and future advancements in epistemic planning through examples of representative benchmarks.
论文引入了EPDDL,这是一种形式化的语言,用于捕获动态元逻辑(DEL)的全部语义。它旨在通过提供统一的表示来解决现有元规划器中的碎片化问题。主要贡献包括对抽象事件模型的正式开发以及对EPDDL语法和语义的DEL基础规范。通过代表基准的示例,EPDDL促进了互操作性、可重复评估和元规划的未来进展。
Decision-oriented benchmarking to transform AI weather forecast access: Application to the Indian monsoon
Authors: Rajat Masiwal, Colin Aitken, Adam Marchakitus, Mayank Gupta, Katherine Kowal, Hamid A. Pahlavan, Tyler Yang, Y. Qiang Sun, Michael Kremer, Amir Jina, William R. Boos, Pedram Hassanzadeh
First: 2026-02-03T17:27:22+00:00 · Latest: 2026-02-03T17:27:22+00:00
Abstract
Artificial intelligence weather prediction (AIWP) models now often outperform traditional physics-based models on common metrics while requiring orders-of-magnitude less computing resources and time. Open-access AIWP models thus hold promise as transformational tools for helping low- and middle-income populations make decisions in the face of high-impact weather shocks. Yet, current approaches to evaluating AIWP models focus mainly on aggregated meteorological metrics without considering local stakeholders' needs in decision-oriented, operational frameworks. Here, we introduce such a framework that connects meteorology, AI, and social sciences. As an example, we apply it to the 150-year-old problem of Indian monsoon forecasting, focusing on benefits to rain-fed agriculture, which is highly susceptible to climate change. AIWP models skillfully predict an agriculturally relevant onset index at regional scales weeks in advance when evaluated out-of-sample using deterministic and probabilistic metrics. This framework informed a government-led effort in 2025 to send 38 million Indian farmers AI-based monsoon onset forecasts, which captured an unusual weeks-long pause in monsoon progression. This decision-oriented benchmarking framework provides a key component of a blueprint for harnessing the power of AIWP models to help large vulnerable populations adapt to weather shocks in the face of climate variability and change.
中文标题/摘要
标题:面向决策的基准测试以变革人工智能天气预报接入:以印度季风为例
人工智能天气预测(AIWP)模型现在在常见指标上通常优于传统的基于物理的模型,同时所需的计算资源和时间要少得多数量级。开放访问的AIWP模型因此有望成为帮助低收入和中等收入人群在面对高影响天气冲击时做出决策的变革性工具。然而,目前评估AIWP模型的方法主要集中在聚合的气象指标上,而没有考虑当地利益相关者在决策导向的操作框架中的需求。在这里,我们介绍了一种连接气象学、人工智能和社会科学的框架。作为示例,我们将其应用于150年的印度季风预报问题,重点关注对高度易受气候变化影响的雨养农业的好处。当使用确定性和概率性指标进行离样本评估时,AIWP模型能够提前数周准确预测具有农业相关性的季节开始指数。该框架为2025年由政府领导的一项努力提供了指导,即向3800万印度农民发送基于人工智能的季风开始预报,这些预报捕捉到了季风进程异常持续数周的暂停。这种面向决策的基准测试框架为利用AIWP模型的力量帮助大量脆弱人群适应气候变异性与变化提供了一个关键组成部分。
Summary / 总结
The study introduces a decision-oriented benchmarking framework to evaluate artificial intelligence weather prediction (AIWP) models, focusing on their utility for local stakeholders. The framework connects meteorology, AI, and social sciences, and is applied to Indian monsoon forecasting. AIWP models predict an agriculturally relevant onset index weeks in advance, which was used to inform a government-led effort to send forecasts to 38 million farmers. This approach highlights the potential of AIWP models to help vulnerable populations adapt to weather shocks.
研究旨在通过整合气象学、人工智能和社会科学来评估AI天气预测模型,更好地服务于决策需求,特别是低收入和中等收入地区。方法是使用确定性和概率性指标来评估AIWP模型在区域尺度上提前数周预测农业相关降水开始指数的能力。关键发现表明,AIWP模型能够有效预测季风开始,有助于雨养农业适应气候变化。2025年,这一框架使政府能够向3800万农民提供基于AI的季风预报,捕捉到了季风异常暂停的进程。
Conditional Flow Matching for Visually-Guided Acoustic Highlighting
Authors: Hugo Malard, Gael Le Lan, Daniel Wong, David Lou Alon, Yi-Chiao Wu, Sanjeel Parekh
First: 2026-02-03T17:24:47+00:00 · Latest: 2026-02-03T17:24:47+00:00
Abstract
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
中文标题/摘要
标题:基于视觉引导的声学高亮条件流匹配
视觉引导的声学高亮旨在重新平衡音频,使其与伴随的视频保持一致,创造一个连贯的视听体验。虽然视觉显著性和增强已经得到了广泛的研究,但声学高亮仍然被忽视,经常导致视觉和听觉焦点的不一致。现有的方法使用判别模型,但在音频混音中固有的模糊性面前,这些模型难以应对,因为不良平衡和良好平衡的音频混音之间并不存在自然的一对一映射。为了解决这一局限性,我们将此任务重新定义为生成问题,并引入了一种条件流匹配(CFM)框架。迭代流生成中的一个关键挑战是早期预测错误——在选择正确的源进行增强——会在步骤中累积并使轨迹偏离流形。为了解决这个问题,我们引入了一种回放损失,该损失在最终步骤中惩罚漂移,鼓励自我纠正的轨迹并稳定长期流的整合。我们还提出了一种条件模块,在向量场回归之前融合音频和视觉线索,使跨模态源选择变得明确。广泛的定量和定性评估表明,我们的方法始终超越了之前的最佳判别方法,证明了视觉引导的音频混音最好通过生成建模来解决。
Summary / 总结
The paper addresses the challenge of visually-guided acoustic highlighting, where the goal is to align audio with video to create a coherent audio-visual experience. It introduces a Conditional Flow Matching (CFM) framework that reframes the task as a generative problem. The method includes a rollout loss to stabilize long-range flow integration and a conditioning module to fuse audio and visual cues. Experimental results show that the proposed approach outperforms existing discriminative models, demonstrating the effectiveness of generative modeling for visually-guided audio remixing.
论文解决了视觉引导下的声学突出问题,目标是调整音频以与视频内容相匹配。提出了一种条件流匹配(CFM)框架,将问题重新定义为生成任务。方法使用卷出损失来稳定长期流集成,并提出了一种条件模块来融合音频和视觉线索。实验结果表明,所提出的方法在视觉和听觉焦点的对齐方面优于现有模型。
RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images
Authors: Mishal Fatima, Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Michael Moeller, Margret Keuper
First: 2026-02-03T17:22:45+00:00 · Latest: 2026-02-03T17:22:45+00:00
Comments: *Equal Contribution
Abstract
Most vision models are trained on RGB images processed through ISP pipelines optimized for human perception, which can discard sensor-level information useful for machine reasoning. RAW images preserve unprocessed scene data, enabling models to leverage richer cues for both object detection and object description, capturing fine-grained details, spatial relationships, and contextual information often lost in processed images. To support research in this domain, we introduce RAWDet-7, a large-scale dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments, densely annotated for seven object categories following MS-COCO and LVIS conventions. In addition, we provide object-level descriptions derived from the corresponding high-resolution sRGB images, facilitating the study of object-level information preservation under RAW image processing and low-bit quantization. The dataset allows evaluation under simulated 4-bit, 6-bit, and 8-bit quantization, reflecting realistic sensor constraints, and provides a benchmark for studying detection performance, description quality & detail, and generalization in low-bit RAW image processing. Dataset & code upon acceptance.
中文标题/摘要
标题:RAWDet-7:一种针对量化RAW图像的目标检测与描述多场景基准
大多数视觉模型是在经过ISP管道优化以适应人类感知的RGB图像上进行训练的,这些管道可能会丢弃对机器推理有用的传感器级信息。RAW图像保留了未处理的场景数据,使模型能够利用更丰富的线索进行目标检测和目标描述,捕捉细粒度的细节、空间关系和在处理图像中经常丢失的上下文信息。为了支持该领域的研究,我们引入了RAWDet-7,这是一个包含约25000个训练和7600个测试RAW图像的大规模数据集,这些图像来自多种相机、光照条件和环境,并按照MS-COCO和LVIS的规范进行了密集标注。此外,我们还提供了来自相应高分辨率sRGB图像的目标级描述,便于研究RAW图像处理和低比特量化下目标级信息的保留情况。该数据集允许在模拟的4比特、6比特和8比特量化下进行评估,反映了现实的传感器约束,并为研究低比特RAW图像处理下的检测性能、描述质量和泛化能力提供了基准。接受后提供数据集和代码。
Summary / 总结
The research aims to evaluate object detection and description on RAW images, which preserve sensor-level information. The study introduces RAWDet-7, a dataset of 25,000 training and 7,600 test RAW images from various cameras and environments, annotated for seven object categories. The dataset supports evaluation under different quantization levels (4-bit, 6-bit, 8-bit) and includes object-level descriptions from corresponding sRGB images, enabling the study of information preservation and detection performance in low-bit quantized RAW images.
研究旨在通过利用未处理的场景数据探索RAW图像在目标检测和描述中的潜力。研究引入了RAWDet-7数据集,包含来自不同相机和环境的25,000张训练图像和7,600张测试图像,并对七类物体进行了标注。数据集还包括从高分辨率sRGB图像中提取的目标级描述,用于评估在不同量化级别(4位、6位和8位)下的检测性能、描述质量和泛化能力。
See-through: Single-image Layer Decomposition for Anime Characters
Authors: Jian Lin, Chengze Li, Haoyun Qin, Kwun Wang Chan, Yanghua Jin, Hanyuan Liu, Stephen Chun Wang Choy, Xueting Liu
First: 2026-02-03T17:12:36+00:00 · Latest: 2026-02-03T17:12:36+00:00
Comments: 23 pages, 20 figures, preprint version only
Abstract
We introduce a framework that automates the transformation of static anime illustrations into manipulatable 2.5D models. Current professional workflows require tedious manual segmentation and the artistic ``hallucination'' of occluded regions to enable motion. Our approach overcomes this by decomposing a single image into fully inpainted, semantically distinct layers with inferred drawing orders. To address the scarcity of training data, we introduce a scalable engine that bootstraps high-quality supervision from commercial Live2D models, capturing pixel-perfect semantics and hidden geometry. Our methodology couples a diffusion-based Body Part Consistency Module, which enforces global geometric coherence, with a pixel-level pseudo-depth inference mechanism. This combination resolves the intricate stratification of anime characters, e.g., interleaving hair strands, allowing for dynamic layer reconstruction. We demonstrate that our approach yields high-fidelity, manipulatable models suitable for professional, real-time animation applications.
中文标题/摘要
标题:透明化:单图像分层分解用于动漫角色
我们提出了一种框架,自动将静态动漫插图转换为可操作的2.5D模型。当前的专业工作流程需要繁琐的手动分割和艺术上的“想象”被遮挡的区域,以实现运动效果。我们的方法通过将单个图像分解为完全填充、语义上独立的图层并推断绘制顺序来克服这一问题。为了解决训练数据稀缺的问题,我们引入了一个可扩展的引擎,从商业Live2D模型中提取高质量的监督信息,捕捉像素级的语义和隐藏的几何结构。我们的方法结合了一个基于扩散的体部一致性模块,该模块确保全局几何一致性,以及一个像素级伪深度推断机制。这种结合解决了动漫角色复杂的分层问题,例如交错的发丝,允许动态图层重建。我们证明,我们的方法生成了高保真度、可操作的模型,适用于专业实时动画应用。
Summary / 总结
The research aims to automate the transformation of static anime illustrations into manipulatable 2.5D models by decomposing a single image into distinct layers with inferred drawing orders. The method uses a Body Part Consistency Module based on diffusion and a pixel-level pseudo-depth inference mechanism to achieve global geometric coherence and resolve intricate layer stratifications. Key findings show that the approach produces high-fidelity, manipulatable models suitable for professional animation applications.
研究旨在通过将单张图像分解为完全填充、语义上独立的多层并推断绘制顺序,自动化将静态动漫插图转换为可操控的2.5D模型。方法结合了基于扩散的体部一致性模块和像素级伪深度推断机制,以实现全局几何一致性并解决复杂的分层问题。实验结果表明,该方法生成了高质量、可操控的模型,适用于专业的实时动画应用。
LIVE: Long-horizon Interactive Video World Modeling
Authors: Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, Li Jiang
First: 2026-02-03T17:10:03+00:00 · Latest: 2026-02-03T17:10:03+00:00
Comments: 18 pages, 22 figures
Abstract
Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.
中文标题/摘要
标题:LIVE:长时程交互视频世界建模
自回归视频世界模型在动作条件下预测未来视觉观察。虽然在短时程上有效,但这些模型在长时程生成中往往难以应对,因为小的预测误差会随着时间累积。先前的方法通过引入预训练教师模型和序列级分布匹配来缓解这一问题,但这些方法会增加额外的计算成本,并且无法防止错误传播超过训练时程。在本文中,我们提出了一种名为LIVE的长时程交互视频世界模型,通过一种新颖的循环一致性目标来强制执行误差累积的边界,从而消除了基于教师的蒸馏的需要。具体而言,LIVE首先从真实帧进行正向滚动,然后应用反向生成过程重建初始状态。随后在重建的终端状态上计算扩散损失,从而提供对长时程误差传播的显式约束。此外,我们提供了一个统一的观点,涵盖了不同的方法,并引入了渐进式训练课程以稳定训练。实验表明,LIVE在长时程基准测试中达到了最先进的性能,生成了远超训练滚动长度的稳定、高质量的视频。
Summary / 总结
LIVE is a long-horizon interactive video world model that addresses the issue of error accumulation in autoregressive models by using a cycle-consistency objective. It performs a forward rollout from ground-truth frames and then reverses the process to reconstruct the initial state, with the diffusion loss computed on the reconstructed terminal state to constrain long-horizon errors. Experiments show that LIVE outperforms existing methods on long-horizon benchmarks, generating stable and high-quality videos beyond the training rollout lengths.
研究动机是解决长时序视频世界模型中误差累积的问题,这些模型在短时序上有效但在长期预测中表现不佳。主要方法是LIVE,一种新颖的循环一致性目标,该目标通过防止误差累积来增强模型性能,无需依赖教师模型进行蒸馏,从而降低计算成本。关键实验发现表明,LIVE在长时序基准测试中优于先前的方法,生成了超出训练时序长度的稳定且高质量的视频。
How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation
Authors: Devanshu Sahoo, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal, Dhruv Kumar
First: 2025-12-11T08:28:33+00:00 · Latest: 2026-02-03T17:04:09+00:00
Comments: This manuscript has been withdrawn by the authors because the methodology and results have been superseded by a more rigorous framework (SPACI and AST-ASIP). The corrected and expanded findings are now available in arXiv:2601.21360. Please cite the new manuscript instead
Abstract
The use of Large Language Models (LLMs) as automatic judges for code evaluation is becoming increasingly prevalent in academic environments. But their reliability can be compromised by students who may employ adversarial prompting strategies in order to induce misgrading and secure undeserved academic advantages. In this paper, we present the first large-scale study of jailbreaking LLM-based automated code evaluators in academic context. Our contributions are: (i) We systematically adapt 20+ jailbreaking strategies for jailbreaking AI code evaluators in the academic context, defining a new class of attacks termed academic jailbreaking. (ii) We release a poisoned dataset of 25K adversarial student submissions, specifically designed for the academic code-evaluation setting, sourced from diverse real-world coursework and paired with rubrics and human-graded references, and (iii) In order to capture the multidimensional impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs. We find that these models exhibit significant vulnerability, particularly to persuasive and role-play-based attacks (up to 97% JSR). Our adversarial dataset and benchmark suite lay the groundwork for next-generation robust LLM-based evaluators in academic code assessment.
中文标题/摘要
标题:如何愚弄你的AI助教:学术环境中文本生成模型代码评估破解的系统研究
大型语言模型(LLMs)作为代码评估的自动评判者在学术环境中的应用越来越普遍。但学生可能会通过采用对抗性提示策略来诱导误判,从而获得不应得的学术优势,从而损害其可靠性。本文首次对学术环境中基于LLM的自动代码评估器破解进行了大规模研究。我们的贡献包括:(i) 我们系统地适应了20多种破解策略,用于破解学术环境中的AI代码评估器,定义了一类新的攻击类型,称为学术破解。(ii) 我们发布了一个包含25000个对抗性学生提交的中毒数据集,专门设计用于学术代码评估环境,来源于多样化的实际课程作业,并配有人工评分参考和评分标准。(iii) 为了捕捉学术破解的多维影响,我们系统地适应并定义了三个破解指标(破解成功率、分数膨胀率和危害性)。(iv) 我们使用六种LLM全面评估了学术破解攻击。我们发现这些模型对有说服力和角色扮演的攻击表现出显著的脆弱性(最高97%的JSR)。我们的对抗性数据集和基准套件为下一代在学术代码评估中具有更强鲁棒性的LLM评估器奠定了基础。
Summary / 总结
This paper investigates the vulnerability of large language models (LLMs) used for academic code evaluation to adversarial prompting strategies, termed 'academic jailbreaking'. The authors developed 20+ jailbreaking strategies, created a poisoned dataset of 25K adversarial student submissions, and defined three metrics to evaluate the impact. They found that LLMs are highly vulnerable, especially to persuasive and role-play-based attacks, with a jailbreak success rate up to 97%. The study provides a benchmark for developing more robust LLM-based evaluators in academic settings.
本文研究了大型语言模型(LLMs)在学术环境中的代码评估自动判分系统的脆弱性。作者系统地适应了20多种越狱策略,并发布了一个包含25000个对抗性学生提交的污染数据集。他们定义了三个指标来评估这些攻击的影响,并发现LLMs对有说服力和角色扮演的攻击特别脆弱,越狱成功率高达97%。该研究为开发更 robust 的LLM基评估器提供了基准,在学术代码评估中具有重要意义。
Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment
Authors: Johny J. Lopez, Md Meftahul Ferdaus, Mahdi Abdelguerfi
First: 2026-02-03T17:03:46+00:00 · Latest: 2026-02-03T17:03:46+00:00
Abstract
Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.
中文标题/摘要
标题:优化边缘的视觉-语言模型用于地下基础设施评估
对地下基础设施,如污水和涵洞系统的自主检查对于公共安全和城市可持续性至关重要。尽管配备视觉传感器的机器人平台可以高效地检测结构缺陷,但从这些检测中自动生成人类可读的摘要仍然是一项重大挑战,特别是在资源受限的边缘设备上。本文提出了一种新颖的两阶段端到端缺陷摘要管道,结合了我们轻量级的RAPID-SCAN分割模型和在边缘计算平台上部署的微调视觉-语言模型(VLM)。第一阶段使用RAPID-SCAN(资源感知管道检查和缺陷分割紧凑自适应网络),仅使用0.64M参数实现了0.834的F1分数,以实现高效的缺陷分割。第二阶段利用微调的Phi-3.5 VLM从分割输出生成简洁的、领域特定的自然语言摘要。我们引入了一个包含手动验证描述的检查图像数据集,用于VLM的微调和评估。为了实现实时性能,我们使用后训练量化和硬件特定优化,显著减少了模型大小和推理延迟,而不牺牲摘要质量。我们在移动机器人平台上部署和评估了完整的管道,展示了其在实际检查场景中的有效性。我们的结果表明,可部署于边缘的集成AI系统有可能弥合自动缺陷检测与基础设施维护可操作见解之间的差距,为更可扩展和自主的检查解决方案铺平了道路。
Summary / 总结
This paper addresses the challenge of generating human-readable summaries from automated defect detection in underground infrastructure inspections. It proposes a two-stage pipeline using RAPID-SCAN for efficient defect segmentation and a fine-tuned Phi-3.5 Vision-Language Model for summarization. The pipeline achieves 0.834 F1-score for segmentation and generates concise summaries. Post-training quantization and hardware optimization reduce model size and inference latency. The system is deployed on a mobile robotic platform, showing effectiveness in real-world scenarios and potential for scalable autonomous inspection solutions.
本文解决了自动检测地下基础设施缺陷后生成人类可读摘要的挑战。提出了一种两阶段管道,使用轻量级的RAPID-SCAN模型进行高效的缺陷分割,并使用微调的Vision-Language模型生成简洁的摘要。通过后训练量化和硬件特定优化实现实时性能,展示了在实际场景中的有效性,并为可扩展的自主检测解决方案铺平了道路。