arXiv 论文速递

Choreographing a World of Dynamic Objects

Authors: Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, Jiajun Wu

First: 2026-01-07T18:59:40+00:00 · Latest: 2026-01-07T18:59:40+00:00

Abstract

Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies. Project page: https://yanzhelyu.github.io/chord

中文标题/摘要

标题：编排动态物体的世界

在我们物理的4D（3D + 时间）世界中，动态物体不断演变、变形并与其它物体相互作用，导致多样的4D场景动态。本文中，我们提出了一种通用的生成管道CHORD，用于编排动态物体和场景并合成此类现象。传统的基于规则的图形管道创建这些动态是基于特定类别的启发式方法，但劳动密集且不具扩展性。最近的学习方法通常需要大规模数据集，但可能无法涵盖所有感兴趣的对象类别。相反，我们的方法通过提出一种基于蒸馏的管道，从2D视频的欧拉表示中提取丰富的拉格朗日运动信息，继承了视频生成模型的通用性。我们的方法是通用的、多功能的且类别无关的。我们通过实验展示了其有效性，生成了多种多体4D动态，展示了其相对于现有方法的优势，并展示了其在生成机器人操作策略方面的应用。项目页面：https://yanzhelyu.github.io/chord

Summary / 总结

This paper addresses the challenge of generating diverse 4D dynamics of dynamic objects in our physical world. It introduces CHORD, a universal generative pipeline that extracts Lagrangian motion information from Eulerian representations of 2D videos to synthesize complex 4D dynamics. Experiments show that CHORD outperforms existing methods in generating multi-body 4D dynamics and can be applied to robotics manipulation policies without requiring category-specific heuristics or large-scale datasets.

本文解决了在物理世界中生成动态物体的多样化4D动态的挑战。它提出了CHORD，一种通用生成管道，从2D视频的欧拉表示中提取拉格朗日运动信息以合成复杂的4D动态。实验表明，CHORD在生成多体4D动态方面优于现有方法，并且无需特定类别的人工启发式或大规模数据集即可应用于机器人操作策略。

Embedding Autonomous Agents in Resource-Constrained Robotic Platforms

Authors: Negar Halakou, Juan F. Gutierrez, Ye Sun, Han Jiang, Xueming Wu, Yilun Song, Andres Gomez

First: 2026-01-07T18:57:32+00:00 · Latest: 2026-01-07T18:57:32+00:00

Comments: This is an open-access, author-archived version of a manuscript published in European Conference on Multi-Agent Systems 2025

Abs · PDF · Code1 · Code2

Abstract

Many embedded devices operate under resource constraints and in dynamic environments, requiring local decision-making capabilities. Enabling devices to make independent decisions in such environments can improve the responsiveness of the system and reduce the dependence on constant external control. In this work, we integrate an autonomous agent, programmed using AgentSpeak, with a small two-wheeled robot that explores a maze using its own decision-making and sensor data. Experimental results show that the agent successfully solved the maze in 59 seconds using 287 reasoning cycles, with decision phases taking less than one millisecond. These results indicate that the reasoning process is efficient enough for real-time execution on resource-constrained hardware. This integration demonstrates how high-level agent-based control can be applied to resource-constrained embedded systems for autonomous operation.

中文标题/摘要

标题：在资源受限的机器人平台上嵌入自主代理

许多嵌入式设备在资源受限和动态环境中运行，需要本地决策能力。使设备能够在这些环境中独立做出决策可以提高系统的响应性并减少对外部控制的依赖。在本研究中，我们使用AgentSpeak编程将一个自主代理与一个小型两轮机器人集成，该机器人利用自身的决策能力和传感器数据探索迷宫。实验结果表明，代理在59秒内成功解决了迷宫，使用了287次推理循环，决策阶段耗时不到一毫秒。这些结果表明，推理过程在资源受限的硬件上实现实时执行是高效的。这种集成展示了如何将基于代理的高级控制应用于资源受限的嵌入式系统以实现自主操作。

Summary / 总结

The research aims to enable autonomous decision-making in resource-constrained robotic platforms to enhance system responsiveness and reduce external control dependency. An autonomous agent programmed with AgentSpeak was integrated into a small two-wheeled robot navigating a maze. The agent successfully solved the maze in 59 seconds with 287 reasoning cycles, demonstrating efficient real-time execution on limited hardware.

研究旨在通过将自主代理与两轮机器人集成，增强资源受限嵌入式设备的自主性和响应性。该代理使用AgentSpeak编程，基于自身的推理和传感器数据做出决策。机器人成功在59秒内通过迷宫，使用了287次推理循环，表明在有限硬件资源上实现了高效的实时执行。

ImLoc: Revisiting Visual Localization with Image-based Representation

Authors: Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys

First: 2026-01-07T18:51:51+00:00 · Latest: 2026-01-07T18:51:51+00:00

Comments: Code will be available at https://github.com/cvg/Hierarchical-Localization

Abs · PDF · Code1 · Code2 · Code3

Abstract

Existing visual localization methods are typically either 2D image-based, which are easy to build and maintain but limited in effective geometric reasoning, or 3D structure-based, which achieve high accuracy but require a centralized reconstruction and are difficult to update. In this work, we revisit visual localization with a 2D image-based representation and propose to augment each image with estimated depth maps to capture the geometric structure. Supported by the effective use of dense matchers, this representation is not only easy to build and maintain, but achieves highest accuracy in challenging conditions. With compact compression and a GPU-accelerated LO-RANSAC implementation, the whole pipeline is efficient in both storage and computation and allows for a flexible trade-off between accuracy and highest memory efficiency. Our method achieves a new state-of-the-art accuracy on various standard benchmarks and outperforms existing memory-efficient methods at comparable map sizes. Code will be available at https://github.com/cvg/Hierarchical-Localization.

中文标题/摘要

标题：ImLoc：基于图像的表示重新审视视觉定位

现有的视觉定位方法通常是基于2D图像的，这类方法易于构建和维护，但几何推理能力有限；或者基于3D结构的，这类方法能够实现高精度，但需要集中重建且难以更新。在本文中，我们基于2D图像的表示重新审视视觉定位，并提出通过添加估计的深度图来捕捉几何结构。借助密集匹配器的有效使用，这种表示不仅易于构建和维护，而且在具有挑战性的条件下实现了最高的精度。通过紧凑压缩和GPU加速的LO-RANSAC实现，整个管道在存储和计算上都高效，并允许在精度和最高内存效率之间灵活权衡。我们的方法在各种标准基准上达到了新的最先进的精度，并在相似的地图大小下优于现有的内存高效方法。代码将在https://github.com/cvg/Hierarchical-Localization公开。

Summary / 总结

This paper revisits visual localization using a 2D image-based representation, enhancing each image with estimated depth maps to capture geometric structure. The method leverages dense matchers for effective geometric reasoning and is efficient in both storage and computation. It achieves state-of-the-art accuracy on various benchmarks and outperforms existing memory-efficient methods at comparable map sizes.

本文重新审视了使用基于2D图像的表示进行视觉定位的方法，通过添加估计的深度图来捕捉几何结构。该方法利用密集匹配器进行有效的几何推理，并在存储和计算上都非常高效。它在各种基准测试中达到了最先进的准确性，并且在相似的地图大小下优于现有的高效内存方法。

HONEYBEE: Efficient Role-based Access Control for Vector Databases via Dynamic Partitioning[Technical Report]

Authors: Hongbin Zhong, Matthew Lentz, Nina Narodytska, Adriana Szekeres, Kexin Rong

First: 2025-05-02T18:59:31+00:00 · Latest: 2026-01-07T18:49:03+00:00

Comments: Accepted by SIGMOD 2026

Abs · PDF · Code1 · Code2

Abstract

Enterprise deployments of vector databases require access control policies to protect sensitive data. These systems often implement access control through hybrid vector queries that combine nearest-neighbor search with relational predicates based on user permissions. However, existing approaches face a fundamental trade-off: dedicated per-user indexes minimize query latency but incur high memory redundancy, while shared indexes with post-search filtering reduce memory overhead at the cost of increased latency. This paper introduces HONEYBEE, a dynamic partitioning framework that leverages the structure of Role-Based Access Control (RBAC) policies to create a smooth trade-off between these extremes. RBAC policies organize users into roles and assign permissions at the role level, creating a natural ``thin waist`` in the permission structure that is ideal for partitioning decisions. Specifically, HONEYBEE produces overlapping partitions where vectors can be strategically replicated across different partitions to reduce query latency while controlling memory overhead. To guide these decisions, HONEYBEE develops analytical models of vector search performance and recall, and formulates partitioning as a constrained optimization problem that balances memory usage, query efficiency, and recall. Evaluations on RBAC workloads demonstrate that HONEYBEE achieves up to 13.5X lower query latency than row-level security with only a 1.24X increase in memory usage, while achieving comparable query performance to dedicated, per-role indexes with 90.4% reduction in additional memory consumption, offering a practical middle ground for secure and efficient vector search.

中文标题/摘要

标题：HONEYBEE：通过动态分区实现向量数据库的高效角色基础访问控制[技术报告]

企业部署向量数据库需要访问控制策略来保护敏感数据。这些系统通常通过结合最近邻搜索和基于用户权限的关系谓词的混合向量查询来实现访问控制。然而，现有方法面临一个基本的权衡：为每个用户创建专用索引可以最小化查询延迟，但会产生高内存冗余，而共享索引在搜索后过滤可以减少内存开销，但会增加延迟。本文介绍了一种动态分区框架HONEYBEE，该框架利用角色基础访问控制（RBAC）策略的结构来在这些极端之间实现平滑的权衡。RBAC策略将用户组织成角色，并在角色级别分配权限，从而形成一个自然的“细腰”，非常适合分区决策。具体而言，HONEYBEE 生成重叠分区，使向量可以在不同分区之间战略性地复制，以减少查询延迟并控制内存开销。为了指导这些决策，HONEYBEE 开发了向量搜索性能和召回率的分析模型，并将分区问题表述为一个受约束的优化问题，该问题平衡了内存使用、查询效率和召回率。在RBAC工作负载上的评估表明，HONEYBEE 的查询延迟比行级安全低13.5倍，内存使用量仅增加1.24倍，同时在90.4%的额外内存消耗减少的情况下，查询性能与专用、按角色索引相当，提供了一种实用的中间地带，以实现安全和高效的向量搜索。

Summary / 总结

HONEYBEE is a dynamic partitioning framework for vector databases that balances query latency and memory usage by leveraging Role-Based Access Control policies. It creates overlapping partitions to reduce query latency while controlling memory overhead, using analytical models to optimize partitioning. Experiments show HONEYBEE reduces query latency by up to 13.5X with only a 1.24X increase in memory usage, and achieves comparable query performance to per-role indexes with a 90.4% reduction in additional memory consumption.

HONEYBEE 是一种动态分区框架，旨在优化向量数据库中的基于角色的访问控制。它利用 RBAC 策略创建重叠分区，以减少查询延迟并控制内存使用量。HONEYBEE 使用分析模型来平衡内存使用、查询效率和召回率，实现了与行级安全相比高达 13.5 倍的查询延迟降低，同时内存使用量仅增加 1.24 倍，并且与基于角色的专用索引相比，额外内存消耗减少了 90.4%。

Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition

Authors: Nia Touko, Matthew O A Ellis, Cristiano Capone, Alessio Burrello, Elisa Donati, Luca Manneschi

First: 2026-01-07T18:48:31+00:00 · Latest: 2026-01-07T18:48:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Reliable long-term decoding of surface electromyography (EMG) is hindered by signal drift caused by electrode shifts, muscle fatigue, and posture changes. While state-of-the-art models achieve high intra-session accuracy, their performance often degrades sharply. Existing solutions typically demand large datasets or high-compute pipelines that are impractical for energy-efficient wearables. We propose a lightweight framework for Test-Time Adaptation (TTA) using a Temporal Convolutional Network (TCN) backbone. We introduce three deployment-ready strategies: (i) causal adaptive batch normalization for real-time statistical alignment; (ii) a Gaussian Mixture Model (GMM) alignment with experience replay to prevent forgetting; and (iii) meta-learning for rapid, few-shot calibration. Evaluated on the NinaPro DB6 multi-session dataset, our framework significantly bridges the inter-session accuracy gap with minimal overhead. Our results show that experience-replay updates yield superior stability under limited data, while meta-learning achieves competitive performance in one- and two-shot regimes using only a fraction of the data required by current benchmarks. This work establishes a path toward robust, "plug-and-play" myoelectric control for long-term prosthetic use.

中文标题/摘要

标题：基于EMG的手势识别测试时轻量级适应

由于电极位移、肌肉疲劳和姿势变化导致的信号漂移，长期可靠的表面肌电图（EMG）解码受到阻碍。尽管最先进的模型在单会话中能实现高精度，但其性能往往在会话间急剧下降。现有解决方案通常需要大量数据集或高计算量的管道，这在能量高效的可穿戴设备上是不切实际的。我们提出了一种基于时间卷积网络（TCN）骨干的测试时轻量级适应（TTA）框架。我们引入了三种部署就绪的策略：（i）因果自适应批量归一化进行实时统计对齐；（ii）经验回放的高斯混合模型（GMM）对齐以防止遗忘；（iii）元学习以实现快速、少样本校准。在NinaPro DB6多会话数据集上评估，我们的框架在最小开销下显著缩小了会话间精度差距。我们的结果表明，经验回放更新在有限数据下提供了更好的稳定性，而元学习仅使用当前基准所需数据的一小部分便在单样本和两样本情况下实现了竞争力的性能。这项工作为长期假肢使用中的稳健、即插即用的肌电控制奠定了道路。

Summary / 总结

The paper addresses the challenge of signal drift in EMG-based gesture recognition, proposing a lightweight Test-Time Adaptation framework using a TCN backbone. It introduces three strategies: causal adaptive batch normalization, GMM alignment with experience replay, and meta-learning for rapid calibration. The framework significantly improves inter-session accuracy with minimal overhead, showing superior stability and competitive performance with limited data.

论文针对EMG手势识别中的信号漂移问题，提出了一种轻量级的Test-Time Adaptation框架，使用了TCN骨干网络。引入了三种策略：因果自适应批量归一化、基于经验重播的GMM对齐以及元学习快速校准。该框架在保持低开销的同时显著提高了跨会话的准确率，显示出在有限数据下的优越稳定性和竞争力。

Robust Physics Discovery from Highly Corrupted Data: A PINN Framework Applied to the Nonlinear Schrödinger Equation

Authors: Pietro de Oliveira Esteves

First: 2026-01-07T18:43:11+00:00 · Latest: 2026-01-07T18:43:11+00:00

Comments: 9 pages, 4 figures, 2 tables. Code available at https://github.com/p-esteves/pinn-nlse-2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

We demonstrate a deep learning framework capable of recovering physical parameters from the Nonlinear Schrodinger Equation (NLSE) under severe noise conditions. By integrating Physics-Informed Neural Networks (PINNs) with automatic differentiation, we achieve reconstruction of the nonlinear coefficient beta with less than 0.2 percent relative error using only 500 sparse, randomly sampled data points corrupted by 20 percent additive Gaussian noise, a regime where traditional finite difference methods typically fail due to noise amplification in numerical derivatives. We validate the method's generalization capabilities across different physical regimes (beta between 0.5 and 2.0) and varying data availability (between 100 and 1000 training points), demonstrating consistent sub-1 percent accuracy. Statistical analysis over multiple independent runs confirms robustness (standard deviation less than 0.15 percent for beta equals 1.0). The complete pipeline executes in approximately 80 minutes on modest cloud GPU resources (NVIDIA Tesla T4), making the approach accessible for widespread adoption. Our results indicate that physics-based regularization acts as an effective filter against high measurement uncertainty, positioning PINNs as a viable alternative to traditional optimization methods for inverse problems in spatiotemporal dynamics where experimental data is scarce and noisy. All code is made publicly available to facilitate reproducibility.

中文标题/摘要

标题：从高度 corrupted 数据中发现稳健的物理规律：应用于非线性薛定谔方程的 PINN 框架

我们展示了一种深度学习框架，能够在严重噪声条件下从非线性薛定谔方程（NLSE）中恢复物理参数。通过将物理信息神经网络（PINNs）与自动微分相结合，我们仅使用500个稀疏、随机采样的数据点，这些数据点被20%的高斯噪声污染，就能以小于0.2%的相对误差重建非线性系数β，而在传统有限差分方法中，由于噪声在数值导数中的放大作用，通常在这种情况下会失败。我们验证了该方法在不同物理区域（β在0.5到2.0之间）和不同数据可用性（100到1000个训练点之间）下的泛化能力，显示出一致的低于1%的准确率。多次独立运行的统计分析证实了其稳健性（β等于1.0时的标准差小于0.15%）。整个管道在适度的云GPU资源（NVIDIA Tesla T4）上大约80分钟内执行完毕，使该方法易于广泛采用。我们的结果表明，基于物理的正则化作为一种有效的滤波器，可以对抗高测量不确定性，将PINNs定位为在实验数据稀缺且噪声的情况下，时空动态逆问题中传统优化方法的可行替代方案。所有代码已公开发布，以促进可重复性。

Summary / 总结

The research aims to recover physical parameters from the Nonlinear Schrödinger Equation (NLSE) under highly corrupted data conditions using Physics-Informed Neural Networks (PINNs). By integrating PINNs with automatic differentiation, the method achieves less than 0.2 percent relative error in reconstructing the nonlinear coefficient beta using only 500 sparse, noisy data points. The approach demonstrates consistent sub-1 percent accuracy across different physical regimes and varying data availability, with robustness confirmed through statistical analysis. The complete pipeline executes efficiently on modest cloud GPU resources, making it accessible for widespread adoption in inverse problems with scarce and noisy data.

研究提出了一种使用物理感知神经网络（PINNs）在高度噪声数据条件下从非线性薛定谔方程（NLSE）中恢复物理参数的深度学习框架。通过结合自动微分，该方法成功地用500个稀疏的噪声数据点重建了非线性系数beta，相对误差低于0.2%，超越了传统的有限差分方法。该方法在不同的物理条件下和不同的数据可用性下表现出一致的低于1%的准确率，并通过统计分析验证了其鲁棒性。整个流程在普通的云GPU资源上运行效率高，使其易于实际应用。

Agentic Rubrics as Contextual Verifiers for SWE Agents

Authors: Mohit Raghavendra, Anisha Gunjal, Bing Liu, Yunzhong He

First: 2026-01-07T18:38:23+00:00 · Latest: 2026-01-07T18:38:23+00:00

Comments: 31 pages, 11 Figures

Abs · PDF · Code1 · Code2

Abstract

Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.

中文标题/摘要

标题：代理评分表作为软件工程代理上下文验证器

验证对于提高代理至关重要：它为强化学习提供奖励信号，并通过测试时缩放（TTS）在推理时获得收益。尽管如此，软件工程（SWE）代理环境中的验证通常依赖于代码执行，这由于环境设置开销而难以扩展。可扩展的替代方案，如补丁分类器和启发式方法存在，但它们在代码库上下文中的依据较少，且难以解释。为此，我们探索了代理评分表：专家代理与仓库交互以创建基于上下文的评分表检查表，候选补丁随后根据其评分表进行评分，无需执行测试。在SWE-Bench Verified的并行TTS评估下，代理评分表在Qwen3-Coder-30B-A3B上的得分为54.2%，在Qwen3-32B上的得分为40.6%，与我们比较集中最强基线相比至少提高了3.5个百分点。我们进一步分析了评分表的行为，显示评分表得分与真实测试结果一致，同时也能标记测试未捕捉到的问题。我们的消融实验表明，代理上下文收集对于生成代码库特定的、无歧义的标准至关重要。综上所述，这些结果表明代理评分表为SWE代理提供了高效、可扩展且精细的验证信号。

Summary / 总结

The paper aims to improve the verification process for software engineering agents by proposing Agentic Rubrics, which are context-grounded rubric checklists created by an expert agent interacting with the repository. These rubrics allow candidate patches to be scored without requiring test execution, providing a scalable alternative to code execution. The method achieves a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline. The results show that rubric scores are consistent with ground-truth tests and can flag issues that tests do not capture, highlighting the importance of agentic context gathering for producing codebase-specific, unambiguous criteria.

研究旨在通过提供一种可扩展的替代方案来改进软件工程代理的验证过程，该方案替代了代码执行。Agentic Rubrics 是基于仓库上下文由专家代理生成的检查表，用于在无需执行测试的情况下对候选补丁进行评分。在 SWE-Bench Verified 上，Agentic Rubrics 较现有方法获得了更高的分数，且相对于最强基线方法有至少 3.5 个百分点的显著提升。此外，这些检查表还能有效捕捉到地面真实测试未发现的问题，表明它们在提供细粒度和高效的验证信号方面的实用性。

FedDUAL: A Dual-Strategy with Adaptive Loss and Dynamic Aggregation for Mitigating Data Heterogeneity in Federated Learning

Authors: Pranab Sahoo, Ashutosh Tripathi, Sriparna Saha, Samrat Mondal

First: 2024-12-05T18:42:29+00:00 · Latest: 2026-01-07T18:33:05+00:00

Comments: Transactions on Machine Learning Research (TMLR)

Abs · PDF · Code1 · Code2

Abstract

Federated Learning (FL) marks a transformative approach to distributed model training by combining locally optimized models from various clients into a unified global model. While FL preserves data privacy by eliminating centralized storage, it encounters significant challenges such as performance degradation, slower convergence, and reduced robustness of the global model due to the heterogeneity in client data distributions. Among the various forms of data heterogeneity, label skew emerges as a particularly formidable and prevalent issue, especially in domains such as image classification. To address these challenges, we begin with comprehensive experiments to pinpoint the underlying issues in the FL training process. Based on our findings, we then introduce an innovative dual-strategy approach designed to effectively resolve these issues. First, we introduce an adaptive loss function for client-side training, meticulously crafted to preserve previously acquired knowledge while maintaining an optimal equilibrium between local optimization and global model coherence. Secondly, we develop a dynamic aggregation strategy for aggregating client models at the server. This approach adapts to each client's unique learning patterns, effectively addressing the challenges of diverse data across the network. Our comprehensive evaluation, conducted across three diverse real-world datasets, coupled with theoretical convergence guarantees, demonstrates the superior efficacy of our method compared to several established state-of-the-art approaches.

中文标题/摘要

标题：FedDUAL：一种适应性损失和动态聚合的双策略方法，用于缓解联邦学习中的数据异质性

联邦学习（FL）通过将来自不同客户端的局部优化模型整合成统一的全局模型，标志着分布式模型训练的一种变革性方法。尽管FL通过消除集中存储来保护数据隐私，但它面临着诸如性能下降、收敛速度减慢和全局模型鲁棒性降低等重大挑战，这些挑战主要是由于客户端数据分布的异质性。在各种形式的数据异质性中，标签偏差尤为突出且普遍，尤其是在图像分类等领域。为了解决这些挑战，我们首先进行了全面的实验，以确定FL训练过程中的根本问题。基于这些发现，我们提出了一种创新的双策略方法，旨在有效解决这些问题。首先，我们引入了一种适应性损失函数，用于客户端训练，精心设计以保留先前获得的知识，同时在局部优化和全局模型一致性之间保持最佳平衡。其次，我们开发了一种动态聚合策略，用于在服务器端聚合客户端模型。该方法能够适应每个客户端的独特学习模式，有效解决了网络中多样化的数据挑战。我们在三个不同的真实世界数据集上进行了全面评估，并结合了理论收敛保证，证明了我们方法在与多个现有先进方法相比时的优越效果。

Summary / 总结

The paper addresses the challenges of data heterogeneity in Federated Learning, particularly label skew, which can degrade model performance. It proposes FedDUAL, a dual-strategy approach combining an adaptive loss function for client-side training and a dynamic aggregation strategy for server-side model aggregation. The method is evaluated across three real-world datasets and shows superior performance compared to existing methods.

论文针对联邦学习中的数据异质性问题，特别是图像分类中的标签偏差，提出了FedDUAL方法，该方法包括客户端训练的自适应损失函数和服务器端模型聚合的动态策略。该方法在三个真实世界数据集上进行了评估，并显示出比现有方法更好的性能。

Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models

Authors: Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, Mattias Rantalainen

First: 2026-01-07T18:24:12+00:00 · Latest: 2026-01-07T18:24:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.

中文标题/摘要

标题：扫描诱导的领域偏移削弱了病理基础模型的稳健性

病理基础模型（PFMs）已成为计算病理学的核心，旨在提供从全切片图像（WSIs）中提取特征的一般编码器。尽管在基准测试中表现出色，但PFMs对现实世界技术领域偏移（如全切片扫描仪设备的变异性）的稳健性仍然知之甚少。我们系统地评估了14种PFMs对扫描诱导变异性（包括最先进的模型、早期的自监督模型以及基于自然图像训练的基线模型）的稳健性。使用包含384例乳腺癌WSIs的多扫描仪数据集（在五种设备上扫描），我们独立地隔离了扫描器效应，排除了生物学和实验室混杂因素的影响。稳健性通过互补的无监督嵌入分析和一系列临床病理学监督预测任务进行评估。我们的结果表明，当前的PFMs对扫描诱导的领域偏移缺乏不变性。大多数模型在其嵌入空间中编码了明显的扫描器特定变异性。虽然AUC通常保持稳定，但这掩盖了一个关键的失败模式：扫描器变异性系统地改变了嵌入空间，并影响了下游模型预测的校准，导致扫描器依赖性偏差，可能影响临床应用的可靠性。我们进一步表明，稳健性不是简单地由训练数据规模、模型大小或模型的最新程度决定的。没有一种模型能够可靠地抵抗扫描器诱导的变异性。虽然训练数据最多样化的模型（这里代表视觉-语言模型）似乎在稳健性方面具有优势，但它们在下游监督任务中表现不佳。我们得出结论，PFM的开发和评估需要超越以准确性为中心的基准，转向明确评估和优化在现实获取变异性下的嵌入稳定性和校准。

Summary / 总结

The study evaluates the robustness of 14 pathology foundation models (PFMs) to scanner-induced variability using a multiscanner dataset of breast cancer whole-slide images. Despite strong benchmark performance, the PFMs showed significant scanner-specific variability in their embedding spaces, leading to scanner-dependent bias that impacts model calibration and reliability. The robustness was assessed through unsupervised embedding analyses and supervised prediction tasks, indicating that none of the models provided reliable robustness against scanner-induced variability.

研究评估了14种病理基础模型（PFMs）在多扫描器数据集中的鲁棒性，该数据集包含乳腺癌全切片图像。研究结果显示，当前PFMs对扫描器引起的领域变化缺乏不变性，大多数模型在其嵌入空间中编码了扫描器特有的变化。虽然AUC分数通常保持稳定，但这掩盖了关键问题，即扫描器变化会影响下游模型预测的校准，导致扫描器依赖性偏差。研究结论指出，鲁棒性不能仅基于训练数据规模、模型大小或模型的最新性来假设，PFMs需要在现实的获取变化下评估嵌入稳定性和校准的鲁棒性。

Attention Needs to Focus: A Unified Perspective on Attention Allocation

Authors: Zichuan Fu, Wentao Song, Guojing Li, Yejing Wang, Xian Wu, Yimin Deng, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao

First: 2026-01-01T08:39:15+00:00 · Latest: 2026-01-07T18:20:49+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.

中文标题/摘要

标题：注意力需要聚焦：统一视角下的注意力分配

Transformer架构，现代大型语言模型（LLMs）的核心基石，在序列建模中取得了非凡的成功，主要归功于其注意力机制。然而，尽管其强大，标准的注意力机制仍然存在已知的问题：表示坍塌和注意力陷阱。尽管先前的工作提出了针对这些问题的方法，但它们通常被孤立研究，掩盖了它们之间的深层联系。在本文中，我们提出了一种统一的视角，认为这些问题都可以追溯到一个共同根源——不恰当的注意力分配。我们识别了两种失败模式：1）注意力过载，其中令牌接收相似的高权重，模糊了语义特征，导致表示坍塌；2）注意力不足，其中没有令牌具有语义相关性，但注意力仍然被迫分配，导致虚假关注，如注意力陷阱。基于这一洞察，我们引入了懒惰注意力机制，这是一种旨在实现更聚焦的注意力分配的新机制。为了缓解过载，它在头部和维度之间使用位置区分来增强令牌的区别。为了对抗不足，它结合了弹性-softmax，这是一种修改的规范化函数，放松了标准的softmax约束，以抑制对不相关令牌的注意力。在FineWeb-Edu语料库上的实验，通过九个不同的基准进行评估，表明懒惰注意力机制成功地缓解了注意力陷阱，并在与标准注意力和现代架构相比时实现了竞争力的性能，同时达到高达59.58%的注意力稀疏度。

Summary / 总结

This paper addresses the issues of representational collapse and attention sink in the standard attention mechanism of Transformer architectures. It proposes a unified perspective, identifying two failure modes: Attention Overload and Attention Underload. To tackle these issues, the authors introduce Lazy Attention, which uses positional discrimination and Elastic-Softmax to improve attention allocation. Experiments show that Lazy Attention effectively mitigates attention sink and achieves competitive performance with up to 59.58% attention sparsity.

本文探讨了标准Transformer注意力机制中的表示坍塌和注意力陷阱问题，认为这些问题都源于不恰当的注意力分配。提出了Lazy Attention机制，通过位置区分来增强token的区别，并结合弹性Softmax来放松标准Softmax约束，抑制无关注意力。实验结果表明，Lazy Attention在FineWeb-Edu语料库上取得了与标准注意力和现代架构相当的性能，并且达到了高达59.58%的注意力稀疏性。

ToTMNet: FFT-Accelerated Toeplitz Temporal Mixing Network for Lightweight Remote Photoplethysmography

Authors: Vladimir Frants, Sos Agaian, Karen Panetta

First: 2026-01-07T18:15:09+00:00 · Latest: 2026-01-07T18:15:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Remote photoplethysmography (rPPG) estimates a blood volume pulse (BVP) waveform from facial videos captured by commodity cameras. Although recent deep models improve robustness compared to classical signal-processing approaches, many methods increase computational cost and parameter count, and attention-based temporal modeling introduces quadratic scaling with respect to the temporal length. This paper proposes ToTMNet, a lightweight rPPG architecture that replaces temporal attention with an FFT-accelerated Toeplitz temporal mixing layer. The Toeplitz operator provides full-sequence temporal receptive field using a linear number of parameters in the clip length and can be applied in near-linear time using circulant embedding and FFT-based convolution. ToTMNet integrates the global Toeplitz temporal operator into a compact gated temporal mixer that combines a local depthwise temporal convolution branch with gated global Toeplitz mixing, enabling efficient long-range temporal filtering while only having 63k parameters. Experiments on two datasets, UBFC-rPPG (real videos) and SCAMPS (synthetic videos), show that ToTMNet achieves strong heart-rate estimation accuracy with a compact design. On UBFC-rPPG intra-dataset evaluation, ToTMNet reaches 1.055 bpm MAE with Pearson correlation 0.996. In a synthetic-to-real setting (SCAMPS to UBFC-rPPG), ToTMNet reaches 1.582 bpm MAE with Pearson correlation 0.994. Ablation results confirm that the gating mechanism is important for effectively using global Toeplitz mixing, especially under domain shift. The main limitation of this preprint study is the use of only two datasets; nevertheless, the results indicate that Toeplitz-structured temporal mixing is a practical and efficient alternative to attention for rPPG.

中文标题/摘要

标题：ToTMNet：FFT加速Toeplitz时间混合网络用于轻量级远程光电流图

远程光电流图（rPPG）从普通相机拍摄的面部视频中估计血容积脉冲（BVP）波形。尽管最近的深度模型在鲁棒性方面优于经典信号处理方法，但许多方法增加了计算成本和参数数量，基于注意力的时间建模引入了时间长度的二次缩放。本文提出了一种轻量级的rPPG架构ToTMNet，用FFT加速的Toeplitz时间混合层替代了时间注意力。Toeplitz算子使用剪辑长度中的线性数量参数提供了整个序列的时间感受野，并且可以通过循环嵌入和FFT卷积在接近线性时间内应用。ToTMNet将全局Toeplitz时间算子整合到一个紧凑的门控时间混合器中，该混合器结合了局部深度时间卷积分支和门控全局Toeplitz混合，能够在仅拥有63k参数的情况下实现高效的长程时间滤波。在两个数据集UBFC-rPPG（真实视频）和SCAMPS（合成视频）上的实验表明，ToTMNet在紧凑设计下实现了强大的心率估计准确性。在UBFC-rPPG内部数据集评估中，ToTMNet达到1.055 bpm MAE和皮尔逊相关系数0.996。在合成到现实的设置（SCAMPS到UBFC-rPPG）中，ToTMNet达到1.582 bpm MAE和皮尔逊相关系数0.994。消融结果证实，门控机制对于有效利用全局Toeplitz混合至关重要，尤其是在领域转移的情况下。这项预印本研究的主要局限性是仅使用了两个数据集；然而，结果表明，Toeplitz结构的时间混合是rPPG中注意力的一种实用且高效的替代方案。

CktGen: Automated Analog Circuit Design with Generative Artificial Intelligence

Authors: Yuxuan Hou, Hehe Fan, Jianrong Zhang, Yue Zhang, Hua Chen, Min Zhou, Faxin Yu, Roger Zimmermann, Yi Yang

First: 2024-10-01T18:35:44+00:00 · Latest: 2026-01-07T18:11:26+00:00

Comments: Paper accepted by Engineering

Abs · PDF · Code1 · Code2

Abstract

The automatic synthesis of analog circuits presents significant challenges. Most existing approaches formulate the problem as a single-objective optimization task, overlooking that design specifications for a given circuit type vary widely across applications. To address this, we introduce specification-conditioned analog circuit generation, a task that directly generates analog circuits based on target specifications. The motivation is to leverage existing well-designed circuits to improve automation in analog circuit design. Specifically, we propose CktGen, a simple yet effective variational autoencoder that maps discretized specifications and circuits into a joint latent space and reconstructs the circuit from that latent vector. Notably, as a single specification may correspond to multiple valid circuits, naively fusing specification information into the generative model does not capture these one-to-many relationships. To address this, we decouple the encoding of circuits and specifications and align their mapped latent space. Then, we employ contrastive training with a filter mask to maximize differences between encoded circuits and specifications. Furthermore, classifier guidance along with latent feature alignment promotes the clustering of circuits sharing the same specification, avoiding model collapse into trivial one-to-one mappings. By canonicalizing the latent space with respect to specifications, we can search for an optimal circuit that meets valid target specifications. We conduct comprehensive experiments on the open circuit benchmark and introduce metrics to evaluate cross-model consistency. Experimental results demonstrate that CktGen achieves substantial improvements over state-of-the-art methods.

中文标题/摘要

标题：CktGen：基于生成人工智能的自动化模拟电路设计

模拟电路的自动综合面临着重大挑战。大多数现有方法将问题形式化为单目标优化任务，忽略了给定电路类型在不同应用中的设计规范差异广泛。为了解决这一问题，我们引入了基于规范条件的模拟电路生成任务，该任务可以直接根据目标规范生成模拟电路。动机是利用现有设计良好的电路来提高模拟电路设计的自动化程度。具体来说，我们提出了CktGen，这是一种简单而有效的变分自编码器，将离散化规范和电路映射到联合潜在空间，并从潜在向量中重建电路。值得注意的是，一种规范可能对应多个有效电路，简单地将规范信息融合到生成模型中不能捕捉这些一对多的关系。为了解决这一问题，我们将电路编码和规范编码解耦，并对它们的映射潜在空间进行对齐。然后，我们使用过滤掩码的对比训练来最大化编码电路和规范之间的差异。此外，分类器指导与潜在特征对齐促进了具有相同规范的电路的聚类，避免了模型退化为简单的一对一映射。通过对规范进行潜在空间的规范化，我们可以搜索满足有效目标规范的最优电路。我们在开放电路基准上进行了全面的实验，并引入了评估跨模型一致性的度量。实验结果表明，CktGen 在最先进的方法上取得了显著的改进。

Summary / 总结

The paper addresses the challenge of automatically synthesizing analog circuits by introducing CktGen, a variational autoencoder that maps specifications and circuits into a joint latent space. By decoupling the encoding of circuits and specifications and using contrastive training with a filter mask, CktGen effectively captures the one-to-many relationships between specifications and valid circuits. The method also employs classifier guidance and latent feature alignment to promote clustering of circuits with similar specifications, avoiding trivial one-to-one mappings. Experiments show that CktGen outperforms existing methods on the open circuit benchmark, demonstrating significant improvements in cross-model consistency.

CktGen 通过引入基于规格的模拟电路生成方法来解决自动模拟电路合成的挑战。它利用变分自编码器将规格和电路映射到联合潜在空间，并通过对比训练和分类器指导来对齐和聚类具有相同规格的电路。实验结果表明，CktGen 在生成符合目标规格的电路方面优于现有方法。

Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning

Authors: Yifan Wang, Yanyu Li, Sergey Tulyakov, Yun Fu, Anil Kag

First: 2026-01-07T18:05:08+00:00 · Latest: 2026-01-07T18:05:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.

中文标题/摘要

标题：扩散-DRF：可微奖励流用于视频扩散微调

直接偏好优化(DPO)最近通过提高视觉保真度和文本对齐性改善了文本到视频(T2V)生成。然而，当前方法依赖于人类注释或学习奖励模型中的非可微偏好信号。这种依赖性使得训练耗时、易产生偏差且容易被操纵，这通常会导致奖励劫持和训练不稳定。我们提出了一种可微奖励流(Diffusion-DRF)，使用冻结的现成视觉-语言模型(VLM)作为无训练的批评家，用于微调视频扩散模型。Diffusion-DRF直接通过扩散去噪链反向传播VLM反馈，将logit级响应转换为可优化的token感知梯度。我们提出了一种自动化的、按方面结构化的提示管道，以获得可靠的多维VLM反馈，而梯度检查点技术使得通过最终去噪步骤进行高效更新成为可能。Diffusion-DRF在提高视频质量和语义对齐的同时，减轻了奖励劫持和崩溃问题——无需额外的奖励模型或偏好数据集。它具有模型通用性，并且可以轻松应用于其他基于扩散的生成任务。

Summary / 总结

The paper proposes Diffusion-DRF, a method for fine-tuning video diffusion models using a differentiable reward flow. It leverages a frozen Vision-Language Model (VLM) to provide token-aware gradients for optimization, avoiding the need for additional reward models or preference datasets. This approach improves video quality and semantic alignment while mitigating reward hacking and collapse issues.

论文针对当前文本到视频（T2V）生成中的直接偏好优化（DPO）方法依赖非可微偏好信号的局限性，提出了一个可微奖励流Diffusion-DRF，该方法使用冻结的视觉-语言模型（VLM）提供可微的梯度进行优化。该方法提高了视频质量和语义对齐，避免了奖励作弊和崩溃，且不需要额外的奖励模型或偏好数据集。

Klear: Unified Multi-Task Audio-Video Joint Generation

Authors: Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, Chen Zhang, Pengfei Wan

First: 2026-01-07T18:03:45+00:00 · Latest: 2026-01-07T18:03:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Klear and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Klear scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.

中文标题/摘要

标题：Klear：统一多任务音频-视频联合生成

音频-视频联合生成取得了快速进展，但仍面临重大挑战。非商业方法仍存在音频-视觉不一致、唇音对齐差和单模态退化等问题，这些问题可能源于弱的音频-视觉对应建模、有限的泛化能力和稀缺的高质量密集字幕数据。为解决这些问题，我们引入了Klear，并深入探讨了三个维度——模型架构、训练策略和数据整理。从架构上看，我们采用了一塔设计，使用统一的DiT块和全域注意力机制，实现了紧密的音频-视觉对齐和强大的可扩展性。从训练上看，我们采用了渐进式多任务制度——随机模态遮蔽以实现跨任务联合优化，以及多阶段课程，从而生成稳健的表示，增强音频-视觉对齐的世界知识，并防止单模态崩溃。对于数据集，我们首次提出了一个大规模的带有密集字幕的音频-视频数据集，并引入了一种新的自动化数据构建管道，该管道标注和过滤了数百万个多样、高质量、严格对齐的音频-视频-字幕三元组。基于此，Klear能够扩展到大规模数据集，提供高保真度、语义和时间上对齐的指令跟随生成，在联合和单模态设置中均表现出色，并且在分布外场景中表现出强大的泛化能力。在各个任务上，它大幅优于先前的方法，并实现了与Veo 3相当的性能，提供了一条统一、可扩展的通往下一代音频-视频合成的道路。

Summary / 总结

Klear addresses the challenges in audio-video joint generation by introducing a unified model architecture, a progressive multitask training strategy, and a novel data curation pipeline. The model uses a single-tower design with DiT blocks and Omni-Full Attention, achieving tight audio-visual alignment. The training strategy includes random modality masking and a multistage curriculum, which enhance robustness and prevent unimodal collapse. The dataset consists of a large-scale audio-video dataset with dense captions, constructed using an automated pipeline. Klear outperforms previous methods and achieves performance comparable to Veo 3, demonstrating high-fidelity, semantically and temporally aligned generation in both joint and unimodal settings.

Klear通过引入单塔模型和统一的DiT块以及全注意力机制，并采用渐进多任务训练策略来解决音频-视觉联合生成的挑战。该方法使用随机模态遮蔽和多阶段课程来增强音频-视觉对齐并防止单模态退化。Klear还提出了一套大规模的带有密集字幕的数据集和一个自动数据构建管道。实验结果表明，Klear在多个任务上显著优于先前的方法，并且在联合生成和单模态设置中实现了与Veo 3相当的性能，展示了强大的泛化能力和高保真生成。

Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Authors: Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Shangtong Zhang, Yanjun Qi

First: 2025-05-21T16:15:01+00:00 · Latest: 2026-01-07T17:58:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) is a framework for solving sequential decision-making problems. In this work, we demonstrate that, surprisingly, RL emerges during the inference time of large language models (LLMs), a phenomenon we term in-context RL (ICRL). To reveal this capability, we introduce a simple multi-round prompting framework, we call ICRL prompting, for inference-time self-improvement. The goal of ICRL prompting is to guide LLMs to perform reinforcement learning during inference for self-improvement on a given task. After each response, the model receives numerical scalar feedback, denoted as a reward. In the next round, we prompt the LLM again together with a context that concatenates all prior responses and their associated rewards. We consistently observe that response quality improves as the context grows. In other words, the LLM can optimize scalar reward signals during inference, exhibiting behavior analogous to reinforcement learning. We evaluate ICRL prompting on Game of 24, creative writing, ScienceWorld, and Olympiad-level math competitions (AIME and HMMT), demonstrating significant improvements over baselines such as Self-Refine and Reflexion. Notably, even when the reward signals are generated by the same LLM, ICRL prompting still improves performance, highlighting a promising new paradigm for test-time scaling.

中文标题/摘要

标题：奖励足够：大语言模型是上下文强化学习者

强化学习（RL）是一种解决顺序决策问题的框架。在本研究中，我们展示了令人惊讶的现象：在大型语言模型（LLM）的推理过程中，RL 会自然出现，我们将其称为上下文强化学习（ICRL）。为了揭示这一能力，我们引入了一种简单的多轮提示框架，称为 ICRL 提示，用于推理时的自我改进。ICRL 提示的目标是引导 LLM 在推理过程中进行强化学习，以在给定任务上进行自我改进。每次响应后，模型会收到一个数值标量反馈，称为奖励。在下一轮中，我们再次提示 LLM 并提供一个上下文，该上下文是所有先前响应及其相关奖励的串联。我们观察到，随着上下文的增长，响应质量不断提高。换句话说，LLM 可以在推理过程中优化标量奖励信号，表现出类似于强化学习的行为。我们在 24 点游戏、创意写作、ScienceWorld 以及奥林匹克级别的数学竞赛（AIME 和 HMMT）中评估了 ICRL 提示，展示了其相对于 Self-Refine 和 Reflexion 等基线方法的显著改进。值得注意的是，即使奖励信号由相同的 LLM 生成，ICRL 提示仍然提高了性能，突显出一种新的测试时扩展范式。

Wow, wo, val! A Comprehensive Embodied World Model Evaluation Turing Test

Authors: Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang

First: 2026-01-07T17:50:37+00:00 · Latest: 2026-01-07T17:50:37+00:00

Abs · PDF · Code1 · Code2

Abstract

As world models gain momentum in Embodied AI, an increasing number of works explore using video foundation models as predictive world models for downstream embodied tasks like 3D prediction or interactive generation. However, before exploring these downstream tasks, video foundation models still have two critical questions unanswered: (1) whether their generative generalization is sufficient to maintain perceptual fidelity in the eyes of human observers, and (2) whether they are robust enough to serve as a universal prior for real-world embodied agents. To provide a standardized framework for answering these questions, we introduce the Embodied Turing Test benchmark: WoW-World-Eval (Wow,wo,val). Building upon 609 robot manipulation data, Wow-wo-val examines five core abilities, including perception, planning, prediction, generalization, and execution. We propose a comprehensive evaluation protocol with 22 metrics to assess the models' generation ability, which achieves a high Pearson Correlation between the overall score and human preference (>0.93) and establishes a reliable foundation for the Human Turing Test. On Wow-wo-val, models achieve only 17.27 on long-horizon planning and at best 68.02 on physical consistency, indicating limited spatiotemporal consistency and physical reasoning. For the Inverse Dynamic Model Turing Test, we first use an IDM to evaluate the video foundation models' execution accuracy in the real world. However, most models collapse to $\approx$ 0% success, while WoW maintains a 40.74% success rate. These findings point to a noticeable gap between the generated videos and the real world, highlighting the urgency and necessity of benchmarking World Model in Embodied AI.

中文标题/摘要

标题：Wow, wo, val！全面的具身世界模型评估图灵测试

随着具身人工智能中世界模型的发展，越来越多的研究工作探索使用视频基础模型作为下游具身任务（如3D预测或交互生成）的预测世界模型。然而，在探索这些下游任务之前，视频基础模型仍然有两个关键问题未得到解答：（1）它们的生成泛化是否足够维持人类观察者的感知保真度，（2）它们是否足够稳健，能够作为现实世界具身代理的通用先验。为了提供一个标准化框架来回答这些问题，我们引入了具身图灵测试基准：Wow-wo-val（Wow, wo, val）。基于609个机器人操作数据，Wow-wo-val 检查了五个核心能力，包括感知、规划、预测、泛化和执行。我们提出了一种全面的评估协议，包含22个指标来评估模型的生成能力，该协议在整体评分与人类偏好之间的皮尔逊相关系数超过0.93，并为人类图灵测试建立了可靠的基础。在Wow-wo-val上，模型在长时规划上仅达到17.27分，在物理一致性上最高达到68.02分，表明空间时间一致性有限和物理推理能力有限。对于逆动力学模型图灵测试，我们首先使用逆动力学模型来评估视频基础模型在现实世界中的执行准确性。然而，大多数模型的准确率降至约0%，而Wow保持了40.74%的成功率。这些发现表明生成的视频与现实世界之间存在明显的差距，突显了在具身人工智能中基准测试世界模型的紧迫性和必要性。

Summary / 总结

This paper introduces WoW-World-Eval (Wow, wo, val) as a benchmark for evaluating embodied world models, addressing their perceptual fidelity and robustness. The evaluation uses 22 metrics on 609 robot manipulation data, covering perception, planning, prediction, generalization, and execution. Models show limited spatiotemporal consistency and physical reasoning, scoring only 17.27 on long-horizon planning and 68.02 on physical consistency. In the Inverse Dynamic Model Turing Test, WoW outperforms other models with a 40.74% success rate, indicating a significant gap between generated videos and real-world execution capabilities.

该论文提出了WoW-wo-val基准，用于评估AI体态中的世界模型。它解决了视频基础模型是否能保持感知保真度和足够稳健以供实际体态代理使用的问题。该基准评估了五个核心能力，并使用22个指标来评估模型，显示了有限的空间-时间一致性与物理推理能力。对于逆动力学模型图灵测试，WoW在真实世界执行准确性上优于其他模型，成功率为40.74%，突显了生成视频与现实世界执行之间的差距。

ContextFocus: Activation Steering for Contextual Faithfulness in Large Language Models

Authors: Nikhil Anand, Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, Koyel Mukherjee

First: 2026-01-07T17:45:20+00:00 · Latest: 2026-01-07T17:45:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) encode vast amounts of parametric knowledge during pre-training. As world knowledge evolves, effective deployment increasingly depends on their ability to faithfully follow externally retrieved context. When such evidence conflicts with the model's internal knowledge, LLMs often default to memorized facts, producing unfaithful outputs. In this work, we introduce ContextFocus, a lightweight activation steering approach that improves context faithfulness in such knowledge-conflict settings while preserving fluency and efficiency. Unlike prior approaches, our solution requires no model finetuning and incurs minimal inference-time overhead, making it highly efficient. We evaluate ContextFocus on the ConFiQA benchmark, comparing it against strong baselines including ContextDPO, COIECD, and prompting-based methods. Furthermore, we show that our method is complementary to prompting strategies and remains effective on larger models. Extensive experiments show that ContextFocus significantly improves contextual-faithfulness. Our results highlight the effectiveness, robustness, and efficiency of ContextFocus in improving contextual-faithfulness of LLM outputs.

中文标题/摘要

标题：ContextFocus: 用于大型语言模型上下文忠实性的激活导向控制

大型语言模型（LLMs）在预训练过程中编码了大量的参数化知识。随着世界知识的演变，有效部署越来越多地依赖于它们能够忠实跟随外部检索到的上下文的能力。当这种证据与模型内部知识发生冲突时，LLMs 经常默认使用记忆中的事实，产生不忠实的输出。在本研究中，我们引入了 ContextFocus，一种轻量级的激活导向控制方法，该方法在知识冲突设置中提高了上下文忠实性，同时保持流畅性和效率。与先前的方法不同，我们的解决方案不需要模型微调，并且在推理时间上几乎没有额外开销，使其非常高效。我们在 ConFiQA 基准上评估了 ContextFocus，将其与 ContextDPO、COIECD 和基于提示的方法等强基线进行比较。此外，我们展示了我们的方法与提示策略的互补性，并且在更大规模的模型上仍然有效。广泛的实验表明，ContextFocus 显著提高了上下文忠实性。我们的结果突显了 ContextFocus 在提高LLM输出上下文忠实性方面的有效性和鲁棒性以及效率。

Summary / 总结

The research aims to enhance the contextual faithfulness of Large Language Models (LLMs) by addressing knowledge conflicts between the model's internal knowledge and externally retrieved context. ContextFocus, a lightweight activation steering approach, is introduced to steer the model's activations towards the external context without requiring model fine-tuning or significant inference overhead. Experiments on the ConFiQA benchmark demonstrate that ContextFocus significantly improves contextual faithfulness compared to strong baselines, while maintaining fluency and efficiency, and it is effective on larger models as well.

本文针对大型语言模型在外部上下文与内部知识冲突时产生不忠实输出的问题，引入了ContextFocus，这是一种轻量级的激活导向方法，能够在不需模型微调和显著增加推理时间开销的情况下提升上下文忠实度。ConFiQA基准测试的实验结果表明，ContextFocus显著提高了上下文忠实度，且在强基线方法和更大规模的模型上仍然有效。

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Authors: Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, Yan Lu

First: 2026-01-07T17:40:08+00:00 · Latest: 2026-01-07T17:40:08+00:00

Comments: Work In Progress

Abs · PDF · Code1 · Code2

Abstract

GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.

中文标题/摘要

标题：InfiniteWeb：大规模生成GUI代理训练用的可扩展网络环境

代表用户与图形界面交互的GUI代理是实用AI助手的一个有前途的方向。然而，训练这些代理受到合适环境稀缺的阻碍。我们提出了InfiniteWeb，这是一种自动大规模生成功能性的网络环境的系统，用于GUI代理训练。虽然大语言模型在生成单个网页方面表现良好，但构建具有许多相互连接页面的现实且功能性的网站面临挑战。我们通过统一规范、以任务为中心的测试驱动开发以及网站种子与参考设计图像的结合来解决这些挑战，以确保多样性。我们的系统还生成可验证的任务评估器，为强化学习提供密集的奖励信号。实验表明，InfiniteWeb在现实网站构建方面超越了商业编码代理，而基于我们生成环境训练的GUI代理在OSWorld和Online-Mind2Web上实现了显著的性能提升，证明了该系统的有效性。

Summary / 总结

The research aims to address the challenge of training GUI agents by synthesizing scalable web environments. The method involves using a unified specification and task-centric test-driven development to generate diverse and functional websites. Experiments show that InfiniteWeb outperforms commercial coding agents in constructing realistic websites, and GUI agents trained on these environments perform better on OSWorld and Online-Mind2Web tasks, validating the system's effectiveness.

研究旨在通过合成可扩展的网络环境来解决训练GUI代理的挑战。方法包括使用统一规范和任务导向的测试驱动开发来生成多样且功能齐全的网站。实验表明，InfiniteWeb在构建真实网站方面优于商业编码代理，并且在OSWorld和Online-Mind2Web任务上训练的GUI代理表现出色，验证了该系统的有效性。

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Authors: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang

First: 2025-12-23T18:59:46+00:00 · Latest: 2026-01-07T17:31:29+00:00

Comments: webpage: https://spatialtree.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

中文标题/摘要

标题：SpatialTree：空间能力在多模态大语言模型中的分支发展

认知科学表明，空间能力是逐步发展的，从感知到推理和互动。然而，在多模态大语言模型（MLLMs）中，这种层次结构仍然不甚明了，因为大多数研究都集中在少数任务上。我们引入了SpatialTree，这是一种认知科学启发式的层次结构，将空间能力分为四个层次：低级感知（L1）、心理制图（L2）、模拟（L3）和能动性（L4）。基于这种分类法，我们构建了第一个能力导向的层次基准，全面评估了主流MLLMs的27个子能力。评估结果揭示了一个清晰的结构：L1技能大多相互独立，而更高层次的技能则高度相关，表明相互依赖性在增加。通过有针对性的监督微调，我们发现了一个令人惊讶的转移动态：L1内的负向转移，但低级到高级能力之间存在强烈的跨层次转移，且具有显著的协同效应。最后，我们探讨了如何改进整个层次结构。我们发现，鼓励大量“思考”的简单强化学习是不可靠的：它有助于复杂推理，但损害了直观感知。我们提出了一种简单的自动思考策略，抑制不必要的思考，使强化学习能够一致地提高所有层次的性能。通过构建SpatialTree，我们提供了一个概念验证框架，用于理解和系统地扩展MLLMs中的空间能力。

Summary / 总结

The research aims to understand the development of spatial abilities in multimodal language models (MLLMs) by introducing SpatialTree, a cognitive-science-inspired hierarchy. The study evaluates 27 sub-abilities across four levels: perception, mental mapping, simulation, and agentic competence. Results show that lower-level skills are largely independent, while higher-level skills are strongly correlated. Through fine-tuning, the study reveals negative transfer within the lowest level but strong cross-level transfer from lower to higher abilities. The research also suggests that naive reinforcement learning (RL) can hurt intuitive perception but can improve performance across all levels with a simple auto-think strategy that suppresses unnecessary deliberation.

研究旨在通过引入名为SpatialTree的认知科学启发式层次结构来理解多模态语言模型（MLLMs）中的空间能力发展，该层次结构将空间能力分为四个层次：感知、心理映射、模拟和主动能力。研究评估了27个子能力在各种MLLMs中的表现，并发现较低层次的能力大多独立，而较高层次的能力则高度相关。通过有针对性的微调，研究发现较低层次内存在负向转移，但较低层次到较高层次的能力之间存在强大的跨层次转移。研究还建议，强化学习（RL）应谨慎使用，因为它可以提高复杂推理但会损害直观感知。提出了一种简单的自动思考策略来抑制不必要的思考，从而实现所有层次上的一致性能提升。

A Single-Loop Bilevel Deep Learning Method for Optimal Control of Obstacle Problems

Authors: Yongcun Song, Shangzhi Zeng, Jin Zhang, Lvgang Zhang

First: 2026-01-07T17:30:42+00:00 · Latest: 2026-01-07T17:30:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Optimal control of obstacle problems arises in a wide range of applications and is computationally challenging due to its nonsmoothness, nonlinearity, and bilevel structure. Classical numerical approaches rely on mesh-based discretization and typically require solving a sequence of costly subproblems. In this work, we propose a single-loop bilevel deep learning method, which is mesh-free, scalable to high-dimensional and complex domains, and avoids repeated solution of discretized subproblems. The method employs constraint-embedding neural networks to approximate the state and control and preserves the bilevel structure. To train the neural networks efficiently, we propose a Single-Loop Stochastic First-Order Bilevel Algorithm (S2-FOBA), which eliminates nested optimization and does not rely on restrictive lower-level uniqueness assumptions. We analyze the convergence behavior of S2-FOBA under mild assumptions. Numerical experiments on benchmark examples, including distributed and obstacle control problems with regular and irregular obstacles on complex domains, demonstrate that the proposed method achieves satisfactory accuracy while reducing computational cost compared to classical numerical methods.

中文标题/摘要

标题：一种单环 bilevel 深度学习方法用于障碍问题的最优控制

障碍问题的最优控制在多种应用中出现，但由于其非光滑性、非线性和 bilevel 结构，计算上具有挑战性。经典的数值方法依赖于基于网格的离散化，并通常需要解决一系列昂贵的子问题。在本文中，我们提出了一种单环 bilevel 深度学习方法，该方法无网格、可扩展到高维和复杂域，并避免了重复求解离散化子问题。该方法使用嵌入约束的神经网络来近似状态和控制，并保持 bilevel 结构。为了高效训练神经网络，我们提出了单环随机一阶 bilevel 算法 (S2-FOBA)，该算法消除了嵌套优化，并不依赖于下层问题的严格唯一性假设。在温和假设下，我们分析了 S2-FOBA 的收敛行为。在基准示例上的数值实验，包括分布式和障碍控制问题，具有规则和不规则障碍的复杂域，表明所提出的方法在计算成本降低的同时达到了满意的精度，优于经典数值方法。

Summary / 总结

This paper addresses the computational challenges of optimal control of obstacle problems, which are common in various applications. The authors propose a single-loop bilevel deep learning method that is mesh-free and scalable to high-dimensional domains, avoiding the need to repeatedly solve discretized subproblems. The method uses constraint-embedding neural networks and a novel Single-Loop Stochastic First-Order Bilevel Algorithm (S2-FOBA) for efficient training. Experiments show that the proposed method achieves good accuracy while reducing computational cost compared to traditional numerical methods.

本文针对障碍问题的最优控制计算难题，提出了一个无网格、高维可扩展的单环双层深度学习方法。该方法使用嵌入约束的神经网络来近似状态和控制，并保持双层结构。为了高效训练神经网络，作者引入了单环随机一阶双层算法（S2-FOBA），该算法避免了嵌套优化，并不要求严格的下层唯一性假设。数值实验表明，所提出的方法在保持良好精度的同时，减少了计算成本，优于传统数值方法。

GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

Authors: Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu

First: 2026-01-07T17:26:41+00:00 · Latest: 2026-01-07T17:26:41+00:00

Abs · PDF · Code1 · Code2

Abstract

The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

中文标题/摘要

标题：GeoReason: 通过逻辑一致性强化学习使遥感视觉语言模型的思考与回答保持一致

遥感视觉语言模型(RS-VLMs)的发展强调了从感知中心的识别向高级演绎推理过渡的重要性，以增强复杂空间任务中的认知可靠性。然而，当前模型常常遭受逻辑幻觉的问题，即正确的答案是基于有缺陷的推理链或依赖于位置捷径而非空间逻辑。这种脱节削弱了在战略空间决策中的可靠性。为了解决这一问题，我们提出了GeoReason框架，旨在使内部思考与最终决策同步。我们首先构建了GeoReason-Bench，这是一个逻辑驱动的数据集，包含4,000条从几何原语和专家知识中合成的推理轨迹。然后我们制定了两阶段训练策略：(1) 监督知识初始化，以使模型具备推理语法和领域专业知识；(2) 一致性感知强化学习，以提高演绎可靠性。这一阶段整合了一种新颖的逻辑一致性奖励，通过选项排列策略惩罚逻辑漂移，以使决策基于可验证的推理轨迹。实验结果表明，我们的框架显著提高了RS-VLMs的认知可靠性和可解释性，达到了与其他先进方法相比的最优性能。

Summary / 总结

GeoReason is a framework that aims to improve the cognitive reliability of Remote Sensing Vision-Language Models (RS-VLMs) by aligning internal reasoning with final decisions. It introduces a two-stage training strategy: supervised knowledge initialization and consistency-aware reinforcement learning. The latter uses a Logical Consistency Reward to penalize logical drift, ensuring that decisions are based on verifiable reasoning. Experiments show that GeoReason significantly enhances the reliability and interpretability of RS-VLMs, outperforming other advanced methods.

GeoReason 是一个框架，通过解决逻辑幻觉问题来提升遥感视觉语言模型 (RS-VLMs) 的认知可靠性。它采用两阶段训练策略：监督知识初始化和一致性意识强化学习。后者使用逻辑一致性奖励来惩罚逻辑漂移，确保决策基于可验证的推理。实验表明，GeoReason 提高了 RS-VLMs 的可解释性和可靠性，并超越了其他先进方法。

Equivariant Neural Networks for Force-Field Models of Lattice Systems

Authors: Yunhao Fan, Gia-Wei Chern

First: 2026-01-07T17:09:04+00:00 · Latest: 2026-01-07T17:09:04+00:00

Comments: 13 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Machine-learning (ML) force fields enable large-scale simulations with near-first-principles accuracy at substantially reduced computational cost. Recent work has extended ML force-field approaches to adiabatic dynamical simulations of condensed-matter lattice models with coupled electronic and structural or magnetic degrees of freedom. However, most existing formulations rely on hand-crafted, symmetry-aware descriptors, whose construction is often system-specific and can hinder generality and transferability across different lattice Hamiltonians. Here we introduce a symmetry-preserving framework based on equivariant neural networks (ENNs) that provides a general, data-driven mapping from local configurations of dynamical variables to the associated on-site forces in a lattice Hamiltonian. In contrast to ENN architectures developed for molecular systems -- where continuous Euclidean symmetries dominate -- our approach aims to embed the discrete point-group and internal symmetries intrinsic to lattice models directly into the neural-network representation of the force field. As a proof of principle, we construct an ENN-based force-field model for the adiabatic dynamics of the Holstein Hamiltonian on a square lattice, a canonical system for electron-lattice physics. The resulting ML-enabled large-scale dynamical simulations faithfully capture mesoscale evolution of the symmetry-breaking phase, illustrating the utility of lattice-equivariant architectures for linking microscopic electronic processes to emergent dynamical behavior in condensed-matter lattice systems.

中文标题/摘要

标题：等变神经网络在晶格系统力场模型中的应用

机器学习（ML）力场能够以接近第一性原理的准确性进行大规模模拟，同时大幅降低计算成本。最近的工作将ML力场方法扩展到包含电子和结构或磁性自由度的凝聚态晶格模型的绝热动力学模拟中。然而，大多数现有方法依赖于手工构建、具有对称性意识的描述符，其构建往往针对特定系统，这会妨碍在不同晶格哈密顿量之间的通用性和可转移性。在此，我们引入了一种基于等变神经网络（ENN）的对称性保持框架，该框架提供了一种从动力学变量的局部配置到晶格哈密顿量中相应位点力的一般、数据驱动的映射。与分子系统中开发的ENN架构——其中连续欧几里得对称性占主导地位——不同，我们的方法旨在直接将晶格模型固有的离散点群对称性和内部对称性嵌入到力场的神经网络表示中。作为原理证明，我们构建了一个基于ENN的力场模型，用于描述平方晶格上霍尔斯坦哈密顿量的绝热动力学，这是一个电子-晶格物理的典型系统。由此产生的ML增强的大规模动力学模拟准确地捕捉到了对称性破缺相的中间尺度演化，展示了晶格等变架构在连接凝聚态晶格系统中的微观电子过程与涌现动力学行为之间的实用性。

Summary / 总结

This paper introduces a symmetry-preserving framework using equivariant neural networks (ENNs) for force-field models of lattice systems, addressing the limitations of hand-crafted descriptors in existing methods. The ENN-based approach embeds the discrete symmetries intrinsic to lattice models, enabling a general and data-driven mapping from local configurations to on-site forces. The model successfully simulates the adiabatic dynamics of the Holstein Hamiltonian on a square lattice, accurately capturing the symmetry-breaking phase transition, demonstrating the potential of lattice-equivariant architectures for condensed-matter physics simulations.

本文提出了一种使用等变神经网络（ENNs）的对称性保持框架，用于晶格系统的力场模型，解决了手工构建描述符的局限性。该方法直接将离散对称性嵌入到神经网络中，实现了通用的数据驱动方法。基于ENN的模型准确模拟了方形晶格上霍尔斯坦哈密顿量的亚绝热动力学，有效地捕捉了对称性破缺相变过程。

Quantifying the Impact of Modules and Their Interactions in the PSO-X Framework

Authors: Christian L. Camacho-Villalón, Ana Nikolikj, Katharina Dost, Eva Tuba, Sašo Džeroski, Tome Eftimov

First: 2026-01-07T17:06:05+00:00 · Latest: 2026-01-07T17:06:05+00:00

Abs · PDF · Code1 · Code2

Abstract

The PSO-X framework incorporates dozens of modules that have been proposed for solving single-objective continuous optimization problems using particle swarm optimization. While modular frameworks enable users to automatically generate and configure algorithms tailored to specific optimization problems, the complexity of this process increases with the number of modules in the framework and the degrees of freedom defined for their interaction. Understanding how modules affect the performance of algorithms for different problems is critical to making the process of finding effective implementations more efficient and identifying promising areas for further investigation. Despite their practical applications and scientific relevance, there is a lack of empirical studies investigating which modules matter most in modular optimization frameworks and how they interact. In this paper, we analyze the performance of 1424 particle swarm optimization algorithms instantiated from the PSO-X framework on the 25 functions in the CEC'05 benchmark suite with 10 and 30 dimensions. We use functional ANOVA to quantify the impact of modules and their combinations on performance in different problem classes. In practice, this allows us to identify which modules have greater influence on PSO-X performance depending on problem features such as multimodality, mathematical transformations and varying dimensionality. We then perform a cluster analysis to identify groups of problem classes that share similar module effect patterns. Our results show low variability in the importance of modules in all problem classes, suggesting that particle swarm optimization performance is driven by a few influential modules.

中文标题/摘要

标题：量化PSO-X框架中模块及其交互的影响

PSO-X框架整合了用于解决单目标连续优化问题的数十个模块，采用粒子群优化方法。模块化框架使用户能够自动生成和配置针对特定优化问题的算法，但随着框架中模块数量的增加和它们之间自由度的定义，这一过程的复杂性也随之增加。理解模块如何影响不同问题的算法性能对于提高找到有效实现的效率以及确定进一步研究的有希望领域至关重要。尽管模块化优化框架具有实际应用和科学意义，但缺乏研究探讨哪些模块在这些框架中最重要以及它们如何相互作用的实证研究。在本文中，我们分析了从PSO-X框架实例化出的1424个粒子群优化算法在CEC'05基准套件中的25个函数（10和30维）上的性能。我们使用函数ANOVA来量化不同问题类别中模块及其组合对性能的影响。在实践中，这使我们能够根据问题特征（如多模态性、数学变换和不同维度）识别出对PSO-X性能影响更大的模块。然后，我们进行聚类分析以识别具有相似模块效应模式的问题类别组。我们的结果表明，在所有问题类别中，模块的重要性变化很小，这表明粒子群优化性能主要由少数几个有影响力的模块驱动。

Summary / 总结

This paper aims to understand the impact of modules and their interactions in the PSO-X framework, which is used for solving single-objective continuous optimization problems. The authors use functional ANOVA to analyze the performance of 1424 particle swarm optimization algorithms on 25 benchmark functions, identifying which modules have the greatest influence on performance based on problem features. The results indicate that particle swarm optimization performance is primarily driven by a few influential modules, with low variability across different problem classes.

本文旨在理解PSO-X框架中模块及其相互作用对解决单目标连续优化问题的影响。作者使用功能方差分析（ANOVA）分析了1424个粒子群优化算法在25个基准函数上的性能，确定了哪些模块对性能有最大的影响，基于问题特征。结果表明，粒子群优化性能主要由少数几个有影响力的模块驱动，不同问题类别的影响变异较低。

S2Vec: Self-Supervised Geospatial Embeddings for the Built Environment

Authors: Shushman Choudhury, Elad Aharoni, Chandrakumari Suvarna, Iveel Tsogsuren, Abdul Rahman Kreidieh, Chun-Ta Lu, Neha Arora

Venue: ACM Transactions on Spatial Algorithms and Systems 2026

First: 2025-04-10T20:16:02+00:00 · Latest: 2026-01-07T16:58:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Scalable general-purpose representations of the built environment are crucial for geospatial artificial intelligence applications. This paper introduces S2Vec, a novel self-supervised framework for learning such geospatial embeddings. S2Vec uses the S2 Geometry library to partition large areas into discrete S2 cells, rasterizes built environment feature vectors within cells as images, and applies masked autoencoding on these rasterized images to encode the feature vectors. This approach yields task-agnostic embeddings that capture local feature characteristics and broader spatial relationships. We evaluate S2Vec on several large-scale geospatial prediction tasks, both random train/test splits (interpolation) and zero-shot geographic adaptation (extrapolation). Our experiments show S2Vec's competitive performance against several baselines on socioeconomic tasks, especially the geographic adaptation variant, with room for improvement on environmental tasks. We also explore combining S2Vec embeddings with image-based embeddings downstream, showing that such multimodal fusion can often improve performance. Our findings highlight how S2Vec can learn effective general-purpose geospatial representations of the built environment features it is provided, and how it can complement other data modalities in geospatial artificial intelligence.

中文标题/摘要

标题：S2Vec：建筑环境的自监督地理空间嵌入

可扩展的通用建筑环境表示对于地理空间人工智能应用至关重要。本文介绍了S2Vec，这是一种新颖的自监督框架，用于学习此类地理空间嵌入。S2Vec 使用S2几何库将大面积划分为离散的S2单元，将单元内的建筑环境特征向量作为图像进行矢量化，并在这些矢量化图像上应用掩蔽自编码以编码特征向量。该方法生成了任务无关的嵌入，能够捕捉局部特征特性和更广泛的地理关系。我们在多个大规模地理空间预测任务上评估了S2Vec，包括随机训练/测试拆分（内插）和零样本地理适应（外推）。我们的实验表明，S2Vec 在社会经济任务上与几个基线具有竞争力，特别是在地理适应变体方面，但在环境任务上仍有改进空间。我们还探索了将S2Vec嵌入与下游的图像嵌入结合使用，表明这种多模态融合通常可以提高性能。我们的研究结果突显了S2Vec如何学习有效的通用地理空间表示，以及它如何在地理空间人工智能中补充其他数据模态。

Summary / 总结

S2Vec is a self-supervised framework for learning geospatial embeddings of the built environment. It uses S2 Geometry to partition areas into cells, rasterizes feature vectors as images, and applies masked autoencoding. S2Vec shows competitive performance on socioeconomic tasks, particularly in geographic adaptation, and can improve performance when combined with image-based embeddings. The method effectively captures local and spatial relationships, making it a valuable tool for geospatial AI applications.

S2Vec 是一个自监督框架，通过 S2 几何将大区域划分为离散单元，并在栅格化特征向量上应用掩码自编码来学习任务无关的地理空间嵌入。它在社会经济和环境预测任务上的评估显示了竞争力的表现，尤其是在地理适应方面，并展示了与图像嵌入结合使用时的潜力。

Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction

Authors: Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao

First: 2026-01-07T16:57:30+00:00 · Latest: 2026-01-07T16:57:30+00:00

Comments: Project page: https://xdimlab.github.io/Gen3R/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.

中文标题/摘要

标题：Gen3R：三维场景生成与前馈重建的结合

我们提出了Gen3R，一种将基础重建模型的强大先验与视频扩散模型结合用于场景级三维生成的方法。我们重新利用了VGGT重建模型，通过训练一个适配器在其标记上生成几何潜在变量，并将这些潜在变量正则化以与预训练的视频扩散模型的外观潜在变量对齐。通过联合生成这些解耦但对齐的潜在变量，Gen3R 生成了RGB视频及其对应的3D几何，包括相机姿态、深度图和全局点云。实验表明，我们的方法在单图和多图条件下的3D场景生成中达到了最先进的效果。此外，我们的方法可以通过利用生成先验增强重建的鲁棒性，展示了重建和生成模型紧密耦合的互惠益处。

Summary / 总结

Gen3R integrates the geometric priors of reconstruction models with the generative power of video diffusion models to generate 3D scenes. It trains an adapter on VGGT tokens to produce geometric latents that align with appearance latents from pre-trained video diffusion models. This approach jointly generates RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments show that Gen3R outperforms existing methods in single- and multi-image conditioned 3D scene generation and enhances reconstruction robustness through generative priors.

Gen3R 是一种结合重建模型先验和视频扩散模型的方法，用于 3D 场景生成。它通过训练 VGGT 重建标记上的适配器来生成几何潜变量，并与预训练的视频扩散模型的外观潜变量对齐。这种方法生成了包括 RGB 视频和相应的 3D 几何（如相机姿态、深度图和全局点云）。实验表明，Gen3R 在单图像和多图像条件下的 3D 场景生成中优于现有方法，并通过生成先验增强重建的鲁棒性。

Cells on Autopilot: Adaptive Cell (Re)Selection via Reinforcement Learning

Authors: Marvin Illian, Ramin Khalili, Antonio A. de A. Rocha, Lin Wang

First: 2026-01-07T16:51:33+00:00 · Latest: 2026-01-07T16:51:33+00:00

Comments: 11 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

The widespread deployment of 5G networks, together with the coexistence of 4G/LTE networks, provides mobile devices a diverse set of candidate cells to connect to. However, associating mobile devices to cells to maximize overall network performance, a.k.a. cell (re)selection, remains a key challenge for mobile operators. Today, cell (re)selection parameters are typically configured manually based on operator experience and rarely adapted to dynamic network conditions. In this work, we ask: Can an agent automatically learn and adapt cell (re)selection parameters to consistently improve network performance? We present a reinforcement learning (RL)-based framework called CellPilot that adaptively tunes cell (re)selection parameters by learning spatiotemporal patterns of mobile network dynamics. Our study with real-world data demonstrates that even a lightweight RL agent can outperform conventional heuristic reconfigurations by up to 167%, while generalizing effectively across different network scenarios. These results indicate that data-driven approaches can significantly improve cell (re)selection configurations and enhance mobile network performance.

中文标题/摘要

标题：细胞自动驾驶：通过强化学习实现自适应小区（重）选择

5G网络的广泛部署以及4G/LTE网络的共存，为移动设备提供了多样化的候选小区连接选择。然而，将移动设备与小区关联起来以最大化整体网络性能，即小区（重）选择，仍然是移动运营商面临的关键挑战。目前，小区（重）选择参数通常基于运营商经验手动配置，并且很少适应动态网络条件。在本研究中，我们提出的问题是：是否可以使用代理自动学习和适应小区（重）选择参数，以持续提升网络性能？我们提出了一种基于强化学习（RL）的框架CellPilot，通过学习移动网络动态的空间和时间模式来自适应调整小区（重）选择参数。我们的研究使用实际数据表明，即使是一个轻量级的RL代理，也可以比传统的启发式重新配置提高高达167%的性能，同时在不同网络场景中表现出良好的泛化能力。这些结果表明，数据驱动的方法可以显著改善小区（重）选择配置并增强移动网络性能。

Summary / 总结

This paper addresses the challenge of cell (re)selection in 5G networks by proposing a reinforcement learning (RL) framework called CellPilot. The framework automatically tunes cell (re)selection parameters to adapt to dynamic network conditions, improving overall network performance. Experimental results show that CellPilot outperforms conventional methods by up to 167% and generalizes well across different network scenarios.

该论文通过提出基于强化学习（RL）的框架CellPilot来解决5G网络中的小区（再）选择挑战。该框架能够自动学习和调整小区（再）选择参数以提升网络性能。实验结果表明，CellPilot相较于传统方法可提升高达167%的性能，并且在不同网络场景下具有良好的泛化能力。

User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

Authors: Xiaoyuan Wu, Roshni Kaushik, Wenkai Li, Lujo Bauer, Koichi Onoue

First: 2025-10-23T16:38:26+00:00 · Latest: 2026-01-07T16:46:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are rapidly being adopted for tasks like drafting emails, summarizing meetings, and answering health questions. In these settings, users may need to share private information (e.g., contact details, health records). To evaluate LLMs' ability to identify and redact such information, prior work introduced real-life, scenario-based benchmarks (e.g., ConfAIde, PrivacyLens) and found that LLMs can leak private information in complex scenarios. However, these evaluations relied on proxy LLMs to judge the helpfulness and privacy-preservation quality of LLM responses, rather than directly measuring users' perceptions. To understand how users perceive the helpfulness and privacy-preservation quality of LLM responses to privacy-sensitive scenarios, we conducted a user study ($n=94$) using 90 PrivacyLens scenarios. We found that users had low agreement with each other when evaluating identical LLM responses. In contrast, five proxy LLMs reached high agreement, yet each proxy LLM had low correlation with users' evaluations. These results indicate that proxy LLMs cannot accurately estimate users' wide range of perceptions of utility and privacy in privacy-sensitive scenarios. We discuss the need for more user-centered studies to measure LLMs' ability to help users while preserving privacy, and for improving alignment between LLMs and users in estimating perceived privacy and utility.

中文标题/摘要

标题：用户对隐私敏感场景中LLM响应的隐私感知与帮助感知

大型语言模型（LLMs）正迅速被用于撰写邮件、总结会议和回答健康问题等任务。在这种情况下，用户可能需要分享私人信息（如联系方式、健康记录）。为了评估LLMs识别和屏蔽此类信息的能力，先前的研究引入了基于真实场景的实际基准（如ConfAIde、PrivacyLens），并发现LLMs在复杂场景中可能会泄露私人信息。然而，这些评估依赖于代理LLMs来判断LLM响应的帮助性和隐私保护质量，而不是直接测量用户的感知。为了了解用户如何感知隐私敏感场景中LLM响应的帮助性和隐私保护质量，我们使用90个PrivacyLens场景进行了用户研究（n=94）。我们发现，用户在评估相同的LLM响应时意见分歧很大。相反，五个代理LLMs达成了一致，但每个代理LLM与用户评估的相关性都很低。这些结果表明，代理LLMs无法准确估计用户在隐私敏感场景中对实用性和隐私的广泛感知。我们讨论了需要更多以用户为中心的研究来衡量LLMs帮助用户同时保护隐私的能力，并提高LLMs与用户在估计感知隐私和实用性的匹配度。

Summary / 总结

The study evaluates users' perceptions of privacy and helpfulness in LLM responses to privacy-sensitive scenarios using 90 PrivacyLens scenarios. Despite high agreement among five proxy LLMs, users showed low agreement in their evaluations, indicating that proxy LLMs cannot accurately estimate users' perceptions. The research highlights the need for user-centered studies to better measure LLMs' ability to help while preserving privacy and improving alignment between LLMs and users in estimating perceived privacy and utility.

研究通过使用90个PrivacyLens场景进行94名用户的实验，评估用户对隐私敏感场景中LLM响应的隐私感知和帮助感知。结果显示，用户之间对LLM响应的评价分歧较大，但五个代理LLM之间达成高度一致，表明代理LLM无法准确估计用户的隐私和实用性感知。研究强调了需要进行用户中心的评估，以更好地使LLM与用户感知相一致。

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts

Authors: Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu, Guanglu Sun, See-Kiong Ng, Bing Qin

First: 2026-01-07T16:39:34+00:00 · Latest: 2026-01-07T16:39:34+00:00

Comments: 10 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.

中文标题/摘要

标题：大型多模态模型在跨模态冲突下的推理一致性分析

大型多模态模型（LMMs）在通过思维链（CoT）进行视频推理方面展现了令人印象深刻的性能。然而，它们推理链的鲁棒性仍然值得怀疑。在本文中，我们识别出一种关键的失败模式，称为文本惯性，即在思维过程中一旦出现文本幻觉，模型往往会盲目地坚持错误的文本，而忽视矛盾的视觉证据。为了系统地研究这一问题，我们提出了逻辑图扰动协议，该协议结构化地向不同LMMs的推理链中注入扰动，涵盖原生推理架构和提示驱动范式，以评估它们的自我反思能力。结果显示，模型在不到10%的情况下成功自我纠正，并且主要依赖于盲目的文本错误传播。为了缓解这一问题，我们引入了主动视觉上下文精炼，这是一种无需训练的推理范式，它协调了一个主动视觉再定位机制，以强制执行细粒度验证，并结合自适应上下文精炼策略来总结和去噪推理历史。实验表明，我们的方法显著抑制了幻觉传播并增强了推理鲁棒性。

Semantic-E2VID: a Semantic-Enriched Paradigm for Event-to-Video Reconstruction

Authors: Jingqian Wu, Yunbo Jia, Shengpeng Xu, Edmund Y. Lam

First: 2025-10-20T09:45:13+00:00 · Latest: 2026-01-07T16:35:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Event cameras provide a promising sensing modality for high-speed and high-dynamic-range vision by asynchronously capturing brightness changes. A fundamental task in event-based vision is event-to-video (E2V) reconstruction, which aims to recover intensity videos from event streams. Most existing E2V approaches formulate reconstruction as a temporal--spatial signal recovery problem, relying on temporal aggregation and spatial feature learning to infer intensity frames. While effective to some extent, this formulation overlooks a critical limitation of event data: due to the change-driven sensing mechanism, event streams are inherently semantically under-determined, lacking object-level structure and contextual information that are essential for faithful reconstruction. In this work, we revisit E2V from a semantic perspective and argue that effective reconstruction requires going beyond temporal and spatial modeling to explicitly account for missing semantic information. Based on this insight, we propose \textit{Semantic-E2VID}, a semantic-enriched end-to-end E2V framework that reformulates reconstruction as a process of semantic learning, fusing and decoding. Our approach first performs semantic abstraction by bridging event representations with semantics extracted from a pretrained Segment Anything Model (SAM), while avoiding modality-induced feature drift. The learned semantics are then fused into the event latent space in a representation-compatible manner, enabling event features to capture object-level structure and contextual cues. Furthermore, semantic-aware supervision is introduced to explicitly guide the reconstruction process toward semantically meaningful regions, complementing conventional pixel-level and temporal objectives. Extensive experiments on six public benchmarks demonstrate that Semantic-E2VID consistently outperforms state-of-the-art E2V methods.

中文标题/摘要

标题：Semantic-E2VID：一种事件驱动视频重建的语义增强范式

事件相机通过异步捕捉亮度变化提供了一种有前景的高帧率和高动态范围视觉传感模态。事件驱动视觉中的一个基本任务是事件到视频（E2V）重建，其目标是从事件流中恢复亮度视频。大多数现有的E2V方法将重建公式化为一个时域-空域信号恢复问题，依赖于时间聚合和空域特征学习来推断亮度帧。虽然在一定程度上是有效的，但这种公式忽略了事件数据的一个关键限制：由于基于变化的传感机制，事件流本质上是语义欠定的，缺乏对象级结构和上下文信息，这些对于忠实重建是必不可少的。在本文中，我们从语义角度重新审视E2V，并认为有效的重建需要超越时间和空间建模，明确考虑缺失的语义信息。基于这一见解，我们提出了Semantic-E2VID，这是一种语义增强的端到端E2V框架，将重建公式化为语义学习、融合和解码的过程。我们的方法首先通过将事件表示与预训练的Segment Anything Model (SAM) 提取的语义进行桥梁，进行语义抽象，同时避免模态引起的特征漂移。学习到的语义以表示兼容的方式融合到事件潜在空间中，使事件特征能够捕捉对象级结构和上下文线索。此外，引入了语义感知监督，明确指导重建过程向语义有意义的区域发展，补充传统的像素级和时域目标。在六个公开基准上的广泛实验表明，Semantic-E2VID 一致地优于最先进的E2V方法。

Summary / 总结

The paper addresses the challenge of event-to-video reconstruction using event cameras, which capture brightness changes asynchronously. Most existing methods focus on temporal and spatial signal recovery, but this approach overlooks the semantic limitations of event data. To address this, the authors propose Semantic-E2VID, which enriches the reconstruction process by incorporating semantic information. This is achieved through semantic abstraction using a pretrained Segment Anything Model and fusing learned semantics into the event latent space. The method also introduces semantic-aware supervision to guide the reconstruction process. Experiments show that Semantic-E2VID outperforms existing methods on six public benchmarks.

研究旨在通过引入语义信息来解决现有事件到视频重建方法的局限性。提出的Semantic-E2VID框架将重建过程重新定义为一个语义学习任务，将预训练的Segment Anything Model提取的语义信息整合到事件潜空间中。这种方法增强了捕捉对象级结构和上下文线索的能力，从而提高了重建质量。在六个公开基准上的实验表明，Semantic-E2VID优于现有方法。

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Authors: Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

First: 2026-01-07T16:32:17+00:00 · Latest: 2026-01-07T16:32:17+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2

Abstract

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.

中文标题/摘要

标题：注意生成细节：针对视频扩散模型的局部详细偏好优化

使文本到视频的扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化（DPO）方法依赖于多样本排名和特定任务的批评模型，这既不高效，又经常导致模糊的全局监督。为了解决这些限制，我们提出了一种名为LocalDPO的新型后训练框架，该框架从真实视频中构建局部偏好对，并在时空区域级别优化对齐。我们设计了一个自动流水线来高效收集偏好对数据，该数据通过每次提示仅进行一次推理生成偏好对，从而消除对外部批评模型或手动注释的需求。具体而言，我们将高质量的真实视频视为正样本，并通过局部添加随机时空遮罩并仅恢复被遮罩区域来生成相应的负样本，使用冻结的基础模型。在训练过程中，我们引入了一种区域感知的DPO损失，该损失限制偏好学习仅在被破坏的区域进行，以实现快速收敛。在Wan2.1和CogVideoX上的实验表明，LocalDPO在视频保真度、时间连贯性和人类偏好得分方面均优于其他后训练方法，建立了更高效和精细的视频生成器对齐范式。

Summary / 总结

The research aims to improve the alignment between text-to-video diffusion models and human preferences. It introduces LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. The method uses an automated pipeline to generate preference pairs with a single inference per prompt, and introduces a region-aware DPO loss to restrict preference learning to corrupted areas. Experiments show that LocalDPO enhances video fidelity, temporal coherence, and human preference scores compared to other post-training approaches.

研究旨在通过解决现有直接偏好优化（DPO）方法的效率低和模糊性问题，改善文本到视频扩散模型与人类偏好的对齐。LocalDPO是一种新型后训练框架，从真实视频中构建局部偏好对，并在时空区域级别优化对齐。它使用自动化流水线生成每个提示单次推理的偏好对，避免了外部批评模型或人工标注的需要。实验表明，LocalDPO在提高视频保真度、时间连贯性和人类偏好评分方面优于其他后训练方法。

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Authors: Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, Yansong Tang

First: 2026-01-07T16:26:33+00:00 · Latest: 2026-01-07T16:26:33+00:00

Comments: Project page: https://lin-shan.com/CLAP/

Abs · PDF · Code1 · Code2

Abstract

Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from visual entanglement, capturing noise rather than manipulation skills. To address this, we propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories. By employing contrastive learning, CLAP maps video transitions onto a quantized, physically executable codebook. Building on this representation, we introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation. Furthermore, we propose a Knowledge Matching (KM) regularization strategy to mitigate catastrophic forgetting during fine-tuning. Extensive experiments demonstrate that CLAP significantly outperforms strong baselines, enabling the effective transfer of skills from human videos to robotic execution. Project page: https://lin-shan.com/CLAP/.

中文标题/摘要

标题：CLAP：对比潜在动作预训练，用于从人类视频中学习多模态模型

通用的视觉-语言-动作模型目前受到机器人数据稀缺的限制，相比之下，人类视频示范却非常丰富。现有的潜在动作模型试图利用视频数据，但往往受到视觉纠缠的影响，捕获的是噪声而非操作技能。为了解决这个问题，我们提出了对比潜在动作预训练（CLAP），这是一种将视频中的视觉潜在空间与机器人轨迹中的本体感受潜在空间对齐的框架。通过使用对比学习，CLAP将视频过渡映射到一个量化且物理可执行的码本上。在此基础上，我们引入了一种双形式的VLA框架，包括CLAP-NTP，一种在指令跟随和对象泛化方面表现出色的自回归模型，以及CLAP-RF，一种用于高频率精确操作的修正流策略。此外，我们提出了一种知识匹配（KM）正则化策略，以减轻微调过程中的灾难性遗忘。广泛的实验表明，CLAP显著优于强大的基线模型，能够有效将人类视频中的技能转移到机器人执行中。项目页面：https://lin-shan.com/CLAP/

Summary / 总结

The research aims to improve generalist Vision-Language-Action models by addressing the scarcity of robotic data. The proposed Contrastive Latent Action Pretraining (CLAP) framework aligns video data with robot trajectories using contrastive learning. CLAP creates a quantized, physically executable codebook and introduces a dual-formulation VLA framework, including CLAP-NTP for instruction following and CLAP-RF for precise manipulation. Experiments show that CLAP outperforms strong baselines, effectively transferring skills from human videos to robotic execution.

研究旨在通过解决机器人数据稀缺性问题，改善通用的视觉-语言-动作模型。CLAP框架通过对比学习对齐视频和机器人轨迹的潜在空间，创建一个量化且物理可执行的代码本。研究引入了CLAP-NTP和CLAP-RF模型，分别在指令跟随和高频精确操作方面表现出色。知识匹配正则化策略有助于防止遗忘。实验表明，CLAP显著优于强基线，实现了从人类视频到机器人执行的有效技能转移。

Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding

Authors: Shengyuan Ye, Bei Ouyang, Tianyi Qian, Liekang Zeng, Mu Yuan, Xiaowen Chu, Weijie Hong, Xu Chen

First: 2025-12-08T09:32:47+00:00 · Latest: 2026-01-07T16:24:34+00:00

Comments: Accepted by IEEE International Conference on Computer Communications 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have demonstrated impressive multimodal comprehension capabilities and are being deployed in an increasing number of online video understanding applications. While recent efforts extensively explore advancing VLMs' reasoning power in these cases, deployment constraints are overlooked, leading to overwhelming system overhead in real-world deployments. To address that, we propose Venus, an on-device memory-and-retrieval system for efficient online video understanding. Venus proposes an edge-cloud disaggregated architecture that sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages. In the ingestion stage, Venus continuously processes streaming edge videos via scene segmentation and clustering, where the selected keyframes are embedded with a multimodal embedding model to build a hierarchical memory for efficient storage and retrieval. In the querying stage, Venus indexes incoming queries from memory, and employs a threshold-based progressive sampling algorithm for keyframe selection that enhances diversity and adaptively balances system cost and reasoning accuracy. Our extensive evaluation shows that Venus achieves a 15x-131x speedup in total response latency compared to state-of-the-art methods, enabling real-time responses within seconds while maintaining comparable or even superior reasoning accuracy.

中文标题/摘要

标题：金星：一种高效的边缘记忆与检索系统，用于基于VLM的在线视频理解

视觉语言模型（VLMs）展示了令人印象深刻的多模态理解能力，并被部署在越来越多的在线视频理解应用中。尽管最近的努力广泛探索了在这些情况下增强VLMs的推理能力，但部署限制被忽视了，导致实际部署中的系统开销过大。为了解决这个问题，我们提出了金星，一种用于高效在线视频理解的边缘设备记忆与检索系统。金星提出了一种边缘-云分离架构，将云中的记忆构建和关键帧检索下沉到边缘，分为两个阶段操作。在摄取阶段，金星通过场景分割和聚类连续处理边缘视频流，其中选定的关键帧通过多模态嵌入模型构建层次化记忆，以实现高效存储和检索。在查询阶段，金星从记忆中索引入来的查询，并采用基于阈值的渐进采样算法进行关键帧选择，以增强多样性并适应性地平衡系统成本和推理准确性。我们广泛评估表明，与最先进的方法相比，金星在总响应延迟上实现了15-131倍的加速，能够在几秒钟内实现实时响应，同时保持相当甚至更优的推理准确性。

Summary / 总结

Venus is an on-device memory-and-retrieval system designed to enhance the efficiency of online video understanding using vision-language models (VLMs). It proposes an edge-cloud disaggregated architecture to reduce system overhead, with two stages: ingestion and querying. During ingestion, Venus processes streaming videos and builds a hierarchical memory for efficient storage and retrieval. In the querying stage, it indexes incoming queries and uses a threshold-based progressive sampling algorithm to select keyframes, balancing system cost and reasoning accuracy. Venus achieves a 15x-131x speedup in total response latency compared to existing methods, enabling real-time responses within seconds while maintaining comparable or superior reasoning accuracy.

Venus 是一种在设备端的内存和检索系统，旨在通过视觉语言模型提高在线视频理解的效率。它提出了一种边缘-云分离架构来减少系统开销。Venus 通过场景分割和聚类处理流媒体视频，将选定的关键帧嵌入多模态嵌入模型以构建高效存储和检索的层次化内存。在查询阶段，它使用基于阈值的渐进采样算法选择关键帧，平衡系统成本和推理准确性。Venus 在总响应延迟上实现了 15 倍至 131 倍的加速，能够在几秒钟内实现实时响应，同时保持或提高推理准确性。

Minimum distance classification for nonlinear dynamical systems

Authors: Dominique Martinez

First: 2026-01-07T16:21:47+00:00 · Latest: 2026-01-07T16:21:47+00:00

Abs · PDF · Code1 · Code2

Abstract

We address the problem of classifying trajectory data generated by some nonlinear dynamics, where each class corresponds to a distinct dynamical system. We propose Dynafit, a kernel-based method for learning a distance metric between training trajectories and the underlying dynamics. New observations are assigned to the class with the most similar dynamics according to the learned metric. The learning algorithm approximates the Koopman operator which globally linearizes the dynamics in a (potentially infinite) feature space associated with a kernel function. The distance metric is computed in feature space independently of its dimensionality by using the kernel trick common in machine learning. We also show that the kernel function can be tailored to incorporate partial knowledge of the dynamics when available. Dynafit is applicable to various classification tasks involving nonlinear dynamical systems and sensors. We illustrate its effectiveness on three examples: chaos detection with the logistic map, recognition of handwritten dynamics and of visual dynamic textures.

中文标题/摘要

标题：非线性动力系统中的最小距离分类

我们解决了由某些非线性动力学生成的轨迹数据分类问题，其中每个类别对应一个独特的动力系统。我们提出了一种基于核的方法，用于学习训练轨迹与潜在动力学之间的距离度量。新观察值根据学习到的度量被分配到动态最相似的类别中。学习算法近似了Koopman算子，该算子在与核函数相关的（潜在无限维）特征空间中全局线性化了动力学。通过使用机器学习中常见的核技巧，距离度量在不依赖其维度的情况下在特征空间中计算。我们还展示了当可用时，可以通过调整核函数来结合部分动力学知识。Dynafit适用于涉及非线性动力系统和传感器的各种分类任务。我们在混沌检测（使用逻辑映射）、手写动态识别和视觉动态纹理识别的三个示例中展示了其有效性。

LinkD: AutoRegressive Diffusion Model for Mechanical Linkage Synthesis

Authors: Yayati Jadhav, Amir Barati Farimani

First: 2026-01-07T16:19:11+00:00 · Latest: 2026-01-07T16:19:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Designing mechanical linkages to achieve target end-effector trajectories presents a fundamental challenge due to the intricate coupling between continuous node placements, discrete topological configurations, and nonlinear kinematic constraints. The highly nonlinear motion-to-configuration relationship means small perturbations in joint positions drastically alter trajectories, while the combinatorially expanding design space renders conventional optimization and heuristic methods computationally intractable. We introduce an autoregressive diffusion framework that exploits the dyadic nature of linkage assembly by representing mechanisms as sequentially constructed graphs, where nodes correspond to joints and edges to rigid links. Our approach combines a causal transformer with a Denoising Diffusion Probabilistic Model (DDPM), both conditioned on target trajectories encoded via a transformer encoder. The causal transformer autoregressively predicts discrete topology node-by-node, while the DDPM refines each node's spatial coordinates and edge connectivity to previously generated nodes. This sequential generation enables adaptive trial-and-error synthesis where problematic nodes exhibiting kinematic locking or collisions can be selectively regenerated, allowing autonomous correction of degenerate configurations during design. Our graph-based, data-driven methodology surpasses traditional optimization approaches, enabling scalable inverse design that generalizes to mechanisms with arbitrary node counts. We demonstrate successful synthesis of linkage systems containing up to 20 nodes with extensibility to N-node architectures. This work advances autoregressive graph generation methodologies and computational kinematic synthesis, establishing new paradigms for scalable inverse design of complex mechanical systems.

中文标题/摘要

标题：LinkD：机械连杆合成的自回归扩散模型

设计机械连杆以实现目标末端执行器轨迹由于连续节点放置、离散拓扑配置和非线性运动学约束之间的复杂耦合而构成了一个基本挑战。高度非线性的运动到配置关系意味着关节位置的小幅变化会极大地改变轨迹，而设计空间的组合性扩展使得传统优化和启发式方法在计算上不可行。我们提出了一种自回归扩散框架，该框架利用连杆装配的二元性质，将机构表示为顺序构建的图，其中节点对应于关节，边对应于刚性连杆。我们的方法结合了一个因果变换器和一个去噪扩散概率模型（DDPM），两者都通过变换编码器对目标轨迹进行条件化。因果变换器按顺序预测离散拓扑节点，而DDPM细化每个节点的空间坐标和与先前生成节点的边连接性。这种顺序生成使我们能够进行自适应的试错合成，在设计过程中，可以有选择地重新生成表现出运动学锁定或碰撞的节点，从而自主纠正退化配置。基于图的数据驱动方法超越了传统的优化方法，使逆向设计具有可扩展性，并能够推广到具有任意节点数的机构。我们展示了成功合成包含多达20个节点的连杆系统，并且该方法可以扩展到N节点架构。这项工作推进了自回归图生成方法和计算运动合成，建立了复杂机械系统可扩展逆向设计的新范式。

Summary / 总结

The paper addresses the challenge of designing mechanical linkages to achieve specific end-effector trajectories by introducing a novel autoregressive diffusion model. This model uses a causal transformer and a Denoising Diffusion Probabilistic Model (DDPM) to generate and refine the topology and spatial coordinates of the linkage nodes sequentially. The method overcomes the computational intractability of traditional optimization techniques by enabling adaptive trial-and-error synthesis, which corrects degenerate configurations. Key findings include successful synthesis of up to 20-node linkage systems, demonstrating the model's scalability and generalization to arbitrary node counts.

论文提出了LinkD，一种自回归扩散模型，用于合成机械连杆以实现目标末端执行器轨迹。该方法使用自回归变压器预测离散拓扑结构，并使用去噪扩散概率模型细化空间坐标和边连接性。该方法能够进行自适应的试错合成，允许纠正退化配置，并在任意节点数的机制中超越传统优化方法，在可扩展性和泛化能力方面表现出色。

UniVideo: Unified Understanding, Generation, and Editing for Videos

Authors: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

First: 2025-10-09T16:01:30+00:00 · Latest: 2026-01-07T16:04:47+00:00

Comments: Project Website https://congwei1230.github.io/UniVideo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design preserves the MLLM's original text generation capabilities, enables accurate interpretation of complex multimodal instructions, and maintains visual consistency in the generated content. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as changing the environment or altering materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we released our model and code.

中文标题/摘要

标题：UniVideo：统一的视频理解、生成和编辑

统一的多模态模型在多模态内容生成和编辑方面取得了令人鼓舞的结果，但主要局限于图像领域。在本文中，我们提出了UniVideo，这是一种多功能框架，将统一建模扩展到视频领域。UniVideo 采用双流设计，结合多模态大型语言模型（MLLM）进行指令理解，以及多模态DiT（MMDiT）进行视频生成。这种设计保留了MLLM的原始文本生成能力，能够准确解释复杂的多模态指令，并在生成的内容中保持视觉一致性。基于此架构，UniVideo 统一了多种视频生成和编辑任务，并在它们之间联合训练。广泛的实验表明，UniVideo 在文本/图像到视频生成、上下文视频生成和上下文视频编辑方面与最先进的特定任务基线相当或超越。值得注意的是，UniVideo 统一的设计使其具备两种形式的泛化能力。首先，UniVideo 支持任务组合，例如通过在单一指令中整合多种能力，将编辑与风格迁移相结合。其次，即使没有针对自由形式视频编辑进行显式训练，UniVideo 也能从大规模图像编辑数据中转移其编辑能力，处理诸如改变环境或在视频中改变材料等未见过的指令。除了这些核心能力，UniVideo 还支持基于视觉提示的视频生成，其中MLLM 解释视觉提示并在合成过程中引导MMDiT。为了促进未来的研究，我们发布了我们的模型和代码。

Summary / 总结

UniVideo is a unified framework that extends multimodal modeling to the video domain, combining a Multimodal Large Language Model (MLLM) for instruction understanding and a Multimodal DiT (MMDiT) for video generation. It supports various video generation and editing tasks under a single instruction paradigm and demonstrates competitive performance against task-specific baselines. Notably, UniVideo generalizes well to unseen video editing tasks and supports visual-prompt-based video generation. The unified design enables task composition and leverages large-scale image editing data for video editing capabilities without explicit training. The framework is released with the model and code available for future research.

UniVideo 是一个统一框架，用于视频的理解、生成和编辑，将多模态建模扩展到视频领域。它采用双流设计，使用多模态大型语言模型进行指令理解，以及多模态 DiT 进行视频生成，保持视觉一致性并能够准确解释复杂的多模态指令。实验表明，UniVideo 在各种视频生成和编辑任务中表现优于或匹配最先进的特定任务基线，并通过任务组合和从图像编辑数据中进行迁移学习来处理未见过的指令。此外，它还支持基于视觉提示的视频生成。该模型和代码已公开发布。

MobileDreamer: Generative Sketch World Model for GUI Agent

Authors: Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, Wan Guanglu

First: 2026-01-07T15:51:44+00:00 · Latest: 2026-01-07T15:51:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Mobile GUI agents have shown strong potential in real-world automation and practical applications. However, most existing agents remain reactive, making decisions mainly from current screen, which limits their performance on long-horizon tasks. Building a world model from repeated interactions enables forecasting action outcomes and supports better decision making for mobile GUI agents. This is challenging because the model must predict post-action states with spatial awareness while remaining efficient enough for practical deployment. In this paper, we propose MobileDreamer, an efficient world-model-based lookahead framework to equip the GUI agents based on the future imagination provided by the world model. It consists of textual sketch world model and rollout imagination for GUI agent. Textual sketch world model forecasts post-action states through a learning process to transform digital images into key task-related sketches, and designs a novel order-invariant learning strategy to preserve the spatial information of GUI elements. The rollout imagination strategy for GUI agent optimizes the action-selection process by leveraging the prediction capability of world model. Experiments on Android World show that MobileDreamer achieves state-of-the-art performance and improves task success by 5.25%. World model evaluations further verify that our textual sketch modeling accurately forecasts key GUI elements.

中文标题/摘要

标题：MobileDreamer: 生成式草图世界模型的GUI代理

移动GUI代理在现实世界的自动化和实际应用中展现了强大的潜力。然而，现有的大多数代理仍然具有反应性，主要从当前屏幕做出决策，这限制了它们在长期任务中的表现。通过从重复交互中构建世界模型，可以预测动作结果并支持更好的决策。这具有挑战性，因为模型必须在保持空间意识的同时进行预测，并且足够高效以实现实际部署。在本文中，我们提出了一种高效的基于世界模型的前瞻框架MobileDreamer，以通过世界模型提供的未来想象来装备GUI代理。它由文本草图世界模型和GUI代理的展开想象组成。文本草图世界模型通过学习过程预测动作后的状态，将数字图像转换为与任务相关的关键草图，并设计了一种新颖的顺序不变学习策略以保留GUI元素的空间信息。GUI代理的展开想象策略通过利用世界模型的预测能力优化了动作选择过程。在Android World上的实验表明，MobileDreamer实现了最先进的性能，并将任务成功率提高了5.25%。世界模型评估进一步证实了我们文本草图建模准确地预测了关键GUI元素。

Summary / 总结

MobileDreamer is designed to enhance mobile GUI agents by incorporating a world model that predicts future states based on user interactions, enabling better decision-making for long-horizon tasks. It uses a textual sketch world model to forecast post-action states and a rollout imagination strategy to optimize action selection. Experiments on Android World demonstrate that MobileDreamer outperforms existing methods, achieving a 5.25% improvement in task success rates.

MobileDreamer 是一种用于移动 GUI 代理的前瞻框架，通过构建世界模型预测未来状态并改善长期任务中的决策。它使用文本草图世界模型来预测动作后的状态，并使用回放想象策略来优化动作选择。实验表明，MobileDreamer 的性能优于现有方法，并将任务成功率提高了 5.25%。文本草图模型准确预测了关键 GUI 元素，支持移动 GUI 代理的高效和有效的决策。

Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model

Authors: Yuan Wang, Borui Liao, Huijuan Huang, Jinda Lu, Ouxiang Li, Kuien Liu, Meng Wang, Xiang Wang

First: 2026-01-07T15:47:14+00:00 · Latest: 2026-01-07T15:47:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.

中文标题/摘要

标题：基于框架的思考：通过框架奖励模型评估生成视频失真

近期视频奖励模型和后训练策略的进步提高了文本到视频（T2V）生成的质量。尽管这些模型通常评估视觉质量、运动质量和文本对齐，但它们往往忽略了关键的结构性失真，如异常对象外观和交互，这些失真会降低生成视频的整体质量。为解决这一问题，我们引入了REACT，这是一种专门用于生成视频结构性失真评估的框架级奖励模型。REACT通过推理视频帧，分配点级评分和归因标签，专注于识别失真。为此，我们构建了一个大规模的人类偏好数据集，基于我们提出的结构性失真分类法进行标注，并使用高效的链式思考（CoT）合成流水线生成额外数据。REACT采用两阶段框架进行训练：首先进行监督微调并注入领域知识，然后使用组相对策略优化（GRPO）和成对奖励进行强化学习，以增强推理能力和使输出评分与人类偏好对齐。在推理过程中，引入了动态采样机制，专注于最有可能出现失真的帧。我们还提出了REACT-Bench，这是一个生成视频失真评估基准。实验结果表明，REACT补充了现有的奖励模型，实现了准确的定量评估和可解释的归因分析。

Summary / 总结

The research aims to improve the evaluation of structural distortions in generative videos, which are often overlooked by existing models. REACT, a frame-level reward model, is introduced to assess these distortions by assigning point-wise scores and attribution labels. REACT is trained in two stages: supervised fine-tuning and reinforcement learning, and it outperforms existing models in both quantitative evaluations and interpretability.

论文提出了REACT，一种用于评估生成视频中结构性失真的帧级奖励模型。REACT采用两阶段训练框架，包括监督微调和强化学习，以评估异常物体外观和交互等失真。该模型基于大规模的人类偏好数据集和高效的合成管道进行训练，并在定量评估和归因分析方面优于现有模型。

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Authors: Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

First: 2025-11-03T15:05:44+00:00 · Latest: 2026-01-07T15:44:03+00:00

Comments: 22 pages, includes figures and tables; introduces the EngTrace benchmark

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark comprising 90 templates across three major engineering branches, nine core domains and 20 distinct areas. Through domain-aware parameterization, we generate 1,350 unique, contamination-resistant test cases to stress-test generalization. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 24 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

中文标题/摘要

标题：EngTrace：一种可验证工程推理过程监督的符号基准

大型语言模型（LLMs）正越来越多地进入由严格定量标准和不可变物理法则管理的专业化、安全关键型工程工作流程中，因此对其推理能力进行严格评估变得至关重要。然而，现有的基准测试如MMLU、MATH和HumanEval评估的是孤立的认知技能，未能捕捉到工程中基于物理的推理核心，其中科学原理、定量建模和实际约束必须相互结合。为了在工程中实现可验证的过程监督，我们引入了EngTrace，这是一种包含90个模板的符号基准，覆盖了三大工程分支、九个核心领域和20个不同的领域。通过领域感知参数化，我们生成了1,350个独特的、抗污染的测试案例，以测试泛化能力。超越结果匹配，我们引入了一种可验证的两阶段评估框架，该框架通过分层协议使用自动化程序检查和异构AI法庭验证中间推理轨迹和最终答案。我们对24种领先LLM的评估揭示了数值精度和轨迹保真度之间的权衡，指出了一个复杂性悬崖，在此悬崖上，抽象的数学预训练无法转化为高级工程任务所需的综合推理。

Summary / 总结

EngTrace is a symbolic benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) in safety-critical engineering workflows. It consists of 1,350 unique test cases across 90 templates, covering three major engineering branches and 20 distinct areas. The benchmark evaluates LLMs not only on final answers but also on intermediate reasoning traces through a verifiable two-stage evaluation framework. The evaluation of 24 leading LLMs shows a trade-off between numeric precision and trace fidelity, highlighting the complexity required for advanced engineering tasks.

EngTrace 是一个符号基准，旨在评估大型语言模型（LLMs）在安全关键工程工作流中的推理能力。它包含三个工程分支和20个不同领域的90个模板，生成1,350个独特的测试案例。该基准使用可验证的两阶段评估框架来评估中间推理轨迹和最终答案，揭示了24种领先LLM在数值精度和轨迹保真度之间的权衡。

Towards Understanding Feature Learning in Parameter Transfer

Authors: Hua Yuan, Xuran Meng, Qiufeng Wang, Shiyu Xia, Ning Xu, Xu Yang, Jing Wang, Xin Geng, Yong Rui

First: 2025-09-26T08:37:54+00:00 · Latest: 2026-01-07T15:40:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. To our best knowledge, our theory is the first to provide a dynamic analysis for parameter transfer and also the first to prove the existence of negative transfer theoretically. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.

中文标题/摘要

标题：理解参数转移中的特征学习

参数转移是迁移学习中的一个核心范式，通过在上游模型和下游模型之间共享模型参数，实现跨任务和领域的知识重用。然而，当仅将上游模型的一部分参数转移到下游模型时，仍然缺乏理论理解，即在这种部分参数重用的情况下何时有益，以及哪些因素决定了其有效性。为了解决这一缺口，我们分析了一个设置，其中上游模型和下游模型都是ReLU卷积神经网络（CNNs）。在这一理论框架内，我们描述了继承参数作为普遍知识载体的作用，并确定了增强其对目标任务有益影响的关键因素。此外，我们的分析还揭示了为什么在某些情况下，参数转移可能会导致目标任务的测试准确率低于从头开始训练新模型。据我们所知，我们的理论是第一个提供参数转移的动态分析，并且是第一个从理论上证明存在负迁移的理论。进行了数值实验和实际数据实验来实证验证我们的理论发现。

Summary / 总结

The paper aims to understand the conditions under which partial parameter transfer from an upstream to a downstream model is beneficial in transfer learning. The authors analyze ReLU convolutional neural networks and identify key factors that enhance the effectiveness of parameter transfer. They also provide theoretical evidence that in some cases, parameter transfer can result in lower test accuracy compared to training from scratch. Experiments support their theoretical insights.

论文旨在理解在迁移学习中，从上游模型部分参数转移到下游模型是有益的条件。作者分析了ReLU卷积神经网络，并发现继承的参数作为通用知识的载体，可以增强目标任务的性能。他们还确定了增强这些参数有益影响的因素，并提供了理论证据表明，在某些情况下，参数转移可能导致测试准确率低于从零开始训练新模型。实验支持了理论发现。

Uncertainty-Aware Robotic World Model Makes Offline Model-Based Reinforcement Learning Work on Real Robots

Authors: Chenhao Li, Andreas Krause, Marco Hutter

First: 2025-04-23T12:58:15+00:00 · Latest: 2026-01-07T15:37:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning (RL) has achieved impressive results in robotics, yet high-performing pipelines remain highly task-specific, with little reuse of prior data. Offline Model-based RL (MBRL) offers greater data efficiency by training policies entirely from existing datasets, but suffers from compounding errors and distribution shift in long-horizon rollouts. Although existing methods have shown success in controlled simulation benchmarks, robustly applying them to the noisy, biased, and partially observed datasets typical of real-world robotics remains challenging. We present a principled pipeline for making offline MBRL effective on physical robots. Our RWM-U extends autoregressive world models with epistemic uncertainty estimation, enabling temporally consistent multi-step rollouts with uncertainty effectively propagated over long horizons. We combine RWM-U with MOPO-PPO, which adapts uncertainty-penalized policy optimization to the stable, on-policy PPO framework for real-world control. We evaluate our approach on diverse manipulation and locomotion tasks in simulation and on real quadruped and humanoid, training policies entirely from offline datasets. The resulting policies consistently outperform model-free and uncertainty-unaware model-based baselines, and fusing real-world data in model learning further yields robust policies that surpass online model-free baselines trained solely in simulation.

中文标题/摘要

标题：意识不确定性的人工智能机器人世界模型使离线模型导向强化学习在真实机器人上可行

强化学习（RL）在机器人学中取得了令人印象深刻的成果，但高性能的管道仍然高度任务特定，很少重用先前的数据。离线模型导向强化学习（MBRL）通过完全从现有数据集训练策略来提供更高的数据效率，但长时序滚动中会遭受累积错误和分布偏移。尽管现有方法在受控的模拟基准测试中取得了成功，但在真实世界机器人典型的嘈杂、有偏见和部分观测数据上稳健地应用它们仍然具有挑战性。我们提出了一种原理性的管道，使离线MBRL在物理机器人上有效。我们的RWM-U扩展了自回归世界模型，加入了先验不确定性估计，使多步滚动具有时间一致性，并有效地在长时序中传播不确定性。我们将RWM-U与MOPO-PPO结合，后者将不确定性惩罚策略优化适应稳定、在线的PPO框架，用于现实世界的控制。我们在模拟和真实四足和人形机器人上评估了我们的方法，训练策略完全来自离线数据集。结果表明，这些策略在各种操作和运动任务中始终优于无模型和不确定性无意识的模型导向基线，并且在模型学习中融合现实世界数据进一步产生了稳健的策略，这些策略超越了仅在模拟中训练的在线无模型基线。

Summary / 总结

This paper addresses the challenge of applying offline Model-based Reinforcement Learning (MBRL) to real-world robotics, where data is often noisy and biased. The authors propose RWM-U, which integrates epistemic uncertainty into autoregressive world models, allowing for more accurate long-horizon rollouts. They combine this with MOPO-PPO, an uncertainty-penalized policy optimization method adapted for real-world control. Experiments on various manipulation and locomotion tasks show that the proposed approach outperforms both model-free and uncertainty-unaware model-based baselines, and incorporating real-world data further enhances policy robustness, surpassing online model-free baselines trained only in simulation.

该论文旨在解决将离线模型导向强化学习（MBRL）应用于真实世界机器人的问题，其中数据通常噪声较大且偏差明显。作者提出了一种名为RWM-U的方法，将表征不确定性整合到自回归世界模型中，以实现更稳健的长时序模拟。他们将RWM-U与MOPO-PPO结合，后者是为真实世界控制而调整的不确定性惩罚策略优化方法。实验结果显示，该方法在各种操作和运动任务上优于无模型方法和不确定性无意识的MBRL方法，并且将真实世界数据融合到模型学习中进一步提高了策略的鲁棒性，使其超越仅在仿真中训练的在线无模型基线方法。