arXiv 论文速递

Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition

Authors: Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi

Venue: NeurIPS 2025 Spotlight

First: 2025-11-05T18:59:35+00:00 · Latest: 2025-11-05T18:59:35+00:00

Comments: NeurIPS 2025 Spotlight paper. Project page: https://jong980812.github.io/DANCE/

Abstract

Effective explanations of video action recognition models should disentangle how movements unfold over time from the surrounding spatial context. However, existing methods based on saliency produce entangled explanations, making it unclear whether predictions rely on motion or spatial context. Language-based approaches offer structure but often fail to explain motions due to their tacit nature -- intuitively understood but difficult to verbalize. To address these challenges, we propose Disentangled Action aNd Context concept-based Explainable (DANCE) video action recognition, a framework that predicts actions through disentangled concept types: motion dynamics, objects, and scenes. We define motion dynamics concepts as human pose sequences. We employ a large language model to automatically extract object and scene concepts. Built on an ante-hoc concept bottleneck design, DANCE enforces prediction through these concepts. Experiments on four datasets -- KTH, Penn Action, HAA500, and UCF-101 -- demonstrate that DANCE significantly improves explanation clarity with competitive performance. We validate the superior interpretability of DANCE through a user study. Experimental results also show that DANCE is beneficial for model debugging, editing, and failure analysis.

中文标题/摘要

标题：解离概念胜过言语：可解释的视频动作识别

有效的视频动作识别模型解释应将动作随时间展开的运动与周围的空间上下文分离。然而，现有的基于显著性的方法会产生纠缠的解释，使得难以判断预测是依赖于运动还是空间上下文。基于语言的方法虽然提供了结构，但由于其隐含性——直观理解但难以言表——往往无法解释动作。为解决这些挑战，我们提出了解离动作和上下文的概念基础可解释（DANCE）视频动作识别框架，该框架通过解离的概念类型——运动动力学、物体和场景——来预测动作。我们将运动动力学概念定义为人形姿态序列。我们使用大型语言模型自动提取物体和场景概念。基于先验概念瓶颈设计，DANCE 通过这些概念强制执行预测。在四个数据集——KTH、宾夕法尼亚动作、HAA500 和 UCF-101——上的实验表明，DANCE 显著提高了解释的清晰度，同时保持了竞争力。我们通过用户研究验证了 DANCE 的优越可解释性。实验结果还表明，DANCE 对模型调试、编辑和故障分析有益。

Summary / 总结

The research aims to provide clear explanations for video action recognition models by disentangling motion dynamics from spatial context. The DANCE framework predicts actions using disentangled concepts of motion dynamics, objects, and scenes. Experiments on four datasets show that DANCE enhances explanation clarity while maintaining competitive performance, and user studies validate its superior interpretability for debugging and failure analysis.

研究旨在通过分离时间上的动作与空间背景来为视频动作识别模型提供清晰的解释。DANCE框架通过动作动态、物体和场景等分离的概念来预测动作。在四个数据集上的实验表明，DANCE提高了解释的清晰度并保持了竞争力，用户研究验证了其在调试和故障分析中的优越可解释性。

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

Authors: Richard Dewey, Janos Botyanszki, Ciamac C. Moallemi, Andrew T. Zheng

First: 2025-11-05T18:58:18+00:00 · Latest: 2025-11-05T18:58:18+00:00

Abs · PDF · Code1 · Code2

Abstract

AI researchers have long focused on poker-like games as a testbed for environments characterized by multi-player dynamics, imperfect information, and reasoning under uncertainty. While recent breakthroughs have matched elite human play at no-limit Texas hold'em, the multi-player dynamics are subdued: most hands converge quickly with only two players engaged through multiple rounds of bidding. In this paper, we present Solly, the first AI agent to achieve elite human play in reduced-format Liar's Poker, a game characterized by extensive multi-player engagement. We trained Solly using self-play with a model-free, actor-critic, deep reinforcement learning algorithm. Solly played at an elite human level as measured by win rate (won over 50% of hands) and equity (money won) in heads-up and multi-player Liar's Poker. Solly also outperformed large language models (LLMs), including those with reasoning abilities, on the same metrics. Solly developed novel bidding strategies, randomized play effectively, and was not easily exploitable by world-class human players.

中文标题/摘要

标题：超越精英人类：通过自博弈和强化学习掌握骗子扑克

AI研究者长期以来一直将扑克类游戏作为多玩家动态、不完美信息和不确定性推理环境的测试平台。尽管最近在无限德州扑克方面取得了突破，达到了精英人类的水平，但多玩家动态被抑制：大多数手牌很快结束，仅通过多轮竞价由两名玩家参与。在本文中，我们介绍了Solly，这是第一个在缩减版骗子扑克中达到精英人类水平的AI代理，骗子扑克的特点是广泛的多玩家参与。我们使用无模型、演员-评论家的深度强化学习算法对Solly进行了自博弈训练。Solly在一对一和多玩家骗子扑克中的胜率（赢得超过50%的手牌）和权益（赢得的钱）方面达到了精英人类水平。Solly在相同指标上也超过了包括具有推理能力的大语言模型在内的其他大语言模型。Solly开发了新的竞价策略，随机化了玩法，并且不容易被世界级的人类玩家利用。

Summary / 总结

This paper presents Solly, an AI agent that achieves elite human play in Liar's Poker, a game with extensive multi-player engagement. Solly was trained using self-play and a model-free, actor-critic, deep reinforcement learning algorithm. Key findings include Solly's elite performance as measured by win rate and equity in both heads-up and multi-player games, and its superior performance compared to large language models. Solly developed unique bidding strategies and was not easily exploitable by human players.

本文介绍了Solly，一个在具有大量多玩家参与的Liar's Poker游戏中达到精英人类水平的AI代理。Solly使用自我对弈和基于模型的、演员-评论家的深度强化学习算法进行训练。关键发现包括Solly在一对一和多人游戏中的胜率和收益均达到精英水平，且其表现优于大型语言模型。Solly开发了独特的叫牌策略，并不容易被顶级人类玩家所利用。

Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

Authors: Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette

First: 2025-11-05T18:43:15+00:00 · Latest: 2025-11-05T18:43:15+00:00

Comments: Preprint. Under Review

Abs · PDF · Code1 · Code2

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean for each prompt. Statistically, this centering acts as a control variate (or baseline), reducing the variance of the policy-gradient estimator. Typically, the mean reward is estimated using per-prompt empirical averages for each prompt in a batch. Drawing inspiration from Stein's paradox, we propose using shrinkage estimators that combine per-prompt and across-prompt means to improve the overall per-prompt mean estimation accuracy -- particularly in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our proposed baseline serves as a drop-in replacement for existing per-prompt mean baselines, requiring no additional hyper-parameters or computation. Empirically, shrinkage baselines consistently outperform standard empirical-mean baselines, leading to lower-variance gradient updates and improved training stability.

中文标题/摘要

标题：缩小方差：可验证奖励的强化学习收缩基线

可验证奖励的强化学习（RLVR）已成为使用基于策略梯度方法（如GRPO）后训练大型推理模型的强大范式。为了稳定训练，这些方法通常通过从每个提示的样本均值中减去轨迹奖励来对齐轨迹奖励。从统计学角度来看，这种对齐起到了控制变量（或基线）的作用，降低了策略梯度估计器的方差。通常，奖励均值是通过批量中每个提示的样本平均值来估计的。受到Stein悖论的启发，我们提出了一种收缩估计器，将每个提示和跨提示的均值结合起来，以提高整体每个提示均值估计的准确性，特别是在RLVR中常见的低生成阶段。理论上，我们构建了一个基于收缩的基线，可以证明其在各种算法中提供了更低方差的策略梯度估计器。我们提出的基线可以作为现有每个提示均值基线的即插即用替代品，不需要额外的超参数或计算。实验上，收缩基线始终优于标准的样本均值基线，导致更低方差的梯度更新和改进的训练稳定性。

Summary / 总结

The paper addresses the challenge of stabilizing training in Reinforcement Learning with Verifiable Rewards (RLVR) by proposing shrinkage baselines. These baselines combine per-prompt and across-prompt means to reduce the variance of the policy-gradient estimator, especially in the low-generation regime. Theoretical analysis shows that shrinkage baselines yield lower-variance estimators across different algorithms. Empirical results demonstrate that shrinkage baselines outperform standard empirical-mean baselines, leading to more stable training and lower-variance gradient updates.

论文通过提出收缩基线来解决强化学习中可验证奖励（RLVR）的训练稳定问题。这些基线结合了每个提示和跨提示的均值，以减少策略梯度估计器的方差，特别是在低生成阶段。理论分析表明，收缩基线在不同算法中都能提供更低方差的估计。实验证明，这些收缩基线优于标准的经验均值基线，导致更低方差的梯度更新和更好的训练稳定性。

GDS Agent for Graph Algorithmic Reasoning

Authors: Borun Shi, Ioannis Panagiotas

First: 2025-08-28T10:35:44+00:00 · Latest: 2025-11-05T18:39:38+00:00

Comments: Technical report

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have shown remarkable multimodal information processing and reasoning ability. When equipped with tools through function calling and enhanced with retrieval-augmented techniques, compound LLM-based systems can access closed data sources and answer questions about them. However, they still struggle to process and reason over large-scale graph-structure data. We introduce the GDS (Graph Data Science) agent in this technical report. The GDS agent introduces a comprehensive set of graph algorithms as tools, together with preprocessing (retrieval) and postprocessing of algorithm results, in a model context protocol (MCP) server. The server can be used with any modern LLM out-of-the-box. GDS agent allows users to ask any question that implicitly and intrinsically requires graph algorithmic reasoning about their data, and quickly obtain accurate and grounded answers. We introduce new benchmarks that evaluate intermediate tool calls as well as final responses. The results indicate that GDS agent is able to solve a wide spectrum of graph tasks. We also provide detailed case studies for more open-ended tasks and study scenarios where the agent struggles. Finally, we discuss the remaining challenges and the future roadmap.

中文标题/摘要

标题：GDS代理用于图算法推理

大型语言模型（LLMs）在多模态信息处理和推理方面表现出色。通过函数调用配备工具并结合检索增强技术，复合LLM系统可以访问封闭数据源并回答相关问题。然而，它们仍然难以处理和推理大规模的图结构数据。在本技术报告中，我们介绍了GDS（Graph Data Science）代理。GDS代理引入了一整套图算法作为工具，以及算法结果的预处理（检索）和后处理，在模型上下文协议（MCP）服务器中。该服务器可以与任何现代LLM无缝使用。GDS代理允许用户提出任何隐含和内在需要图算法推理的问题，并迅速获得准确和可靠的答案。我们引入了新的基准测试，评估中间工具调用以及最终响应。结果表明，GDS代理能够解决广泛的图任务。我们还提供了更开放任务的详细案例研究，并研究了代理遇到困难的场景。最后，我们讨论了剩余的挑战和未来路线图。

AnaFlow: Agentic LLM-based Workflow for Reasoning-Driven Explainable and Sample-Efficient Analog Circuit Sizing

Authors: Mohsen Ahmadzadeh, Kaichang Chen, Georges Gielen

First: 2025-11-05T18:24:01+00:00 · Latest: 2025-11-05T18:24:01+00:00

Comments: This article was accepted by 2025 International Conference on Computer-Aided Design (ICCAD 2025) and was presented in Munich, October 2025

Abs · PDF · Code1 · Code2

Abstract

Analog/mixed-signal circuits are key for interfacing electronics with the physical world. Their design, however, remains a largely handcrafted process, resulting in long and error-prone design cycles. While the recent rise of AI-based reinforcement learning and generative AI has created new techniques to automate this task, the need for many time-consuming simulations is a critical bottleneck hindering the overall efficiency. Furthermore, the lack of explainability of the resulting design solutions hampers widespread adoption of the tools. To address these issues, a novel agentic AI framework for sample-efficient and explainable analog circuit sizing is presented. It employs a multi-agent workflow where specialized Large Language Model (LLM)-based agents collaborate to interpret the circuit topology, to understand the design goals, and to iteratively refine the circuit's design parameters towards the target goals with human-interpretable reasoning. The adaptive simulation strategy creates an intelligent control that yields a high sample efficiency. The AnaFlow framework is demonstrated for two circuits of varying complexity and is able to complete the sizing task fully automatically, differently from pure Bayesian optimization and reinforcement learning approaches. The system learns from its optimization history to avoid past mistakes and to accelerate convergence. The inherent explainability makes this a powerful tool for analog design space exploration and a new paradigm in analog EDA, where AI agents serve as transparent design assistants.

中文标题/摘要

标题：AnaFlow：基于LLM的代理工作流，用于推理驱动的可解释和样本高效模拟电路尺寸优化

模拟/混合信号电路是将电子设备与物理世界接口的关键。然而，其设计仍然是一个主要依靠手工制作的过程，导致设计周期长且容易出错。虽然最近基于AI的强化学习和生成AI的兴起为自动化此任务创造了新的技术，但需要大量耗时的仿真仍然是整体效率的关键瓶颈。此外，结果设计解决方案的缺乏可解释性阻碍了这些工具的广泛应用。为了解决这些问题，提出了一种新的代理AI框架，用于样本高效和可解释的模拟电路尺寸优化。该框架采用多代理工作流，其中专门的大型语言模型（LLM）代理协作以解释电路拓扑结构、理解设计目标，并通过具有人类可解释推理的迭代优化电路设计参数以达到目标。自适应仿真策略创建了一个智能控制，提高了样本效率。AnaFlow框架在两个不同复杂度的电路中进行了演示，并能够完全自动完成尺寸优化任务，不同于纯贝叶斯优化和强化学习方法。该系统从其优化历史中学习，以避免过去错误并加速收敛。固有的可解释性使它成为模拟设计空间探索的强大工具，并为模拟EDA引入了新的范式，其中AI代理作为透明的设计助手。

Summary / 总结

The paper introduces AnaFlow, an agentic AI framework for automating analog circuit sizing with improved sample efficiency and explainability. It uses a multi-agent workflow with specialized Large Language Model (LLM)-based agents to interpret circuit topology and iteratively refine design parameters. The adaptive simulation strategy enhances efficiency, and the system learns from past mistakes to accelerate convergence. AnaFlow successfully completes the sizing task for two circuits of varying complexity, demonstrating its effectiveness over Bayesian optimization and reinforcement learning approaches.

该论文介绍了一种名为AnaFlow的基于多智能体的工作流框架，用于自动化模拟电路尺寸设计，提高了样本效率和可解释性。该框架使用专门的大型语言模型（LLM）基于的智能体来解释电路拓扑结构，并逐步优化设计参数。自适应模拟策略提升了效率，系统从历史优化中学习以避免错误并加速收敛。AnaFlow成功地完成了两个不同复杂度电路的尺寸任务，证明了其在贝叶斯优化和强化学习方法之外的有效性。

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

Authors: Seungyong Lee, Jeong-gi Kwak

Venue: SIGGRAPH Asia 2025

First: 2025-08-06T19:10:58+00:00 · Latest: 2025-11-05T18:23:44+00:00

Comments: Accepted to SIGGRAPH Asia 2025, project page: https://nxnai.github.io/Voost/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

中文标题/摘要

标题：Voost：一种统一且可扩展的双向虚拟试穿与脱下变换器

虚拟试穿旨在合成一个人穿着目标服装的逼真图像，但准确建模服装与人体的对应关系仍然是一个持续的挑战，尤其是在姿态和外观变化的情况下。在本文中，我们提出了一种名为Voost的统一且可扩展框架，该框架使用单个扩散变换器联合学习虚拟试穿与脱下。通过联合建模两个任务，Voost使每件服装与人体配对能够监督两个方向，并支持生成方向和服装类别的灵活条件，增强了服装与人体关系推理，而无需特定任务的网络、辅助损失或额外标签。此外，我们引入了两种推理时的技术：注意力温度缩放以提高对分辨率或掩码变化的鲁棒性，以及利用任务之间双向一致性进行自我纠正采样。广泛的实验表明，Voost在试穿和脱下基准测试中均达到了最先进的结果，一致地在对齐精度、视觉保真度和泛化方面优于强大的基线。

Summary / 总结

Voost is a unified framework using a diffusion transformer to jointly learn virtual try-on and try-off, addressing the challenge of modeling garment-body correspondence under pose and appearance variation. It enhances relational reasoning between garments and bodies without task-specific networks or additional labels, and introduces techniques like attention temperature scaling and self-corrective sampling to improve robustness and consistency. Voost achieves state-of-the-art results in alignment accuracy, visual fidelity, and generalization on benchmarks for both try-on and try-off tasks, outperforming strong baselines.

Voost 是一个使用扩散变换器联合学习虚拟试穿和脱下的统一框架，解决了在姿态和外观变化下建模服装与人体对应关系的挑战。它通过不使用特定任务的网络或额外标签来增强服装与人体之间的关系推理，并引入了注意力温度缩放和双向一致性驱动的自我纠正采样等技术来提高鲁棒性和一致性。Voost 在虚拟试穿和脱下基准测试中实现了最先进的结果，在对齐精度、视觉保真度和泛化方面优于强基线。

Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

Authors: Lipeng Zu, Hansong Zhou, Xiaonan Zhang

First: 2025-11-05T18:20:23+00:00 · Latest: 2025-11-05T18:20:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Offline reinforcement learning (RL) enables training from fixed data without online interaction, but policies learned offline often struggle when deployed in dynamic environments due to distributional shift and unreliable value estimates on unseen state-action pairs. We introduce Behavior-Adaptive Q-Learning (BAQ), a framework designed to enable a smooth and reliable transition from offline to online RL. The key idea is to leverage an implicit behavioral model derived from offline data to provide a behavior-consistency signal during online fine-tuning. BAQ incorporates a dual-objective loss that (i) aligns the online policy toward the offline behavior when uncertainty is high, and (ii) gradually relaxes this constraint as more confident online experience is accumulated. This adaptive mechanism reduces error propagation from out-of-distribution estimates, stabilizes early online updates, and accelerates adaptation to new scenarios. Across standard benchmarks, BAQ consistently outperforms prior offline-to-online RL approaches, achieving faster recovery, improved robustness, and higher overall performance. Our results demonstrate that implicit behavior adaptation is a principled and practical solution for reliable real-world policy deployment.

中文标题/摘要

标题：行为自适应Q学习：从离线到在线RL的统一框架

离线强化学习（RL）允许使用固定数据进行训练而无需在线交互，但离线学习的策略在动态环境中部署时往往难以应对，因为存在分布偏移和对未见过的状态-动作对的不可靠价值估计。我们提出了行为自适应Q学习（BAQ），这是一种旨在实现从离线到在线RL平滑且可靠的过渡的框架。核心思想是利用从离线数据中隐式推导出的行为模型，在在线微调过程中提供行为一致性信号。BAQ结合了一个双重目标损失函数，（i）当不确定性高时，使在线策略朝向离线行为对齐，（ii）随着更多自信的在线经验积累，逐渐放松这种约束。这种自适应机制减少了从分布外估计的误差传播，稳定了早期的在线更新，并加速了对新场景的适应。在标准基准测试中，BAQ始终优于先前的离线到在线RL方法，实现了更快的恢复、更好的鲁棒性和更高的整体性能。我们的结果表明，隐式行为适应是一种原理上和实践上可靠的现实世界策略部署解决方案。

Summary / 总结

The research aims to address the challenges of deploying offline-trained policies in dynamic environments by introducing Behavior-Adaptive Q-Learning (BAQ), which uses an implicit behavioral model from offline data to guide online fine-tuning. The method employs a dual-objective loss function to align the online policy with offline behavior when uncertainty is high and gradually relaxes this constraint as more confident online experience is gained. Experimental results show that BAQ outperforms previous approaches in terms of faster recovery, improved robustness, and higher overall performance across standard benchmarks.

研究旨在通过引入行为自适应Q学习（BAQ）来解决将离线训练的策略部署到动态环境中的挑战。BAQ利用离线数据中的隐式行为模型，在在线微调过程中提供行为一致性信号，使用双重目标损失函数，在不确定性高时使在线策略与离线行为对齐，并随着更多自信的在线经验积累逐渐放松这种约束。研究结果表明，BAQ在标准基准测试中优于之前的离线到在线RL方法，实现了更快的恢复、更好的鲁棒性和更高的整体性能。

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Authors: Xingyao Wang, Simon Rosenberg, Juan Michelini, Calvin Smith, Hoang Tran, Engel Nyst, Rohit Malhotra, Xuhui Zhou, Valerie Chen, Robert Brennan, Graham Neubig

First: 2025-11-05T18:16:44+00:00 · Latest: 2025-11-05T18:16:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents, which has 64k+ GitHub stars. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex, full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.

中文标题/摘要

标题：OpenHands 软件代理SDK：满足生产需求的可组合和可扩展基础

代理现在在软件开发过程中被广泛使用，但构建生产就绪的软件工程代理是一个复杂的过程。有效部署软件代理需要在实现和实验中具有灵活性，可靠的执行和安全性，以及用户与代理交互的接口。在本文中，我们介绍了OpenHands 软件代理SDK，这是一个用于实现满足这些需求的软件开发代理的工具包。该工具包是对流行的OpenHands 软件开发代理框架中的代理组件进行了完全的架构重设计，该框架在GitHub上有64k+颗星。为了实现灵活性，我们设计了一个简单的代理实现接口，在默认情况下只需要几行代码，但可以轻松扩展到具有自定义工具、内存管理等功能的更复杂、功能更全面的代理。为了安全性和可靠性，它提供了无缝的本地到远程执行移植，集成的REST/WebSocket服务。为了与人类用户交互，它可以连接到各种接口，如视觉工作空间（VS Code、VNC、浏览器）、命令行接口和API。与来自OpenAI、Claude和Google的现有SDK相比，OpenHands独特地集成了原生沙箱执行、生命周期控制、模型无关的多LLM路由和内置的安全分析。在SWE-Bench 验证和GAIA基准上的实验证明了其强大的性能。这些元素结合在一起，使OpenHands 软件代理SDK能够提供一个实用的基础，用于原型设计，解锁新的自定义应用程序类别，并可靠地大规模部署代理。

Summary / 总结

The OpenHands Software Agent SDK is designed to simplify the development of production-ready software engineering agents by providing a flexible, secure, and interactive framework. It includes a simple interface for implementing agents, seamless local-to-remote execution, integrated REST/WebSocket services, and direct connection to various user interfaces. Empirical results on SWE-Bench Verified and GAIA benchmarks show strong performance, making it a practical foundation for prototyping and deploying agents at scale.

研究旨在解决构建生产级软件工程代理的复杂性。OpenHands Software Agent SDK 是一个工具包，重新设计了 OpenHands 框架中的代理组件，提供了灵活性、安全性和用户交互功能。实验结果表明，该SDK在SWE-Bench Verified和GAIA基准测试中表现出色，证明其适用于原型设计和大规模部署。

Generative View Stitching

Authors: Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann

First: 2025-10-28T17:59:58+00:00 · Latest: 2025-11-05T17:59:23+00:00

Comments: Updated acknowledgements and fixed figure visibility issue on Safari. Project website: https://andrewsonga.github.io/gvs

Abs · PDF · Code1 · Code2 · Project1

Abstract

Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv\"ard's Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

中文标题/摘要

标题：生成式视图缝合

自回归视频扩散模型能够生成长期稳定且与历史一致的序列，但无法用未来条件引导当前生成。在具有预定义摄像机轨迹的摄像机引导视频生成中，这一限制会导致生成场景中的碰撞，随后自回归迅速崩溃。为解决这一问题，我们提出了生成式视图缝合（GVS），该方法并行采样整个序列，使生成场景忠实于预定义的整个摄像机轨迹。我们的主要贡献是一种采样算法，将先前用于机器人规划的扩散缝合技术扩展到视频生成。虽然此类缝合方法通常需要专门训练的模型，但GVS与任何使用扩散强迫训练的现成视频模型兼容，扩散强迫是一种流行的序列扩散框架，我们证明它已经提供了缝合所需的必要功能。我们还引入了全方位引导技术，通过同时条件化于过去和未来来增强缝合的时序一致性，并使我们提出的闭环机制能够实现长距离的一致性。总体而言，GVS实现了稳定、无碰撞、帧到帧一致且能够为各种预定义摄像机路径闭合循环的摄像机引导视频生成，包括奥斯卡·鲁道夫斯沃德的不可能楼梯。结果最好以视频形式查看：https://andrewsonga.github.io/gvs

Summary / 总结

Generative View Stitching (GVS) addresses the limitation of autoregressive video diffusion models by proposing a sampling algorithm that generates the entire sequence in parallel, ensuring the generated scene is faithful to the predefined camera trajectory. GVS uses Omni Guidance to enhance temporal consistency and enables loop-closing for long-range coherence. The method is compatible with any off-the-shelf video model trained with Diffusion Forcing and achieves stable, collision-free, and frame-to-frame consistent camera-guided video generation for various predefined camera paths, including the Impossible Staircase. Results are best viewed as videos at the project website: https://andrewsonga.github.io/gvs.

Generative View Stitching (GVS)通过提出一种并行生成整个序列的算法，确保生成的场景与预定义的摄像机轨迹一致，解决了自回归视频扩散模型的局限性。GVS使用了增强时间一致性的Omni Guidance技术，并实现了长距离连贯性的闭环机制。该方法适用于任何使用扩散强迫训练的现成视频模型，并实现了稳定、无碰撞、帧间一致的摄像机引导视频生成，包括奥斯卡·鲁德贝克的不可能楼梯。结果请在项目网站上查看：https://andrewsonga.github.io/gvs

MAROON: A Framework for the Joint Characterization of Near-Field High-Resolution Radar and Optical Depth Imaging Techniques

Authors: Vanessa Wirth, Johanna Bräunig, Nikolai Hofmann, Martin Vossiek, Tim Weyrich, Marc Stamminger

First: 2024-11-01T11:53:10+00:00 · Latest: 2025-11-05T17:58:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Utilizing the complementary strengths of wavelength-specific range or depth sensors is crucial for robust computer-assisted tasks such as autonomous driving. Despite this, there is still little research done at the intersection of optical depth sensors and radars operating close range, where the target is decimeters away from the sensors. Together with a growing interest in high-resolution imaging radars operating in the near field, the question arises how these sensors behave in comparison to their traditional optical counterparts. In this work, we take on the unique challenge of jointly characterizing depth imagers from both, the optical and radio-frequency domain using a multimodal spatial calibration. We collect data from four depth imagers, with three optical sensors of varying operation principle and an imaging radar. We provide a comprehensive evaluation of their depth measurements with respect to distinct object materials, geometries, and object-to-sensor distances. Specifically, we reveal scattering effects of partially transmissive materials and investigate the response of radio-frequency signals. All object measurements will be made public in form of a multimodal dataset, called MAROON.

中文标题/摘要

标题：MAROON：一种近场高分辨率雷达和光学厚度成像技术联合表征框架

利用特定波长范围内的距离或深度传感器的互补优势对于稳健的计算机辅助任务，如自动驾驶至关重要。尽管如此，在光学厚度传感器和近距离工作的雷达之间交集的研究仍然很少，目标距离传感器仅几十厘米。随着对近场高分辨率成像雷达的兴趣日益增长，一个关键问题是这些传感器在与传统光学传感器相比时的行为如何。在这项工作中，我们利用多模态空间校准方法，同时从光学和射频领域表征深度成像仪的独特挑战。我们收集了四种深度成像仪的数据，包括三种不同工作原理的光学传感器和一个成像雷达。我们对其深度测量进行了全面评估，针对不同的物体材料、几何形状和物体到传感器的距离。具体来说，我们揭示了部分透明材料的散射效应，并研究了射频信号的响应。所有物体测量将以多模态数据集的形式公开，称为MAROON。