arXiv 论文速递

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

Authors: Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li, Yuhang He, Yihong Gong

First: 2025-11-24T18:59:56+00:00 · Latest: 2025-11-24T18:59:56+00:00

Abstract

We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.

中文标题/摘要

标题：VDC-Agent：视频详细字幕生成者通过自主反思进化

我们提出了VDC-Agent，这是一种无需人类注释或更大教师模型的视频详细字幕生成自进化框架。该代理形成了一条闭合循环，包括字幕生成、原则引导评分（评分和文本建议）以及提示精炼。当字幕质量下降时，通过利用之前的思维链来修正更新。在未标记的视频上运行此过程会产生（字幕，评分）对的轨迹。我们将这些轨迹转换为偏好元组，并过滤掉JSON解析错误的样本，最终得到包含18,886个自动构建对的VDC-Agent-19K。然后，我们使用易于困难的课程直接偏好优化对基础MLLM进行微调。基于Qwen2.5-VL-7B-Instruct，我们的VDC-Agent-7B在VDC基准测试中达到了49.08%的平均准确率和2.50分，超越了专门的视频字幕生成器，并在相似的推理成本下提高了基础模型5.13%的准确率和0.27分。

Summary / 总结

VDC-Agent is a self-evolving framework for Video Detailed Captioning that does not require human annotations or larger teacher models. It forms a closed loop of caption generation, scoring, and prompt refinement, with a self-reflection mechanism to improve when quality regresses. Running this process on unlabeled videos produces 18,886 automatically constructed (caption, score) pairs, which are used to fine-tune the base model, achieving state-of-the-art performance with 49.08% average accuracy and 2.50 score on the VDC benchmark.

VDC-Agent 是一个无需人类注释或更大教师模型的视频详细字幕生成框架，通过生成、评分和提示精炼形成闭环，并在质量下降时使用自我反思进行改进。处理未标记的视频后，生成了 18,886 个（字幕，评分）对，用于微调基础 MLLM，实现了 49.08% 的平均准确率和 2.50 分，超越了专门的视频字幕生成器，并且比基础模型提高了 5.13% 的准确率和 0.27 分。

Are Image-to-Video Models Good Zero-Shot Image Editors?

Authors: Zechuan Zhang, Zhenyuan Chen, Zongxin Yang, Yi Yang

First: 2025-11-24T18:59:54+00:00 · Latest: 2025-11-24T18:59:54+00:00

Comments: technical report

Abs · PDF · Code1 · Code2

Abstract

Large-scale video diffusion models show strong world simulation and temporal reasoning abilities, but their use as zero-shot image editors remains underexplored. We introduce IF-Edit, a tuning-free framework that repurposes pretrained image-to-video diffusion models for instruction-driven image editing. IF-Edit addresses three key challenges: prompt misalignment, redundant temporal latents, and blurry late-stage frames. It includes (1) a chain-of-thought prompt enhancement module that transforms static editing instructions into temporally grounded reasoning prompts; (2) a temporal latent dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving semantic and temporal coherence; and (3) a self-consistent post-refinement step that sharpens late-stage frames using a short still-video trajectory. Experiments on four public benchmarks, covering non-rigid editing, physical and temporal reasoning, and general instruction edits, show that IF-Edit performs strongly on reasoning-centric tasks while remaining competitive on general-purpose edits. Our study provides a systematic view of video diffusion models as image editors and highlights a simple recipe for unified video-image generative reasoning.

中文标题/摘要

标题：图像到视频模型是好的零样本图像编辑器吗？

大规模视频扩散模型展示了强大的世界模拟能力和时间推理能力，但它们作为零样本图像编辑器的应用尚未得到充分探索。我们引入了IF-Edit，这是一种无需调优的框架，将预训练的图像到视频扩散模型重新用于指令驱动的图像编辑。IF-Edit 解决了三个关键挑战：提示不匹配、冗余的时间潜变量以及模糊的后期帧。它包括（1）一个链式思考提示增强模块，将静态编辑指令转换为时间上具针对性的推理提示；（2）一种时间潜变量丢弃策略，在专家切换点之后压缩帧潜变量，加速去噪同时保持语义和时间一致性；以及（3）一个自我一致的后精修步骤，使用短的静止视频轨迹细化后期帧。在四个公开基准上的实验涵盖了非刚性编辑、物理和时间推理以及通用指令编辑，表明IF-Edit 在推理为中心的任务上表现出色，同时在通用编辑任务上保持竞争力。我们的研究为视频扩散模型作为图像编辑器提供了系统性视角，并强调了一个统一的视频-图像生成推理的简单配方。

Summary / 总结

The study explores the potential of image-to-video models as zero-shot image editors, addressing challenges such as prompt misalignment and temporal coherence. IF-Edit, a tuning-free framework, enhances these models by incorporating a chain-of-thought prompt module, a temporal latent dropout strategy, and a self-consistent post-refinement step. Experiments on four benchmarks demonstrate strong performance on reasoning tasks while maintaining competitiveness on general edits.

研究探讨了图像到视频模型作为零样本图像编辑器的潜力，解决了提示对齐不准确和时间连贯性等问题。IF-Edit这一无调优框架通过引入链式思考提示模块、时间潜变量丢弃策略和自我一致后精修步骤来增强这些模型。实验结果显示，在推理任务上表现出色，同时在通用编辑任务上保持竞争力。

Mixture of Horizons in Action Chunking

Authors: Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding

First: 2025-11-24T18:59:51+00:00 · Latest: 2025-11-24T18:59:51+00:00

Comments: 15 pages, 14 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

中文标题/摘要

标题：行动分块中的视野混合

视觉-语言-行动（VLA）模型在机器人操作方面展现了显著的能力，但它们的表现对训练期间使用的$\textbf{行动分块长度}$（称为$\textbf{视野}$）非常敏感。我们的实证研究表明，存在固有的权衡：较长的视野提供更强的全局预见性，但会降低细粒度的准确性；较短的视野则能增强局部控制，但在长期任务上表现不佳，这表明固定选择单一视野是次优的。为了缓解这种权衡，我们提出了一种$\textbf{视野混合（MoH）}$策略。MoH将行动分块重新排列为具有不同视野的多个段落，使用共享的动作变换器并行处理这些段落，并通过轻量级线性门融合输出。MoH具有三个吸引人的优点。1）MoH在单一模型中同时利用长期预见性和短期精确性，从而提高性能和对复杂任务的泛化能力。2）MoH可以无缝集成到全注意动作模块中，且几乎不增加训练或推理开销。3）MoH支持动态推理和自适应视野，通过跨视野共识选择稳定动作，与基线相比，吞吐量提高2.5倍，同时保持了优越的性能。广泛的实验表明，MoH在模拟和真实世界任务中都取得了持续且显著的改进。值得注意的是，在混合任务设置下，$π_{0.5}$与MoH结合后，在LIBERO上仅经过30000次训练迭代，就达到了99%的平均成功率，创造了新的最佳记录。项目页面：https://github.com/Timsty1/MixtureOfHorizons

Summary / 总结

The study explores the limitations of fixed action chunk lengths (horizons) in vision-language-action models for robotic manipulation, revealing a trade-off between global foresight and local control. To address this, a mixture of horizons (MoH) strategy is proposed, which combines segments with different horizons within a single model, enhancing both performance and generalizability. Experiments show that MoH outperforms baselines, especially in complex tasks, and achieves higher throughput with 99% success rate on LIBERO with 30k training iterations.

研究针对VLA模型在训练过程中对动作片段长度（即视野）的敏感性，提出了混合视野（MoH）策略。MoH在单一模型中结合了长期预见和短期精度，提升了性能和泛化能力。实验结果显示，MoH在多种策略上表现出一致的改进，特别是在LIBERO上仅用30k训练迭代就达到了99%的成功率。

Cloud4D

Authors: Jacob Lin, Edward Gryspeerdt, Ronald Clark

Venue: NeurIPS 2025 Spotlight

First: 2025-11-24T18:59:37+00:00 · Latest: 2025-11-24T18:59:37+00:00

Comments: NeurIPS 2025 Spotlight, project page: https://cloud4d.jacob-lin.com/

Abs · PDF · Code1 · Code2

Abstract

There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.

中文标题/摘要

标题：Cloud4D

在使用机器学习改进数值天气预报和气候模型方面取得了巨大进展。然而，大多数全球模型的作用尺度为千米级，难以模拟单个云和极端降水、风速、湍流和地表辐射等因子。因此，需要转向更高分辨率的模型，而这些模型需要高分辨率的现实世界观测，当前的仪器难以获取。我们提出了Cloud4D，这是第一个仅使用同步地面相机重建物理一致的四维云状态的学习框架。利用同化指导的2D到3D变换器，Cloud4D推断出25米空间分辨率和5秒时间分辨率的液态水含量的完整3D分布。通过跟踪时间上的3D液态水含量检索，Cloud4D还估计了水平风矢量。在为期两个月、包含六个天空方向相机的部署中，我们的系统在空间-时间分辨率上相比最先进的卫星测量提高了数量级，同时保留了与共置雷达测量的单位误差（<10%）。代码和数据可在我们的项目页面https://cloud4d.jacob-lin.com/上获取。

Summary / 总结

Cloud4D is a learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using synchronized ground-based cameras. It infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution using a homography-guided 2D-to-3D transformer. Across a two-month deployment, Cloud4D provides an order-of-magnitude improvement in space-time resolution compared to state-of-the-art satellite measurements, with single-digit relative error against collocated radar measurements.

Cloud4D 是一个基于学习的框架，利用同步地面摄像头重建一个物理上一致的四维云状态。它使用同化投影引导的2D到3D变换器推断出25米空间和5秒时间分辨率的全3D液态水含量分布。在两个月的部署中，Cloud4D相比最先进的卫星测量提供了数量级的空间-时间分辨率提升，并且与共址雷达测量相比相对误差保持在个位数以内。

Cost-Aware Contrastive Routing for LLMs

Authors: Reza Shirkavand, Shangqian Gao, Peiran Yu, Heng Huang

First: 2025-08-17T20:16:44+00:00 · Latest: 2025-11-24T18:59:36+00:00

Abs · PDF · Code1 · Code2

Abstract

We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.

中文标题/摘要

标题：成本意识对比路由技术在大语言模型中的应用

我们研究了在多样且动态的模型池中，大语言模型的成本意识路由方法。现有方法往往忽视了提示特定的上下文，依赖昂贵的模型剖析，假设固定的专家集，或者使用低效的试错策略。我们提出了成本光谱对比路由（CSCR），这是一种轻量级框架，将提示和模型映射到共享的嵌入空间，以实现快速、成本敏感的选择。CSCR 使用紧凑且快速计算的开源模型的对数分数足迹和黑盒 API 的困惑度指纹。对比编码器被训练以在自适应成本范围内优先选择最便宜的准确专家。在推理时，路由简化为通过 FAISS 索引的单次 k-NN 查找，无需重新训练即可适应专家池的变化，并实现微秒级延迟。在多个基准测试中，CSCR 一致地优于基线，将准确性和成本之间的权衡提高了高达 25%，同时在未见过的大语言模型和离分布的提示上表现出强大的泛化能力。

Summary / 总结

The research aims to improve cost-aware routing for large language models by addressing limitations of existing methods. It introduces Cost-Spectrum Contrastive Routing (CSCR), which maps prompts and models into a shared embedding space for efficient, cost-sensitive selection. CSCR uses logit footprints and perplexity fingerprints to enable fast routing and reduces inference time to microsecond latency. Experiments show CSCR outperforms baselines by up to 25% in the accuracy-cost tradeoff and generalizes well to new models and prompts.

研究旨在通过解决现有方法的局限性，改进大型语言模型的成本感知路由。提出了Cost-Spectrum Contrastive Routing (CSCR)，将提示和模型映射到共享嵌入空间，实现高效的成本敏感选择。CSCR 使用逻辑足迹和困惑度指纹来实现快速路由，并将推理时间减少到微秒级。实验表明，CSCR 在准确性和成本之间的权衡上优于基线多达 25%，并且能够很好地泛化到新的模型和未见过的提示。

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Authors: Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai

Venue: AAAI 2026 Oral

First: 2025-11-24T18:59:17+00:00 · Latest: 2025-11-24T18:59:17+00:00

Comments: Accepted to AAAI 2026 (Oral). The code is available at \url{https://github.com/H-EmbodVis/GRANT}

Abs · PDF · Code1 · Code2 · Code3

Abstract

Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

中文标题/摘要

标题：一起烹饪和清洁：教授并行任务执行的实体代理

任务调度对于实体AI至关重要，使代理能够遵循自然语言指令并在3D物理世界中高效执行动作。然而，现有的数据集通常通过忽略运筹学知识和3D空间定位来简化任务规划。在本工作中，我们提出了基于运筹学知识的3D定位任务调度（ORS3D），这是一个需要语言理解、3D定位和效率优化协同作用的新任务。与之前的设置不同，ORS3D要求代理通过利用可并行子任务（例如，在微波炉运行时清洁水槽）来最小化总完成时间。为了促进对ORS3D的研究，我们构建了包含4000个真实场景中60000个复合任务的ORS3D-60K大型数据集。此外，我们提出了GRANT，这是一种具备简单而有效的调度标记机制的实体多模态大型语言模型，用于生成高效的任务调度和定位动作。在ORS3D-60K上的广泛实验验证了GRANT在语言理解、3D定位和调度效率方面的有效性。代码可在https://github.com/H-EmbodVis/GRANT获取

Summary / 总结

This work addresses the challenge of task scheduling for embodied AI by introducing ORS3D, a new task that integrates language understanding, 3D grounding, and efficiency optimization. The authors propose ORS3D-60K, a large dataset for training agents to execute parallelizable subtasks efficiently. They also introduce GRANT, an embodied multi-modal large language model with a scheduling token mechanism, which demonstrates effectiveness in language understanding, 3D grounding, and scheduling efficiency on ORS3D-60K.

本文提出了ORS3D，一种结合语言理解、3D空间定位和效率优化的新任务，以解决实体AI代理的任务调度挑战。作者提出了GRANT，一个具备简单有效调度机制的实体多模态大型语言模型，用于生成高效的任务调度。在ORS3D-60K数据集上的实验表明，GRANT在语言理解、3D定位和调度效率方面表现出色。GRANT的代码已公开可用。

Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

Authors: Yun Zhou, Yaoting Wang, Guangquan Jie, Jinyu Liu, Henghui Ding

First: 2025-11-24T18:58:22+00:00 · Latest: 2025-11-24T18:58:22+00:00

Comments: Code: https://github.com/FudanCVL/Ref-SAM3D

Abs · PDF · Code1 · Code2 · Code3

Abstract

SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.

中文标题/摘要

标题：Ref-SAM3D：将SAM3D与文本结合进行参考3D重建

SAM3D因其强大的3D物体重建能力而受到广泛关注。然而，一个关键限制是：SAM3D无法重建由文本描述所指明的具体物体，这种能力对于实际应用如3D编辑、游戏开发和虚拟环境至关重要。为解决这一问题，我们提出了Ref-SAM3D，这是一种简单而有效的SAM3D扩展，通过引入文本描述作为高级先验，实现单张RGB图像的文本引导3D重建。通过广泛的定性实验，我们展示了Ref-SAM3D仅通过自然语言和单个2D视图即可实现具有竞争力和高保真的零样本重建性能。我们的结果表明，Ref-SAM3D有效地弥合了2D视觉线索与3D几何理解之间的差距，提供了一种更灵活和易用的参考引导3D重建范式。代码可在：https://github.com/FudanCVL/Ref-SAM3D

Summary / 总结

The research aims to enhance SAM3D's 3D object reconstruction capabilities by integrating textual descriptions, addressing its limitation in reconstructing objects based on text. Ref-SAM3D, a simple extension, uses natural language and a single RGB image to achieve text-guided 3D reconstruction, demonstrating competitive and high-fidelity performance in zero-shot scenarios. The method effectively bridges the gap between 2D visual cues and 3D geometric understanding, providing a more flexible approach for reference-guided 3D reconstruction.

论文针对SAM3D无法根据文本描述重建特定物体的问题，这个问题对于3D编辑和游戏开发等应用至关重要。Ref-SAM3D通过将文本描述作为高级先验扩展了SAM3D，使其能够仅凭自然语言和单个2D视图进行文本引导的3D重建。大量实验表明，Ref-SAM3D提供了竞争性和高保真的零样本重建，有效地弥合了2D视觉线索与3D几何理解之间的差距。

Prompt Less, Smile More: MTP with Semantic Engineering in Lieu of Prompt Engineering

Authors: Jayanaka L. Dantanarayana, Savini Kashmira, Thakee Nathees, Zichen Zhang, Krisztian Flautner, Lingjia Tang, Jason Mars

First: 2025-11-24T18:58:22+00:00 · Latest: 2025-11-24T18:58:22+00:00

Abs · PDF · Code1 · Code2

Abstract

AI-Integrated programming is emerging as a foundational paradigm for building intelligent systems with large language models (LLMs). Recent approaches such as Meaning Typed Programming (MTP) automate prompt generation by leveraging the semantics already present in code. However, many real-world applications depend on contextual cues, developer intent, and domain-specific reasoning that extend beyond what static code semantics alone can express. To address this limitation, we introduce Semantic Engineering, a lightweight method for enriching program semantics so that LLM-based systems can more accurately reflect developer intent without requiring full manual prompt design. We present Semantic Context Annotations (SemTexts), a language-level mechanism that allows developers to embed natural-language context directly into program constructs. Integrated into the Jac programming language, Semantic Engineering extends MTP to incorporate these enriched semantics during prompt generation. We further introduce a benchmark suite designed to reflect realistic AI-Integrated application scenarios. Our evaluation shows that Semantic Engineering substantially improves prompt fidelity, achieving performance comparable to Prompt Engineering while requiring significantly less developer effort.

中文标题/摘要

标题：少提示，多微笑：利用语义工程替代提示工程的MTP

AI集成编程正成为构建基于大型语言模型（LLM）的智能系统的基础范式。最近的方法如意义类型编程（MTP）通过利用代码中已有的语义来自动化提示生成。然而，许多实际应用依赖于上下文线索、开发人员意图和超出静态代码语义所能表达的领域特定推理。为了解决这一局限，我们引入了语义工程，这是一种轻量级的方法，用于丰富程序语义，使基于LLM的系统能够更准确地反映开发人员的意图，而无需进行完整的手动提示设计。我们提出了语义上下文注解（SemTexts），这是一种语言级别的机制，允许开发人员直接将自然语言上下文嵌入到程序结构中。集成到Jac编程语言中，语义工程将这些丰富语义扩展到提示生成过程中。我们还引入了一套基准测试套件，以反映实际的AI集成应用场景。我们的评估表明，语义工程显著提高了提示的准确性，性能与提示工程相当，但所需开发人员的努力显著减少。

Summary / 总结

The paper addresses the limitation of existing Meaning Typed Programming (MTP) approaches in capturing contextual cues and developer intent beyond static code semantics. It introduces Semantic Engineering, a lightweight method that enriches program semantics using Semantic Context Annotations (SemTexts) to better reflect developer intent. Integrated into the Jac programming language, this method extends MTP for more accurate prompt generation with less developer effort, achieving performance comparable to traditional prompt engineering methods but with greater efficiency.

论文针对现有Meaning Typed Programming (MTP)方法在捕捉超出静态代码语义的上下文线索和开发者意图方面的局限性，提出了轻量级的Semantic Engineering方法，通过Semantic Context Annotations (SemTexts)来丰富程序语义，更好地反映开发者意图。该方法集成到Jac编程语言中，在提高提示准确性和性能的同时，减少了开发者的工作量。评估结果显示，Semantic Engineering在性能上与Prompt Engineering相当，但所需开发者努力较少。

SLMFix: Leveraging Small Language Models for Error Fixing with Reinforcement Learning

Authors: David Jiahao Fu, Aryan Gupta, Aaron Councilman, David Grove, Yu-Xiong Wang, Vikram Adve

First: 2025-11-24T18:56:47+00:00 · Latest: 2025-11-24T18:56:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in large language models (LLMs) have shown very impressive capabilities in code generation across many programming languages. However, even state-of-the-art LLMs generate programs that contains syntactic errors and fail to complete the given tasks, especially for low-resource programming languages (LRPLs). In addition, high training cost makes finetuning LLMs unaffordable with constrained computational resources, further undermining the effectiveness of LLMs for code generation. In this work, we propose SLMFix, a novel code generation pipeline that leverages a small language model (SLM) finetuned using reinforcement learning (RL) techniques to fix syntactic errors in LLM-generated programs to improve the quality of LLM-generated programs for domain-specific languages (DSLs). In specific, we applied RL on the SLM for the program repair task using a reward calculated using both a static validator and a static semantic similarity metric. Our experimental results demonstrate the effectiveness and generalizability of our approach across multiple DSLs, achieving more than 95% pass rate on the static validator. Notably, SLMFix brings substantial improvement to the base model and outperforms supervised finetuning approach even for 7B models on a LRPL, showing the potential of our approach as an alternative to traditional finetuning approaches.

中文标题/摘要

标题：SLMFix：利用小型语言模型结合强化学习进行错误修复

大型语言模型（LLMs）在多种编程语言的代码生成方面展现了非常出色的性能。然而，即使是最先进的LLMs生成的程序中也包含语法错误，并且无法完成给定的任务，尤其是在低资源编程语言（LRPLs）方面。此外，高训练成本使得对LLMs进行微调在受限计算资源下变得不可行，进一步削弱了LLMs在代码生成中的有效性。在本文中，我们提出了一种新颖的代码生成管道SLMFix，该管道利用通过强化学习（RL）技术微调的小型语言模型（SLM）修复LLMs生成程序中的语法错误，以提高LLMs生成程序的质量，特别是针对领域特定语言（DSLs）。具体而言，我们使用结合静态验证器和静态语义相似度度量的奖励对SLM进行RL训练，以完成程序修复任务。我们的实验结果表明，该方法在多个DSLs上具有有效性和泛化性，静态验证器的通过率超过95%。值得注意的是，SLMFix在低资源编程语言（LRPL）上对基础模型带来了显著改进，并且即使对于7B模型，其表现也优于监督微调方法，这表明我们方法具有替代传统微调方法的潜力。

Summary / 总结

SLMFix proposes a code generation pipeline that uses a small language model (SLM) fine-tuned with reinforcement learning (RL) to fix syntactic errors in programs generated by large language models (LLMs), particularly for low-resource programming languages. The approach applies RL to the SLM for program repair, using a reward based on a static validator and semantic similarity. Experiments show SLMFix improves the pass rate on the static validator to over 95% across multiple domain-specific languages and outperforms supervised fine-tuning, especially for 7B models in low-resource languages.

SLMFix 提出了一种新的代码生成管道，使用经过强化学习（RL）微调的小型语言模型（SLM）来修复大型语言模型（LLM）生成的程序中的语法错误，特别是针对低资源编程语言。SLM 使用来自静态验证器和静态语义相似度度量的奖励来训练修复程序。该方法在多个领域特定语言中表现出高有效性，静态验证器的通过率超过 95%，即使在低资源编程语言中，SLMFix 也优于监督微调方法，甚至对于 7B 模型也是如此。

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Authors: Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, Xudong Wang

First: 2025-11-24T18:55:19+00:00 · Latest: 2025-11-24T18:55:19+00:00

Comments: Project page: https://wakalsprojectpage.github.io/comt-website/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth, segmentation, edges, and DINO features). At inference, the model reasons directly in the continuous visual token space, preserving efficiency while optionally decoding dense predictions for interpretability. Evaluated across more than ten diverse perception benchmarks, including CV-Bench, MMVP, RealWorldQA, MMStar, WorldMedQA, and HRBench, integrating COVT into strong VLMs such as Qwen2.5-VL and LLaVA consistently improves performance by 3% to 16% and demonstrates that compact continuous visual thinking enables more precise, grounded, and interpretable multimodal intelligence.

中文标题/摘要

标题：视觉思维链：通过连续视觉标记提高VLMs的视觉理解和推理能力

视觉-语言模型（VLMs）在语言推理方面表现出色，但在需要密集视觉感知的感知理解方面存在局限，例如空间推理和几何意识。这种局限性源于当前VLMs在捕捉跨空间维度的密集视觉信息方面的机制有限。我们提出了视觉思维链（COVT），这是一种框架，使VLMs不仅在语言中进行推理，还能通过连续视觉标记进行推理——这些紧凑的潜在表示编码丰富的感知线索。在大约20个标记的预算内，COVT从轻量级视觉专家中提炼知识，捕捉诸如2D外观、3D几何、空间布局和边缘结构等互补属性。在训练过程中，带有COVT的VLM自回归预测这些视觉标记以重建密集监督信号（例如，深度、分割、边缘和DINO特征）。在推理时，模型直接在连续视觉标记空间中进行推理，保持效率的同时可选地解码密集预测以提高可解释性。在CV-Bench、MMVP、RealWorldQA、MMStar、WorldMedQA和HRBench等超过十个多样化的感知基准上进行评估，将COVT整合到强大的VLMs如Qwen2.5-VL和LLaVA中，性能提高了3%到16%，表明紧凑的连续视觉思维能够实现更精确、更具体的多模态智能。

Summary / 总结

The research aims to enhance VLMs' perceptual understanding by introducing Chain-of-Visual-Thought (COVT), which uses continuous visual tokens to capture dense visual information. During training, the VLM with COVT predicts these tokens to reconstruct dense supervision signals, and at inference, it reasons directly in the visual token space. Experiments across various perception benchmarks show that integrating COVT into strong VLMs like Qwen2.5-VL and LLaVA improves performance by 3% to 16%. This method demonstrates that compact continuous visual thinking can enable more precise and interpretable multimodal intelligence.

研究旨在通过引入Chain-of-Visual-Thought (COVT) 来增强 VLMs 的感知理解能力，COVT 使用连续视觉令牌来捕获密集的视觉信息。在训练过程中，带有 COVT 的 VLM 预测这些令牌以重建密集的监督信号，而在推理过程中，它直接在视觉令牌空间中进行推理。实验表明，将 COVT 集成到强大的 VLMs 如 Qwen2.5-VL 和 LLaVA 中可以提高 3% 至 16% 的性能。该方法表明，紧凑的连续视觉思考可以实现更精确和可解释的多模态智能。

Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration

Authors: James Y. Huang, Sheng Zhang, Qianchu Liu, Guanghui Qin, Tinghui Zhu, Tristan Naumann, Muhao Chen, Hoifung Poon

First: 2025-11-24T18:55:16+00:00 · Latest: 2025-11-24T18:55:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in challenging, knowledge-intensive reasoning tasks. However, extending LLMs to perceive and reason over a new modality (e.g., vision), often requires costly development of large-scale vision language models (VLMs) with LLMs as backbones. Smaller VLMs are more efficient and adaptable but often lack the broad knowledge and reasoning capabilities of frontier LLMs. In this work, we propose BeMyEyes, a modular, multi-agent framework for extending LLMs to multimodal reasoning by orchestrating collaboration between efficient, adaptable VLMs as perceivers and powerful LLMs as reasoners through conversations. We then introduce a data synthesis and supervised fine-tuning pipeline to train the perceiver agent to effectively collaborate with the reasoner agent. By combining the complementary strengths of perception and reasoning agents, BeMyEyes avoids the need for training large-scale multimodal models, preserves the generalization and reasoning capabilities of LLMs, and allows flexible extension to new domains and modalities. Experiments show that our framework unlocks the multimodal reasoning capabilities for LLMs, enabling a lightweight and fully open-source solution, i.e. equipping text-only DeepSeek-R1 with Qwen2.5-VL-7B perceiver, to outperform large-scale proprietary VLMs such as GPT-4o on a wide range of knowledge-intensive multimodal tasks. These results demonstrate the effectiveness, modularity, and scalability of our multi-agent approach for building future multimodal reasoning systems.

中文标题/摘要

标题：Be My Eyes: 通过多智能体协作将大型语言模型扩展到新模态

大型语言模型（LLMs）在具有挑战性的、知识密集型的推理任务中展现了卓越的能力。然而，将LLMs扩展到感知和推理新的模态（例如，视觉），通常需要开发大规模的视觉语言模型（VLMs），以LLMs为骨干。较小的VLMs更高效、更适应，但往往缺乏前沿LLMs的广泛知识和推理能力。在本工作中，我们提出了一种名为BeMyEyes的模块化多智能体框架，通过协调感知智能体（高效的、适应性强的VLMs）和推理智能体（强大的LLMs）之间的对话合作，将LLMs扩展到多模态推理。我们随后介绍了一种数据合成和监督微调管道，以训练感知智能体有效地与推理智能体协作。通过结合感知和推理智能体的互补优势，BeMyEyes避免了训练大规模多模态模型的需要，保留了LLMs的泛化和推理能力，并允许灵活扩展到新的领域和模态。实验表明，我们的框架为LLMs解锁了多模态推理能力，提供了一个轻量级且完全开源的解决方案，即仅用文本的DeepSeek-R1配备Qwen2.5-VL-7B感知器，就能在一系列知识密集型多模态任务中超越大规模的专有VLMs，如GPT-4o。这些结果证明了我们多智能体方法的有效性、模块化和可扩展性，用于构建未来的多模态推理系统。

Summary / 总结

BeMyEyes proposes a modular multi-agent framework to extend large language models (LLMs) to multimodal reasoning by collaborating between efficient vision language models (VLMs) and powerful LLMs. The framework uses a data synthesis and supervised fine-tuning pipeline to train the VLMs to work effectively with LLMs. Experiments show that this approach enables lightweight, open-source solutions to outperform large-scale proprietary VLMs on knowledge-intensive multimodal tasks, preserving the reasoning capabilities of LLMs and allowing flexible extension to new domains and modalities.

BeMyEyes 是一个模块化框架，通过高效视觉语言模型 (VLM) 和强大语言模型 (LLM) 的协作来扩展大型语言模型 (LLM) 的多模态推理能力。它使用数据合成和监督微调流水线来训练 VLM 与 LLM 有效协作。实验表明，BeMyEyes 在各种知识密集型多模态任务上优于大型专有 VLM，展示了其有效性和可扩展性。

The Loss of Control Playbook: Degrees, Dynamics, and Preparedness

Authors: Charlotte Stix, Annika Hallensleben, Alejandro Ortega, Matteo Pistillo

First: 2025-11-19T20:10:39+00:00 · Latest: 2025-11-24T18:52:00+00:00

Abs · PDF · Code1 · Code2

Abstract

This research report addresses the absence of an actionable definition for Loss of Control (LoC) in AI systems by developing a novel taxonomy and preparedness framework. Despite increasing policy and research attention, existing LoC definitions vary significantly in scope and timeline, hindering effective LoC assessment and mitigation. To address this issue, we draw from an extensive literature review and propose a graded LoC taxonomy, based on the metrics of severity and persistence, that distinguishes between Deviation, Bounded LoC, and Strict LoC. We model pathways toward a societal state of vulnerability in which sufficiently advanced AI systems have acquired or could acquire the means to cause Bounded or Strict LoC once a catalyst, either misalignment or pure malfunction, materializes. We argue that this state becomes increasingly likely over time, absent strategic intervention, and propose a strategy to avoid reaching a state of vulnerability. Rather than focusing solely on intervening on AI capabilities and propensities potentially relevant for LoC or on preventing potential catalysts, we introduce a complementary framework that emphasizes three extrinsic factors: Deployment context, Affordances, and Permissions (the DAP framework). Compared to work on intrinsic factors and catalysts, this framework has the unfair advantage of being actionable today. Finally, we put forward a plan to maintain preparedness and prevent the occurrence of LoC outcomes should a state of societal vulnerability be reached, focusing on governance measures (threat modeling, deployment policies, emergency response) and technical controls (pre-deployment testing, control measures, monitoring) that could maintain a condition of perennial suspension.

中文标题/摘要

标题：失控玩本：程度、动态与准备

本研究报告通过开发一种新颖的分类法和准备框架，解决了人工智能系统中缺乏可操作的失控（LoC）定义的问题。尽管政策和研究的关注度不断增加，现有的LoC定义在范围和时间上差异显著，阻碍了有效的LoC评估和缓解。为解决这一问题，我们借鉴了广泛文献综述，并提出了一种基于严重性和持续性的分级LoC分类法，区分了偏差、有界失控和严格失控。我们建模了通向社会脆弱状态的路径，在这种状态下，足够先进的AI系统一旦出现催化剂（无论是失准还是纯粹故障），就可能获得或能够获得导致有界或严格失控的手段。我们认为，在缺乏战略干预的情况下，这种状态随着时间的推移变得越来越可能。我们提出了一个策略，以避免达到脆弱状态。我们不仅关注干预可能与LoC相关的AI能力和倾向，或防止潜在催化剂，还引入了一个互补框架，强调三个外在因素：部署背景、便利条件和许可（DAP框架）。与内在因素和催化剂的工作相比，该框架今天具有不可比拟的可操作性。最后，我们提出了一项计划，以维持准备状态并防止在社会脆弱状态下发生LoC结果，重点关注治理措施（威胁建模、部署政策、应急响应）和技术控制（预部署测试、控制措施、监控），以维持一种持久的暂停状态。

Summary / 总结

This research develops a novel taxonomy and preparedness framework for Loss of Control (LoC) in AI systems to address the lack of an actionable definition. By distinguishing between Deviation, Bounded LoC, and Strict LoC based on severity and persistence, the study models pathways to societal vulnerability and proposes a DAP framework focusing on Deployment context, Affordances, and Permissions to enhance preparedness. The findings suggest that strategic intervention is necessary to avoid reaching a state of vulnerability and that a combination of governance measures and technical controls can maintain a condition of perennial suspension.

研究开发了一种新的LoC分类和准备框架，以解决AI系统中缺乏明确定义的问题。通过根据严重性和持续性区分偏离、有界LoC和严格LoC，研究指出了可能导致社会脆弱性的路径，即先进的AI可能造成重大危害。提出的DAP框架侧重于部署背景、便利性和许可三个外部因素，以增强准备性。关键发现包括在缺乏战略干预的情况下，社会脆弱性变得越来越可能，并且需要治理措施和技术控制来维持持续的暂停状态。

SING: SDE Inference via Natural Gradients

Authors: Amber Hu, Henry Smith, Scott Linderman

Venue: NeurIPS

First: 2025-06-21T19:36:11+00:00 · Latest: 2025-11-24T18:49:51+00:00

Comments: To appear in Advances in Neural Processing Information Systems (NeurIPS), 2025

Abs · PDF · Code1 · Code2

Abstract

Latent stochastic differential equation (SDE) models are important tools for the unsupervised discovery of dynamical systems from data, with applications ranging from engineering to neuroscience. In these complex domains, exact posterior inference of the latent state path is typically intractable, motivating the use of approximate methods such as variational inference (VI). However, existing VI methods for inference in latent SDEs often suffer from slow convergence and numerical instability. We propose SDE Inference via Natural Gradients (SING), a method that leverages natural gradient VI to efficiently exploit the underlying geometry of the model and variational posterior. SING enables fast and reliable inference in latent SDE models by approximating intractable integrals and parallelizing computations in time. We provide theoretical guarantees that SING approximately optimizes the intractable, continuous-time objective of interest. Moreover, we demonstrate that better state inference enables more accurate estimation of nonlinear drift functions using, for example, Gaussian process SDE models. SING outperforms prior methods in state inference and drift estimation on a variety of datasets, including a challenging application to modeling neural dynamics in freely behaving animals. Altogether, our results illustrate the potential of SING as a tool for accurate inference in complex dynamical systems, especially those characterized by limited prior knowledge and non-conjugate structure.

中文标题/摘要

标题：SING：通过自然梯度进行隐含SDE推断

隐含随机微分方程（SDE）模型是用于从数据中无监督发现动力系统的重要工具，应用范围从工程到神经科学。在这些复杂的领域中，隐含状态路径的确切后验推断通常是不可行的，因此推动了近似方法如变分推断（VI）的应用。然而，现有的用于隐含SDE推断的VI方法往往存在收敛速度慢和数值不稳定的问题。我们提出了SDE推断通过自然梯度（SING）的方法，该方法利用自然梯度VI来高效地利用模型和变分后验的内在几何结构。SING通过近似不可积分和并行化时间计算，实现了隐含SDE模型的快速可靠推断。我们提供了理论保证，表明SING近似优化了目标的连续时间不可解目标。此外，我们证明了更好的状态推断能够更准确地估计非线性漂移函数，例如使用高斯过程SDE模型。SING在多种数据集上的状态推断和漂移估计性能优于先前的方法，包括对自由活动动物神经动力学建模这一具有挑战性的应用。总之，我们的结果展示了SING作为复杂动力系统准确推断工具的潜力，尤其是在先验知识有限和非共轭结构的情况下。

Summary / 总结

SING is a method for inferring latent state paths in stochastic differential equation (SDE) models using natural gradient variational inference, which improves convergence and numerical stability compared to existing methods. Theoretical guarantees show that SING optimizes the continuous-time objective, and experimental results demonstrate its superior performance in state inference and drift estimation across various datasets, including neural dynamics in freely behaving animals.

SING 是一种使用自然梯度变分推断来推断 SDE 模型中潜在状态路径的方法，相比现有方法，它能提高收敛性和数值稳定性。理论保证表明 SING 能优化连续时间目标，实验结果表明它在状态推断和漂移估计方面表现出色，包括自由行为动物的神经动力学数据集。

Learning Robust Social Strategies with Large Language Models

Authors: Dereck Piche, Mohammed Muqeeth, Milad Aghajohari, Juan Duque, Michael Noukhovitch, Aaron Courville

First: 2025-11-24T18:43:46+00:00 · Latest: 2025-11-24T18:43:46+00:00

Abs · PDF · Code1 · Code2

Abstract

As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation and non-exploitability. We then introduce a group-relative baseline that simplifies advantage computation in iterated games, enabling multi-agent training at LLM scale. We also contribute a novel social dilemma environment, Trust and Split, which requires natural language communication to achieve high collective welfare. Across a wide range of social dilemmas, policies learned with Advantage Alignment achieve higher collective payoffs while remaining robust against exploitation by greedy agents.

中文标题/摘要

标题：使用大型语言模型学习稳健的社会策略

随着自主AI的普及，具有不同且可能冲突目标的代理将以复杂的方式相互作用。这些多代理交互提出了一个基本挑战，特别是在社会困境中，代理的个人激励可能会损害集体福利。虽然强化学习（RL）在单代理领域有效于使大型语言模型（LLMs）对齐，但先前的小网络结果表明，在多代理环境中标准RL通常会收敛到自私的策略。我们展示了在LLMs中存在相同效应：尽管有合作先验，RL训练的LLM代理会发展出机会主义行为，甚至可以利用先进的闭源模型。为了解决RL倾向于收敛到不良均衡的趋势，我们采用了一种最近的对手学习意识算法——优势对齐，对LLMs进行微调以促进多代理合作和不可剥削性。我们还引入了一种基于群体相对基准，简化了迭代游戏中优势计算的方法，从而在LLM规模上实现多代理训练。我们还贡献了一个新的社会困境环境——信任与分配，该环境需要自然语言交流以实现高集体福利。在一系列社会困境中，使用优势对齐学习到的策略实现了更高的集体收益，同时保持了对贪婪代理的抗剥削性。

Summary / 总结

The research aims to develop robust social strategies for large language models (LLMs) in multi-agent interactions, where standard reinforcement learning (RL) often leads to self-interested policies. To address this, the study adapts Advantage Alignment, an opponent-learning awareness algorithm, to fine-tune LLMs for cooperation and non-exploitability. The Trust and Split environment, requiring natural language communication, was developed to test these policies. Results show that Advantage Alignment-trained LLMs achieve higher collective payoffs in various social dilemmas and remain robust against exploitation by greedy agents.

该研究旨在解决大型语言模型（LLMs）在多智能体社会困境中的对齐问题，其中标准强化学习（RL）通常会导致自私行为。为克服这一问题，研究人员采用了一种名为Advantage Alignment的对手学习意识算法，以促进合作和不可被利用性。他们引入了一个新的社会困境环境，Trust and Split，要求通过自然语言交流实现高集体福利。结果表明，使用Advantage Alignment训练的LLMs在各种社会困境中实现了更高的集体收益，并且更能抵抗贪婪智能体的利用。

In-Video Instructions: Visual Signals as Generative Control

Authors: Gongfan Fang, Xinyin Ma, Xinchao Wang

First: 2025-11-24T18:38:45+00:00 · Latest: 2025-11-24T18:38:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.

中文标题/摘要

标题：视频内指令：视觉信号作为生成控制

大规模视频生成模型最近展示了强大的视觉能力，能够预测符合当前观察中的逻辑和物理线索的未来帧。在本文中，我们研究了这些能力是否可以用于可控的图像到视频生成，通过将嵌入在帧中的视觉信号解释为指令，我们提出了一种称为视频内指令的范式。与基于提示的控制相比，后者提供的是全局且粗糙的文本描述，视频内指令直接将用户指导编码到视觉领域，通过叠加文本、箭头或轨迹等元素。这使得视觉主题与其预期动作之间具有明确的空间意识和无歧义的对应关系，通过为不同的对象分配不同的指令。在三个最先进的生成器（包括Veo 3.1、Kling 2.5和Wan 2.2）上的广泛实验表明，视频模型可以可靠地解释和执行这些嵌入在视觉中的指令，尤其是在复杂的多对象场景中。

Summary / 总结

This study explores the use of visual signals within video frames as instructions for generating future frames, termed In-Video Instruction. Unlike text-based prompts, these visual cues are spatially explicit and can guide the generation of actions for specific objects. Experiments on three advanced video generators show that these models can reliably interpret and follow visual instructions, especially in scenarios involving multiple objects.

本研究探讨了将视觉信号嵌入视频帧作为控制图像到视频生成的指令，称为In-Video Instruction。与使用全局文本描述的提示控制不同，In-Video Instruction直接将用户指导嵌入视觉领域。实验表明，Veo 3.1、Kling 2.5和Wan 2.2等三种先进视频生成器能够可靠地解释和执行嵌入的视觉指令，尤其是在复杂的多对象场景中。

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Authors: Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh

First: 2025-11-24T18:35:54+00:00 · Latest: 2025-11-24T18:35:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

中文标题/摘要

标题：DR Tulu：随机制表的强化学习进行深入研究

深度研究模型执行多步研究以生成长格式、充分归因的答案。然而，大多数开放的深度研究模型是通过验证奖励的强化学习（RLVR）训练在易于验证的短格式问答任务上，这不适用于现实中的长格式任务。我们通过随机制表的强化学习（RLER）来解决这一问题，在此过程中，我们在训练过程中构建和维护与策略模型共同进化的评价标准；这使得评价标准能够包含模型新探索的信息，并提供区分性、在线的反馈。使用RLER，我们开发了DR Tulu（DR Tulu-8B），这是第一个直接为开放性、长格式深度研究训练的开放模型。在科学、医疗保健和一般领域的四个长格式深度研究基准测试中，DR Tulu 显著优于现有的开放深度研究模型，并且在某些方面与专有的深度研究系统相当，同时其规模和每次查询的成本要小得多。为了促进未来的研究，我们发布了所有数据、模型和代码，包括我们新的基于MCP的代理基础设施，用于深度研究系统。

Real-Time Object Tracking with On-Device Deep Learning for Adaptive Beamforming in Dynamic Acoustic Environments

Authors: Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin, Sandra Roger, Maximo Cobos

First: 2025-11-24T18:33:50+00:00 · Latest: 2025-11-24T18:33:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.

中文标题/摘要

标题：基于嵌入式深度学习的实时目标跟踪在动态声学环境中的自适应波束形成

目标跟踪和声学波束形成的进步正在推动监视、人机交互和机器人技术的新能力。本研究提出了一种嵌入式系统，该系统将基于深度学习的跟踪与波束形成相结合，以在动态环境中实现精确的声音源定位和定向音频捕获。该方法结合单目深度估计和立体视觉，以实现移动物体的精确三维定位。使用MEMS麦克风构建的平面同心圆麦克风阵列提供了一个紧凑、节能的平台，支持在方位和仰角上进行2D波束扫描。实时跟踪输出不断调整阵列的焦点，使声学响应与目标位置同步。通过结合学习的空间意识与动态扫描，系统在存在多个或移动声源的情况下保持稳健性能。实验评估表明，该设计在信号与干扰比方面取得了显著提升，使其非常适合用于视频会议、智能家居设备和辅助技术。

Summary / 总结

This work introduces an embedded system that integrates deep learning-based object tracking with acoustic beamforming to enhance sound source localization and directional audio capture in dynamic environments. The system uses single-camera depth estimation and stereo vision for 3D object localization and a MEMS microphone array for 2D beam steering. Real-time tracking outputs adapt the microphone array's focus, synchronizing the acoustic response with the target's position. Experimental results show significant improvements in signal-to-interference ratio, making the system suitable for applications like teleconferencing and smart home devices.

该研究提出了一种嵌入式系统，结合了基于深度学习的目标跟踪和声束形成技术，以实现动态环境中的精确声源定位和方向性音频捕获。系统利用单目深度估计和立体视觉进行三维目标定位，并使用MEMS麦克风阵列进行二维波束扫描。实时跟踪不断调整麦克风阵列的焦点，显著提高了信干比，使系统适用于如远程会议和智能家居设备等应用。关键发现包括在动态声学环境中显著提高的信干比，展示了系统的鲁棒性能。

Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

Authors: Brennen A. Hill, Mant Koh En Wei, Thangavel Jishnuanandh

Venue: NeurIPS 2025

First: 2025-08-04T21:29:07+00:00 · Latest: 2025-11-24T18:31:13+00:00

Comments: Published in the Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). Additionally accepted for presentation in the NeurIPS 2025 Workshop: Embodied World Models for Decision Making (EWM) and the NeurIPS 2025 Workshop: Optimization for Machine Learning (OPT)

Abs · PDF · Code1 · Code2

Abstract

Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent's own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.

中文标题/摘要

标题：传达计划，而非感知：基于体态世界模型的大规模多智能体协调

在多智能体系统中，稳健的协调对于有效决策至关重要，尤其是在部分可观测的情况下。多智能体强化学习（MARL）中的一个核心问题是是否需要设计通信协议或通过端到端学习来学习这些协议。我们使用体态世界模型来研究这一二分法。我们提出了两种通信策略来解决合作任务分配问题。第一种，学习直接通信（LDC），通过端到端学习来学习协议。第二种，意图通信，使用一种工程化的归纳偏置：一个紧凑的、通过学习得到的世界模型，即想象轨迹生成模块（ITGM），它使用智能体自身的策略来模拟未来状态。然后，消息生成网络（MGN）将这个计划压缩成一条消息。我们在网格世界中评估了这些方法，这是一个用于体态人工智能问题的经典抽象，同时扩展环境复杂性。我们的实验表明，在简单环境中，自发通信是可行的，但工程化、基于世界模型的方法在复杂性增加时表现出更优的性能、样本效率和可扩展性。这些发现支持将结构化、预测模型整合到MARL智能体中，以实现主动、目标驱动的协调。

Summary / 总结

This paper explores the effectiveness of different communication strategies in multi-agent systems, particularly in scenarios with partial observability. It compares Learned Direct Communication (LDC), which learns a communication protocol end-to-end, with Intention Communication, which uses an engineered inductive bias involving an Imagined Trajectory Generation Module (ITGM) to simulate future states and compress them into messages. The study evaluates these methods in a grid world task, demonstrating that the world model-based approach outperforms the end-to-end learned protocol in terms of performance, sample efficiency, and scalability with increasing complexity.

论文研究了在部分可观测性条件下不同通信策略在多智能体系统中的有效性。它比较了从头学习通信协议的Learned Direct Communication (LDC)方法与使用紧凑世界模型生成消息的工程化先验偏置的Intention Communication方法。研究在网格世界环境中评估了这些方法，结果显示，随着环境复杂性的增加，基于世界模型的方法表现更好且更具可扩展性。

Predicting partially observable dynamical systems via diffusion models with a multiscale inference scheme

Authors: Rudy Morel, Francesco Pio Ramunno, Jeff Shen, Alberto Bietti, Kyunghyun Cho, Miles Cranmer, Siavash Golkar, Olexandr Gugnin, Geraud Krawezik, Tanya Marwah, Michael McCabe, Lucas Meyer, Payel Mukhopadhyay, Ruben Ohana, Liam Parker, Helen Qu, François Rozet, K. D. Leka, François Lanusse, David Fouhey, Shirley Ho

First: 2025-11-24T18:30:04+00:00 · Latest: 2025-11-24T18:30:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Conditional diffusion models provide a natural framework for probabilistic prediction of dynamical systems and have been successfully applied to fluid dynamics and weather prediction. However, in many settings, the available information at a given time represents only a small fraction of what is needed to predict future states, either due to measurement uncertainty or because only a small fraction of the state can be observed. This is true for example in solar physics, where we can observe the Sun's surface and atmosphere, but its evolution is driven by internal processes for which we lack direct measurements. In this paper, we tackle the probabilistic prediction of partially observable, long-memory dynamical systems, with applications to solar dynamics and the evolution of active regions. We show that standard inference schemes, such as autoregressive rollouts, fail to capture long-range dependencies in the data, largely because they do not integrate past information effectively. To overcome this, we propose a multiscale inference scheme for diffusion models, tailored to physical processes. Our method generates trajectories that are temporally fine-grained near the present and coarser as we move farther away, which enables capturing long-range temporal dependencies without increasing computational cost. When integrated into a diffusion model, we show that our inference scheme significantly reduces the bias of the predicted distributions and improves rollout stability.

中文标题/摘要

标题：通过多尺度推理方案的扩散模型预测部分可观测的动力系统

条件扩散模型为概率预测动力系统提供了自然框架，并已在流体动力学和天气预测中取得成功应用。然而，在许多情况下，给定时间可用的信息仅占预测未来状态所需信息的一小部分，这可能是由于测量不确定性或只能观察到状态的一部分。例如，在太阳物理学中，我们只能观测到太阳的表面和大气，但其演化是由我们无法直接测量的内部过程驱动的。在本文中，我们解决了部分可观测、长记忆动力系统的概率预测问题，应用于太阳动力学和活跃区域的演化。我们展示了标准的推理方案，如自回归展开，无法捕捉数据中的长期依赖性，主要是因为它们无法有效地整合过去的信息。为克服这一问题，我们提出了一种针对物理过程的多尺度推理方案，应用于扩散模型。我们的方法在靠近现在的时间上生成细粒度的轨迹，而在更远的时间上则更粗粒度，这使得在不增加计算成本的情况下能够捕捉长期时间依赖性。当整合到扩散模型中时，我们展示了我们的推理方案显著减少了预测分布的偏差并提高了展开的稳定性。

Summary / 总结

This paper addresses the challenge of predicting partially observable dynamical systems, particularly in solar physics where only a fraction of the Sun's internal processes can be observed. The authors propose a multiscale inference scheme for diffusion models to capture long-range dependencies, which standard autoregressive methods fail to do. The method generates fine-grained trajectories near the present and coarser ones further in the future, reducing prediction bias and improving rollout stability.

本文针对太阳物理学中的部分可观测动态系统预测问题，提出了一种适用于扩散模型的多尺度推理方案。传统的推理方法难以捕捉长期依赖关系，但所提方案在接近当前时刻生成精细轨迹，而更远的时间段则生成较粗的轨迹，从而提高了预测准确性和稳定性，显著降低了预测分布的偏差并增强了扩散模型的性能。

UISearch: Graph-Based Embeddings for Multimodal Enterprise UI Screenshots Retrieval

Authors: Maroun Ayli, Youssef Bakouny, Tushar Sharma, Nader Jalloul, Hani Seifeddine, Rima Kilany

First: 2025-11-24T18:20:08+00:00 · Latest: 2025-11-24T18:20:08+00:00

Comments: 12 pages, 2 figures, 3 algorithms, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Enterprise software companies maintain thousands of user interface screens across products and versions, creating critical challenges for design consistency, pattern discovery, and compliance check. Existing approaches rely on visual similarity or text semantics, lacking explicit modeling of structural properties fundamental to user interface (UI) composition. We present a novel graph-based representation that converts UI screenshots into attributed graphs encoding hierarchical relationships and spatial arrangements, potentially generalizable to document layouts, architectural diagrams, and other structured visual domains. A contrastive graph autoencoder learns embeddings preserving multi-level similarity across visual, structural, and semantic properties. The comprehensive analysis demonstrates that our structural embeddings achieve better discriminative power than state-of-the-art Vision Encoders, representing a fundamental advance in the expressiveness of the UI representation. We implement this representation in UISearch, a multi-modal search framework that combines structural embeddings with semantic search through a composable query language. On 20,396 financial software UIs, UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency (P95: 124ms), scaling to 20,000+ screens. The hybrid indexing architecture enables complex queries and supports fine-grained UI distinction impossible with vision-only approaches.

中文标题/摘要

标题：UISearch：基于图的嵌入表示在多模态企业UI屏幕截图检索中的应用

企业软件公司维护着成千上万的产品和版本中的用户界面屏幕，这为企业设计一致性、模式发现和合规检查带来了关键挑战。现有方法依赖于视觉相似性或文本语义，缺乏对用户界面（UI）组成中至关重要的结构属性的显式建模。我们提出了一种新颖的基于图的表示方法，将UI屏幕截图转换为编码层次关系和空间布局的带属性图，该方法可能适用于文档布局、建筑图纸和其他结构化视觉领域。对比图自编码器学习保留多级视觉、结构和语义属性相似性的嵌入表示。全面的分析表明，我们的结构嵌入在区分能力上优于最先进的视觉编码器，代表了UI表示表达能力上的根本性进步。我们在此表示中实现了UISearch，这是一种多模态搜索框架，通过可组合查询语言结合结构嵌入和语义搜索。在20,396个金融软件UI上，UISearch实现了0.92的Top-5准确率，中位延迟为47.5毫秒（P95：124毫秒），可扩展到20,000多个屏幕。混合索引架构能够支持复杂查询，并支持与仅视觉方法无法实现的精细粒度的UI区分。

Summary / 总结

The research addresses the challenges of design consistency, pattern discovery, and compliance checks in enterprise software by proposing UISearch, a multimodal search framework. It uses a graph-based representation to encode UI screenshots, capturing hierarchical relationships and spatial arrangements. The contrastive graph autoencoder learns embeddings that preserve multi-level similarity, outperforming state-of-the-art Vision Encoders. UISearch achieves 0.92 Top-5 accuracy with 47.5ms median latency on 20,396 financial software UIs, demonstrating its effectiveness and scalability.

研究旨在通过开发基于图的表示方法来解决企业软件中的设计一致性、模式发现和合规检查问题。方法是将UI截图转换为包含层次关系和空间布局的标记图，并使用对比图自编码器来学习保留多级相似性的嵌入。关键发现表明，结构嵌入在区分能力上优于最先进的视觉编码器，而UISearch框架在大规模金融软件UI数据集中实现了高准确率和低延迟的搜索。

Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

Authors: Chenhao Li, Andreas Krause, Marco Hutter

First: 2025-01-17T10:39:09+00:00 · Latest: 2025-11-24T18:17:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.

中文标题/摘要

标题：机器人世界模型：一种用于机器人领域稳健策略优化的神经网络模拟器

学习稳健且通用的世界模型对于在真实环境中实现高效和可扩展的机器人控制至关重要。在本文中，我们提出了一种新的框架，用于学习能够准确捕捉复杂、部分可观测和随机动力学的世界模型。所提出的方法采用了一种双自回归机制和自我监督训练，以实现可靠的长时预测，而不依赖于特定领域的归纳偏置，从而确保在各种机器人任务中的适应性。我们还提出了一种策略优化框架，该框架利用世界模型在想象环境中高效训练，并无缝部署到实际系统中。本文通过解决长时预测、误差累积和仿真实用性转移的挑战，推进了基于模型的强化学习。通过提供一种可扩展且稳健的框架，所介绍的方法为实际应用中的适应性和高效机器人系统铺平了道路。

PEANuT: Parameter-Efficient Adaptation with Weight-aware Neural Tweakers

Authors: Yibo Zhong, Haoxiang Jiang, Lincan Li, Ryumei Nakada, Tianci Liu, Linjun Zhang, Huaxiu Yao, Haoyu Wang

First: 2024-10-02T17:29:23+00:00 · Latest: 2025-11-24T18:17:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Fine-tuning large pre-trained foundation models often yields excellent downstream performance but is prohibitively expensive when updating all parameters. Parameter-efficient fine-tuning (PEFT) methods such as LoRA alleviate this by introducing lightweight update modules, yet they commonly rely on weight-agnostic linear approximations, limiting their expressiveness. In this work, we propose PEANuT, a novel PEFT framework that introduces weight-aware neural tweakers, compact neural modules that generate task-adaptive updates conditioned on frozen pre-trained weights. PEANuT provides a flexible yet efficient way to capture complex update patterns without full model tuning. We theoretically show that PEANuT achieves equivalent or greater expressivity than existing linear PEFT methods with comparable or fewer parameters. Extensive experiments across four benchmarks with over twenty datasets demonstrate that PEANuT consistently outperforms strong baselines in both NLP and vision tasks, while maintaining low computational overhead.

中文标题/摘要

标题：PEANuT: 参数高效适应与权重感知神经微调器

对大型预训练基础模型进行微调通常能获得出色的下游性能，但当更新所有参数时成本高昂。参数高效微调（PEFT）方法如LoRA通过引入轻量级更新模块来缓解这一问题，但它们通常依赖于权重无关的线性近似，限制了其表达能力。在本文中，我们提出了一种新的PEFT框架PEANuT，该框架引入了权重感知神经微调器，这是一种紧凑的神经模块，根据冻结的预训练权重生成任务适应性更新。PEANuT提供了一种灵活且高效的捕获复杂更新模式的方法，无需完全模型微调。我们从理论上证明，PEANuT在参数数量相当或更少的情况下，其表达能力与现有线性PEFT方法相当或更优。在四个基准上的广泛实验，涵盖了超过二十个数据集，表明PEANuT在自然语言处理和视觉任务中始终优于强大的基线模型，同时保持较低的计算开销。

Summary / 总结

PEANuT is a parameter-efficient fine-tuning framework that introduces weight-aware neural tweakers to generate task-adaptive updates conditioned on frozen pre-trained weights. This approach provides a flexible and efficient way to capture complex update patterns without full model tuning. Experiments across multiple benchmarks show that PEANuT outperforms strong baselines in both NLP and vision tasks while maintaining low computational overhead.

PEANuT 是一种参数高效微调框架，通过引入基于权重的神经微调器来生成针对冻结预训练权重的任务自适应更新。这种方法提供了一种灵活且高效的方式来捕捉复杂的更新模式，而无需进行完整的模型调整。实验结果表明，PEANuT 在 NLP 和视觉任务中均优于强基线，并且保持了较低的计算开销。

SketchDeco: Training-Free Latent Composition for Precise Sketch Colourisation

Authors: Chaitat Utintu, Pinaki Nath Chowdhury, Aneeshan Sain, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song

First: 2024-05-29T02:53:59+00:00 · Latest: 2025-11-24T18:15:06+00:00

Comments: Project Page: \url{https://chaitron.github.io/SketchDeco/}

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce SketchDeco, a training-free approach to sketch colourisation that bridges the gap between professional design needs and intuitive, region-based control. Our method empowers artists to use simple masks and colour palettes for precise spatial and chromatic specification, avoiding both the tediousness of manual assignment and the ambiguity of text-based prompts. We reformulate this task as a novel, training-free composition problem. Our core technical contribution is a guided latent-space blending process: we first leverage diffusion inversion to precisely ``paint'' user-defined colours into specified regions, and then use a custom self-attention mechanism to harmoniously blend these local edits with a globally consistent base image. This ensures both local colour fidelity and global harmony without requiring any model fine-tuning. Our system produces high-quality results in 15--20 inference steps on consumer GPUs, making professional-quality, controllable colourisation accessible.

中文标题/摘要

标题：SketchDeco：无需训练的隐空间组合用于精确素描着色

我们介绍了SketchDeco，这是一种无需训练的方法，用于素描着色，填补了专业设计需求与直观的区域控制之间的空白。我们的方法使艺术家能够使用简单的蒙版和颜色调色板进行精确的空间和色彩指定，避免了手动分配的繁琐和基于文本提示的模糊性。我们将此任务重新定义为一种新颖的、无需训练的组合问题。我们核心的技术贡献是一种引导下的隐空间混合过程：我们首先利用扩散反演精确地“绘制”用户定义的颜色到指定的区域，然后使用自定义的注意力机制将这些局部编辑和谐地与全局一致的基础图像融合。这确保了局部色彩的忠实性和全局的和谐性，而无需任何模型微调。我们的系统在消费级GPU上进行15-20步推理即可生成高质量的结果，使专业级别的、可控的着色变得可行。

Summary / 总结

SketchDeco is a training-free approach for sketch colorization that allows artists to use simple masks and color palettes for precise spatial and chromatic specification. The method reformulates the task as a composition problem and uses a guided latent-space blending process involving diffusion inversion and self-attention to achieve high-quality results. The system produces professional-quality, controllable colorization in 15-20 inference steps on consumer GPUs.

SketchDeco 是一种无需训练的方法，用于素描着色，允许艺术家使用简单的遮罩和颜色调色板进行精确的空间和色彩指定。该方法将任务重新定义为一个合成问题，并使用引导的潜空间混合过程，结合扩散反向将用户定义的颜色精确地绘制到指定区域，并使用自定义的注意力机制将这些局部编辑和谐地与全局一致的基础图像融合。这种方法确保了局部颜色的忠实性和全局和谐性，并在消费级 GPU 上以 15-20 个推理步骤生成高质量的结果。

LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems

Authors: Tianyang Duan, Zongyuan Zhang, Zheng Lin, Songxiao Guo, Xiuxian Guan, Guangyu Wu, Zihan Fang, Haotian Meng, Xia Du, Ji-Zhe Zhou, Heming Cui, Jun Luo, Yue Gao

First: 2025-11-24T18:03:59+00:00 · Latest: 2025-11-24T18:03:59+00:00

Comments: 15 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Multi-agent reinforcement learning (MARL) has been increasingly adopted in many real-world applications. While MARL enables decentralized deployment on resource-constrained edge devices, it suffers from severe non-stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)-driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity-Aware Expert Demonstration module, which leverages theoretical non-stationarity bounds to enhance the quality of LLM-generated expert trajectories, thus providing high reward and training-stable samples for each agent. Moreover, a Hybrid Expert-Agent Policy Optimization module adaptively balances each agent's learning from both expert-generated and agent-generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state-of-the-art MARL methods.

中文标题/摘要

标题：基于LLM的站稳性意识专家演示在移动系统中多智能体强化学习

多智能体强化学习（MARL）在许多实际应用中被越来越多地采用。虽然MARL能够在资源受限的边缘设备上实现去中心化的部署，但由于智能体策略的同步更新，它遭受严重的非站稳性问题。这种非站稳性导致训练不稳定和政策收敛不良，尤其是在智能体数量增加时更为明显。在本文中，我们提出了一种名为RELED的可扩展MARL框架，该框架结合了大型语言模型（LLM）驱动的专家演示与自主智能体探索。RELED集成了站稳性意识专家演示模块，该模块利用理论上的非站稳性边界来增强LLM生成的专家轨迹的质量，从而为每个智能体提供高奖励和训练稳定的样本。此外，混合专家-智能体策略优化模块能够适当地平衡每个智能体从专家生成和智能体生成轨迹中学习的比例，加速策略收敛并提高泛化能力。基于OpenStreetMap的真实城市网络实验表明，RELED在与最先进的MARL方法相比时表现出更优的性能。

An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification

Authors: Saniah Kayenat Chowdhury, Rusab Sarmun, Muhammad E. H. Chowdhury, Sohaib Bassam Zoghoul, Israa Al-Hashimi, Adam Mushtak, Amith Khandakar

First: 2025-11-24T18:01:47+00:00 · Latest: 2025-11-24T18:01:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.

中文标题/摘要

标题：一种基于解剖学的混合深度学习框架用于肺癌肿瘤分期分类

准确的肺癌肿瘤分期对于预后和治疗计划至关重要。然而，端到端的深度学习方法仍然面临挑战，因为这些方法往往忽略了肿瘤-淋巴结-转移系统中至关重要的空间和解剖信息。肿瘤分期依赖于多个定量标准，包括肿瘤大小及其与最近解剖结构的距离，微小的变化会影响分期结果。我们提出了一种基于医学原理的混合管道，通过明确测量肿瘤的大小和距离属性，而不是将其视为纯粹的图像分类任务来进行分期。我们的方法使用专门的编码-解码网络精确分割肺部及其邻近的解剖结构，包括肺叶、肿瘤、纵隔和膈肌。随后，我们提取必要的肿瘤属性，即测量肿瘤的最大尺寸，并通过分割掩模的定量分析计算肿瘤与邻近解剖结构之间的距离。最后，我们应用与医学指南一致的规则进行肿瘤分期。该新型框架在Lung-PET-CT-Dx数据集上进行了评估，其整体分类准确率达到了91.36%，优于传统的深度学习模型。我们报告了各分期的F1分数，分别为0.93（T1）、0.89（T2）、0.96（T3）和0.90（T4），这是先前文献中经常忽略的重要评估方面。据我们所知，这是首次将明确的临床背景嵌入到肿瘤分期分类中的研究。与标准卷积神经网络在不可解释的“黑箱”中运作不同，我们的方法提供了最先进的性能和透明的决策支持。

Summary / 总结

The research aims to improve the accuracy of lung cancer tumor staging by incorporating spatial and anatomical information, which is often neglected by end-to-end deep learning approaches. The proposed hybrid framework uses specialized encoder-decoder networks to segment the lung and its anatomical structures, followed by quantitative analysis to measure tumor properties and apply rule-based staging. The method achieves an overall classification accuracy of 91.36% and per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), outperforming traditional deep learning models.

论文针对准确分期肺癌肿瘤这一关键问题，提出了一个混合深度学习框架，该框架通过明确测量肿瘤大小和与解剖结构的距离，而不是将其视为纯粹的图像分类任务来实现。该方法使用专门的编码-解码网络来分割肺部及其邻近的解剖结构，并应用与医学指南一致的规则进行分期。该框架在Lung-PET-CT-Dx数据集上的总体分类准确率为91.36%，各分期的F1分数分别为0.93（T1）、0.89（T2）、0.96（T3）和0.90（T4）。

Growing with the Generator: Self-paced GRPO for Video Generation

Authors: Rui Li, Yuanzhi Liang, Ziqi Ni, Haibing Huang, Chi Zhang, Xuelong Li

First: 2025-11-24T17:56:03+00:00 · Latest: 2025-11-24T17:56:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.

中文标题/摘要

标题：随着生成器成长：自适应GRPO的视频生成

组相对策略优化（GRPO）已成为用于后训练视频生成模型的强大强化学习范式。然而，现有的GRPO管道依赖于静态、固定容量的奖励模型，这些模型在训练过程中冻结了评估行为。这种刚性的奖励引入了分布偏差，随着生成器的改进迅速饱和，并最终限制了基于强化学习的对齐的稳定性和有效性。我们提出了自适应GRPO，这是一种能力感知的GRPO框架，其中奖励反馈与生成器共同进化。我们的方法引入了一种渐进的奖励机制，该机制随着生成质量的提高，自动从粗略的视觉保真度转移到时间连贯性以及细粒度的文本-视频语义对齐。这种自适应课程缓解了奖励-策略不匹配，减轻了奖励利用，并提供了更稳定的优化。在VBench上对多个视频生成骨干网络的实验表明，与具有静态奖励的GRPO基线相比，自适应GRPO在视觉质量和语义对齐方面均表现出一致的改进，验证了自适应GRPO的有效性和普适性。

Summary / 总结

The paper addresses the limitations of existing Group Relative Policy Optimization (GRPO) pipelines, which use static reward models that can introduce distributional bias and limit the generator's effectiveness. It introduces Self-Paced GRPO, a competence-aware framework that evolves the reward feedback alongside the generator. This method shifts the reward emphasis from visual fidelity to temporal coherence and semantic alignment as the generator improves, leading to more stable optimization and better performance. Experiments show consistent improvements in visual and semantic alignment over static reward baselines.

论文针对现有使用静态奖励模型的Group Relative Policy Optimization (GRPO)管道存在的问题，这些问题可能导致分布偏差并限制生成器的效果。提出了一个能力感知的Self-Paced GRPO框架，该框架随着生成器的改进逐步调整奖励反馈，从视觉保真度转向时间连贯性和语义对齐。实验结果表明，与静态奖励基线相比，在视觉质量和语义对齐方面都有持续的改进，验证了Self-Paced GRPO的有效性和通用性。

Leveraging LLMs for reward function design in reinforcement learning control tasks

Authors: Franklin Cardenoso, Wouter Caarls

First: 2025-11-24T17:55:46+00:00 · Latest: 2025-11-24T17:55:46+00:00

Abs · PDF · Code1 · Code2

Abstract

The challenge of designing effective reward functions in reinforcement learning (RL) represents a significant bottleneck, often requiring extensive human expertise and being time-consuming. Previous work and recent advancements in large language models (LLMs) have demonstrated their potential for automating the generation of reward functions. However, existing methodologies often require preliminary evaluation metrics, human-engineered feedback for the refinement process, or the use of environmental source code as context. To address these limitations, this paper introduces LEARN-Opt (LLM-based Evaluator and Analyzer for Reward functioN Optimization). This LLM-based, fully autonomous, and model-agnostic framework eliminates the need for preliminary metrics and environmental source code as context to generate, execute, and evaluate reward function candidates from textual descriptions of systems and task objectives. LEARN-Opt's main contribution lies in its ability to autonomously derive performance metrics directly from the system description and the task objective, enabling unsupervised evaluation and selection of reward functions. Our experiments indicate that LEARN-Opt achieves performance comparable to or better to that of state-of-the-art methods, such as EUREKA, while requiring less prior knowledge. We find that automated reward design is a high-variance problem, where the average-case candidate fails, requiring a multi-run approach to find the best candidates. Finally, we show that LEARN-Opt can unlock the potential of low-cost LLMs to find high-performing candidates that are comparable to, or even better than, those of larger models. This demonstrated performance affirms its potential to generate high-quality reward functions without requiring any preliminary human-defined metrics, thereby reducing engineering overhead and enhancing generalizability.

中文标题/摘要

标题：利用LLM在强化学习控制任务中设计奖励函数

在强化学习（RL）中设计有效的奖励函数是一项重大瓶颈，通常需要大量的专业知识和时间。先前的工作和最近大型语言模型（LLMs）的进步表明，它们有潜力自动化奖励函数的生成。然而，现有方法通常需要初步评估指标、人工工程反馈以进行细化过程，或使用环境源代码作为上下文。为了解决这些限制，本文介绍了LEARN-Opt（LLM基于的评估和分析奖励函数优化框架）。这个基于LLM的、完全自主且模型无关的框架消除了初步指标和环境源代码作为上下文的需要，从系统的文本描述和任务目标生成、执行和评估奖励函数候选方案。LEARN-Opt的主要贡献在于能够直接从系统描述和任务目标中自主推导出性能指标，实现无监督的评估和奖励函数的选择。我们的实验表明，LEARN-Opt在性能上与EUREKA等最先进的方法相当或更好，同时需要较少的先验知识。我们发现，自动化奖励设计是一个高方差问题，平均情况下的候选方案失败，需要多运行方法来找到最佳候选方案。最后，我们展示了LEARN-Opt可以利用低成本的LLM找到与更大模型相当甚至更好的高性能候选方案。这种表现证明了它在无需任何初步人工定义的指标的情况下生成高质量奖励函数的潜力，从而减少工程开销并增强泛化能力。

Summary / 总结

This paper addresses the challenge of designing effective reward functions in reinforcement learning by introducing LEARN-Opt, an LLM-based framework that autonomously generates, executes, and evaluates reward function candidates from textual descriptions. LEARN-Opt eliminates the need for preliminary metrics and environmental source code, and it can derive performance metrics directly from system descriptions and task objectives. Experiments show that LEARN-Opt achieves performance comparable to or better than state-of-the-art methods while requiring less prior knowledge. The framework's performance is high-variance, necessitating a multi-run approach to find the best candidates, but it can still produce high-quality reward functions using low-cost LLMs.

本文通过引入LEARN-Opt框架，解决强化学习中有效奖励函数设计的挑战。LEARN-Opt是一个基于LLM的框架，能够自主生成、执行和评估从系统描述和任务目标中提取的奖励函数候选方案，无需初步指标和环境源代码。实验表明，LEARN-Opt在性能上与EUREKA等最先进的方法相当或更优，需要较少的先验知识，并展示了低成本LLM找到高性能奖励函数的潜力。

Node Preservation and its Effect on Crossover in Cartesian Genetic Programming

Authors: Mark Kocherovsky, Illya Bakurov, Wolfgang Banzhaf

First: 2025-11-01T17:26:56+00:00 · Latest: 2025-11-24T17:55:01+00:00

Comments: Draft to cite in another paper before both papers are peer-reviewed for the evo*2026 conference, 21 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

While crossover is a critical and often indispensable component in other forms of Genetic Programming, such as Linear- and Tree-based, it has consistently been claimed that it deteriorates search performance in CGP. As a result, a mutation-alone $(1+λ)$ evolutionary strategy has become the canonical approach for CGP. Although several operators have been developed that demonstrate an increased performance over the canonical method, a general solution to the problem is still lacking. In this paper, we compare basic crossover methods, namely one-point and uniform, to variants in which nodes are ``preserved,'' including the subgraph crossover developed by Roman Kalkreuth, the difference being that when ``node preservation'' is active, crossover is not allowed to break apart instructions. We also compare a node mutation operator to the traditional point mutation; the former simply replaces an entire node with a new one. We find that node preservation in both mutation and crossover improves search using symbolic regression benchmark problems, moving the field towards a general solution to CGP crossover.

中文标题/摘要

标题：节点保存及其对笛卡尔遗传编程交叉操作的影响

虽然在其他形式的遗传编程中，如线性和树形，交叉是至关重要的且通常是必不可少的组成部分，但在笛卡尔遗传编程(CGP)中，交叉一直被认为会降低搜索性能。因此，仅使用变异的$(1+λ)$进化策略已成为CGP的经典方法。尽管已经开发出了一些表现出比经典方法更好的性能的操作符，但仍然缺乏解决该问题的一般方法。在本文中，我们将基本的交叉方法，即一点交叉和均匀交叉，与节点“保存”变体进行比较，包括由罗马·卡尔克鲁特开发的子图交叉，其区别在于当“节点保存”激活时，交叉不能打断指令。我们还将节点变异操作符与传统的点变异进行比较；前者只是用一个新的节点替换整个节点。我们发现，在变异和交叉中都进行节点保存可以提高使用符号回归基准问题的搜索性能，使该领域朝着CGP交叉的一般解决方案迈进。

Summary / 总结

This paper investigates the impact of node preservation on crossover in Cartesian Genetic Programming (CGP), comparing it to traditional crossover methods like one-point and uniform crossover. The study also contrasts a node mutation operator with point mutation. The key experimental findings show that node preservation enhances search performance in symbolic regression tasks, suggesting a potential general solution to improve CGP crossover performance.

本文研究了节点保存对Cartesian Genetic Programming (CGP) 中交叉操作的影响。它比较了基本的交叉方法如一点交叉和均匀交叉，以及节点保存变体，并且对比了节点突变操作与传统的点突变。研究使用符号回归基准问题来证明节点保存可以提高CGP的搜索性能，使该领域更接近于解决交叉问题的一般方案。

CellFMCount: A Fluorescence Microscopy Dataset, Benchmark, and Methods for Cell Counting

Authors: Abdurahman Ali Mohammed, Catherine Fonder, Ying Wei, Wallapak Tavanapong, Donald S Sakaguchi, Qi Li, Surya K. Mallapragada

First: 2025-11-24T17:53:59+00:00 · Latest: 2025-11-24T17:53:59+00:00

Comments: The IEEE International Conference on Data Mining (ICDM) 2025

Abs · PDF · Code1 · Code2

Abstract

Accurate cell counting is essential in various biomedical research and clinical applications, including cancer diagnosis, stem cell research, and immunology. Manual counting is labor-intensive and error-prone, motivating automation through deep learning techniques. However, training reliable deep learning models requires large amounts of high-quality annotated data, which is difficult and time-consuming to produce manually. Consequently, existing cell-counting datasets are often limited, frequently containing fewer than $500$ images. In this work, we introduce a large-scale annotated dataset comprising $3{,}023$ images from immunocytochemistry experiments related to cellular differentiation, containing over $430{,}000$ manually annotated cell locations. The dataset presents significant challenges: high cell density, overlapping and morphologically diverse cells, a long-tailed distribution of cell count per image, and variation in staining protocols. We benchmark three categories of existing methods: regression-based, crowd-counting, and cell-counting techniques on a test set with cell counts ranging from $10$ to $2{,}126$ cells per image. We also evaluate how the Segment Anything Model (SAM) can be adapted for microscopy cell counting using only dot-annotated datasets. As a case study, we implement a density-map-based adaptation of SAM (SAM-Counter) and report a mean absolute error (MAE) of $22.12$, which outperforms existing approaches (second-best MAE of $27.46$). Our results underscore the value of the dataset and the benchmarking framework for driving progress in automated cell counting and provide a robust foundation for future research and development.

中文标题/摘要

标题：CellFMCount：一种荧光显微镜数据集、基准和细胞计数方法

准确的细胞计数在各种生物医学研究和临床应用中至关重要，包括癌症诊断、干细胞研究和免疫学。手工计数劳动密集且容易出错，推动了通过深度学习技术实现自动化。然而，训练可靠的深度学习模型需要大量高质量的标注数据，这需要大量时间和精力。因此，现有的细胞计数数据集通常有限，经常包含少于500张图像。在本研究中，我们引入了一个大规模标注数据集，包含3,023张免疫细胞化学实验图像，涉及细胞分化，包含超过430,000个手动标注的细胞位置。该数据集提出了显著的挑战：高细胞密度、重叠且形态多样的细胞、每张图像细胞计数的长尾分布以及染色协议的差异。我们对包含每张图像10到2,126个细胞的测试集中的三种现有方法类别进行了基准测试：基于回归的方法、人群计数方法和细胞计数技术。我们还评估了如何仅使用点标注数据集适应Segment Anything Model (SAM)进行显微镜细胞计数。作为案例研究，我们实现了一种基于密度图的SAM适应版本（SAM-Counter），并报告了平均绝对误差（MAE）为22.12，优于现有方法（第二好的MAE为27.46）。我们的结果强调了数据集和基准测试框架的价值，推动了自动化细胞计数的进步，并为未来的研究和发展提供了坚实的基础。

Summary / 总结

The paper introduces CellFMCount, a large annotated dataset of 3,023 images with over 430,000 manually annotated cell locations, addressing the need for high-quality data in automated cell counting. The dataset includes challenges such as high cell density and morphological diversity. The authors benchmark existing methods and adapt the Segment Anything Model (SAM) for cell counting, achieving a mean absolute error of 22.12, which outperforms other approaches by a significant margin.

该研究介绍了CellFMCount，包含3,023张图像和超过430,000个手动标注的细胞位置，解决了高细胞密度、重叠细胞和不同染色协议带来的挑战。它对现有的基于回归、人群计数和细胞计数技术进行了基准测试，并评估了仅使用点标注数据集的Segment Anything Model (SAM)。研究结果显示，SAM-Counter的平均绝对误差（MAE）为22.12，优于现有方法的最佳MAE为27.46，突显了该数据集在推动自动化细胞计数研究方面的价值。

Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning

Authors: Qihan Huang, Haofei Zhang, Rong Wei, Yi Wang, Rui Tang, Mingli Song, Jie Song

First: 2025-11-24T17:42:29+00:00 · Latest: 2025-11-24T17:42:29+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

RL (reinforcement learning) methods (e.g., GRPO) for MLLM (Multimodal LLM) perception ability has attracted wide research interest owing to its remarkable generalization ability. Nevertheless, existing reinforcement learning methods still face the problem of low data quality, where data samples cannot elicit diverse responses from MLLMs, thus restricting the exploration scope for MLLM reinforcement learning. Some methods attempt to mitigate this problem by imposing constraints on entropy, but none address it at its root. Therefore, to tackle this problem, this work proposes Syn-GRPO (Synthesis-GRPO), which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training. Specifically, Syn-GRPO consists of two components: (1) data server; (2) GRPO workflow. The data server synthesizes new samples from existing ones using an image generation model, featuring a decoupled and asynchronous scheme to achieve high generation efficiency. The GRPO workflow provides the data server with the new image descriptions, and it leverages a diversity reward to supervise the MLLM to predict image descriptions for synthesizing samples with diverse responses. Experiment results across three visual perception tasks demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods, and Syn-GRPO presents promising potential for scaling long-term self-evolving RL. Our code is available at https://github.com/hqhQAQ/Syn-GRPO.

中文标题/摘要

标题：Syn-GRPO：自演化数据合成以提升MLLM感知推理能力

由于强化学习方法（例如GRPO）在MLLM（多模态大模型）感知能力方面表现出显著的泛化能力，因此基于RL的方法引起了广泛的研究兴趣。然而，现有的强化学习方法仍然面临数据质量低的问题，即数据样本无法激发MLLM的多样响应，从而限制了MLLM强化学习的探索范围。一些方法试图通过施加熵约束来缓解这一问题，但没有从根本上解决问题。因此，为了解决这一问题，这项工作提出了Syn-GRPO（合成-GRPO），它使用在线数据生成器在GRPO训练中合成具有多样响应的高质量训练数据。具体而言，Syn-GRPO包括两个组件：（1）数据服务器；（2）GRPO工作流。数据服务器使用图像生成模型从现有样本中合成新的样本，采用解耦和异步方案以实现高效的生成。GRPO工作流向数据服务器提供新的图像描述，并利用多样性奖励来监督MLLM预测图像描述以合成具有多样响应的样本。在三个视觉感知任务上的实验结果表明，Syn-GRPO大幅提高了数据质量，实现了显著优于现有MLLM感知方法的性能，并展示了Syn-GRPO在长期自演化RL扩展方面的巨大潜力。我们的代码可在https://github.com/hqhQAQ/Syn-GRPO获取。

Summary / 总结

Syn-GRPO is designed to enhance the data quality for MLLM perception by synthesizing diverse training samples using an online data generator. It consists of a data server that generates new samples from existing ones and a GRPO workflow that uses a diversity reward to guide MLLM predictions. Experiments across three visual perception tasks show that Syn-GRPO significantly improves data quality and outperforms existing methods, indicating its potential for long-term self-evolving reinforcement learning.

该研究针对强化学习方法在MLLM感知中的低数据质量问题，提出了Syn-GRPO，通过在线数据生成器合成多样化的训练数据。Syn-GRPO 包含一个数据服务器，使用图像生成模型生成新样本，以及一个GRPO工作流提供多样性的监督。实验结果显示，Syn-GRPO 显著提高了数据质量，并优于现有方法，展示了长期自我进化的强化学习的潜力。

Understanding the Staged Dynamics of Transformers in Learning Latent Structure

Authors: Rohan Saha, Farzane Aminmansour, Alona Fyshe

First: 2025-11-24T17:20:42+00:00 · Latest: 2025-11-24T17:20:42+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

While transformers can discover latent structure from context, the dynamics of how they acquire different components of the latent structure remain poorly understood. In this work, we use the Alchemy benchmark, to investigate the dynamics of latent structure learning. We train a small decoder-only transformer on three task variants: 1) inferring missing rules from partial contextual information, 2) composing simple rules to solve multi-step sequences, and 3) decomposing complex multi-step examples to infer intermediate steps. By factorizing each task into interpretable events, we show that the model acquires capabilities in discrete stages, first learning the coarse grained rules, before learning the complete latent structure. We also identify a crucial asymmetry, where the model can compose fundamental rules robustly, but struggles to decompose complex examples to discover the fundamental rules. These findings offer new insights into understanding how a transformer model learns latent structures, providing a granular view of how these capabilities evolve during training.

中文标题/摘要

标题：理解Transformer在学习潜在结构中的分阶段动态

虽然变压器可以从上下文中发现潜在结构，但它们如何获取潜在结构的不同组成部分的动力学机制仍然知之甚少。在本研究中，我们使用Alchemy基准测试，探讨潜在结构学习的动力学。我们对三种任务变体进行训练：1) 从部分上下文信息中推断缺失的规则，2) 组合简单的规则以解决多步序列问题，3) 分解复杂的多步示例以推断中间步骤。通过将每个任务分解为可解释的事件，我们表明模型在离散阶段获得能力，首先学习粗粒度规则，然后学习完整的潜在结构。我们还发现一个关键的不对称性，即模型可以稳健地组合基本规则，但在分解复杂示例以发现基本规则方面存在困难。这些发现为理解变压器模型如何学习潜在结构提供了新的见解，提供了这些能力在训练过程中如何演变的详细视图。

Summary / 总结

This work investigates how transformers learn latent structures by training a small decoder-only transformer on three tasks: inferring missing rules, composing simple rules, and decomposing complex examples. The model acquires capabilities in stages, first learning coarse-grained rules before mastering the complete latent structure. The study reveals that while the model can robustly compose fundamental rules, it struggles with decomposing complex examples to discover these rules, providing new insights into transformer learning dynamics.

研究通过训练一个小型解码器变压器在三个任务上——推断缺失规则、组合简单规则和分解复杂示例——来探讨变压器如何学习潜在结构。模型的能力分阶段发展，首先学习粗粒度规则，然后是完整的潜在结构。研究揭示了一个不对称性，即模型可以稳健地组合基本规则，但在分解复杂示例以发现这些基本规则方面存在困难，为理解变压器学习动态提供了新的见解。

MonoMSK: Monocular 3D Musculoskeletal Dynamics Estimation

Authors: Farnoosh Koleini, Hongfei Xue, Ahmed Helmy, Pu Wang

First: 2025-11-24T17:20:17+00:00 · Latest: 2025-11-24T17:20:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Reconstructing biomechanically realistic 3D human motion - recovering both kinematics (motion) and kinetics (forces) - is a critical challenge. While marker-based systems are lab-bound and slow, popular monocular methods use oversimplified, anatomically inaccurate models (e.g., SMPL) and ignore physics, fundamentally limiting their biomechanical fidelity. In this work, we introduce MonoMSK, a hybrid framework that bridges data-driven learning and physics-based simulation for biomechanically realistic 3D human motion estimation from monocular video. MonoMSK jointly recovers both kinematics (motions) and kinetics (forces and torques) through an anatomically accurate musculoskeletal model. By integrating transformer-based inverse dynamics with differentiable forward kinematics and dynamics layers governed by ODE-based simulation, MonoMSK establishes a physics-regulated inverse-forward loop that enforces biomechanical causality and physical plausibility. A novel forward-inverse consistency loss further aligns motion reconstruction with the underlying kinetic reasoning. Experiments on BML-MoVi, BEDLAM, and OpenCap show that MonoMSK significantly outperforms state-of-the-art methods in kinematic accuracy, while for the first time enabling precise monocular kinetics estimation.

中文标题/摘要

标题：MonoMSK：单目3D肌肉骨骼动力学估计

重建生物力学现实的3D人体运动——同时恢复运动学（运动）和动力学（力）——是一个关键挑战。虽然基于标记的系统局限于实验室且速度较慢，流行的单目方法使用过于简化的、解剖学不准确的模型（例如SMPL），并且忽略了物理规律，从根本上限制了其生物力学的准确性。在本工作中，我们引入了MonoMSK，这是一种将数据驱动学习与基于物理的模拟相结合的混合框架，用于从单目视频中估计生物力学现实的3D人体运动。MonoMSK通过一个解剖学准确的肌肉骨骼模型联合恢复了运动学（运动）和动力学（力和扭矩）。通过将基于ODE的模拟控制的可微前向动力学和动力学层与基于变换器的逆动力学相结合，MonoMSK建立了受物理调控的逆-前向循环，确保了生物力学因果性和物理合理性。一种新颖的前向-逆向一致性损失进一步使运动重建与潜在的动力学推理对齐。在BML-MoVi、BEDLAM和OpenCap上的实验表明，MonoMSK在运动学准确性上显著优于现有方法，同时首次实现了精确的单目动力学估计。

Summary / 总结

MonoMSK is a hybrid framework that combines data-driven learning and physics-based simulation to estimate 3D human motion and forces from monocular video. It uses an anatomically accurate musculoskeletal model and an inverse-forward loop to enforce biomechanical causality. Experiments show MonoMSK outperforms existing methods in kinematic accuracy and enables precise monocular kinetics estimation for the first time.

MonoMSK 是一种结合数据驱动学习和物理仿真框架，用于从单目视频估计 3D 人体运动，同时恢复运动和力学特性。它使用了解剖学上准确的肌肉骨骼模型，并通过基于 ODE 的仿真调节逆向-正向循环来确保生物力学因果关系。实验表明，MonoMSK 在运动精度方面优于现有方法，并首次实现了精确的单目力学估计。

Generative Query Expansion with Multilingual LLMs for Cross-Lingual Information Retrieval

Authors: Olivia Macmillan-Scott, Roksana Goworek, Eda B. Özyiğit

First: 2025-11-24T17:18:25+00:00 · Latest: 2025-11-24T17:18:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Query expansion is the reformulation of a user query by adding semantically related information, and is an essential component of monolingual and cross-lingual information retrieval used to ensure that relevant documents are not missed. Recently, multilingual large language models (mLLMs) have shifted query expansion from semantic augmentation with synonyms and related words to pseudo-document generation. Pseudo-documents both introduce additional relevant terms and bridge the gap between short queries and long documents, which is particularly beneficial in dense retrieval. This study evaluates recent mLLMs and fine-tuned variants across several generative expansion strategies to identify factors that drive cross-lingual retrieval performance. Results show that query length largely determines which prompting technique is effective, and that more elaborate prompts often do not yield further gains. Substantial linguistic disparities persist: cross-lingual query expansion can produce the largest improvements for languages with the weakest baselines, yet retrieval is especially poor between languages written in different scripts. Fine-tuning is found to lead to performance gains only when the training and test data are of similar format. These outcomes underline the need for more balanced multilingual and cross-lingual training and evaluation resources.

中文标题/摘要

标题：使用多语言大语言模型进行生成式查询扩展以实现跨语言信息检索

查询扩展是通过添加语义相关的信息来重新表述用户查询的过程，是单语言和跨语言信息检索中的一个必不可少的组成部分，用于确保不会遗漏相关文档。最近，多语言大型语言模型（mLLMs）已经将查询扩展从同义词和相关词的语义增强转变为伪文档生成。伪文档不仅引入了额外的相关术语，还缩小了短查询与长文档之间的差距，这在密集检索中尤其有益。本研究评估了多种生成扩展策略下的近期mLLMs及其微调变体，以确定影响跨语言检索性能的因素。结果显示，查询长度很大程度上决定了哪种提示技术有效，而更复杂的提示通常不会带来进一步的收益。语言差异仍然显著：跨语言查询扩展对基础最弱的语言产生的改进最大，但不同书写系统的语言之间的检索效果尤其差。研究发现，只有当训练和测试数据格式相似时，微调才会带来性能提升。这些结果强调了需要更多平衡的多语言和跨语言训练及评估资源的必要性。

Summary / 总结

This study investigates the use of multilingual large language models (mLLMs) for generative query expansion in cross-lingual information retrieval. It evaluates various prompting techniques and fine-tuning methods across different languages and scripts. The research finds that query length is a critical factor in determining the effectiveness of prompting techniques, with more elaborate prompts not always providing additional benefits. The study also highlights significant linguistic disparities, where query expansion is most effective for languages with weaker baselines but retrieval performance is particularly poor between languages using different scripts. Fine-tuning only improves performance when training and test data formats are similar.

该研究探讨了使用多语言大型语言模型（mLLMs）在跨语言信息检索中的查询扩展应用，重点关注生成性扩展策略。研究评估了各种提示技术，发现查询长度是确定这些技术有效性的关键因素。虽然微调可以提高性能，但只有当训练和测试数据具有相似格式时才有效。研究强调了为不同语言和书写系统提供更平衡的多语言和跨语言资源的重要性，以实现更好的性能。

What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models

Authors: Roksana Goworek, Olivia Macmillan-Scott, Eda B. Özyiğit

First: 2025-11-24T17:17:40+00:00 · Latest: 2025-11-24T17:17:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Cross-lingual information retrieval (CLIR) enables access to multilingual knowledge but remains challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models. Existing pipelines often rely on translation and monolingual retrieval heuristics, which add computational overhead and noise, degrading performance. This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets. We find that dense retrieval models trained specifically for CLIR consistently outperform lexical matching methods and derive little benefit from document translation. Contrastive learning mitigates language biases and yields substantial improvements for encoders with weak initial alignment, and re-ranking can be effective, but depends on the quality of the cross-encoder training data. Although high-resource languages still dominate overall performance, gains over lexical and document-translated baselines are most pronounced for low-resource and cross-script pairs. These findings indicate that cross-lingual search systems should prioritise semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, particularly for cross-script and under-resourced languages.

中文标题/摘要

标题：跨语言排名的动力：基于多语言语言模型的检索方法

跨语言信息检索（CLIR）能够访问多语言知识，但由于资源、书写系统和嵌入模型中跨语言语义对齐较弱的差异，仍具有挑战性。现有流水线通常依赖于翻译和单语言检索启发式方法，这增加了计算开销和噪声，降低了性能。本研究系统地评估了四种干预类型，即文档翻译、基于预训练编码器的多语言密集检索、在词、短语和查询-文档级别进行对比学习，以及交叉编码器重排序，这三种基准数据集。我们发现，专门针对CLIR训练的密集检索模型始终优于词法匹配方法，并且从文档翻译中获益甚微。对比学习减轻了语言偏见，并显著提高了初始对齐较弱的编码器的性能，而重排序可以有效，但取决于交叉编码器训练数据的质量。尽管高资源语言仍然主导整体性能，但相对于词法和文档翻译基线的改进最显著的是低资源和跨书写字体对。这些发现表明，跨语言搜索系统应优先考虑语义多语言嵌入和针对特定学习的对齐，而不是基于翻译的流水线，特别是对于跨书写字体和资源不足的语言。

Summary / 总结

This study evaluates different approaches to cross-lingual information retrieval, including document translation, multilingual dense retrieval, contrastive learning, and cross-encoder re-ranking, across three benchmark datasets. The research finds that dense retrieval models outperform lexical matching methods and that contrastive learning improves performance for encoders with weak initial alignment. Re-ranking can be effective but depends on the quality of the cross-encoder training data. The study highlights the importance of semantic multilingual embeddings and targeted learning-based alignment over translation-based pipelines, especially for low-resource and cross-script languages.

研究评估了文档翻译、多语言密集检索、对比学习和交叉编码器重排序在三个基准数据集上的效果。研究发现，密集检索模型优于词汇匹配方法和文档翻译，对比学习可以改善初始对齐较弱的编码器，而重排序可以提升结果但依赖于训练数据的质量。研究强调了语义多语言嵌入和基于学习的对齐方法的重要性，尤其是在低资源和跨书写系统语言对方面。

Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them

Authors: Marc Brinner, Tarek Al Mustafa, Sina Zarrieß

Venue: EMNLP 2025

First: 2025-03-27T21:51:24+00:00 · Latest: 2025-11-24T17:17:31+00:00

Comments: Published in the Findings of the Association for Computational Linguistics: EMNLP 2025

Abs · PDF · Code1 · Code2

Abstract

We investigate the use of LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data, using the scientific domain of invasion biology as a case study. To this end, we leverage domain-specific ontologies by enriching them with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model for concept definitions. To evaluate the effectiveness of this method, we compile a benchmark specifically designed for assessing model performance in invasion biology. After demonstrating substantial improvements over standard LLM pretraining, we investigate the feasibility of applying the proposed approach to domains without comprehensive ontologies by substituting ontological concepts with concepts automatically extracted from a small corpus of scientific abstracts and establishing relationships between concepts through distributional statistics. Our results demonstrate that this automated approach achieves comparable performance using only a small set of scientific abstracts, resulting in a fully automated pipeline for enhancing domain-specific understanding of small encoder models that is especially suited for application in low-resource settings and achieves performance comparable to masked language modeling pretraining on much larger datasets.

中文标题/摘要

标题：利用LLM生成数据增强领域特定编码模型：如何利用领域本体以及如何无需它们

我们研究了使用LLM生成的数据对具有有限训练数据的专业领域中的编码模型进行持续预训练的方法，以入侵生物学领域为例。为此，我们利用领域特定本体，通过添加LLM生成的数据来丰富本体，并将编码模型预训练为基于本体的概念定义嵌入模型。为了评估该方法的有效性，我们编制了一个专门用于评估入侵生物学中模型性能的基准。在证明了该方法在标准LLM预训练上的显著改进后，我们研究了将该方法应用于缺乏全面本体的领域的方法，通过用从少量科学摘要中自动提取的概念替换本体概念，并通过分布统计建立概念之间的关系。我们的结果表明，仅使用少量科学摘要即可实现自动化方法，从而实现一个完全自动化的管道，用于增强领域特定的小型编码模型理解，特别适用于低资源环境，并且性能可与在更大数据集上进行掩码语言模型预训练相媲美。

Summary / 总结

This study explores the use of LLM-generated data for continual pretraining of encoder models in the domain of invasion biology, which has limited training data. By enriching domain-specific ontologies with LLM-generated data and pretraining the encoder model as an ontology-informed embedding model, the study shows significant improvements over standard LLM pretraining. The research also demonstrates that an automated approach using concepts extracted from a small set of scientific abstracts can achieve comparable performance, making the method suitable for low-resource settings without comprehensive ontologies.

该研究探讨了使用LLM生成的数据对入侵生物学这一数据有限的领域进行持续预训练的方法。通过使用领域特定的本体论并结合LLM生成的数据进行预训练，该研究展示了与标准LLM预训练相比的巨大改进。研究还表明，使用从科学摘要中自动提取的概念并利用分布统计来建立概念之间的关系，可以实现与大规模数据集掩码语言模型预训练相当的性能，特别适用于资源有限的环境。

When do World Models Successfully Learn Dynamical Systems?

Authors: Edmund Ross, Claudia Drygala, Leonhard Schwarz, Samir Kaiser, Francesca di Mare, Tobias Breiten, Hanno Gottschalk

First: 2025-07-07T11:29:18+00:00 · Latest: 2025-11-24T17:16:42+00:00

Abs · PDF · Code1 · Code2

Abstract

In this work, we explore the use of compact latent representations with learned time dynamics ('World Models') to simulate physical systems. Drawing on concepts from control theory, we propose a theoretical framework that explains why projecting time slices into a low-dimensional space and then concatenating to form a history ('Tokenization') is so effective at learning physics datasets, and characterise when exactly the underlying dynamics admit a reconstruction mapping from the history of previous tokenized frames to the next. To validate these claims, we develop a sequence of models with increasing complexity, starting with least-squares regression and progressing through simple linear layers, shallow adversarial learners, and ultimately full-scale generative adversarial networks (GANs). We evaluate these models on a variety of datasets, including modified forms of the heat and wave equations, the chaotic regime 2D Kuramoto-Sivashinsky equation, and a challenging computational fluid dynamics (CFD) dataset of a 2D Kármán vortex street around a fixed cylinder, where our model is successfully able to recreate the flow.

中文标题/摘要

标题：何时世界模型成功学习动力系统？

在本研究中，我们探讨了使用紧凑的潜在表示（“世界模型”）和学习时间动力学来模拟物理系统的方法。借鉴控制理论的概念，我们提出了一种理论框架，解释了将时间切片投影到低维空间并连接起来形成历史（“标记化”）为何如此有效地学习物理数据集，并且描述了在什么情况下底层动力学允许从先前标记化帧的历史重建到下一个帧的映射。为了验证这些主张，我们开发了一系列从简单最小二乘回归到简单的线性层、浅层对抗学习者，最终到完整的生成对抗网络（GAN）的模型。我们在包括修改后的热传导方程和波动方程、混沌区域二维库拉莫托-西瓦什金斯基方程以及一个具有挑战性的二维卡门涡街计算流体力学（CFD）数据集上评估了这些模型，其中我们的模型成功地再现了流场。

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

Authors: Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma

First: 2025-11-24T17:15:55+00:00 · Latest: 2025-11-24T17:15:55+00:00

Comments: 10 pages, with supp

Abs · PDF · Code1 · Code2

Abstract

Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.

中文标题/摘要

标题：SteadyDancer: 保留首帧身份的一致且协调的人像动画

在人像动画中，保留首帧身份并确保精确的运动控制是一个基本挑战。主导的参考到视频（R2V）范式的图像到运动绑定过程忽略了现实世界应用中常见的时空错位，导致身份漂移和视觉伪影等问题。我们提出了SteadyDancer框架，这是一个基于图像到视频（I2V）范式的框架，实现了协调且一致的动画，并且是第一个能够稳健地保留首帧身份的方法。首先，我们提出了一种条件协调机制，以协调两种冲突条件，从而实现精确控制而不牺牲保真度。其次，我们设计了协同姿态调制模块，生成一种高度兼容参考图像的适应性和一致性的姿态表示。最后，我们采用了一种分阶段解耦目标训练管道，逐级优化模型以提高运动保真度、视觉质量和时间一致性。实验表明，SteadyDancer在外观保真度和运动控制方面达到了最先进的性能，同时所需训练资源远少于同类方法。

Summary / 总结

SteadyDancer addresses the challenge of preserving first-frame identity and ensuring precise motion control in human image animation. It introduces a Condition-Reconciliation Mechanism and Synergistic Pose Modulation Modules to harmonize conflicting conditions and generate a compatible pose representation. The framework also uses a Staged Decoupled-Objective Training Pipeline for optimization. Experiments show SteadyDancer outperforms existing methods in appearance fidelity and motion control with fewer training resources.

SteadyDancer 解决了在人体图像动画中保持第一帧身份和精确运动控制的挑战。它引入了条件协调机制和协同姿态调制模块来协调冲突条件并生成兼容的姿态表示。该框架还使用分阶段解耦目标训练管道来优化运动保真度、视觉质量和时间连贯性。实验表明，SteadyDancer 在外观保真度和运动控制方面优于现有方法，并且所需的训练资源更少。

SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

Authors: Lingwei Dang, Zonghan Li, Juntong Li, Hongwen Zhang, Liang An, Yebin Liu, Qingyao Wu

First: 2025-11-24T17:14:19+00:00 · Latest: 2025-11-24T17:14:19+00:00

Comments: Project Page: https://droliven.github.io/SyncMV4D

Abs · PDF · Code1 · Code2 · Project1

Abstract

Hand-Object Interaction (HOI) generation plays a critical role in advancing applications across animation and robotics. Current video-based methods are predominantly single-view, which impedes comprehensive 3D geometry perception and often results in geometric distortions or unrealistic motion patterns. While 3D HOI approaches can generate dynamically plausible motions, their dependence on high-quality 3D data captured in controlled laboratory settings severely limits their generalization to real-world scenarios. To overcome these limitations, we introduce SyncMV4D, the first model that jointly generates synchronized multi-view HOI videos and 4D motions by unifying visual prior, motion dynamics, and multi-view geometry. Our framework features two core innovations: (1) a Multi-view Joint Diffusion (MJD) model that co-generates HOI videos and intermediate motions, and (2) a Diffusion Points Aligner (DPA) that refines the coarse intermediate motion into globally aligned 4D metric point tracks. To tightly couple 2D appearance with 4D dynamics, we establish a closed-loop, mutually enhancing cycle. During the diffusion denoising process, the generated video conditions the refinement of the 4D motion, while the aligned 4D point tracks are reprojected to guide next-step joint generation. Experimentally, our method demonstrates superior performance to state-of-the-art alternatives in visual realism, motion plausibility, and multi-view consistency.

中文标题/摘要

标题：SyncMV4D：同步多视角联合扩散的手物交互合成

手物交互（HOI）生成在动画和机器人技术的应用中起着关键作用。当前基于视频的方法主要为单视角，这妨碍了对全面3D几何结构的感知，并经常导致几何失真或不现实的运动模式。虽然3D HOI方法可以生成动态合理的运动，但它们依赖于在受控实验室环境中捕获的高质量3D数据，这严重限制了它们在现实世界场景中的泛化能力。为克服这些限制，我们提出了SyncMV4D，这是第一个通过统一视觉先验、运动动力学和多视角几何结构来联合生成同步多视角HOI视频和4D运动的模型。我们的框架包含两个核心创新：（1）多视角联合扩散（MJD）模型，该模型联合生成HOI视频和中间运动，（2）扩散点对齐器（DPA），该模型将粗略的中间运动细化为全局对齐的4D度量点轨迹。为了紧密耦合2D外观与4D动力学，我们建立了闭环、相互增强的循环。在扩散去噪过程中，生成的视频条件了4D运动的细化，而对齐的4D点轨迹被重新投影以指导下一步关节生成。实验表明，我们的方法在视觉真实感、运动合理性以及多视角一致性方面优于最先进的替代方法。

Summary / 总结

SyncMV4D addresses the limitations of single-view and 3D HOI methods by introducing a model that jointly generates synchronized multi-view HOI videos and 4D motions. It uses a Multi-view Joint Diffusion (MJD) model to co-generate HOI videos and intermediate motions, and a Diffusion Points Aligner (DPA) to refine these into globally aligned 4D metric point tracks. Experiments show that SyncMV4D outperforms existing methods in visual realism, motion plausibility, and multi-view consistency.

SyncMV4D通过引入一个能够联合生成同步多视角手物交互（HOI）视频和4D运动的模型来解决单视角方法在手物交互生成中的局限性。该模型使用Multi-view Joint Diffusion (MJD) 模型和Diffusion Points Aligner (DPA) 共同生成HOI视频并细化中间运动为全局对齐的4D点轨迹。实验结果表明，SyncMV4D在视觉真实感、运动合理性以及多视角一致性方面优于现有方法。

The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet

Authors: Brennen A. Hill, Zhang Xinyu, Timothy Putra Prasetio

Venue: NeurIPS 2025

First: 2025-08-05T01:52:42+00:00 · Latest: 2025-11-24T17:11:32+00:00

Comments: Published in the proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Symmetry and Geometry in Neural Representations (NeurReps). Additionally accepted for presentation in NeurIPS 2025 Workshop: Interpreting Cognition in Deep Learning Models (CogInterp)

Abs · PDF · Code1 · Code2

Abstract

Despite their success, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. These shortcomings can be traced to a lack of inductive biases that reflect the inherent geometric structure of the visual world. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural and computational principles,which evolved to internalize these structures,may offer a blueprint for more capable artificial vision. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet is framed as a geometric framework that emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation for learning disentangled representations, and top-down predictive feedback for representation refinement. We interpret these mechanisms through the lens of geometry and dynamical systems, positing that they guide the learning of structured, low-dimensional neural manifolds. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset, which probes sensitivity to natural textures, and a light field image classification task, which requires processing higher-dimensional visual data. Our results show that VCNet achieves state-of-the-art accuracy of 92.1\% on Spots-10 and 74.4\% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating high-level neuroscientific principles, viewed through a geometric lens, can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.

中文标题/摘要

标题：皮层计算的几何学：VCNet中的流形解缠与预测动力学

尽管现代卷积神经网络（CNN）取得了成功，但它们仍然存在根本性的局限性，包括数据效率低下、不良的分布外泛化以及对抗性扰动的脆弱性。这些不足可以追溯到缺乏反映视觉世界内在几何结构的归纳偏置。相比之下，灵长类视觉系统表现出更高的效率和鲁棒性，表明其架构和计算原理，这些原理进化以内部化这些结构，可能为更强大的人工视觉提供蓝图。本文介绍了视觉皮层网络（VCNet），这是一种新型神经网络架构，其设计受到灵长类视觉皮层宏观组织的启发。VCNet被构想为一个几何框架，模拟了关键的生物机制，包括跨不同皮层区域的分层处理、用于学习解缠表示的双流信息分离以及自上而下的预测反馈以改进表示。我们通过几何学和动力系统的眼光来解释这些机制，认为它们指导了结构化、低维神经流形的学习。我们使用两个专门的基准测试评估了VCNet：Spots-10动物模式数据集，该数据集测试了对自然纹理的敏感性，以及一个光场图像分类任务，该任务需要处理高维视觉数据。我们的结果表明，VCNet在Spots-10上的准确率为92.1%，在光场数据集上的准确率为74.4%，超过了同等规模的当代模型。这项工作表明，通过几何学视角整合高级神经科学原理可以导致更高效和更鲁棒的模型，为解决机器学习中的长期挑战提供了有希望的方向。

Summary / 总结

This paper addresses the limitations of modern CNNs by introducing VCNet, a novel architecture inspired by the primate visual cortex. VCNet incorporates hierarchical processing, dual-stream information segregation, and top-down predictive feedback, which are interpreted through a geometric and dynamical systems lens. On specialized benchmarks, VCNet outperforms contemporary models, achieving 92.1% accuracy on Spots-10 and 74.4% on light field datasets, highlighting its potential for more efficient and robust machine learning models.

本文提出了Visual Cortex Network (VCNet)，一种受灵长类视觉皮层启发的神经网络架构，旨在解决现代CNN的数据效率低下和泛化能力差等问题。VCNet 使用几何框架来模拟分层处理、去纠缠表示学习和预测反馈。在专门的基准测试中，VCNet 在Spots-10 数据集上达到了92.1% 的最佳准确率，在光线场数据集上达到了74.4% 的准确率，超过了同等规模的当代模型。

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

Authors: Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou, Jianbo Zhang, Wei Sun, Dandan Zhu, Xiongkuo Min, Jun Jia, Zhijun Fang

First: 2025-11-24T17:11:00+00:00 · Latest: 2025-11-24T17:11:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

中文标题/摘要

标题：评估数据水印在微调定制扩散模型可追溯性中的应用：全面的基准测试与去除方法

最近的扩散模型微调技术能够生成特定图像集，如特定面孔或艺术风格，但也带来了版权和安全风险。数据水印已被提出通过在训练图像中嵌入不可感知的水印来确保可追溯性，即使在微调后这些水印仍然可以在输出中被检测到。然而，当前的方法缺乏统一的评估框架。为了解决这一问题，本文建立了一个通用的威胁模型，并引入了一个全面的评估框架，涵盖通用性、可传递性和鲁棒性。实验表明，现有方法在通用性和可传递性方面表现良好，并且在对抗常见的图像处理操作时表现出一定的鲁棒性，但在现实世界的威胁场景下仍然存在不足。为了揭示这些漏洞，本文进一步提出了一种实用的数据水印去除方法，该方法可以在不影响微调的情况下完全消除数据水印，突显了未来研究中的一个关键挑战。

Summary / 总结

This paper evaluates dataset watermarking techniques for ensuring traceability in fine-tuned diffusion models, which can reproduce specific image sets but pose copyright and security risks. It introduces a comprehensive evaluation framework focusing on universality, transmissibility, and robustness. The experiments reveal that existing methods perform well in universality and transmissibility but lack robustness against real-world threats. Additionally, the paper proposes a practical watermark removal method that fully eliminates watermarks without affecting fine-tuning, indicating a significant challenge for future research.

该论文评估了用于确保细调扩散模型可再现特定图像集时版权和安全风险的水印技术。它引入了一个综合评估框架，重点关注通用性、可传递性和鲁棒性。实验表明，现有方法在通用性和可传递性方面表现良好，但在应对真实世界威胁方面仍存在不足。论文还提出了一种实用的水印去除方法，揭示了这些技术当前的局限性。