arXiv 论文速递

Snapshot: 20260327_0407

DreamerAD: Efficient Reinforcement Learning via Latent World Model for Autonomous Driving

Authors: Pengxuan Yang, Yupeng Zheng, Deheng Qian, Zebin Xing, Qichao Zhang, Linbo Wang, Yichen Zhang, Shaoyu Guo, Zhongpu Xia, Qiang Chen, Junyu Han, Lingyun Xu, Yifeng Pan, Dongbin Zhao

First: 2026-03-25T17:57:09+00:00 · Latest: 2026-03-25T17:57:09+00:00

Comments: first version

Abs · PDF · Code1 · Code2

Abstract

We introduce DreamerAD, the first latent world model framework that enables efficient reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1 - achieving 80x speedup while maintaining visual interpretability. Training RL policies on real-world driving data incurs prohibitive costs and safety risks. While existing pixel-level diffusion world models enable safe imagination-based training, they suffer from multi-step diffusion inference latency (2s/frame) that prevents high-frequency RL interaction. Our approach leverages denoised latent features from video generation models through three key mechanisms: (1) shortcut forcing that reduces sampling complexity via recursive multi-resolution step compression, (2) an autoregressive dense reward model operating directly on latent representations for fine-grained credit assignment, and (3) Gaussian vocabulary sampling for GRPO that constrains exploration to physically plausible trajectories. DreamerAD achieves 87.7 EPDMS on NavSim v2, establishing state-of-the-art performance and demonstrating that latent-space RL is effective for autonomous driving.

中文标题/摘要

标题：DreamerAD：通过潜在世界模型实现自主驾驶高效强化学习

我们介绍了DreamerAD，这是首个通过将扩散采样从100步压缩到1步来实现自主驾驶高效强化学习的潜在世界模型框架，实现了80倍的速度提升同时保持视觉可解释性。在真实世界驾驶数据上训练RL策略会带来高昂的成本和安全风险。虽然现有的像素级扩散世界模型能够实现安全的想象训练，但它们的多步扩散推理延迟（每帧2秒）阻碍了高频的RL交互。我们的方法通过三种关键机制利用视频生成模型去噪后的潜在特征：(1) 短切强制，通过递归多分辨率步骤压缩降低采样复杂度；(2) 一种自回归密集奖励模型，直接作用于潜在表示以实现细粒度的信用分配；(3) 高斯词汇采样，用于GRPO，限制探索到物理上合理的轨迹。DreamerAD在NavSim v2上实现了87.7 EPDMS，确立了最先进的性能，并证明了潜在空间RL在自主驾驶中的有效性。

Summary / 总结

DreamerAD is a latent world model framework that accelerates reinforcement learning for autonomous driving by compressing diffusion sampling from 100 steps to 1, achieving an 80x speedup while maintaining visual interpretability. It uses shortcut forcing, an autoregressive dense reward model, and Gaussian vocabulary sampling to reduce sampling complexity, enable fine-grained credit assignment, and constrain exploration to physically plausible trajectories, respectively. The approach achieves 87.7 EPDMS on NavSim v2, setting a new state-of-the-art performance benchmark for autonomous driving.

DreamerAD 是一种通过将扩散采样从 100 步压缩到 1 步来加速自主驾驶领域强化学习的潜在世界模型框架，实现了 80 倍的加速同时保持视觉可解释性。该方法利用快捷强制、自回归密集奖励模型和高斯词汇采样来减少采样复杂性、实现细粒度的奖励分配以及将探索限制在物理上可行的轨迹上。该方法在 NavSim v2 上达到了 87.7 EPDMS，确立了自主驾驶领域的最新性能基准。

TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

Authors: Jiaying Zhou, Zhihao Zhan, Ruifeng Zhai, Qinhan Lyu, Hao Liu, Keze Wang, Liang Lin, Guangrun Wang

First: 2026-03-25T17:56:32+00:00 · Latest: 2026-03-25T17:56:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.

中文标题/摘要

标题：TAG: 目标无关引导在视觉-语言-动作模型中实现稳定的目标中心推理

视觉-语言-动作（VLA）策略在将语言指令和视觉观察映射到机器人动作方面取得了显著进展，但在杂乱场景中存在干扰物时，其可靠性会下降。通过对失败案例的分析，我们发现许多错误并非源于不可行的动作，而是实例级的定位失败：策略经常生成一个看似合理的抓取轨迹，但最终偏离目标或甚至抓错了对象实例。为解决这一问题，我们提出了TAG（目标无关引导），这是一种简单的推理时引导机制，旨在显式地减少VLA策略中的干扰物和外观诱导偏差。受无分类器引导（CFG）的启发，TAG在原始观察和对象擦除观察下对比策略预测，并使用它们之间的差异作为残差转向信号，增强对象证据在决策过程中的影响。TAG不需要修改策略架构，可以与现有的VLA策略结合使用，只需进行最少的训练和推理更改。我们在标准操作基准上评估了TAG，包括LIBERO、LIBERO-Plus和VLABench，结果显示它在杂乱环境下的鲁棒性得到了一致提高，并减少了接近失败和抓错对象的执行。

Summary / 总结

The paper addresses the issue of instance-level grounding failures in Vision-Language-Action (VLA) policies, which often result in near-miss or wrong-object executions in cluttered scenes. To tackle this, the authors propose TAG (Target-Agnostic Guidance), a simple inference-time mechanism that contrasts policy predictions with and without object information to reduce bias. TAG does not require modifying the policy architecture and can be easily integrated into existing VLA models. Experiments on standard manipulation benchmarks show that TAG improves robustness under clutter and reduces errors such as near-misses and wrong-object executions.

研究针对视觉-语言-动作（VLA）策略在杂乱场景中出现的目标定位失败问题，这些问题常导致接近失误或错误对象执行。为解决这一问题，作者提出了TAG（目标无关引导），这是一种简单的推理时机制，通过对比含有和不含物体信息的策略预测来生成残差引导信号。该方法无需修改策略架构，并通过在标准操作基准上的评估展示了在杂乱环境下的鲁棒性提升。

Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving

Authors: Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, Yihang Dong, Ce Hao, Xiaoqing Ye, Junyu han, Yifeng Pan, Dongbin Zhao

First: 2026-03-25T17:56:07+00:00 · Latest: 2026-03-25T17:56:07+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Latent-WAM, an efficient end-to-end autonomous driving framework that achieves strong trajectory planning through spatially-aware and dynamics-informed latent world representations. Existing world-model-based planners suffer from inadequately compressed representations, limited spatial understanding, and underutilized temporal dynamics, resulting in sub-optimal planning under constrained data and compute budgets. Latent-WAM addresses these limitations with two core modules: a Spatial-Aware Compressive World Encoder (SCWE) that distills geometric knowledge from a foundation model and compresses multi-view images into compact scene tokens via learnable queries, and a Dynamic Latent World Model (DLWM) that employs a causal Transformer to autoregressively predict future world status conditioned on historical visual and motion representations. Extensive experiments on NAVSIM v2 and HUGSIM demonstrate new state-of-the-art results: 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, surpassing the best prior perception-free method by 3.2 EPDMS with significantly less training data and a compact 104M-parameter model.

中文标题/摘要

标题：Latent-WAM：基于潜在世界的动作建模以实现端到端自动驾驶

我们介绍了Latent-WAM，一种高效的端到端自动驾驶框架，通过空间感知和动力学指导的潜在世界表示实现强大的轨迹规划。现有的基于世界模型的规划者由于表示压缩不足、空间理解有限以及动力学利用不足，导致在受限的数据和计算预算下规划效果不佳。Latent-WAM 通过两个核心模块解决了这些限制：空间感知压缩世界编码器（SCWE），从基础模型中提炼几何知识，并通过可学习的查询将多视角图像压缩为紧凑的场景标记；动态潜在世界模型（DLWM），使用因果Transformer基于历史视觉和运动表示自回归预测未来世界状态。在NAVSIM v2和HUGSIM上的大量实验展示了新的最佳结果：NAVSIM v2上的89.3 EPDMS和HUGSIM上的28.9 HD-Score，比最佳先前无感知感知方法高出3.2 EPDMS，且使用显著较少的训练数据和紧凑的104M参数模型。

Summary / 总结

Latent-WAM is an end-to-end autonomous driving framework that improves trajectory planning through spatially-aware and dynamics-informed latent world representations. It uses a Spatial-Aware Compressive World Encoder to compress multi-view images into compact scene tokens and a Dynamic Latent World Model to predict future world status. Experiments show Latent-WAM achieves new state-of-the-art results with 89.3 EPDMS on NAVSIM v2 and 28.9 HD-Score on HUGSIM, outperforming previous methods with less training data and a smaller model size.

Latent-WAM 是一种端到端的自动驾驶框架，通过使用空间感知和动力学指导的潜在世界表示来改进轨迹规划。它通过空间感知压缩世界编码器和动态潜在世界模型克服了现有基于世界模型的规划器的局限性。该框架在 NAVSIM v2 上达到了新的最佳结果，EPDMS 为 89.3，在 HUGSIM 上达到了 28.9 HD-Score，使用较少的训练数据和紧凑的 104M 参数模型。

Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Authors: Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, Tunazzina Islam

First: 2026-03-25T17:54:39+00:00 · Latest: 2026-03-25T17:54:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.

中文标题/摘要

标题：检索改进未必保证更好答案：AI政策问答中RAG的研究

增强检索生成（RAG）系统越来越多地用于分析复杂的政策文件，但在以密集的法律语言和不断演变、重叠的监管框架为特征的领域中，实现专家级使用的足够可靠性仍然具有挑战性。我们使用AI治理和监管档案（AGORA）语料库研究了RAG在AI治理和政策分析中的应用，AGORA语料库是一个包含947份AI政策文件的精选集合。我们的系统结合了一个基于ColBERT的检索器，该检索器通过对比学习进行了微调，以及一个通过直接偏好优化（DPO）与人类偏好对齐的生成器。我们构建了合成查询并收集了成对的偏好，以使系统适应政策领域。通过评估检索质量、答案相关性和忠实性，我们发现领域特定的微调可以提高检索指标，但并不总是能一致地提高端到端的问答性能。在某些情况下，更强的检索反而会导致更自信的虚构，当相关文件不在语料库中时。这些结果突显了那些构建政策导向RAG系统的人的一个关键问题：单个组件的改进未必会转化为更可靠的答案。我们的研究结果为设计动态监管语料库上的基于问题的问答系统提供了实用的见解。

Summary / 总结

The study investigates the effectiveness of retrieval-augmented generation (RAG) systems in AI governance and policy analysis using the AGORA corpus. The system combines a ColBERT-based retriever and a generator fine-tuned with contrastive learning and Direct Preference Optimization (DPO). Experiments show that while domain-specific fine-tuning improves retrieval metrics, it does not consistently enhance end-to-end question answering performance, and stronger retrieval can lead to more confident but less accurate answers when relevant documents are missing. This highlights the need for careful design in policy-focused RAG systems to ensure reliable answers.

研究使用AGORA语料库探讨了RAG系统在AI治理和政策分析中的应用。结合ColBERT检索器和使用DPO进行微调的生成器来提升检索和答案的相关性。实验表明，尽管领域特定的微调可以增强检索指标，但并不一定能统一提升端到端的问题回答性能，更强的检索在相关文档缺失时反而可能导致更自信的虚构。这表明，在政策导向的RAG系统中，单个组件的改进并不一定能保证更可靠的答案。

Vision-Language Models vs Human: Perceptual Image Quality Assessment

Authors: Imran Mehmood, Imad Ali Shah, Ming Ronnier Luo, Brian Deegan

First: 2026-03-25T17:54:07+00:00 · Latest: 2026-03-25T17:54:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

中文标题/摘要

标题：视觉语言模型与人类：感知图像质量评估

心理物理实验仍然是最可靠的感知图像质量评估（IQA）方法，但其成本和有限的可扩展性鼓励了自动化方法的应用。我们研究视觉语言模型（VLMs）是否能在对比度、色彩饱和度和总体偏好三个图像质量尺度上近似人类的感知判断。六种VLMs（四种专有模型和两种开源模型）被与心理物理数据进行基准测试。本研究通过与人类心理物理数据的比较，系统地评估了VLMs在感知IQA中的表现。结果表明，色彩饱和度的属性依赖性变异模型与人类高度一致（ρ最高可达0.93），但在对比度上表现较差，反之亦然。属性加权分析进一步表明，大多数VLMs在评估总体偏好时赋予色彩饱和度更高的权重，类似于心理物理数据。模型内部一致性分析揭示了一个反直觉的权衡：最一致的模型未必是最接近人类的，表明响应变异反映了对场景依赖性感知线索的敏感性。此外，人类与VLM的一致性随着感知可分辨性的增加而提高，表明当刺激差异明显时，VLMs更为可靠。

Summary / 总结

This study evaluates Vision Language Models (VLMs) for perceptual image quality assessment (IQA) across three scales: contrast, colorfulness, and overall preference. Six VLMs, including proprietary and open-source models, were benchmarked against human psychophysical data. The results show that VLMs perform well for colorfulness but less so for contrast, with attribute weighting analysis indicating that most VLMs prioritize colorfulness in overall preference judgments. Additionally, the study finds that VLMs with higher consistency are not always the most aligned with human judgments, and human-VLM agreement improves with perceptual separability of stimuli.

本研究通过将视觉语言模型（VLMs）的判断与人类的心理物理数据进行比较，评估了其在图像质量评估（IQA）中的表现，涉及三个属性：对比度、色彩饱和度和总体偏好。六种VLMs，包括专有和开源模型，进行了基准测试。结果显示，VLMs在色彩饱和度方面与人类高度一致，但在对比度方面表现较差。属性权重分析表明，VLMs在评估总体偏好时更重视色彩饱和度，与人类判断相似。内部模型一致性分析显示，最一致的模型并不一定最接近人类反应，这表明反应的变异性反映了对场景特定感知线索的敏感性。此外，人类与VLMs的一致性随着感知差异的清晰度增加而提高，表明当刺激差异明显时，VLMs表现更可靠。

EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction

Authors: Falong Fan, Yi Xie, Arnis Lektauers, Bo Liu, Jerzy Rozenblit

First: 2026-03-25T17:53:34+00:00 · Latest: 2026-03-25T17:53:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate 3D reconstruction of deformable soft tissues is essential for surgical robotic perception. However, low-texture surfaces, specular highlights, and instrument occlusions often fragment geometric continuity, posing a challenge for existing fixed-topology approaches. To address this, we propose EndoVGGT, a geometry-centric framework equipped with a Deformation-aware Graph Attention (DeGAT) module. Rather than using static spatial neighborhoods, DeGAT dynamically constructs feature-space semantic graphs to capture long-range correlations among coherent tissue regions. This enables robust propagation of structural cues across occlusions, enforcing global consistency and improving non-rigid deformation recovery. Extensive experiments on SCARED show that our method significantly improves fidelity, increasing PSNR by 24.6% and SSIM by 9.1% over prior state-of-the-art. Crucially, EndoVGGT exhibits strong zero-shot cross-dataset generalization to the unseen SCARED and EndoNeRF domains, confirming that DeGAT learns domain-agnostic geometric priors. These results highlight the efficacy of dynamic feature-space modeling for consistent surgical 3D reconstruction.

中文标题/摘要

标题：EndoVGGT：增强的GNN深度估计用于手术3D重建

准确的可变形软组织3D重建对于手术机器人感知至关重要。然而，低纹理表面、镜面高光和器械遮挡常常破坏几何连续性，给现有的固定拓扑方法带来了挑战。为了解决这个问题，我们提出了一种以几何为中心的框架EndoVGGT，配备了变形感知图注意力（DeGAT）模块。DeGAT 不使用静态空间邻域，而是动态构建特征空间语义图以捕捉连贯组织区域之间的长程相关性。这使得结构线索能够在遮挡下稳健传播，增强全局一致性并改善非刚性变形恢复。在SCARED上的大量实验表明，我们的方法显著提高了保真度，PSNR提高了24.6%，SSIM提高了9.1%，超过了先前的最先进方法。最关键的是，EndoVGGT 在未见过的SCARED和EndoNeRF领域中表现出强大的零样本跨数据集泛化能力，证明了DeGAT 学习到了领域无关的几何先验。这些结果突显了动态特征空间建模在一致手术3D重建中的有效性。

Summary / 总结

The research aims to enhance the accuracy of 3D reconstruction of soft tissues in surgery, addressing challenges posed by low-texture surfaces and occlusions. The proposed EndoVGGT framework uses a Deformation-aware Graph Attention (DeGAT) module to dynamically construct semantic graphs, enabling robust propagation of structural cues and improving global consistency. Experiments on SCARED show a significant increase in PSNR and SSIM compared to previous methods, and the method demonstrates strong zero-shot generalization across different datasets, indicating the learning of domain-agnostic geometric priors.

研究旨在通过解决低纹理表面和遮挡带来的挑战，提高手术中软组织的3D重建准确性。提出的EndoVGGT框架使用了Deformation-aware Graph Attention (DeGAT) 模块，动态构建语义图，捕捉长程相关性并改善结构线索的传播。在SCARED上的实验显示，该方法在PSNR和SSIM上显著优于先前的方法，并且在新数据集上表现出强大的零样本泛化能力，表明学习了领域无关的几何先验。

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

Authors: Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, Jianfei Yang

First: 2026-03-25T17:52:43+00:00 · Latest: 2026-03-25T17:52:43+00:00

Comments: Code is available at https://github.com/gxyes/MARS_Chameleon

Abs · PDF · Code1 · Code2 · Code3

Abstract

Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.

中文标题/摘要

标题：变色龙：长时程机器人操作的记忆

机器人操作通常需要记忆：遮挡和状态变化可能会使决策时的观察在感知上产生混淆，使得在观察层面的动作选择非马尔可夫，因为相同的观察可能来自不同的交互历史。大多数具身智能体通过语义压缩的轨迹和基于相似性的检索来实现记忆，这会丢弃区分性的细粒度感知线索，并可能返回感知上相似但决策无关的事件。受人类事件记忆的启发，我们提出了变色龙，它将几何导向的多模态令牌写入以保留区分性的上下文，并通过可微分的记忆堆栈实现目标导向的检索。我们还引入了Camo-数据集，这是一个跨越事件记忆、空间跟踪和感知混淆下的顺序操作的UR5e真实机器人数据集。在各种任务中，变色龙在感知上混淆的环境中始终能够提高决策可靠性和长时程控制，优于强大的基线。

Summary / 总结

The paper addresses the need for robotic manipulation to handle memory, especially in scenarios with occlusions and state changes that make observations ambiguous. It introduces Chameleon, which uses geometry-grounded multimodal tokens to preserve disambiguating context and a differentiable memory stack for goal-directed recall. Experiments show that Chameleon improves decision reliability and long-horizon control compared to strong baselines in perceptually confusable settings.

论文提出了Chameleon系统，该系统使用几何导向的多模态令牌来改善在感知混淆场景下的机器人操作。Chameleon通过保存区分上下文并使用可微分的记忆堆栈进行目标导向的回忆，来增强决策可靠性和长期控制。该系统在包含真实机器人UR5e数据的Camo-Dataset中的各种任务中表现优于强基线，该数据集包括事件回忆、空间跟踪和感知混淆下的序列操作。

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Authors: Qijia He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma, Jaemin Cho, Jason Ren, Daniel S Weld, Ranjay Krishna

First: 2026-03-25T17:52:23+00:00 · Latest: 2026-03-25T17:52:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

中文标题/摘要

标题：VFIG：使用视觉语言模型在SVG中矢量化复杂图形

可缩放矢量图形（SVG）是技术插图和数字设计中不可或缺的格式，提供精确的分辨率独立性和灵活的语义可编辑性。然而，在实践中，原始矢量源文件经常丢失或不可访问，只剩下难以修改或缩放的“扁平”位图版本（例如，PNG或JPEG）。手动重建这些图形是一个劳动密集型过程，需要专门的技能来恢复原始的几何意图。为了解决这一问题，我们提出了VFIG，一种用于复杂和高保真图形到SVG转换的视觉语言模型家族。尽管这项任务本质上是数据驱动的，但现有数据集通常规模较小且缺乏专业图表的复杂性。我们通过引入包含66,000个高质量图形-SVG配对的VFIG-DATA数据集来解决这一问题，这些配对来自各种真实世界的论文图形和程序生成的图表。认识到SVG由重复的基本元素和分层局部结构组成，我们引入了一种从监督微调（SFT）开始的粗到细的训练课程，学习基本元素，然后过渡到强化学习（RL）优化以优化全局图表保真度、布局一致性以及拓扑边缘情况。最后，我们引入了VFIG-BENCH，一个全面的评估套件，包含新的度量标准，用于衡量复杂图形的结构完整性。VFIG在开源模型中达到了最先进的性能，并且与GPT-5.2的表现相当，在VFIG-BENCH上的VLM-Judge得分为0.829。

Summary / 总结

VFIG is a Vision-Language Model designed for converting complex rasterized figures into scalable vector graphics (SVG). It leverages a large-scale dataset, VFIG-DATA, consisting of 66,000 figure-SVG pairs, and employs a training curriculum that starts with supervised fine-tuning and transitions to reinforcement learning to optimize global diagram fidelity. VFIG outperforms existing models, achieving a VLM-Judge score of 0.829 on the VFIG-BENCH evaluation suite.

VFIG 是一种用于将复杂矢量化图形转换为可缩放矢量图形 (SVG) 的视觉-语言模型。它利用包含 66,000 个图形-SVG 对的大规模数据集 VFIG-DATA，并采用一种从监督微调开始，过渡到强化学习以优化全局图示保真度的训练课程。VFIG 在 VFIG-BENCH 评估套件上的 VLM-Judge 得分为 0.829，优于现有模型。

Towards Training-Free Scene Text Editing

Authors: Yubo Li, Xugong Qin, Peng Zhang, Hailun Lin, Gangyan Zeng, Kexin Zhang

Venue: CVPR 2026

First: 2026-03-25T17:50:31+00:00 · Latest: 2026-03-25T17:50:31+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at https://github.com/lyb18758/TextFlow

中文标题/摘要

标题：朝向无需训练的场景文本编辑

场景文本编辑旨在修改自然图像中的文本内容，同时保持视觉真实性和语义一致性。现有方法通常需要特定任务的训练或配对数据，限制了其可扩展性和适应性。在本文中，我们提出了一种无需训练的场景文本编辑框架TextFlow，该框架结合了Attention Boost (AttnBoost) 和Flow Manifold Steering (FMS) 的优点，能够在无需额外训练的情况下实现灵活、高保真的文本操作。具体而言，FMS通过建模字符和背景区域的视觉流来保持结构和风格的一致性，而AttnBoost则通过基于注意力的指导增强文本内容的渲染。通过联合利用这些互补模块，我们的方法以语义对齐和空间细化的方式实现端到端的文本编辑。大量实验表明，我们的框架在视觉质量和文本准确性方面与基于训练的方法相当或更优，能够在多种场景和语言中很好地泛化。这项研究推动了场景文本编辑向更高效、更具通用性和无需训练的范式发展。代码可在https://github.com/lyb18758/TextFlow 获取

Summary / 总结

The paper addresses the challenge of scene text editing by proposing TextFlow, a training-free framework that combines AttnBoost and FMS to achieve high-fidelity text manipulation without additional training. FMS ensures structural and style consistency by modeling visual flow, while AttnBoost enhances text rendering through attention-based guidance. Experiments show that TextFlow matches or outperforms training-based methods in visual quality and text accuracy across various scenes and languages, advancing the field towards a more efficient and generalizable paradigm.

论文提出了一种无需训练的场景文本编辑框架TextFlow，结合了AttnBoost和FMS。AttnBoost通过注意力机制增强文本渲染，而FMS通过建模视觉流来保持结构和风格的一致性。实验表明，TextFlow在视觉质量和文本准确性方面达到了与或优于基于训练的方法的水平，并且在各种场景和语言中具有良好的泛化能力。

Completeness of Unbounded Best-First Minimax and Descent Minimax

Authors: Quentin Cohen-Solal

First: 2026-03-25T17:50:31+00:00 · Latest: 2026-03-25T17:50:31+00:00

Abs · PDF · Code1 · Code2

Abstract

In this article, we focus on search algorithms for two-player perfect information games, whose objective is to determine the best possible strategy, and ideally a winning strategy. Unfortunately, some search algorithms for games in the literature are not able to always determine a winning strategy, even with an infinite search time. This is the case, for example, of the following algorithms: Unbounded Best-First Minimax and Descent Minimax, which are core algorithms in state-of-the-art knowledge-free reinforcement learning. They were then improved with the so-called completion technique. However, whether this technique sufficiently improves these algorithms to allow them to always determine a winning strategy remained an open question until now. To answer this question, we generalize the two algorithms (their versions using the completion technique), and we show that any algorithm of this class of algorithms computes the best strategy. Finally, we experimentally show that the completion technique improves winning performance.

中文标题/摘要

标题：无界最佳优先极小极大和下降极小极大完备性

本文关注两人完美信息博弈的搜索算法，其目标是确定最佳策略，理想情况下是获胜策略。不幸的是，文献中的一些博弈搜索算法即使在无限搜索时间内也无法总是确定获胜策略。例如，无界最佳优先极小极大和下降极小极大算法是当前知识无关强化学习的核心算法。随后，这些算法通过所谓的完备性技术得到了改进。然而，这种技术是否足够改进这些算法以使其总是能够确定获胜策略，直到现在仍是一个开放问题。为回答这个问题，我们泛化了这两种算法（使用完备性技术的版本），并证明了这类算法中的任何算法都能计算出最佳策略。最后，我们通过实验表明，完备性技术可以提高获胜性能。

Summary / 总结

This study addresses the completeness of search algorithms for two-player perfect information games, specifically Unbounded Best-First Minimax and Descent Minimax, which are crucial in knowledge-free reinforcement learning. The research demonstrates that by generalizing these algorithms with the completion technique, they can always determine a winning strategy. Experiments confirm that the completion technique enhances winning performance.

该研究探讨了用于两人完美信息博弈的搜索算法，特别是Unbounded Best-First Minimax和Descent Minimax，这些算法在无知识强化学习中至关重要。研究通过将这些算法泛化并使用完成技术，证明了它们可以始终确定最佳策略，包括获胜策略。实验结果显示，完成技术可以提高获胜性能。

Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Authors: Duc Vu, Anh Nguyen, Chi Tran, Anh Tran

Venue: CVPR 2026

First: 2026-03-25T17:48:10+00:00 · Latest: 2026-03-25T17:48:10+00:00

Comments: Accepted to CVPR 2026 (Main Conference)

Abs · PDF · Code1 · Code2

Abstract

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

中文标题/摘要

标题：Anti-I2V：保护照片免受恶意图像到视频生成的侵害

基于扩散的视频生成模型的进步虽然显著提高了人类动画的质量，但也带来了通过特定人的照片和文本提示生成假视频的滥用风险。最近的努力集中在对抗性攻击上，通过引入精心设计的扰动来保护图像免受扩散模型的影响。然而，大多数现有方法针对的是图像生成，而相对较少的方法明确地针对图像到视频扩散模型（VDMs），并且大多数主要集中在基于UNet的架构上。因此，它们对扩散变换器（DiT）模型的有效性仍然很大程度上未被探索，因为这些模型由于更大的容量和先进的注意力机制，表现出改进的特征保留和更强的时间一致性。在本文中，我们引入了Anti-I2V，这是一种针对恶意人类图像到视频生成的新颖防御方法，适用于各种扩散基础架构。Anti-I2V 不仅在 RGB 空间中更新噪声，还在 L*a*b* 和频域中操作，提高了鲁棒性并集中在显著像素上。然后，我们确定了在去噪过程中捕捉到最独特语义特征的网络层，设计了适当的训练目标，以最大化时间连贯性和生成保真度的降级。通过广泛的验证，Anti-I2V 在对抗各种视频扩散模型方面表现出最先进的防御性能，提供了一个有效的解决方案。

Summary / 总结

This paper addresses the threat of generating fake videos from a person's photo using diffusion-based video generation models. It introduces Anti-I2V, a defense mechanism that operates in both the $L$*$a$*$b$* and frequency domains to protect images from such models. Anti-I2V identifies key network layers during the denoising process to degrade temporal coherence and generation fidelity. Extensive validation shows that Anti-I2V outperforms existing methods against various video diffusion models.

本文提出了一种名为Anti-I2V的新颖防御机制，以应对恶意的人像到视频生成威胁，该机制适用于多种扩散模型架构。Anti-I2V通过在$L$*$a$*$b$*和频域中应用噪声更新，并专注于显著像素来增强鲁棒性。该方法识别出在去噪过程中捕捉到最独特语义特征的网络层，并设计相应的训练目标以降低时间连贯性和生成保真度。广泛的验证表明，Anti-I2V在对抗多种视频扩散模型方面表现出色，提供了一种有效的解决方案来防止从图像和文本提示生成假视频的问题。

POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan

Authors: Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kumar Das, Monorama Swain, Yufang Hou, Elisabeth Andre, Khalid Mahmood Malik, Markus Schedl, Shah Nawaz

Venue: ACM MM 2026

First: 2026-03-25T17:47:00+00:00 · Latest: 2026-03-25T17:47:00+00:00

Comments: Grand challenge at ACM MM 2026

Abs · PDF · Code1 · Code2

Abstract

Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.

中文标题/摘要

标题：POLY-SIM：多语种说话人识别缺失模态2026年挑战计划

多模态说话人识别系统通常假设在训练和测试过程中音频-视觉模态是完整且一致的。然而，在实际应用中，这种假设往往不成立。视觉信息可能由于遮挡、摄像机故障或隐私限制而缺失，而多语种说话人则由于语言间的语言变异增加了额外的复杂性。这些挑战显著影响了多模态说话人识别系统的鲁棒性和泛化能力。POLY-SIM 2026年挑战旨在推进在缺失模态和跨语言条件下多模态说话人识别的研究。具体而言，挑战鼓励开发能够有效利用不完整多模态输入并保持在不同语言中强大性能的稳健方法。本报告介绍了POLY-SIM 2026年挑战的设计和组织，包括数据集、任务定义、评估协议和基线模型。通过提供标准化的基准和评估框架，挑战旨在促进更稳健和实用的多模态说话人识别系统的发展。

Summary / 总结

The POLY-SIM Grand Challenge 2026 addresses the robustness of multimodal speaker identification systems under missing modality and cross-lingual conditions. It evaluates methods that can effectively use incomplete multimodal inputs and perform well across different languages. The challenge includes a standardized dataset, task formulation, and evaluation protocol to advance research in this area.

POLY-SIM 2026 大挑战旨在解决在缺失模态和跨语言条件下多模态说话人识别系统的鲁棒性问题。它评估了能够处理不完整音频-视觉输入并在不同语言中表现良好的方法。主要发现包括开发了能够有效利用不完整多模态数据的鲁棒模型，展示了在挑战性场景中的改进性能。

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Authors: Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao

First: 2026-02-18T14:19:01+00:00 · Latest: 2026-03-25T17:42:42+00:00

Comments: 8 pages

Abs · PDF · Code1 · Code2

Abstract

Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures. We propose Team-of-Thoughts, a heterogeneous MAS framework that treats diverse models as specialized tools within an orchestrator-driven paradigm. Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own domain-specific strengths to guide selection. At inference, the orchestrator dynamically activates the most compatible agents based on these profiles to maximize capability coverage. Across five mathematical reasoning and code generation benchmarks, Team-of-Thoughts consistently outperforms individual models and existing MAS baselines. Notably, on AIME24 and LiveCodeBench, Team-of-Thoughts achieves 96.00% and 77.91% accuracy, respectively, significantly improving over homogeneous role-play baselines (80.00% and 65.93%).

中文标题/摘要

标题：思想团队：通过协调工具调用实现代理系统测试时高效扩展

现有的多代理系统（MAS）通常依赖于同质模型配置，未能充分利用不同后训练架构中固有的多样专长。我们提出了一种异构MAS框架——思想团队，该框架将不同的模型视为在协调者驱动范式下的专门工具。思想团队引入了两个新颖组件：（1）协调者校准，用于识别具有卓越协调和综合能力的模型；（2）代理自我评估，一种工具代理自我评估其领域特定优势的协议，以指导选择。在推理时，协调者根据这些档案动态激活最兼容的代理，以最大化能力覆盖。在五个数学推理和代码生成基准测试中，思想团队始终优于单个模型和现有MAS基线。值得注意的是，在AIME24和LiveCodeBench上，思想团队分别实现了96.00%和77.91%的准确率，显著优于同质角色扮演基线（80.00%和65.93%）。

Summary / 总结

The research aims to enhance the efficiency and performance of Multi-Agent Systems (MAS) by leveraging the diverse expertise of different models. It introduces Team-of-Thoughts, a heterogeneous framework that includes an orchestrator for model selection and calibration, and an agent self-assessment protocol. Experimental results show that Team-of-Thoughts outperforms individual models and existing MAS baselines, achieving 96.00% and 77.91% accuracy on AIME24 and LiveCodeBench, respectively, surpassing homogeneous role-play baselines by significant margins.

研究旨在通过利用不同模型的多样化专长来提升多智能体系统的效率和性能。提出了Team-of-Thoughts，这是一种异构MAS框架，包括一个调度器用于动态激活模型和一个代理自我评估机制来评估其专长。在五个基准测试中，Team-of-Thoughts的表现优于单一模型和现有MAS基线，分别在AIME24和LiveCodeBench上达到96.00%和77.91%的准确率，显著超越了同质角色扮演基线。

The Free-Market Algorithm: Self-Organizing Optimization for Open-Ended Complex Systems

Authors: Martin Jaraiz

First: 2026-03-25T17:41:25+00:00 · Latest: 2026-03-25T17:41:25+00:00

Comments: 26 pages, 3 figures, 2 tables, draft

Abs · PDF · Code1 · Code2

Abstract

We introduce the Free-Market Algorithm (FMA), a novel metaheuristic inspired by free-market economics. Unlike Genetic Algorithms, Particle Swarm Optimization, and Simulated Annealing -- which require prescribed fitness functions and fixed search spaces -- FMA uses distributed supply-and-demand dynamics where fitness is emergent, the search space is open-ended, and solutions take the form of hierarchical pathway networks. Autonomous agents discover rules, trade goods, open and close firms, and compete for demand with no centralized controller. FMA operates through a three-layer architecture: a universal market mechanism (supply, demand, competition, selection), pluggable domain-specific behavioral rules, and domain-specific observation. The market mechanism is identical across applications; only the behavioral rules change. Validated in two unrelated domains. In prebiotic chemistry, starting from 900 bare atoms (C, H, O, N), FMA discovers all 12 feasible amino acid formulas, all 5 nucleobases, the formose sugar chain, and Krebs cycle intermediates in under 5 minutes on a laptop -- with up to 240 independent synthesis routes per product. In macroeconomic forecasting, reading a single input-output table with zero estimated parameters, FMA achieves Mean Absolute Error of 0.42 percentage points for non-crisis GDP prediction, comparable to professional forecasters, portable to 33 countries. Assembly Theory alignment shows that FMA provides the first explicit, tunable mechanism for the selection signatures described by Sharma et al. (Nature, 2023). The event-driven assembly dynamics resonate with foundational programs in physics -- causal set theory, relational quantum mechanics, constructor theory -- suggesting that Darwinian market dynamics may reflect a deeper organizational principle that lead to the unfolding of Nature itself.

中文标题/摘要

标题：自由市场算法：开放复杂系统的自我组织优化

我们引入了自由市场算法（FMA），这是一种受自由市场经济启发的新型元启发式算法。与遗传算法、粒子群优化和模拟退火不同，FMA 不需要预设的适应度函数和固定的搜索空间，而是利用分布式供需动态，其中适应度是涌现的，搜索空间是开放的，解决方案表现为分层路径网络。自主代理发现规则、交易商品、开设和关闭企业，并在没有中央控制器的情况下竞争需求。 FMA 通过三层架构运行：通用市场机制（供应、需求、竞争、选择）、可插拔的领域特定行为规则和领域特定观察。市场机制在所有应用中都是相同的；只有行为规则会改变。在两个不相关的领域中得到了验证。在前生物化学中，从900个裸原子（C、H、O、N）开始，FMA 在不到5分钟的时间内（使用笔记本电脑）发现了所有12种可行的氨基酸公式、所有5种核苷酸、福尔马林糖链和克氏循环中间体，每种产品最多有240条独立合成路线。在宏观经济预测中，仅读取一个投入产出表且无参数估计，FMA 的非危机GDP预测平均绝对误差为0.42个百分点，与专业预测者相当，并且可以移植到33个国家。装配理论对齐表明，FMA 提供了 Sharma 等人（Nature, 2023）描述的选择特征的第一个明确、可调机制。事件驱动的装配动力学与物理学中的基础程序——因果集理论、关系量子力学、构造理论——相呼应，表明达尔文市场动力学可能反映了更深层次的组织原则，这些原则导致了自然本身的展开。

Summary / 总结

The Free-Market Algorithm (FMA) is a metaheuristic inspired by free-market economics, designed for open-ended complex systems. Unlike traditional algorithms that require fixed fitness functions and search spaces, FMA uses emergent fitness through supply-and-demand dynamics. It was validated in prebiotic chemistry and macroeconomic forecasting, discovering various molecules and achieving comparable accuracy to professional forecasters. FMA’s three-layer architecture includes a universal market mechanism, pluggable behavioral rules, and domain-specific observation, showing promise in diverse applications.

自由市场算法（FMA）是一种受自由市场经济启发的元启发式算法，适用于开放复杂的系统。不同于传统算法需要固定的目标函数和搜索空间，FMA 通过供需动态实现目标函数的自涌现。该算法已在前生物化学和宏观经济预测中得到验证，发现多种分子并达到了与专业预测者相当的准确性。FMA 的三层架构包括通用市场机制、可插拔的行为规则和领域特定的观察，显示出在多种应用中的潜力。

Knot-10:A Tightness-Stratified Benchmark for Real-World Knot Classification with Topological Difficulty Analysis

Authors: Shiheng Nie, Yunguang Yue

First: 2026-03-24T14:50:34+00:00 · Latest: 2026-03-25T17:39:26+00:00

Comments: 48 pages, 12 figures, 10 supplementary sections

Abs · PDF · Code1 · Code2

Abstract

Physical knot classification is a fine-grained visual classification (FGVC) scenario in which appearance cues are deliberately suppressed: different classes share the same rope material, color, and background, and class identity resides primarily in crossing structure. We introduce the Knots-10 benchmark, comprising 1,440 images with a deployment-oriented split that trains on loosely tied knots and tests on tightly dressed ones. Swin-T and TransFG both average 97.2% accuracy; PMG scores 94.5%, consistent with the hypothesis that jigsaw shuffling disrupts crossing continuity. McNemar tests cannot separate four of the five general-purpose backbones, so small ranking margins should be interpreted with caution. A Mantel permutation test shows that topological distance significantly correlates with confusion patterns in three of the five models (p < 0.01). We propose TACA regularization, which improves embedding-topology alignment from rho=0.46 to rho=0.65 without improving classification accuracy; a random-distance ablation yields comparable alignment, indicating the benefit is likely driven by generic regularization. A pilot cross-domain test with 100 phone photographs reveals a 58-69 percentage-point accuracy drop, exposing rope appearance bias as the dominant failure mode.

中文标题/摘要

标题：Knot-10：一种基于拓扑难度分析的现实世界绳结分类基准

物理绳结分类是细粒度视觉分类（FGVC）的一种场景，其中外观线索被故意抑制：不同类别共享相同的绳索材料、颜色和背景，类别身份主要在于交叉结构。我们引入了Knots-10基准，包含1440张图像，并采用面向部署的划分方式，训练集包含松散打结的绳结，测试集包含紧密打结的绳结。Swin-T和TransFG的平均准确率均为97.2%；PMG得分为94.5%，这与假设拼图打乱会破坏交叉连续性是一致的。McNemar检验无法区分五种通用骨干网络中的四种，因此小排名差距应谨慎解释。Mantel排列检验显示，在五种模型中的三种中，拓扑距离与混淆模式显著相关（p < 0.01）。我们提出了TACA正则化，它在不提高分类准确率的情况下，将嵌入-拓扑对齐从ρ=0.46提高到ρ=0.65；随机距离消融实验显示相似的对齐效果，表明其好处可能是由通用正则化驱动的。一项使用100张手机照片的跨域测试揭示了69-58个百分点的准确率下降，暴露了绳索外观偏差为主要失败模式。

Summary / 总结

The research aims to develop a benchmark for classifying physical knots based on their crossing structure, which is crucial for fine-grained visual classification. The study uses the Knots-10 benchmark with 1,440 images, focusing on the tightness of knots. Swin-T and TransFG achieve 97.2% accuracy, while PMG scores 94.5%. Topological distance significantly correlates with confusion patterns in three models, suggesting that crossing continuity is important. TACA regularization improves embedding-topology alignment but does not enhance classification accuracy. The study also finds that rope appearance bias significantly affects accuracy in a cross-domain test with phone photographs.

研究旨在开发一个基于交叉结构的物理绳结分类基准，这对于细粒度视觉分类至关重要。研究使用了包含1,440张图像的Knots-10基准，重点关注绳结的紧致度。Swin-T和TransFG的准确率为97.2%，而PMG得分为94.5%。拓扑距离在三种模型中与混淆模式显著相关，表明交叉连续性很重要。TACA正则化提高了嵌入-拓扑对齐，但未提高分类准确率。研究还发现，在使用手机照片进行的跨域测试中，绳子外观偏差显著影响了准确率。

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Authors: Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan

Venue: CVPR 2026

First: 2026-03-25T17:38:54+00:00 · Latest: 2026-03-25T17:38:54+00:00

Comments: To be published in CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.

中文标题/摘要

标题：LensWalk：通过规划如何观看视频实现自主视频理解

视频的密集和时间特性为自动化分析带来了巨大的挑战。尽管使用了强大的视觉-语言模型，现有的视频理解方法仍然受限于推理与感知之间的固有脱节：它们依赖于静态的、预先处理的信息，而不能在其理解过程中主动寻求视频中的原始证据。为了解决这一问题，我们提出了LensWalk，这是一种灵活的自主框架，赋予大型语言模型推理器主动控制其视觉观察的能力。LensWalk建立了一个紧密的推理-计划-观察循环，在每个步骤中，代理动态地指定它所观察的视频的时间范围和采样密度。利用这些规范参数化的各种多功能视觉-语言模型工具，代理可以进行广泛的线索扫描，专注于特定段落进行事实提取，并从多个时刻拼接证据以实现整体验证。这种设计允许代理根据其不断发展的思维链进行渐进的、按需的证据收集。无需对任何模型进行微调，LensWalk在多个模型配方上实现了显著的即插即用性能提升，在具有挑战性的长视频基准测试LVBench和Video-MME上，其准确性提高了超过5%。我们的分析表明，使代理能够控制其如何观看是实现更准确、更稳健和更具可解释性的视频推理的关键。

Summary / 总结

LensWalk is a flexible framework that allows a Large Language Model to actively control its visual observation in videos, forming a reason-plan-observe loop. This enables the model to gather evidence progressively and adaptively, improving accuracy by over 5% on long-video benchmarks. The key finding is that allowing agents to control their visual input leads to more accurate, robust, and interpretable video reasoning.

LensWalk 是一个灵活的框架，使大型语言模型能够主动控制其对视频的视觉观察，形成一个推理-计划-观察的循环。该模型能够动态地指定视频观察的时间范围和采样密度，从而实现渐进和按需的证据收集。LensWalk 在长视频基准如 LVBench 和 Video-MME 上，无需模型微调即可提高各种视觉-语言模型配方的准确性超过 5%。

Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training

Authors: Giacomo Borghi, Hyesung Im, Lorenzo Pareschi

First: 2026-03-20T09:48:45+00:00 · Latest: 2026-03-25T17:38:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection--mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection--mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann--Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator--mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.

中文标题/摘要

标题：双时间尺度学习动力学：基于群体的神经网络训练的群体视角

基于群体的学习范式，包括进化策略、群体基于训练（PBT）以及最近的模型合并方法，结合了模型内部快速优化与群体层面较慢的适应。尽管它们在实验上取得了成功，但对由此产生的集体训练动力学的通用数学描述仍然不完整。我们引入了基于双时间尺度群体动力学的神经网络训练理论框架。我们将神经网络群体视为一个相互作用的代理系统，在该系统中，网络参数通过快速的SGD/朗格维恩类型的噪声梯度更新演化，而超参数则通过较慢的选择-突变动力学演化。我们证明了参数和超参数联合分布的大群体极限，并在强时间尺度分离下推导出超参数密度的选择-突变方程。对于每个固定的超参数，快速参数动力学收敛到玻尔兹曼-吉布斯测度，从而诱导出慢进化中的有效适应度。平均动力学将基于群体的学习与二阶优化和经典的复制-突变模型联系起来，并给出了群体均值向最适应超参数移动的条件，阐明了噪声和多样性在优化与探索之间平衡的作用。数值实验既说明了大群体的极限，也说明了简化后的双时间尺度动力学，并表明对有效适应度的访问，无论是以闭式形式还是通过群体级估计，都可以改善群体级更新。

Summary / 总结

This paper introduces a theoretical framework for understanding the training dynamics of neural networks using a two-time-scale approach. It models a population of neural networks where parameters are updated rapidly through noisy gradient updates, while hyperparameters evolve more slowly through selection and mutation. The framework proves the large-population limit and derives a selection-mutation equation for hyperparameters, showing how the population mean moves towards the fittest hyperparameter. Numerical experiments demonstrate the effectiveness of this approach in balancing optimization and exploration, and suggest that estimating the effective fitness can enhance population-level updates.

论文提出了一种基于两时间尺度的理论框架来理解神经网络的训练动态，结合快速参数更新和较慢的超参数演化。它将神经网络群体建模为相互作用的代理系统，并证明了参数和超参数联合分布的大群体极限。关键发现包括推导出超参数密度的选择-突变方程，快速参数动态的玻尔兹曼-吉布斯分布收敛，以及与双层优化和经典模型的联系。数值实验表明，有效适应度的访问可以增强群体级别的更新。

Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents

Authors: Samuel Taiwo, Mohd Amaluddin Yusoff

Venue: Computer Science and Information Technology (CS and IT), pp. 49-67, 2026

First: 2026-03-25T17:35:24+00:00 · Latest: 2026-03-25T17:35:24+00:00

Comments: Presented at CCSEIT 2026. This version matches the published proceedings

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.

中文标题/摘要

标题：评估石油和天然气企业文档中检索增强生成的分块策略

检索增强生成（RAG）已成为解决大型语言模型（LLMs）限制的框架。然而，其有效性在很大程度上取决于文档分块——一个经常被忽视的质量决定因素。本文通过实证研究量化了四种分块策略之间的性能差异：固定大小滑动窗口、递归、断点基于语义和结构感知。我们使用了石油和天然气企业文档的专有语料库，包括文本密集的手册、表格密集的规范以及管道和仪表图（P和IDs），评估了这些方法。我们的研究结果表明，结构感知分块在总体检索效果上更高，特别是在前K项指标上，并且与语义或基线策略相比，计算成本显著降低。至关重要的是，所有四种方法在P和IDs上都表现出有限的效果，突显了纯文本RAG在视觉和空间编码文档中的核心局限性。我们得出结论，虽然明确的结构保留对于专门领域是必不可少的，但未来的工作必须结合多模态模型以克服当前的局限性。

Summary / 总结

This paper evaluates four chunking strategies for Retrieval-Augmented Generation (RAG) in oil and gas enterprise documents, including fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware methods. Using a proprietary corpus, the study finds that structure-aware chunking outperforms others in retrieval effectiveness, especially in top-K metrics, while being more computationally efficient. However, all methods show limited effectiveness on P and IDs, highlighting the need for multimodal models in such specialized domains.

本文评估了四种用于油和气企业文档的检索增强生成（RAG）的分块策略，包括固定大小滑动窗口、递归、断点基于语义和结构感知方法。使用专有语料库进行研究发现，结构感知分块在检索有效性方面表现更好，尤其是在top-K指标上，并且比语义或基线策略更具有计算效率。然而，所有方法在管道和仪表图上的效果有限，这表明在视觉和空间编码文档中，未来的工作需要结合多模态模型来克服当前的限制。

Let it Snow! Animating 3D Gaussian Scenes with Dynamic Weather Effects via Physics-Guided Score Distillation

Authors: Gal Fiebelman, Hadar Averbuch-Elor, Sagie Benaim

Venue: CVPR 2026

First: 2025-04-07T17:51:21+00:00 · Latest: 2026-03-25T17:33:38+00:00

Comments: Accepted to CVPR 2026. Project webpage: https://galfiebelman.github.io/let-it-snow/

Abs · PDF · Code1 · Code2 · Project1

Abstract

3D Gaussian Splatting has recently enabled fast and photorealistic reconstruction of static 3D scenes. However, dynamic editing of such scenes remains a significant challenge. We introduce a novel framework, Physics-Guided Score Distillation, to address a fundamental conflict: physics simulation provides a strong motion prior that is insufficient for photorealism , while video-based Score Distillation Sampling (SDS) alone cannot generate coherent motion for complex, multi-particle scenarios. We resolve this through a unified optimization framework where physics simulation guides Score Distillation to jointly refine the motion prior for photorealism while simultaneously optimizing appearance. Specifically, we learn a neural dynamics model that predicts particle motion and appearance, optimized end-to-end via a combined loss integrating Video-SDS for photorealism with our physics-guidance prior. This allows for photorealistic refinements while ensuring the dynamics remain plausible. Our framework enables scene-wide dynamic weather effects, including snowfall, rainfall, fog, and sandstorms, with physically plausible motion. Experiments demonstrate our physics-guided approach significantly outperforms baselines, with ablations confirming this joint refinement is essential for generating coherent, high-fidelity dynamics.

中文标题/摘要

标题：让雪飘！通过物理引导得分蒸馏动画3D高斯场景及其动态天气效果

3D高斯点积近年来使静态3D场景的快速和逼真重建成为可能。然而，此类场景的动态编辑仍然是一个重大挑战。我们提出了一种新颖的框架——物理引导得分蒸馏，以解决一个基本冲突：物理模拟提供了强大的运动先验，但不足以实现逼真效果，而基于视频的得分蒸馏采样（SDS）单独使用则无法为复杂的多粒子场景生成连贯的运动。我们通过一个统一的优化框架解决这一问题，其中物理模拟引导得分蒸馏共同细化运动先验以实现逼真效果，同时优化外观。具体而言，我们学习了一个神经动力学模型，该模型预测粒子的运动和外观，并通过结合视频-SDS和我们的物理引导先验的端到端联合损失进行优化。这使得在保持动态合理的同时实现逼真的细化。我们的框架能够实现场景范围内的动态天气效果，包括降雪、降雨、雾和沙尘暴，具有物理上合理的运动。实验表明，我们的物理引导方法显著优于基线，消融实验也证实了这种联合细化对于生成连贯的高保真动态是必不可少的。

Summary / 总结

The paper addresses the challenge of dynamic editing of static 3D scenes reconstructed using 3D Gaussian Splatting. It introduces a Physics-Guided Score Distillation framework that combines physics simulation and video-based Score Distillation Sampling to achieve photorealistic and coherent motion. The framework learns a neural dynamics model to predict particle motion and appearance, optimizing both photorealism and physical plausibility. Experiments show that this approach significantly outperforms baselines in generating high-fidelity dynamic weather effects like snowfall and rainfall.

论文提出了一种名为Physics-Guided Score Distillation的框架，以解决3D高斯点云场景动态编辑的挑战。该框架结合了物理模拟和基于视频的Score Distillation，以优化运动和外观，确保逼真性同时保持物理上合理的动态。实验表明，该方法在生成连贯且高保真的动态天气效果（如降雪和降雨）方面优于基线方法。

The role of spatial context and multitask learning in the detection of organic and conventional farming systems based on Sentinel-2 time series

Authors: Jan Hemmerling, Marcel Schwieder, Philippe Rufin, Leon-Friedrich Thomas, Mirela Tulbure, Patrick Hostert, Stefan Erasmi

First: 2026-03-25T17:28:20+00:00 · Latest: 2026-03-25T17:28:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Organic farming is a key element in achieving more sustainable agriculture. For a better understanding of the development and impact of organic farming, comprehensive, spatially explicit information is needed. This study presents an approach for the discrimination of organic and conventional farming systems using intra-annual Sentinel-2 time series. In addition, it examines two factors influencing this discrimination: the joint learning of crop type information in a concurrent task and the role of spatial context. A Vision Transformer model based on the Temporo-Spatial Vision Transformer (TSViT) architecture was used to construct a classification model for the two farming systems. The model was extended for simultaneous learning of the crop type, creating a multitask learning setting. By varying the patch size presented to the model, we tested the influence of spatial context on the classification accuracy of both tasks. We show that discrimination between organic and conventional farming systems using multispectral remote sensing data is feasible. However, classification performance varies substantially across crop types. For several crops, such as winter rye, winter wheat, and winter oat, F1 scores of 0.8 or higher can be achieved. In contrast, other agricultural land use classes, such as permanent grassland, orchards, grapevines, and hops, cannot be reliably distinguished, with F1 scores for the organic management class of 0.4 or lower. Joint learning of farming system and crop type provides only limited additional benefits over single-task learning. In contrast, incorporating wider spatial context improves the performance of both farming system and crop type classification. Overall, we demonstrate that a classification of agricultural farming systems is possible in a diverse agricultural region using multispectral remote sensing data.

中文标题/摘要

标题：基于Sentinel-2时序数据的有机与常规农业系统检测中空间上下文和多任务学习的作用

有机农业是实现更可持续农业的关键要素。为了更好地理解有机农业的发展及其影响，需要全面的空间显式信息。本研究提出了一种基于Sentinel-2时序数据区分有机和常规农业系统的方法。此外，还探讨了影响这种区分的两个因素：同时学习作物类型信息的并发任务以及空间上下文的作用。使用基于Temporo-Spatial Vision Transformer (TSViT) 架构的Vision Transformer模型构建了两种农业系统的分类模型。该模型扩展了同时学习作物类型的功能，形成了多任务学习设置。通过改变提供给模型的块大小，测试了空间上下文对两个任务分类精度的影响。结果显示，使用多光谱遥感数据区分有机和常规农业系统是可行的。然而，分类性能在不同作物类型之间差异很大。对于冬大麦、冬小麦和冬燕麦等作物，可以实现F1分数0.8或更高。相比之下，其他农业用地类型，如永久草地、果园、葡萄园和啤酒花，无法可靠区分，有机管理类别的F1分数为0.4或更低。同时学习农业系统和作物类型仅在单任务学习中提供有限的额外益处。相比之下，纳入更宽广的空间上下文可以提高两种农业系统和作物类型分类的性能。总体而言，我们证明了在多样化的农业地区使用多光谱遥感数据对农业农业系统进行分类是可能的。

Summary / 总结

This study aims to detect organic and conventional farming systems using Sentinel-2 time series data and a Temporo-Spatial Vision Transformer (TSViT) model. The research explores the impact of spatial context and multitask learning on classification accuracy. Results show that while classification of some crops like winter rye and winter wheat can achieve F1 scores of 0.8 or higher, other crops such as permanent grassland and orchards cannot be reliably distinguished. Joint learning of farming system and crop type provides limited additional benefits, but incorporating spatial context significantly improves both tasks' performance.

该研究旨在利用Sentinel-2时序数据区分有机和传统农业系统。采用了一种基于时空间视觉变换器（TSViT）架构的视觉变换器模型，并扩展了多任务学习功能，同时分类作物类型。通过改变输入模型的块大小，研究考察了空间上下文对分类准确率的影响。结果显示，虽然有机和传统农业系统的分类是可行的，但不同作物的表现差异显著。对于冬大麦、冬小麦和冬燕麦等作物，F1分数达到0.8或更高，而对于永久草地、果园等作物，F1分数低于0.4。多任务学习提供的额外好处有限，但纳入更广泛的空间上下文提高了作物类型和农业系统分类的性能。

Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation

Authors: Yara Bahram, Mélodie Desbos, Mohammadhadi Shateri, Eric Granger

Venue: CVPR

First: 2025-11-23T04:22:42+00:00 · Latest: 2026-03-25T17:25:17+00:00

Comments: Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher's domain. Thus, fast and high-quality generation for novel domains relies on two-stage pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and often degrade quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies DM distillation and adaptation. It couples two training signals: (i) a dual-domain distribution-matching distillation (DMD) objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We evaluate Uni-DAD on two comprehensive benchmarks for few-shot image generation (FSIG) and subject-driven personalization (SDP) using diffusion backbones. It delivers better or comparable quality to state-of-the-art (SoTA) adaptation methods even with less than 4 sampling steps, and often surpasses two-stage pipelines in quality and diversity. Code: https://github.com/yaramohamadi/uni-DAD.

中文标题/摘要

标题：Uni-DAD：统一扩散模型的蒸馏与适应一体化方法以实现少量步骤的少量样本图像生成

扩散模型（DMs）能够生成高质量的图像，但在适应新领域时采样仍然昂贵。蒸馏DMs速度快，但通常仍局限于其教师的领域。因此，快速且高质量的新领域生成依赖于两阶段管道：先适应后蒸馏或先蒸馏后适应。然而，这两种方法都增加了设计复杂性，并且常常会降低质量和多样性。我们引入了Uni-DAD，这是一种单阶段管道，统一了DM蒸馏和适应。它结合了两个训练信号：（i）一种双领域分布匹配蒸馏（DMD）目标，引导学生向源教师和目标教师的分布靠拢；（ii）一种多头生成对抗网络（GAN）损失，鼓励在多个特征尺度上提高目标的真实感。源领域蒸馏保留了多样化的源知识，而多头GAN稳定了训练并减少了过拟合，特别是在少量样本情况下。目标教师的加入促进了对结构上更遥远领域的适应。我们使用扩散模型作为骨干，在两个全面的少量样本图像生成（FSIG）和主题驱动个性化（SDP）基准上评估了Uni-DAD。即使少于4步采样，它也能提供更好的或可比的质量，并且在质量和多样性上经常超越两阶段管道。代码：https://github.com/yaramohamadi/uni-DAD.

Summary / 总结

Uni-DAD is a single-stage pipeline that unifies diffusion model distillation and adaptation, using a dual-domain distribution-matching distillation objective and a multi-head generative adversarial network loss to guide the student model towards the source and target domains while stabilizing training. It achieves better or comparable quality to state-of-the-art methods with fewer sampling steps and often outperforms two-stage pipelines in quality and diversity on few-shot image generation benchmarks.

Uni-DAD 是一个单一阶段的管道，将扩散模型的蒸馏和适应统一起来，使用双域分布匹配蒸馏目标和多头生成对抗网络损失来引导学生模型向源域和目标域靠拢，同时稳定训练并减少过拟合。它在少于4次采样的情况下达到与最先进的方法相当或更好的质量，并且在少样本图像生成基准测试中经常优于两阶段管道在质量和多样性方面。

CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition

Authors: Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

First: 2026-03-25T17:14:36+00:00 · Latest: 2026-03-25T17:14:36+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.

中文标题/摘要

标题：CliPPER：针对长时手术过程的视频-语言预训练框架以识别事件

视频-语言基础模型在广泛的任务中已被证明具有高度有效性。特别具有挑战性的领域是手术过程领域，其中标注数据稀缺，且往往需要精确的时间理解以完成复杂的下游任务。为应对这一挑战，我们引入了CliPPER（针对长时手术过程的视频-语言预训练框架以识别事件），一种新型的视频-语言预训练框架，该框架基于手术讲座视频进行训练。我们的方法旨在进行细粒度的时间视频-文本识别，并引入了多种新的预训练策略以提高长时手术视频中的多模态对齐。具体来说，我们提出了上下文视频-文本对比学习（VTC_CTX）和剪辑顺序预测（COP）预训练目标，两者都利用了时间上下文依赖性以增强局部视频理解。此外，我们引入了视频-文本匹配的循环一致性对齐，以在同一个手术视频内增强双向一致性并提高整体表示的一致性。此外，我们引入了更精细的对齐损失，帧-文本匹配（FTM），以提高视频帧与文本之间的对齐。因此，我们的模型在多个公开的手术基准测试中建立了新的最佳水平，包括零样本识别阶段、步骤、器械和三元组。源代码和预训练字幕可在https://github.com/CAMMA-public/CliPPER获取。

Summary / 总结

The research aims to address the challenge of precise temporal understanding in the intraoperative surgical procedure domain, where labeled data is scarce. CliPPER, a novel video-language pretraining framework, is introduced, which uses surgical lecture videos for training. The method includes Contextual Video-Text Contrastive Learning (VTC_CTX), Clip Order Prediction (COP), and Frame-Text Matching (FTM) to enhance multimodal alignment. The model achieves state-of-the-art performance in zero-shot recognition of phases, steps, instruments, and triplets in multiple public surgical benchmarks.

研究旨在解决手术过程中精确的时间理解挑战，该领域标注数据稀缺。作者引入了CliPPER，一种基于手术讲座视频的视频-语言预训练框架。该方法包括Contextual Video-Text Contrastive Learning (VTC_CTX)、Clip Order Prediction (COP)和Frame-Text Matching (FTM)等新颖的预训练策略，以增强多模态对齐。模型在多个公开的手术基准测试中实现了零样本识别阶段、步骤、器械和三元组的最新性能。

Navigating the Latent Space Dynamics of Neural Models

Authors: Marco Fumero, Luca Moschella, Emanuele Rodolà, Francesco Locatello

First: 2025-05-28T18:57:41+00:00 · Latest: 2026-03-25T17:13:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Neural networks transform high-dimensional data into compact, structured representations, often modeled as elements of a lower dimensional latent space. In this paper, we present an alternative interpretation of neural models as dynamical systems acting on the latent manifold. Specifically, we show that autoencoder models implicitly define a latent vector field on the manifold, derived by iteratively applying the encoding-decoding map, without any additional training. We observe that standard training procedures introduce inductive biases that lead to the emergence of attractor points within this vector field. Drawing on this insight, we propose to leverage the vector field as a representation for the network, providing a novel tool to analyze the properties of the model and the data. This representation enables to: (i) analyze the generalization and memorization regimes of neural models, even throughout training; (ii) extract prior knowledge encoded in the network's parameters from the attractors, without requiring any input data; (iii) identify out-of-distribution samples from their trajectories in the vector field. We further validate our approach on vision foundation models, showcasing the applicability and effectiveness of our method in real-world scenarios.

中文标题/摘要

标题：神经模型潜在空间动力学的导航

神经网络将高维数据转换为紧凑的结构化表示，通常建模为低维潜在空间的元素。在本文中，我们提出了一种替代解释，即将神经模型视为作用于潜在流形的动力系统。具体而言，我们表明自编码模型通过迭代应用编码-解码映射隐式定义了流形上的潜在向量场，而无需任何额外训练。我们观察到，标准训练过程引入了诱导偏置，导致该向量场中出现吸引点。基于这一见解，我们提出利用向量场作为网络的表示，提供了一种分析模型和数据属性的新工具。这种表示使我们能够：(i) 分析神经模型的泛化和记忆阶段，甚至在训练过程中；(ii) 无需任何输入数据即可从吸引子中提取网络参数中编码的先验知识；(iii) 通过向量场中轨迹识别异常分布样本。我们进一步在视觉基础模型上验证了我们的方法，展示了我们方法在实际场景中的适用性和有效性。

Summary / 总结

This paper explores neural networks as dynamical systems on a latent manifold, showing that autoencoders define a latent vector field through iterative encoding-decoding. The study reveals attractor points due to training biases and proposes using these vector fields to analyze model properties, extract prior knowledge, and identify out-of-distribution samples. Experiments on vision foundation models demonstrate the method's effectiveness in real-world scenarios.

本文探讨神经模型作为潜空间中的动力系统，表明自编码器隐式定义了潜流形上的向量场，并揭示了由于训练偏见产生的吸引点。研究提出使用该向量场来分析模型属性、提取先验知识以及识别异常分布样本。实验表明该方法在实际场景中的有效性和适用性。

UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience

Authors: Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang

First: 2026-03-25T17:10:29+00:00 · Latest: 2026-03-25T17:10:29+00:00

Comments: Code and models are available at https://github.com/ui-voyager/UI-Voyager

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.

中文标题/摘要

标题：UI-Voyager：一种通过失败经验自我进化的GUI代理

随着多模态大型语言模型（MLLM）的发展，自主移动GUI代理引起了越来越多的关注。然而，现有的方法仍然在从失败轨迹中学习和稀疏奖励下长时间GUI任务的模糊信用分配方面效率低下。为此，我们提出了一种新颖的两阶段自我进化的移动GUI代理UI-Voyager。在第一阶段，我们使用拒绝微调（RFT），这使得数据和模型在完全自主的循环中持续共同进化。第二阶段引入了组相对自我蒸馏（GRSD），它识别出组展开中的关键分叉点，并从成功的轨迹中构建密集的步骤级监督，以纠正失败的轨迹。在AndroidWorld上的广泛实验表明，我们的4B模型实现了81.0%的Pass@1成功率，优于众多最近的基线，并超过了人类水平的性能。消融和案例研究进一步验证了GRSD的有效性。我们的方法代表了向高效、自我进化的高性能移动GUI自动化迈出的重要一步，无需昂贵的手动数据注释。

Summary / 总结

UI-Voyager is a self-evolving mobile GUI agent that addresses the inefficiency of learning from failed trajectories and ambiguous credit assignment. It uses a two-stage approach: the first stage employs Rejection Fine-Tuning for continuous co-evolution of data and models, and the second stage uses Group Relative Self-Distillation to identify critical points and construct dense supervision from successful trajectories to correct failures. The 4B model achieves an 81.0% Pass@1 success rate on AndroidWorld, surpassing recent baselines and human-level performance.

UI-Voyager 是一种自我进化的移动GUI代理，旨在解决从失败轨迹学习效率低下和信用分配模糊的问题。它采用两阶段方法：第一阶段使用拒绝微调实现数据和模型的连续共进化，第二阶段使用组相对自我蒸馏来识别关键点并从成功的轨迹中构建密集的监督以纠正失败。4B模型在AndroidWorld上的Pass@1成功率达到了81.0%，超越了最近的基线和人类水平的性能。

Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

Authors: Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bartłomiej Twardowski, Andrew D. Bagdanov, Tinne Tuytelaars, Joost van de Weijer

First: 2026-03-25T17:04:43+00:00 · Latest: 2026-03-25T17:04:43+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

中文标题/摘要

标题：跨模态原型对齐与混合以实现无需训练的少样本分类

视觉-语言模型（VLMs）如CLIP是通过使文本和图像配对对齐来训练的。为了提高基于CLIP的少样本图像分类性能，最近的研究发现，除了文本嵌入外，训练集中的图像嵌入也是重要信息来源之一。在本文中，我们研究了直接将图像和文本原型混合用于少样本分类的影响，并从偏差-方差角度进行了分析。我们展示了混合原型类似于收缩估计器。尽管混合原型可以提高分类性能，但图像原型仍然以实例特定的背景或上下文信息的形式添加了一些噪声。为了仅捕获与给定分类任务相关的图像空间信息，我们提出将图像原型投影到语义文本嵌入空间的主要方向上，以获得一个文本对齐的语义图像子空间。这些文本对齐的图像原型，当与文本嵌入混合时，进一步提高了分类性能。然而，对于CLIP中跨模态对齐较差的下游数据集，语义对齐可能不是最优的。我们展示了可以通过建模类协方差来利用图像子空间的各向异性。我们证明了结合一个文本对齐的混合原型分类器和一个图像特定的LDA分类器在少样本分类基准上优于现有方法。

Summary / 总结

This work investigates the impact of directly mixing image and text prototypes for few-shot classification using vision-language models like CLIP. The authors propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace, which further improves classification performance. The method outperforms existing approaches across various few-shot classification benchmarks.

这项工作研究了直接混合图像和文本原型以使用CLIP等视觉语言模型进行少量样本分类的影响。作者提出将图像原型投影到语义文本嵌入空间的主要方向上，以获得文本对齐的语义图像子空间，这进一步提高了分类性能。该方法在各种少量样本分类基准测试中被证明优于现有方法。

DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation

Authors: Shuai Wang, Dhasarathy Parthasarathy, Robert Feldt, Yinan Yu

First: 2026-03-22T22:39:32+00:00 · Latest: 2026-03-25T16:56:29+00:00

Comments: Accepted to AAMAS 2026 EA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) have shown impressive capabilities in code generation. However, because most LLMs are trained on public domain corpora, directly applying them to real-world software development often yields low success rates, as these scenarios frequently require domain-specific knowledge. In particular, domain-specific tasks usually demand highly specialized solutions, which are often underrepresented or entirely absent in the training data of generic LLMs. To address this challenge, we propose DomAgent, an autonomous coding agent that bridges this gap by enabling LLMs to generate domain-adapted code through structured reasoning and targeted retrieval. A core component of DomAgent is DomRetriever, a novel retrieval module that emulates how humans learn domain-specific knowledge, by combining conceptual understanding with experiential examples. It dynamically integrates top-down knowledge-graph reasoning with bottom-up case-based reasoning, enabling iterative retrieval and synthesis of structured knowledge and representative cases to ensure contextual relevance and broad task coverage. DomRetriever can operate as part of DomAgent or independently with any LLM for flexible domain adaptation. We evaluate DomAgent on an open benchmark dataset in the data science domain (DS-1000) and further apply it to real-world truck software development tasks. Experimental results show that DomAgent significantly enhances domain-specific code generation, enabling small open-source models to close much of the performance gap with large proprietary LLMs in complex, real-world applications. The code is available at: https://github.com/Wangshuaiia/DomAgent.

中文标题/摘要

标题：DomAgent：利用知识图谱和案例推理进行领域特定代码生成

大型语言模型（LLMs）在代码生成方面展现了令人印象深刻的性能。然而，由于大多数LLMs都是基于公共领域语料库进行训练的，直接将它们应用于实际软件开发中往往成功率较低，因为这些场景通常需要领域特定的知识。特别是，领域特定的任务通常需要高度专业化的解决方案，而这些解决方案在通用LLMs的训练数据中往往被严重低估或完全不存在。为了解决这一挑战，我们提出了一种名为DomAgent的自主编码代理，通过结构化推理和目标检索来弥合这一差距，使LLMs能够生成领域适应的代码。DomAgent的核心组件是DomRetriever，这是一种新颖的检索模块，通过结合概念理解与经验示例来模拟人类学习领域特定知识的方式。DomRetriever动态地将自上而下的知识图谱推理与自下而上的案例推理相结合，实现迭代检索和合成结构化知识及代表性案例，以确保上下文相关性和广泛的任务覆盖范围。DomRetriever可以作为DomAgent的一部分运行，也可以独立与任何LLM进行灵活的领域适应。我们使用数据科学领域（DS-1000）的开放基准数据集评估了DomAgent，并将其应用于实际的卡车软件开发任务。实验结果表明，DomAgent显著提高了领域特定代码生成的能力，使小型开源模型在复杂的真实世界应用中能够缩小与大型专有LLMs之间的性能差距。代码可在：https://github.com/Wangshuaiia/DomAgent/ 获取。

Summary / 总结

DomAgent leverages knowledge graphs and case-based reasoning to enhance domain-specific code generation, addressing the limitations of large language models in real-world applications. The system, particularly the DomRetriever module, combines top-down knowledge-graph reasoning with bottom-up case-based reasoning to ensure contextual relevance and broad task coverage. Experimental results demonstrate that DomAgent significantly improves domain-specific code generation, closing the performance gap with large proprietary LLMs in complex, real-world applications.

DomAgent 通过结合知识图谱推理和案例推理来增强领域特定的代码生成。它引入了DomRetriever，该模块将自上而下的知识图谱推理与自下而上的案例推理相结合，以生成上下文相关且广泛适用的代码。实验结果表明，DomAgent 显著提高了小开源模型的性能，使其在复杂的实际应用中与大型专有LLM的性能差距大大缩小。

Toward Physically Consistent Driving Video World Models under Challenging Trajectories

Authors: Jiawei Zhou, Zhenxin Zhu, Lingyi Du, Linye Lyu, Lijun Zhou, Zhanqian Wu, Hongcheng Luo, Zhuotao Tian, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Yu Li

First: 2026-03-25T16:47:39+00:00 · Latest: 2026-03-25T16:47:39+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video generation models have shown strong potential as world models for autonomous driving simulation. However, existing approaches are primarily trained on real-world driving datasets, which mostly contain natural and safe driving scenarios. As a result, current models often fail when conditioned on challenging or counterfactual trajectories-such as imperfect trajectories generated by simulators or planning systems-producing videos with severe physical inconsistencies and artifacts. To address this limitation, we propose PhyGenesis, a world model designed to generate driving videos with high visual fidelity and strong physical consistency. Our framework consists of two key components: (1) a physical condition generator that transforms potentially invalid trajectory inputs into physically plausible conditions, and (2) a physics-enhanced video generator that produces high-fidelity multi-view driving videos under these conditions. To effectively train these components, we construct a large-scale, physics-rich heterogeneous dataset. Specifically, in addition to real-world driving videos, we generate diverse challenging driving scenarios using the CARLA simulator, from which we derive supervision signals that guide the model to learn physically grounded dynamics under extreme conditions. This challenging-trajectory learning strategy enables trajectory correction and promotes physically consistent video generation. Extensive experiments demonstrate that PhyGenesis consistently outperforms state-of-the-art methods, especially on challenging trajectories. Our project page is available at: https://wm-research.github.io/PhyGenesis/.

中文标题/摘要

标题：朝物理一致的驾驶视频世界模型方向发展，应对具有挑战性的轨迹

视频生成模型在自主驾驶模拟的世界模型方面显示出强大的潜力。然而，现有的方法主要是在包含自然和安全驾驶场景的真实世界驾驶数据集上进行训练。因此，当前的模型在处理具有挑战性或反事实轨迹（如模拟器或规划系统生成的不完美轨迹）时经常失败，生成的视频存在严重的物理不一致性和伪影。为了解决这一局限性，我们提出了PhyGenesis，这是一种设计用于生成具有高视觉保真度和强物理一致性的驾驶视频的世界模型。我们的框架包括两个关键组件：（1）一个物理条件生成器，将潜在无效的轨迹输入转换为物理上可接受的条件；（2）一个增强物理的视频生成器，在这些条件下生成高质量的多视角驾驶视频。为了有效训练这些组件，我们构建了一个大规模的、富含物理信息的异构数据集。具体来说，除了真实世界的驾驶视频，我们还使用CARLA模拟器生成了多种具有挑战性的驾驶场景，从中提取监督信号，指导模型在极端条件下学习物理上合理的动力学。这种具有挑战性轨迹的学习策略使轨迹校正成为可能，并促进了物理一致的视频生成。广泛的实验表明，PhyGenesis在具有挑战性的轨迹上始终优于最先进的方法。我们的项目页面可在以下网址访问：https://wm-research.github.io/PhyGenesis/。

Summary / 总结

The research aims to improve the physical consistency of driving video generation models for autonomous driving simulation. The method involves PhyGenesis, a world model with a physical condition generator and a physics-enhanced video generator. The model is trained on a large-scale dataset combining real-world driving videos and challenging scenarios generated by the CARLA simulator. The experiments show that PhyGenesis outperforms existing methods, particularly in generating physically consistent videos for challenging trajectories.

研究旨在提高自动驾驶模拟中驾驶视频生成模型的物理一致性。方法是使用PhyGenesis，该模型包含物理条件生成器和物理增强视频生成器。模型在结合真实驾驶视频和CARLA模拟器生成的挑战性场景的大规模数据集上进行训练。实验结果表明，PhyGenesis在生成挑战性轨迹的物理一致视频方面优于现有方法。

Towards Safe Learning-Based Non-Linear Model Predictive Control through Recurrent Neural Network Modeling

Authors: Mihaela-Larisa Clement, Mónika Farsang, Agnes Poks, Johannes Edelmann, Manfred Plöchl, Radu Grosu, Ezio Bartocci

First: 2026-03-25T16:43:11+00:00 · Latest: 2026-03-25T16:43:11+00:00

Abs · PDF · Code1 · Code2

Abstract

The practical deployment of nonlinear model predictive control (NMPC) is often limited by online computation: solving a nonlinear program at high control rates can be expensive on embedded hardware, especially when models are complex or horizons are long. Learning-based NMPC approximations shift this computation offline but typically demand large expert datasets and costly training. We propose Sequential-AMPC, a sequential neural policy that generates MPC candidate control sequences by sharing parameters across the prediction horizon. For deployment, we wrap the policy in a safety-augmented online evaluation and fallback mechanism, yielding Safe Sequential-AMPC. Compared to a naive feedforward policy baseline across several benchmarks, Sequential-AMPC requires substantially fewer expert MPC rollouts and yields candidate sequences with higher feasibility rates and improved closed-loop safety. On high-dimensional systems, it also exhibits better learning dynamics and performance in fewer epochs while maintaining stable validation improvement where the feedforward baseline can stagnate.

中文标题/摘要

标题：通过递归神经网络建模实现安全的学习导向非线性模型预测控制

非线性模型预测控制（NMPC）的实际部署往往受限于在线计算：在高控制率下求解非线性规划问题在嵌入式硬件上可能代价高昂，尤其是在模型复杂或预测期较长时。基于学习的NMPC近似将此计算移至离线，但通常需要大量专家数据集和昂贵的训练。我们提出了一种顺序AMPC策略，通过在预测期共享参数来生成MPC候选控制序列。在部署时，我们将策略包裹在一个增强安全性的在线评估和回退机制中，从而得到安全的顺序AMPC。与几个基准测试中的简单前馈策略基线相比，顺序AMPC需要的专家MPC滚动仿真次数显著减少，并且生成的候选序列具有更高的可行性率和改进的闭环安全性。在高维系统中，它还表现出更好的学习动力学和性能，在更少的迭代周期中保持稳定的验证改进，而前馈基线可能会停滞不前。

Summary / 总结

The research aims to address the computational challenges of nonlinear model predictive control (NMPC) by proposing Sequential-AMPC, a sequential neural policy that reduces online computation by sharing parameters across the prediction horizon. Compared to a feedforward policy, Sequential-AMPC requires fewer expert MPC rollouts and achieves higher feasibility rates and improved closed-loop safety. It also demonstrates better learning dynamics and performance on high-dimensional systems with stable validation improvement.

论文提出了一种序贯AMPC方法，通过在预测时域内共享参数来减少在线计算，从而解决非线性模型预测控制（NMPC）的部署问题。该方法所需的专家MPC回放次数较少，并且在多个基准测试中实现了更高的可行率和改进的闭环安全性。在高维系统中，序贯AMPC展示了更好的学习动态和性能，并且在验证改进方面保持稳定，即使前馈基线停滞不前也是如此。

Project and Generate: Divergence-Free Neural Operators for Incompressible Flows

Authors: Xigui Li, Hongwei Zhang, Ruoxi Jiang, Deshu Chen, Chensen Lin, Limei Han, Yuan Qi, Xin Guo, Yuan Cheng

First: 2026-03-25T16:40:58+00:00 · Latest: 2026-03-25T16:40:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning-based models for fluid dynamics often operate in unconstrained function spaces, leading to physically inadmissible, unstable simulations. While penalty-based methods offer soft regularization, they provide no structural guarantees, resulting in spurious divergence and long-term collapse. In this work, we introduce a unified framework that enforces the incompressible continuity equation as a hard, intrinsic constraint for both deterministic and generative modeling. First, to project deterministic models onto the divergence-free subspace, we integrate a differentiable spectral Leray projection grounded in the Helmholtz-Hodge decomposition, which restricts the regression hypothesis space to physically admissible velocity fields. Second, to generate physically consistent distributions, we show that simply projecting model outputs is insufficient when the prior is incompatible. To address this, we construct a divergence-free Gaussian reference measure via a curl-based pushforward, ensuring the entire probability flow remains subspace-consistent by construction. Experiments on 2D Navier-Stokes equations demonstrate exact incompressibility up to discretization error and substantially improved stability and physical consistency.

中文标题/摘要

标题：项目与生成：用于不可压缩流的发散自由神经算子

基于学习的流体动力学模型通常在未加约束的功能空间中运行，导致物理上不可行且不稳定的模拟。虽然基于惩罚的方法提供软正则化，但它们没有结构保证，导致虚假发散和长期崩溃。在本文中，我们引入了一个统一框架，将不可压缩连续方程作为硬性、内在约束应用于确定性和生成性建模。首先，为了将确定性模型投影到发散自由子空间，我们整合了一个基于亥姆霍兹-霍奇分解的可微频谱勒朗投影，这将回归假设空间限制为物理上可行的速度场。其次，为了生成物理上一致的分布，我们证明了简单地投影模型输出是不够的，当先验不兼容时。为了解决这个问题，我们通过旋度推前构造了一个发散自由的高斯参考测度，确保整个概率流从本质上保持子空间一致性。在二维纳维-斯托克斯方程上的实验表明，模拟结果在离散误差范围内完全不可压缩，并且显著提高了稳定性和物理一致性。

Summary / 总结

This work addresses the issue of physically inadmissible simulations in learning-based models for fluid dynamics by introducing a unified framework that enforces the incompressible continuity equation as a hard constraint. The method uses a differentiable spectral Leray projection to project deterministic models onto the divergence-free subspace and constructs a divergence-free Gaussian reference measure to ensure physically consistent distributions. Experiments show exact incompressibility up to discretization error and improved stability and physical consistency in simulations of 2D Navier-Stokes equations.

该研究通过引入一个框架来解决流体动力学习模型中物理不可行的模拟问题，该框架将不可压缩连续方程作为硬约束强制执行。它使用可微谱Leray投影将确定性模型投影到无散度子空间，并通过基于旋度的推前构造一个无散度的高斯参考测度，以确保整个概率流保持子空间一致性。实验结果显示了在离散误差范围内的精确无压缩性，并且显著提高了稳定性和物理一致性。

Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

Authors: Mengyu Yang, Yanming Yang, Chenyi Xu, Chenxi Song, Yufan Zuo, Tong Zhao, Ruibo Li, Chi Zhang

Venue: CVPR 2026

First: 2025-11-27T15:13:32+00:00 · Latest: 2026-03-25T16:36:15+00:00

Comments: Accepted by CVPR 2026; Project page: https://fast3dcache-agi.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.83% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).

中文标题/摘要

标题：Fast3Dcache：无需训练的3D几何合成加速

扩散模型在图像、视频和3D形状等模态上取得了令人印象深刻的生成质量，但由于迭代去噪过程的计算成本高昂，其推理仍然非常昂贵。虽然最近的基于缓存的方法有效地重用了冗余计算以加速2D和视频生成，但直接将这些技术应用于3D扩散模型会严重破坏几何一致性。在3D合成中，即使缓存的隐特征中存在微小的数值误差，也会累积导致结构伪影和拓扑不一致。为克服这一限制，我们提出了一种无需训练的几何感知缓存框架Fast3Dcache，该框架在保持几何保真度的同时加速3D扩散推理。我们的方法引入了预测缓存调度约束（PCSC），根据体素稳定模式动态确定缓存配额，并基于速度大小和加速度准则选择稳定的特征进行重用，以满足时空稳定性标准（SSC）。全面的实验表明，Fast3Dcache显著加速了推理，实现了高达27.12%的加速和54.83%的FLOPs减少，几何质量（通过均方根距离衡量）仅下降2.48%，F-分数下降1.95%，影响极小。

Summary / 总结

Fast3Dcache is a training-free framework that accelerates 3D diffusion model inference by using a geometry-aware caching mechanism. It introduces a Predictive Caching Scheduler Constraint (PCSC) and a Spatiotemporal Stability Criterion (SSC) to ensure geometric consistency. Experiments demonstrate that Fast3Dcache can achieve up to a 27.12% speed-up and a 54.83% reduction in FLOPs, with minimal impact on geometric quality, as measured by a 2.48% increase in Chamfer Distance and a 1.95% increase in F-Score.

Fast3Dcache 是一个无需训练的几何感知缓存框架，旨在加速 3D 扩散推理同时保持几何保真度。它引入了预测缓存调度约束 (PCSC) 来动态确定缓存配额，并使用时空稳定性准则 (SSC) 选择稳定的特征进行重用。实验表明，它可以实现高达 27.12% 的加速和 54.83% 的 FLOPs 减少，同时几何质量的影响很小（洛奇距离增加 2.48%，F-分数增加 1.95%）。

Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models

Authors: Siqi Liu, Xinyang Li, Bochao Zou, Junbao Zhuo, Huimin Ma, Jiansheng Chen

Venue: CVPR 2026

First: 2026-03-25T16:24:50+00:00 · Latest: 2026-03-25T16:24:50+00:00

Comments: 20 pages, 7 figures, accepted at CVPR 2026, project page: see https://founce.github.io/VisionToM

Abs · PDF · Code1 · Code2 · Project1

Abstract

As large language models (LLMs) continue to advance, there is increasing interest in their ability to infer human mental states and demonstrate a human-like Theory of Mind (ToM). Most existing ToM evaluations, however, are centered on text-based inputs, while scenarios relying solely on visual information receive far less attention. This leaves a gap, since real-world human-AI interaction typically requires multimodal understanding. In addition, many current methods regard the model as a black box and rarely probe how its internal attention behaves in multiple-choice question answering (QA). The impact of LLM hallucinations on such tasks is also underexplored from an interpretability perspective. To address these issues, we introduce VisionToM, a vision-oriented intervention framework designed to strengthen task-aware reasoning. The core idea is to compute intervention vectors that align visual representations with the correct semantic targets, thereby steering the model's attention through different layers of visual features. This guidance reduces the model's reliance on spurious linguistic priors, leading to more reliable multimodal language model (MLLM) outputs and better QA performance. Experiments on the EgoToM benchmark-an egocentric, real-world video dataset for ToM with three multiple-choice QA settings-demonstrate that our method substantially improves the ToM abilities of MLLMs. Furthermore, results on an additional open-ended generation task show that VisionToM enables MLLMs to produce free-form explanations that more accurately capture agents' mental states, pushing machine-human collaboration toward greater alignment.

中文标题/摘要

标题：仅视频的ToM增强：提升多模态大型语言模型的理论思维能力

随着大型语言模型（LLMs）的不断进步，人们对其推断人类心理状态和展示类似人类的理论思维（ToM）的能力越来越感兴趣。然而，现有的大多数ToM评估主要集中在基于文本的输入上，而依赖于纯视觉信息的场景则受到较少的关注。这留下了一个缺口，因为现实世界中的人机交互通常需要多模态理解。此外，许多当前的方法将模型视为黑盒，很少探究其内部注意力在多项选择问答（QA）中的行为。从可解释性的角度来看，LLM幻觉对这些任务的影响也尚未得到充分探索。为了解决这些问题，我们引入了VisionToM，这是一种视觉导向的干预框架，旨在增强任务相关的推理能力。核心思想是计算干预向量，使视觉表示与正确的语义目标对齐，从而引导模型的注意力通过不同的视觉特征层。这种指导减少了模型对虚假语言先验的依赖，从而产生了更可靠的多模态语言模型（MLLM）输出，并提高了问答性能。在EgoToM基准测试上进行的实验——这是一个以自我为中心的、现实世界的视频数据集，用于ToM，包含三个多项选择问答设置——表明我们的方法显著提高了MLLM的ToM能力。此外，在额外的开放式生成任务上，结果表明VisionToM使MLLM能够生成更准确捕捉代理心理状态的自由形式解释，推动了机器-人类合作的进一步对齐。

Summary / 总结

The research aims to enhance the Theory of Mind (ToM) capabilities of multimodal large language models (MLLMs) by addressing the limitations of existing text-based ToM evaluations and the lack of attention to visual-only scenarios. The method involves a VisionToM framework that computes intervention vectors to align visual representations with correct semantic targets, guiding the model's attention and reducing reliance on spurious linguistic priors. Experiments on the EgoToM benchmark show significant improvements in MLLM ToM abilities, and additional open-ended generation tasks demonstrate the model's ability to produce more accurate explanations of agents' mental states.

研究旨在通过专注于视频输入来增强多模态大型语言模型（MLLM）的理论思维（ToM）能力。方法是采用VisionToM框架，通过计算干预向量使视觉表示与正确的语义目标对齐，引导模型的注意力并减少对虚假语言先验的依赖。实验表明，VisionToM在EgoToM基准上的表现显著提升，且在额外的开放式生成任务中，MLLM能够产生更准确地捕捉代理心理状态的自由形式解释，推动了机器与人类合作的更紧密对齐。

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

Authors: John Ray B. Martinez

First: 2026-03-25T16:22:53+00:00 · Latest: 2026-03-25T16:22:53+00:00

Comments: 17 pages, 6 figures. Preprint under review

Abs · PDF · Code1 · Code2

Abstract

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

中文标题/摘要

标题：多智能体一致性验证推理提高医学MCQA中的不确定性校准

不准确的信心评分是将AI部署到临床环境中的实际障碍。一个总是过于自信的模型无法提供有用的信号以供推迟决策。我们提出了一种多智能体框架，结合了领域特定的专业智能体与两阶段验证和S-分数加权融合，以提高医学多项选择题回答中的校准和区分度。四个专业智能体（呼吸科、心脏病学、神经病学、胃肠病学）使用Qwen2.5-7B-Instruct生成独立诊断。每个诊断都经过两阶段自我验证过程，衡量内部一致性并生成专业信心分数（S-分数）。S-分数驱动加权融合策略选择最终答案并校准报告的信心。我们在四个实验设置中进行了评估，涵盖了MedQA-USMLE和MedMCQA的100题和250题高分歧子集。校准改进是主要发现，所有四个设置中ECE降低了49-74%，包括更难的MedMCQA基准，在绝对准确性受到知识密集型回忆需求限制时，这些收益仍然存在。在MedQA-250上，整个系统实现了ECE = 0.091（相对于单专业基线的信心校准提高了74.4%）和AUROC = 0.630（+0.056）在59.2%的准确性下。消融分析表明两阶段验证是主要的校准驱动因素，多智能体推理是主要的准确性驱动因素。这些结果表明基于一致性的验证在不同医学问题类型中产生了更可靠的不确定性估计，为安全关键的临床AI应用中的推迟提供了实用的信心信号。

Summary / 总结

The research addresses the issue of miscalibrated confidence scores in AI models used in clinical settings, presenting a multi-agent framework that combines specialist agents with a two-phase verification process and S-Score weighted fusion to improve calibration and discrimination in medical multiple-choice question answering. The system achieves a significant reduction in Expected Calibration Error (ECE) of 49-74% across various experimental settings, with notable improvements on the harder MedMCQA benchmark. On MedQA-250, the full system reduces ECE to 0.091, a 74.4% improvement over the single-specialist baseline.

研究针对临床应用中AI模型信心评分失准的问题，提出了一种结合专家代理的两阶段验证过程和S-Score加权融合的多代理框架，以提高医学多项选择题回答中的校准和区分能力。该系统在各种实验设置中实现了49-74%的期望校准误差（ECE）显著减少，特别是在更难的MedMCQA基准上表现突出。在MedQA-250上，完整系统将ECE降低至0.091，比单一专家代理基线提高了74.4%。

Composer 2 Technical Report

Authors: Cursor Reseach, :, Aaron Chan, Ahmed Shalaby, Alexander Wettig, Aman Sanger, Andrew Zhai, Anurag Ajay, Ashvin Nair, Charlie Snell, Chen Lu, Chen Shen, Emily Jia, Federico Cassano, Hanpeng Liu, Haoyu Chen, Henry Wildermuth, Jacob Jackson, Janet Li, Jediah Katz, Jiajun Yao, Joey Hejna, Josh Warner, Julius Vering, Kevin Frans, Lee Danilek, Less Wright, Lujing Cen, Luke Melas-Kyriazi, Michael Truell, Michiel de Jong, Naman Jain, Nate Schmidt, Nathan Wang, Niklas Muennighoff, Oleg Rybkin, Paul Loh, Phillip Kravtsov, Rishabh Yadav, Sahil Shah, Sam Kottler, Alexander M Rush, Shengtong Zhang, Shomil Jain, Sriram Sankar, Stefan Heule, Stuart H. Sul, Sualeh Asif, Victor Rong, Wanqi Zhu, William Lin, Yuchen Wu, Yuri Volkov, Yury Zemlyanskiy, Zack Holbrook, Zhiyuan Zhang

First: 2026-03-25T16:18:37+00:00 · Latest: 2026-03-25T16:18:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model's knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance through stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems. We develop infrastructure to support training in the same Cursor harness that is used by the deployed model, with equivalent tools and structure, and use environments that match real problems closely. To measure the ability of the model on increasingly difficult tasks, we introduce a benchmark derived from real software engineering problems in large codebases including our own. Composer 2 is a frontier-level coding model and demonstrates a process for training strong domain-specialized models. On our CursorBench evaluations the model achieves a major improvement in accuracy compared to previous Composer models (61.3). On public benchmarks the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in our harness, comparable to state-of-the-art systems.

中文标题/摘要

标题：Composer 2 技术报告

Composer 2 是一种专门设计用于自主软件工程的模型。该模型展示了强大的长期规划和编码智能能力，同时保持了解决交互问题的高效性。模型的训练分为两个阶段：首先，继续预训练以提高模型的知识和潜在编码能力，然后通过更强的推理、准确的多步执行和长周期现实编码问题上的连贯性，进行大规模强化学习以提高端到端的编码性能。我们开发了基础设施，以支持在与部署模型相同的 Cursor 框架中进行训练，使用等效的工具和结构，并使用与实际问题高度匹配的环境。为了衡量模型在越来越困难的任务上的能力，我们引入了一个基准，该基准源自大型代码库中的实际软件工程问题，包括我们自己的代码库。Composer 2 是一个前沿级的编码模型，并展示了训练强大领域专用模型的过程。在我们的 CursorBench 评估中，该模型在准确性方面比之前的 Composer 模型有了重大改进（61.3）。在公开基准测试中，该模型在 Terminal-Bench 上得分为 61.7，在我们的框架中 SWE-bench Multilingual 得分为 73.7，与最先进的系统相当。

Summary / 总结

Composer 2 is a specialized model for agentic software engineering, trained in two phases: continued pretraining and large-scale reinforcement learning. It shows strong long-term planning and coding intelligence, achieving a significant improvement in accuracy (61.3) on CursorBench compared to previous models. On public benchmarks, it scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual, comparable to state-of-the-art systems.

Composer 2 是一种专门用于自主软件工程的模型，通过两个阶段训练：持续预训练和大规模强化学习。它在长期规划和多步执行方面表现出色。该模型在 CursorBench 上的准确率显著提高（61.3），并在公共基准测试如 Terminal-Bench（61.7）和 SWE-bench 多语言版本（73.7）中表现良好，与最先进的系统相当。

VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI

Authors: Daiqi Liu, Johannes Enk, Maureen Stone, Fangxu Xing, Tomás Arias-Vergara, Jerry L. Prince, Jana Hutter, Jonghye Woo, Andreas Maier, Paula Andrea Pérez-Toro

First: 2025-09-17T07:32:00+00:00 · Latest: 2026-03-25T16:10:44+00:00

Comments: Preprint submitted to MIDL short paper 2026

Abs · PDF · Code1 · Code2

Abstract

Accurate segmentation of articulatory structures in real-time MRI (rtMRI) remains challenging, as existing methods rely primarily on visual cues and overlook complementary information from synchronized speech signals. We propose VocSegMRI, a multimodal framework integrating video, audio, and phonological inputs via cross-attention fusion and a contrastive learning objective that improves cross-modal alignment and segmentation precision. Evaluated on USC-75 and further validated via zero-shot transfer on USC-TIMIT, VocSegMRI outperforms unimodal and multimodal baselines, with ablations confirming the contribution of each component.

中文标题/摘要

标题：VocSegMRI：实时MRI中的多模态学习精确声带分割

实时MRI（rtMRI）中articulatory结构的准确分割仍然具有挑战性，现有方法主要依赖视觉线索并忽视同步语音信号中的互补信息。我们提出VocSegMRI，这是一种通过交叉注意力融合和对比学习目标来整合视频、音频和音系输入的多模态框架，该框架提高了跨模态对齐和分割精度。在USC-75上进行评估，并通过零样本迁移验证了USC-TIMIT，VocSegMRI优于单模态和多模态基线，消融实验确认了每个组件的贡献。

Summary / 总结

VocSegMRI is a multimodal framework that integrates video, audio, and phonological inputs to improve the accuracy of real-time MRI vocal tract segmentation. It uses cross-attention fusion and a contrastive learning objective to enhance cross-modal alignment. Experiments on USC-75 and USC-TIMIT datasets show that VocSegMRI outperforms both unimodal and multimodal baselines, with ablation studies confirming the effectiveness of each component.

VocSegMRI 是一个多模态框架，结合了视频、音频和音素输入以提高实时 MRI 声带分割的准确性。它使用交叉注意力融合和对比学习目标来增强跨模态对齐。在 USC-75 和 USC-TIMIT 数据集上的实验表明，VocSegMRI 在性能上优于单模态和多模态基线，且消融研究证实了每个组件的有效性。

OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Authors: Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, Yue Wu, Liefeng Bo, Siliang Tang, Zhao Zhong

First: 2026-03-25T16:08:18+00:00 · Latest: 2026-03-25T16:08:18+00:00

Comments: 32 pages, 22 figures. Project Page: https://omniweaving.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.

中文标题/摘要

标题：OmniWeaving：朝着统一视频生成的自由形式组合与推理

虽然诸如Seedance-2.0之类的专有系统在全能视频生成方面取得了显著成功，但开源替代方案明显落后。大多数学术模型仍然高度碎片化，而现有的少数统一视频生成努力仍然难以在单一框架内无缝集成多种任务。为弥合这一差距，我们提出了OmniWeaving，这是一种具备强大多模态组合和推理指导能力的全能级视频生成模型。通过利用一个包含多种组合和推理增强场景的大规模预训练数据集，OmniWeaving 学习将交错的文本、多张图像和视频输入进行时间上的绑定，并作为智能代理推断复杂的用户意图以进行复杂的视频创作。此外，我们引入了IntelligentVBench，这是第一个全面基准，旨在严格评估高级智能统一视频生成。大量实验表明，OmniWeaving 在开源统一模型中达到了最先进的性能。代码和模型将很快公开。项目页面：https://omniweaving.github.io

Summary / 总结

OmniWeaving is designed to address the gap in unified video generation by integrating diverse tasks into a single framework. It uses a large-scale pretraining dataset to learn multimodal composition and reasoning, enabling it to temporally bind text, images, and video inputs and infer user intentions for complex video creation. Experiments show that OmniWeaving outperforms other open-source unified models in this domain. The codes and model will be publicly available soon.

OmniWeaving旨在通过将多种任务整合到一个框架中来解决统一视频生成的差距。它使用大规模预训练数据集来学习多模态组成和推理，使其能够将文本、图像和视频进行时间上的绑定。实验结果表明，OmniWeaving在其他开源统一模型中表现最佳。项目页面见https://omniweaving.github.io

Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning

Authors: Xingyu Zhu, Liang Yi, Shuo Wang, Wenbo Zhu, Yonglinag Wu, Beier Zhu, Hanwang Zhang

Venue: CVPR 2026

First: 2026-03-23T15:03:47+00:00 · Latest: 2026-03-25T16:07:11+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.

中文标题/摘要

标题：通过多模态贝叶斯分布学习适应点云分析

多模态3D视觉-语言模型在多种3D任务上表现出强大的泛化能力，但在领域变化下其性能显著下降。这促使了对测试时自适应（TTA）的研究，使模型能够在测试时利用测试数据进行在线适应。现有TTA方法中，基于缓存的机制广泛采用，用于利用先前观察到的样本进行在线预测细化。然而，它们仅存储有限的历史信息，导致随着测试流的演变出现逐步的信息损失。此外，它们的预测logits是通过启发式方法融合的，导致适应不稳定。为解决这些局限性，我们提出了BayesMM，一种用于测试时点云分析的多模态贝叶斯分布学习框架。BayesMM将每个类别的文本先验和流式视觉特征建模为高斯分布：文本参数来自语义提示，而视觉参数则随着到达的样本在线更新。两种模态通过贝叶斯模型平均融合，根据后验证据自动调整它们的贡献，从而产生一个统一的预测，无需训练即可持续适应不断变化的测试时数据。在多个点云基准上的广泛实验表明，BayesMM在分布变化下保持了鲁棒性，平均提高了超过4%。

Summary / 总结

The research aims to improve the robustness of multimodal 3D vision-language models under domain shifts by proposing BayesMM, a framework for test-time point cloud analysis. BayesMM models textual and visual features as Gaussian distributions and fuses them via Bayesian model averaging, enabling continuous adaptation without training. Experiments show that BayesMM outperforms existing methods, achieving over 4% average improvement across multiple benchmarks.

论文提出了一种名为BayesMM的框架，用于点云分析的测试时分析，该框架利用了多模态贝叶斯分布学习。它将文本先验和流式视觉特征建模为高斯分布，并通过贝叶斯模型平均融合它们，以自适应地更新预测。实验表明，BayesMM 在分布变化下保持了鲁棒性，相比现有方法平均提高了超过 4%。

Unleashing Vision-Language Semantics for Deepfake Video Detection

Authors: Jiawen Zhu, Yunqi Miao, Xueyi Zhang, Jiankang Deng, Guansong Pang

Venue: CVPR 2026

First: 2026-03-25T16:05:35+00:00 · Latest: 2026-03-25T16:05:35+00:00

Comments: 14 pages, 7 figures, accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent Deepfake Video Detection (DFD) studies have demonstrated that pre-trained Vision-Language Models (VLMs) such as CLIP exhibit strong generalization capabilities in detecting artifacts across different identities. However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength -- the rich vision-language semantics embedded in the latent space. We propose VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics to enhance model's discriminability in deepfake detection. This work i) enhances the visual perception of VLM through a ForgePerceiver, which acts as an independent learner to capture diverse, subtle forgery cues both granularly and holistically, while preserving the pretrained Vision-Language Alignment (VLA) knowledge, and ii) provides a complementary discriminative cue -- Identity-Aware VLA score, derived by coupling cross-modal semantics with the forgery cues learned by ForgePerceiver. Notably, the VLA score is augmented by an identity prior-informed text prompting to capture authenticity cues tailored to each identity, thereby enabling more discriminative cross-modal semantics. Comprehensive experiments on video DFD benchmarks, including classical face-swapping forgeries and recent full-face generation forgeries, demonstrate that our VLAForge substantially outperforms state-of-the-art methods at both frame and video levels. Code is available at https://github.com/mala-lab/VLAForge.

中文标题/摘要

标题：释放视觉-语言语义以增强深伪视频检测

近期的深伪视频检测（DFD）研究显示，预训练的视觉-语言模型（VLMs）如CLIP在检测不同身份的伪影方面表现出强大的泛化能力。然而，现有方法仅侧重于利用视觉特征，忽视了它们最显著的优势——嵌入在潜在空间中的丰富视觉-语言语义。我们提出了一种名为VLAForge的新颖DFD框架，旨在释放这种跨模态语义的潜力，以增强模型在深伪检测中的可区分性。这项工作i) 通过一个ForgePerceiver增强视觉感知，ForgePerceiver作为一个独立学习者，能够捕捉细微的伪造线索，既细致又全面，同时保留预训练的视觉-语言对齐（VLA）知识；ii) 提供了一个补充的可区分线索——身份感知VLA分数，通过将跨模态语义与ForgePerceiver学习到的伪造线索耦合而成。值得注意的是，VLA分数通过身份先验文本提示增强，以捕捉针对每个身份定制的真实线索，从而实现更强大的跨模态语义。在包括经典面部替换伪造和最近的全脸生成伪造在内的视频DFD基准测试中，我们的VLAForge在帧和视频级别上均显著优于现有最佳方法。代码可在https://github.com/mala-lab/VLAForge/获取。

Summary / 总结

This paper addresses the challenge of deepfake video detection by leveraging the rich vision-language semantics in pre-trained models like CLIP. It introduces VLAForge, a novel framework that enhances visual perception through a ForgePerceiver and provides an identity-aware VLA score. Experimental results show that VLAForge outperforms existing methods in both frame and video levels across different types of deepfakes, demonstrating the effectiveness of integrating cross-modal semantics for better detection accuracy.

本文提出了VLAForge，这是一种新颖的深伪视频检测框架，利用预训练的Vision-Language模型（VLM）的跨模态语义来增强辨别能力。该框架通过ForgePerceiver增强视觉感知，捕捉多样化的伪造线索，并引入了一个结合了身份特定文本提示的身份感知VLA分数。实验表明，VLAForge在帧和视频级别的检测中均优于现有方法。

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

Authors: Xingyu Zhu, Beier Zhu, Shuo Wang, Junfeng Fang, Kesen Zhao, Hanwang Zhang, Xiangnan He

Venue: CVPR 2026

First: 2026-03-23T15:23:23+00:00 · Latest: 2026-03-25T16:02:02+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model's general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.

中文标题/摘要

标题：基于零空间投影的原理性引导以防御视觉语言模型的脱狱攻击

随着视觉语言模型（VLMs）在开放世界场景中的广泛应用，它们容易受到视觉脱狱攻击的诱导，生成有害内容，这严重威胁了模型的安全性和可信使用。最近的激活引导方法在推理过程中注入方向向量以诱导拒绝行为，并已显示出有效性。然而，一个引导向量可能会同时增强拒绝能力并导致过度拒绝，从而降低模型在良性输入上的性能。此外，由于缺乏理论可解释性，这些方法仍然存在有限的鲁棒性和有效性。为了更好地平衡安全性和实用性，我们提出了NullSteer，一种零空间投影激活防御框架。我们的方法通过线性变换在模型激活中构建拒绝方向：它在良性子空间内保持零扰动，同时动态诱导沿潜在有害方向的拒绝，从而理论上实现安全性增强而不损害模型的一般能力。广泛的实验表明，NullSteer在各种脱狱攻击下显著减少了有害输出（MiniGPT-4的平均ASR降低超过15%），同时在通用基准上保持与原始模型相当的性能。

Summary / 总结

The paper proposes NullSteer, a null-space projected activation defense framework for vision-language models to defend against visual jailbreak attacks. It constructs refusal directions within model activations to enhance safety without impairing the model's general capabilities. Experiments show NullSteer significantly reduces harmful outputs under various jailbreak attacks while maintaining comparable performance on general benchmarks.

论文针对视觉语言模型在视觉劫持攻击下生成有害内容的风险，提出了一种名为NullSteer的零空间投影激活防御框架，通过动态诱导对潜在有害方向的拒绝行为，同时保持对良性输入的性能。实验结果显示，NullSteer在各种劫持攻击下显著减少了有害输出（平均降低超过15％），同时在通用基准测试中保持了与原始模型相当的性能。

CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

Authors: Xiangru Jian, Shravan Nayak, Kevin Qinghong Lin, Aarash Feizi, Kaixin Li, Patrice Bechard, Spandana Gella, Sai Rajeswar

First: 2026-03-25T15:52:56+00:00 · Latest: 2026-03-25T15:52:56+00:00

Comments: Project Page: https://cua-suite.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows, yet progress toward general-purpose agents is bottlenecked by the scarcity of continuous, high-quality human demonstration videos. Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents. However, the largest existing open dataset, ScaleCUA, contains only 2 million screenshots, equating to less than 20 hours of video. To address this bottleneck, we introduce CUA-Suite, a large-scale ecosystem of expert video demonstrations and dense annotations for professional desktop computer-use agents. At its core is VideoCUA, which provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layerfed reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video. Unlike sparse datasets that capture only final click coordinates, these continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks. CUA-Suite further provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations. Preliminary evaluation reveals that current foundation action models struggle substantially with professional desktop applications (~60% task failure rate). Beyond evaluation, CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models. All data and models are publicly released.

中文标题/摘要

标题：CUA-Suite：大规模人工标注视频演示用于计算机使用代理

计算机使用代理（CUAs）在自动化复杂桌面工作流方面具有巨大潜力，但通用代理的发展受到高质量连续人工演示视频稀缺的限制。近期研究强调，连续视频而非稀疏截图是这些代理扩展的关键缺失要素。然而，现有最大的开放数据集ScaleCUA仅包含200万张截图，相当于不到20小时的视频。为解决这一瓶颈，我们引入了CUA-Suite，这是一个大规模的专家视频演示和密集注释生态系统，用于专业桌面计算机使用代理。其核心是VideoCUA，提供了约10,000个人工演示任务，覆盖87个不同应用程序，包括30 fps的连续屏幕录制、运动光标轨迹和多层推理注释，总计约55小时和600万帧专家视频。与仅捕捉最终点击坐标的稀疏数据集不同，这些连续视频流保留了人类交互的完整时间动态，形成了一种可以无损转换为现有代理框架所需格式的信息超集。CUA-Suite还提供了两个互补资源：UI-Vision，用于评估CUAs中定位和规划能力的严格基准，以及GroundCUA，一个包含56,000张标注截图和超过360万UI元素标注的大规模定位数据集。初步评估显示，当前的基础动作模型在专业桌面应用程序中表现不佳（任务失败率约60%）。除了评估，CUA-Suite丰富的多模态语料库还支持包括通用屏幕解析、连续空间控制、基于视频的奖励建模和视觉世界模型在内的新兴研究方向。所有数据和模型均已公开。

Summary / 总结

CUA-Suite addresses the scarcity of high-quality human demonstration videos for computer-use agents by introducing VideoCUA, which provides 10,000 human-demonstrated tasks across 87 applications with continuous screen recordings and annotations, totaling 55 hours. This dataset surpasses existing datasets in capturing the full temporal dynamics of human interaction, essential for scaling computer-use agents. CUA-Suite also includes UI-Vision and GroundCUA, resources for evaluating and training CUAs. Initial evaluation shows current models struggle with professional desktop applications. The dataset supports various research directions and is publicly available.

CUA-Suite通过引入VideoCUA解决了计算机使用代理中高质量人类演示视频稀缺的问题，VideoCUA提供了87个应用程序中的10,000个人类演示任务，包含连续的屏幕录制和注释，总计55小时和600万帧。该数据集通过捕捉人类交互的完整时间动态来支持更有效的代理开发。CUA-Suite还包含UI-Vision和GroundCUA，用于评估和改进CUAs中的定位和规划能力。初步评估显示，当前的基础动作模型在专业桌面应用程序中存在显著困难。该数据集支持屏幕解析、空间控制和视觉世界模型等新兴研究方向。所有数据和模型均公开发布。

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Authors: Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang, Rui Ye, Yuzhi Zhang, Linfeng Zhang, Weinan E, Siheng Chen, Yanfeng Wang

First: 2026-01-15T13:52:04+00:00 · Latest: 2026-03-25T15:46:47+00:00

Comments: 25 pages. 5 figures

Abs · PDF · Code1 · Code2

Abstract

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.

中文标题/摘要

标题：迈向超长周期自主科学：机器学习工程的认知积累

人工智能向自主科学的进展目前受到超长周期自主性的瓶颈限制，即在持续数天或数周的实验周期中维持战略连贯性和迭代修正的能力。虽然大型语言模型（LLMs）在短期推理方面表现出色，但在现实世界研究中高维度、延迟反馈的环境中，它们容易被执行细节压垮，无法将稀疏反馈整合为连贯的长期指导。在此，我们介绍了ML-Master 2.0，这是一种能够掌握超长周期机器学习工程（MLE）的自主代理，MLE是科学研究的一个代表性缩影。通过将上下文管理重新构想为认知积累的过程，我们的方法引入了层次认知缓存（HCC），这是一种受计算机系统启发的多级架构，能够随着时间的推移对经验进行结构化分化。通过动态提炼短暂执行轨迹为稳定知识和跨任务智慧，HCC使代理能够将即时执行与长期实验策略解耦，从而有效克服静态上下文窗口的扩展限制。在OpenAI的MLE-Bench下24小时预算的评估中，ML-Master 2.0实现了56.44%的最先进的奖牌率。我们的研究结果表明，超长周期自主性为能够自主探索超越人类先例复杂性的AI提供了一个可扩展的蓝图。

Summary / 总结

This paper addresses the challenge of ultra-long-horizon autonomy in artificial intelligence, focusing on the ability to maintain strategic coherence over extended experimental cycles. The authors introduce ML-Master 2.0, an autonomous agent that uses Hierarchical Cognitive Caching (HCC) to manage context and accumulate knowledge over time, enabling it to handle high-dimensional, delayed-feedback environments. ML-Master 2.0 outperforms previous methods on OpenAI's MLE-Bench with a 56.44% medal rate, demonstrating the potential for AI to achieve autonomous scientific discovery beyond human precedent complexities.

本文解决了人工智能中长期自主性的挑战，即在长时间内保持战略一致性困难的问题。它引入了使用层次认知缓存（HCC）来管理上下文并随着时间积累知识的自主代理ML-Master 2.0，使其能够有效地执行机器学习工程任务。ML-Master 2.0 在OpenAI的MLE-Bench上实现了56.44%的最先进的奖牌率，展示了AI处理复杂、长期科学任务的自主性的潜力。

History

20260326_0356 20260325_0407 20260324_0402 20260323_0334 20260322_0333 20260321_0346 20260320_0355 20260319_0358 20260318_0405 20260317_0401 20260316_0338 20260315_0335 20260314_0340 20260313_0351 20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553