arXiv 论文速递

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Authors: Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu

First: 2025-12-29T18:59:57+00:00 · Latest: 2025-12-29T18:59:57+00:00

Comments: Project page: https://jamichss.github.io/stream-diffvsr-project-page/

Abstract

Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: https://jamichss.github.io/stream-diffvsr-project-page/

中文标题/摘要

标题：Stream-DiffVSR：基于自回归扩散的低延迟流式视频超分辨率

基于扩散的视频超分辨率（VSR）方法在感知质量上表现出色，但由于依赖未来帧和昂贵的多步去噪，它们在延迟敏感的设置中仍不实用。我们提出了Stream-DiffVSR，这是一种用于高效在线VSR的因果条件扩散框架。它仅基于过去帧操作，结合了四步精简去噪器以实现快速推理，以及一个自动回归时空引导（ARTG）模块，在潜空间去噪期间注入运动对齐的线索，并采用一个轻量级时空感知解码器，其中包含时空处理器模块（TPM），以增强细节和时间连贯性。Stream-DiffVSR 在 RTX4090 GPU 上处理 720p 帧仅需 0.328 秒，并显著优于先前的扩散基方法。与在线 SOTA TMP 相比，它在感知质量（LPIPS +0.095）上有所提升，同时将延迟降低了超过 130 倍。Stream-DiffVSR 达到了扩散基 VSR 最低的延迟记录，将初始延迟从超过 4600 秒降低到 0.328 秒，从而使其成为第一个适合低延迟在线部署的扩散 VSR 方法。项目页面：https://jamichss.github.io/stream-diffvsr-project-page/

Summary / 总结

Stream-DiffVSR is a causally conditioned diffusion framework designed for efficient online video super-resolution. It uses a four-step denoiser for fast inference, an Auto-regressive Temporal Guidance module to inject motion-aligned cues, and a lightweight temporal-aware decoder with a Temporal Processor Module to enhance detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior methods in terms of perceptual quality and latency, making it suitable for low-latency online deployment.

Stream-DiffVSR 是一种因果条件下的扩散框架，用于高效的在线视频超分辨率，解决了先前方法的延迟问题。它使用四步蒸馏去噪器进行快速推理，Auto-regressive Temporal Guidance 模块注入运动对齐的线索，并使用具有 Temporal Processor Module 的轻量级时空感知解码器。Stream-DiffVSR 在 RTX4090 GPU 上处理 720p 帧仅需 0.328 秒，显著优于先前方法在感知质量和延迟方面的表现，使其适用于低延迟的在线部署。

Training AI Co-Scientists Using Rubric Rewards

Authors: Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse

First: 2025-12-29T18:59:33+00:00 · Latest: 2025-12-29T18:59:33+00:00

Comments: 11 pages in the main paper, total 119 including sample outputs in the Appendix

Abs · PDF · Code1 · Code2

Abstract

AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.

中文标题/摘要

标题：使用评分标准奖励训练AI合作者科学家

AI合作者科学家正在成为一种工具，帮助研究人员实现研究目标。这些AI合作者科学家的关键特征是能够根据目标和约束生成研究计划。该计划可用于头脑风暴，甚至在进一步完善后实施。然而，当前的语言模型在生成遵循所有约束和隐含要求的研究计划方面存在困难。在本研究中，我们探讨了如何利用大量现有的研究论文来训练能够生成更好研究计划的语言模型。我们通过自动从多个领域论文中提取研究目标和目标特定的评分标准，构建了一个可扩展且多样化的训练语料库。然后，我们通过自我评分的强化学习训练研究计划生成模型。在训练过程中，冻结的初始策略作为评分者，评分标准创建生成者-验证者差距，从而在无需外部人类监督的情况下实现改进。为了验证这种方法，我们在机器学习研究目标方面进行了为期225小时的人类专家研究，结果显示，对于70%的研究目标，专家更偏好我们微调的Qwen3-30B-A3B模型生成的研究计划，并且批准了84%的自动提取的目标特定评分标准。为了评估其通用性，我们还将该方法扩展到医学论文和arXiv预印本的研究目标，并使用前沿模型组成的陪审团进行评估。我们的微调带来了12-22%的相对改进，并且在跨领域泛化方面表现出显著效果，即使在执行反馈不可行的医学研究等问题设置中也证明了其有效性。这些发现共同证明了可扩展的自动化训练配方作为提高通用AI合作者科学家的一个步骤的潜力。

Summary / 总结

This study aims to enhance the ability of AI co-scientists to generate research plans by training them using a scalable corpus of research papers. The method involves automatically extracting research goals and grading rubrics from these papers and training models using reinforcement learning with self-grading. The results show that the finetuned Qwen3-30B-A3B model outperforms the initial model in 70% of machine learning research goals and that the approach generalizes well to other domains, including medical research, with significant improvements in plan quality.

该研究旨在通过使用研究论文和目标特定评分标准的可扩展语料库来增强AI合作者生成研究计划的能力。方法包括使用自我评分的强化学习，其中冻结的初始策略作为评分者。结果显示，微调后的Qwen3-30B-A3B模型在70%的机器学习研究目标中优于初始模型，并且获得了84%的评分标准的批准。该方法还在医学研究和新的arXiv预印本中展示了显著的跨领域泛化能力，实现了12-22%的相对改进，表明了可扩展的自动化训练方法在提升通用AI合作者方面的潜力。

Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

Authors: Shaocong Xu, Songlin Wei, Qizhe Wei, Zheng Geng, Hong Li, Licheng Shen, Qianpu Sun, Shu Han, Bin Ma, Bohan Li, Chongjie Ye, Yuhang Zheng, Nan Wang, Saining Zhang, Hao Zhao

First: 2025-12-29T18:59:24+00:00 · Latest: 2025-12-29T18:59:24+00:00

Comments: Project Page: https://daniellli.github.io/projects/DKT/; Code: https://github.com/Daniellli/DKT; Dataset: https://huggingface.co/datasets/Daniellesry/TransPhy3D

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.

中文标题/摘要

标题：扩散技术揭示透明性：重新利用视频扩散技术进行透明物体深度和法线估计

透明物体对感知系统来说一直非常难以处理：折射、反射和穿透破坏了立体视觉、ToF和纯判别单目深度背后的假设，导致空洞和时间上不稳定的估计。我们的关键观察是，现代视频扩散模型已经合成了令人信服的透明现象，表明它们已经内化了光学规则。我们构建了TransPhy3D，一个透明/反射场景的合成视频数据集：包含11000个序列，使用Blender/Cycles渲染。场景由一个精心挑选的类别丰富的静态资产库和形状丰富的程序化资产组成，这些资产与玻璃/塑料/金属材料配对。我们使用基于物理的光线追踪渲染RGB + 深度 + 法线，并使用OptiX降噪。从一个大型视频扩散模型开始，我们通过轻量级LoRA适配器学习了一个视频到视频的深度（和法线）翻译器。在训练过程中，我们在DiT主干中连接RGB和（有噪声的）深度潜变量，并在TransPhy3D和现有的帧级合成数据集上联合训练，从而为任意长度的输入视频生成时间上一致的预测。最终模型DKT在涉及透明性的实际和合成视频基准测试中实现了零样本最佳效果：ClearPose、DREDS（CatKnown/CatNovel）和TransPhy3D-Test。它在准确性和时间一致性方面优于强大的图像/视频基线，并且法线变体在ClearPose上设定了最佳的视频法线估计结果。紧凑的1.3B版本运行速度约为每帧0.17秒。集成到抓取堆栈中，DKT的深度提高了对透明、反射和漫反射表面的成功率，优于先前的估计器。这些结果共同支持一个更广泛的主张：“扩散技术了解透明性。”生成的视频先验可以高效且无标签地重新利用，以实现稳健的时间上一致的感知，用于具有挑战性的实际世界操作。

Summary / 总结

The paper addresses the challenge of estimating depth and normals for transparent objects, which are difficult for perception systems due to refraction and reflection. It leverages modern video diffusion models that can synthesize transparent phenomena to develop TransPhy3D, a synthetic video corpus. The authors train a lightweight LoRA adapter on this corpus and existing datasets to predict depth and normals from videos. The resulting model, DKT, achieves state-of-the-art performance on real and synthetic benchmarks and improves accuracy and temporal consistency over existing methods. It also enhances grasping success rates for various surface types.

论文针对传统感知系统难以处理透明物体的深度和法线估计问题，这些问题由于折射和反射而变得复杂。它提出了一个合成视频数据集TransPhy3D，并训练了一个视频到视频的翻译器DKT，使用了大型视频扩散模型和轻量级LoRA适配器。DKT在真实和合成基准测试中取得了最先进的成果，并在各种表面类型的夹取堆栈中提高了成功率。

Eliciting Behaviors in Multi-Turn Conversations

Authors: Jing Huang, Shujian Zhang, Lun Wang, Andrew Hard, Rajiv Mathews, John Lambert

First: 2025-12-29T18:57:10+00:00 · Latest: 2025-12-29T18:57:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.

中文标题/摘要

标题：多轮对话中行为诱引

在对话环境中从大型语言模型（LLMs）中识别特定且通常复杂的特定行为对于其评估至关重要。近期工作提出了新颖的技术来找到能够从目标模型中诱导特定行为的自然语言提示，但这些研究主要集中在单轮设置中。本文研究了行为诱引在多轮对话中的应用。我们首先提供了一个分析框架，将现有方法分为三类，基于其与目标模型的交互方式：仅使用先验知识的方法、使用离线交互的方法以及从在线交互中学习的方法。然后，我们引入了一种多轮在线方法的通用形式，统一了单轮和多轮诱引。我们评估了这三类方法在自动生成多轮测试案例方面的表现。我们通过分析查询预算，即与目标模型交互的次数，与成功率，即行为诱引输入的发现率之间的权衡，来研究这些方法的效率。我们发现，在三个任务中，仅需几千次查询，在线方法就能实现45/19/77%的平均成功率，而现有的多轮对话基准中的静态方法在这些任务中发现的失败案例很少甚至没有。我们的工作突显了行为诱引方法在多轮对话评估中的新应用，并强调了社区转向动态基准的必要性。

Summary / 总结

This work addresses the challenge of eliciting specific behaviors from large language models in multi-turn conversational settings. It introduces an analytical framework to categorize existing methods into three families based on their interactions with the target model and proposes a generalized multi-turn formulation of the online method. The study evaluates these methods on generating multi-turn test cases and finds that online methods can achieve a higher success rate with fewer interactions compared to static methods, demonstrating the potential of dynamic benchmarks in conversational evaluation.

该研究提出了一种分析框架，将现有方法分为三大类，基于其与目标模型的交互方式来解决在多轮对话中从大型语言模型（LLMs）中引出特定行为的挑战。研究引入了一种统一单轮和多轮引出的在线方法的多轮形式。实验结果显示，使用少量数千次查询，在线方法可以实现45/19/77%的成功率，远超现有基准中的静态方法，后者往往难以发现行为引出的输入。这项研究强调了在多轮对话评估中使用动态基准的重要性。

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Authors: Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng

First: 2025-10-09T17:58:07+00:00 · Latest: 2025-12-29T18:55:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce Group Diffusion Policy Optimization (GDPO), a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.

中文标题/摘要

标题：通过组扩散策略优化提高扩散语言模型的推理能力

扩散语言模型（DLMs）能够进行并行、无序的生成，并通过迭代细化提供灵活的替代方案，以自回归大型语言模型（LLMs）为例。然而，将强化学习（RL）微调适应DLMs仍然是一个开放的挑战，因为难以计算似然性。先驱工作如diffu-GRPO通过一次去遮蔽估计了标记级别的似然性。尽管计算效率高，但这种方法严重有偏。更原则的基础在于序列级别的似然性，其中证据下界（ELBO）作为替代。尽管存在这种清晰的数学联系，但由于似然性评估成本高昂，基于ELBO的方法受到了限制。在本文中，我们重新审视了ELBO估计，并将其方差来源进行了分解。这种分解促使我们通过快速、确定性的积分近似来减少方差，沿着几个关键维度。基于这一见解，我们引入了组扩散策略优化（GDPO），这是一种新的针对DLMs的RL算法。GDPO利用简单的半确定性蒙特卡洛方案来缓解在常规双蒙特卡洛采样下的ELBO估计器的方差爆炸，从而在严格的评估预算下提供一个方差更低的估计器。实验上，GDPO在预训练检查点上实现了持续的收益，并在大多数数学、推理和编码基准上优于diffu-GRPO，这是当前最先进的基线之一。

Summary / 总结

This paper addresses the challenge of adapting reinforcement learning fine-tuning to diffusion language models (DLMs) by revisiting the estimation of sequence-level likelihoods. The authors introduce Group Diffusion Policy Optimization (GDPO), a new RL algorithm that uses semi-deterministic Monte Carlo schemes to reduce the variance of ELBO estimators, leading to consistent performance gains over pretrained checkpoints and outperforming diffu-GRPO on math, reasoning, and coding benchmarks.

本文解决了将强化学习微调适应扩散语言模型（DLMs）的挑战，通过重新审视序列级似然性的估计。它引入了组扩散策略优化（GDPO），使用半确定性蒙特卡洛方案来减少ELBO估计器的方差，在紧缩的评估预算下获得一个方差更低的估计器。GDPO在数学、推理和编码基准测试中优于diffu-GRPO和预训练检查点。

Bellman Calibration for V-Learning in Offline Reinforcement Learning

Authors: Lars van der Laan, Nathan Kallus

First: 2025-12-29T18:52:18+00:00 · Latest: 2025-12-29T18:52:18+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Iterated Bellman Calibration, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions in infinite-horizon Markov decision processes. Bellman calibration requires that states with similar predicted long-term returns exhibit one-step returns consistent with the Bellman equation under the target policy. We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting by repeatedly regressing fitted Bellman targets onto a model's predictions, using a doubly robust pseudo-outcome to handle off-policy data. This yields a one-dimensional fitted value iteration scheme that can be applied to any value estimator. Our analysis provides finite-sample guarantees for both calibration and prediction under weak assumptions, and critically, without requiring Bellman completeness or realizability.

中文标题/摘要

标题：贝尔曼校准在离线强化学习中V学习的应用

我们引入了迭代贝尔曼校准，这是一种简单、模型无关、事后校准无限 horizon 马尔可夫决策过程中的离策价值预测的简单方法。贝尔曼校准要求具有相似长期回报状态的一步回报与目标策略下的贝尔曼方程一致。我们通过反复将拟合的贝尔曼目标回归到模型的预测，并使用双重稳健的伪结果来处理离策数据，将经典的直方图校准和等向性校准适应到动态、反事实的设置中。这产生了一维拟合值迭代方案，可以应用于任何价值估计器。我们的分析在弱假设下提供了校准和预测的有限样本保证，并且关键地，无需贝尔曼完备性或现实性假设。

Summary / 总结

The research introduces Iterated Bellman Calibration, a model-agnostic post-hoc method for calibrating off-policy value predictions in reinforcement learning. It adapts classical calibration techniques to handle dynamic and counterfactual settings by repeatedly regressing fitted Bellman targets onto model predictions. The method provides finite-sample guarantees for both calibration and prediction under weak assumptions, without needing Bellman completeness or realizability. Key findings include the effectiveness of this approach in improving the accuracy of off-policy value estimates.

研究引入了迭代贝尔曼校准方法，用于强化学习中的离策价值预测校准。该方法通过反复将拟合的贝尔曼目标回归到模型的预测上来确保具有相似长期回报的状态的一步回报一致。关键发现包括在弱假设下对校准和预测的有限样本保证，无需贝尔曼完备性或现实性。

Investigation of the Impact of Synthetic Training Data in the Industrial Application of Terminal Strip Object Detection

Authors: Nico Baumgart, Markus Lange-Hegermann, Mike Mücke

First: 2024-03-06T18:33:27+00:00 · Latest: 2025-12-29T18:31:43+00:00

Abs · PDF · Code1 · Code2

Abstract

In industrial manufacturing, deploying deep learning models for visual inspection is mostly hindered by the high and often intractable cost of collecting and annotating large-scale training datasets. While image synthesis from 3D CAD models is a common solution, the individual techniques of domain and rendering randomization to create rich synthetic training datasets have been well studied mainly in simple domains. Hence, their effectiveness on complex industrial tasks with densely arranged and similar objects remains unclear. In this paper, we investigate the sim-to-real generalization performance of standard object detectors on the complex industrial application of terminal strip object detection, carefully combining randomization and domain knowledge. We describe step-by-step the creation of our image synthesis pipeline that achieves high realism with minimal implementation effort and explain how this approach could be transferred to other industrial settings. Moreover, we created a dataset comprising 30.000 synthetic images and 300 manually annotated real images of terminal strips, which is publicly available for reference and future research. To provide a baseline as a lower bound of the expectable performance in these challenging industrial parts detection tasks, we show the sim-to-real generalization performance of standard object detectors on our dataset based on a fully synthetic training. While all considered models behave similarly, the transformer-based DINO model achieves the best score with 98.40 % mean average precision on the real test set, demonstrating that our pipeline enables high quality detections in complex industrial environments from existing CAD data and with a manageable image synthesis effort.

中文标题/摘要

标题：工业终端条目检测中合成训练数据影响的调查

在工业制造中，部署用于视觉检测的深度学习模型主要受到大规模训练数据集收集和标注的高昂且难以解决的成本阻碍。虽然从3D CAD模型生成图像是一种常见解决方案，但用于创建丰富合成训练数据集的领域和渲染随机化技术主要在简单领域中得到了充分研究。因此，它们在复杂工业任务中的有效性，特别是涉及密集排列和相似对象的任务，仍然不清楚。在本文中，我们调查了标准对象检测器在复杂工业应用中的从仿真到现实的泛化性能，结合了随机化和领域知识。我们详细描述了实现高度逼真效果的图像合成管道，并解释了该方法如何应用于其他工业环境。此外，我们创建了一个包含30,000张合成图像和300张手动标注的真实图像的终端条目数据集，该数据集可供参考和未来研究使用。为了提供一个基准，作为这些具有挑战性的工业部件检测任务中可预期性能的下限，我们展示了在我们的数据集上基于完全合成训练的标准对象检测器的从仿真到现实的泛化性能。尽管所有考虑的模型表现相似，基于DINO模型的变压器架构在真实测试集上的平均精度达到了98.40%，这表明我们的管道能够从现有CAD数据中在复杂工业环境中实现高质量的检测，并且所需的图像合成工作量是可以管理的。

Summary / 总结

This paper investigates the effectiveness of synthetic training data for terminal strip object detection in industrial settings. The authors create a high-fidelity image synthesis pipeline using 3D CAD models and combine it with domain knowledge and randomization techniques. They evaluate the performance of standard object detectors on a dataset of 30,000 synthetic and 300 real images, finding that the transformer-based DINO model achieves 98.40% mean average precision on the real test set, indicating that the synthetic data can enable high-quality object detection in complex industrial environments with minimal image synthesis effort.

本文研究了合成训练数据在工业终端条形码物体检测中的有效性。作者使用3D CAD模型和领域知识创建了一个高逼真度的图像合成管道，并开发了一个包含30,000张合成和300张真实图像的数据集。他们发现基于变压器的DINO在真实测试集上的平均精度最高，达到98.40%，表明他们的方法可以在复杂工业环境中从现有CAD数据中实现高质量的物体检测，并且所需的图像合成工作量可管理。

Random Controlled Differential Equations

Authors: Francesco Piatti, Thomas Cass, William F. Turner

First: 2025-12-29T18:25:10+00:00 · Latest: 2025-12-29T18:25:10+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce a training-efficient framework for time-series learning that combines random features with controlled differential equations (CDEs). In this approach, large randomly parameterized CDEs act as continuous-time reservoirs, mapping input paths to rich representations. Only a linear readout layer is trained, resulting in fast, scalable models with strong inductive bias. Building on this foundation, we propose two variants: (i) Random Fourier CDEs (RF-CDEs): these lift the input signal using random Fourier features prior to the dynamics, providing a kernel-free approximation of RBF-enhanced sequence models; (ii) Random Rough DEs (R-RDEs): these operate directly on rough-path inputs via a log-ODE discretization, using log-signatures to capture higher-order temporal interactions while remaining stable and efficient. We prove that in the infinite-width limit, these model induces the RBF-lifted signature kernel and the rough signature kernel, respectively, offering a unified perspective on random-feature reservoirs, continuous-time deep architectures, and path-signature theory. We evaluate both models across a range of time-series benchmarks, demonstrating competitive or state-of-the-art performance. These methods provide a practical alternative to explicit signature computations, retaining their inductive bias while benefiting from the efficiency of random features.

中文标题/摘要

标题：随机控制微分方程

我们提出了一种结合随机特征与控制微分方程（CDEs）的时间序列学习高效框架。在此方法中，大型随机参数化CDEs作为连续时间蓄水池，将输入路径映射到丰富的表示。仅训练一层线性读出层，从而得到快速、可扩展且具有强归纳偏置的模型。在此基础上，我们提出了两种变体：(i) 随机傅里叶CDEs（RF-CDEs）：这些方法在动态学之前使用随机傅里叶特征提升输入信号，提供了一种无核近似RBF增强序列模型的方法；(ii) 随机粗糙DEs（R-RDEs）：这些方法直接在粗糙路径输入上通过对数ODE离散化操作，使用对数符号捕捉高阶时间交互，同时保持稳定性和效率。我们证明，在无限宽度极限下，这些模型分别诱导RBF提升符号核和粗糙符号核，提供了一种随机特征蓄水池、连续时间深度架构和路径符号理论的统一视角。我们在这两种模型在一系列时间序列基准上进行了评估，展示了具有竞争力或最先进的性能。这些方法提供了一种实用的替代显式符号计算的方法，同时保留其归纳偏置并受益于随机特征的效率。

Summary / 总结

The research introduces a training-efficient framework combining random features with controlled differential equations (CDEs) for time-series learning. This approach uses large randomly parameterized CDEs as continuous-time reservoirs to generate rich representations, with only a linear readout layer trained, leading to fast and scalable models. The study proposes two variants: Random Fourier CDEs (RF-CDEs) and Random Rough DEs (R-RDEs), which offer kernel-free approximations and capture higher-order temporal interactions, respectively. Experimental results show competitive or state-of-the-art performance across various time-series benchmarks, providing an efficient alternative to explicit signature computations while maintaining inductive bias.

本文提出了一种结合随机特征与控制微分方程（CDE）的时间序列学习框架。该方法使用大规模随机参数化的CDE作为连续时间的蓄水池，生成丰富的表示，仅训练一层线性读出层，从而实现快速和可扩展的模型。提出了两种变体：随机傅里叶CDE（RF-CDE）和随机粗糙DE（R-RDE）。RF-CDE使用随机傅里叶特征来近似RBF增强序列模型，而R-RDE通过对数微分方程离散化直接操作粗糙路径输入，捕捉更高阶的时间交互。实验表明，在各种时间序列基准测试中表现出竞争性或最先进的性能，提供了一种实用的替代显式签名计算的方法，同时保持效率和归纳偏置。

RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

Authors: Zhe Li, Cheng Chi, Yangyang Wei, Boan Zhu, Tao Huang, Zhenguo Sun, Yibo Peng, Pengwei Wang, Zhongyuan Wang, Fangzhou Liu, Chang Xu, Shanghang Zhang

First: 2025-12-29T17:59:19+00:00 · Latest: 2025-12-29T17:59:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.

中文标题/摘要

标题：RoboMirror：理解后再模仿以实现视频到类人行走

人类通过视觉观察学习行走，先理解视觉内容再模仿动作。然而，最先进的类人行走系统依赖于精心策划的运动捕捉轨迹或稀疏的文本命令，这在视觉理解和控制之间留下了关键差距。文本到动作的方法受到语义稀疏性和阶段管线错误的影响，而基于视频的方法仅进行机械姿态模仿，缺乏真正的视觉理解。我们提出了RoboMirror，这是第一个无需重新目标化的视频到行走框架，体现了“理解后再模仿”的理念。利用VLMs，它将原始第一人称/第三人称视频提炼为视觉运动意图，直接条件化扩散机制策略生成物理上合理且语义上对齐的行走，无需显式的姿态重建或重新目标化。广泛实验验证了RoboMirror的有效性，它通过第一人称视频实现远程存在感，将第三人称控制延迟降低了80%，并且在任务成功率上比基线高出3.7%。通过将类人控制重新构架为视频理解，我们弥合了视觉理解和动作之间的差距。

Summary / 总结

RoboMirror is a video-to-locomotion framework that uses visual language models to interpret raw egocentric and third-person videos, directly conditioning a diffusion-based policy to generate physically plausible and semantically aligned locomotion for humanoid robots. Experiments show that RoboMirror reduces third-person control latency by 80% and achieves a 3.7% higher task success rate compared to baseline methods, bridging the gap between visual understanding and action in humanoid control.

RoboMirror 是一个视频到运动的框架，强调在模仿之前先进行视觉理解。它利用视觉语言模型（VLMs）解析原始的第一人称和第三人称视频，然后通过扩散模型直接生成物理上合理且语义对齐的运动，无需显式的姿态重建或重新定位。实验表明，RoboMirror 将第三人称控制延迟减少了 80%，并且相比基线方法的成功率提高了 3.7%，从而弥合了视觉理解和动作之间的差距。

Nested Browser-Use Learning for Agentic Information Seeking

Authors: Baixuan Li, Jialong Wu, Wenbiao Yin, Kuan Li, Zhongwang Zhang, Huifeng Yin, Zhengwei Tao, Liwen Zhang, Pengjun Xie, Jingren Zhou, Yong Jiang

First: 2025-12-29T17:59:14+00:00 · Latest: 2025-12-29T17:59:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.

中文标题/摘要

标题：代理信息搜索中的嵌套浏览器使用学习

信息搜索(IS)代理在广泛和深入的搜索任务中取得了强大的性能，但它们的工具使用主要局限于API级别的片段检索和基于URL的页面获取，限制了对通过实际浏览可获得的更丰富信息的访问。虽然全面的浏览器交互可以解锁更深层次的能力，但其精细的控制和冗长的页面内容返回引入了对ReAct风格函数调用代理而言的重大复杂性。为了弥合这一差距，我们提出了嵌套浏览器使用学习(NestBrowse)，它引入了一个最小且完整的浏览器操作框架，通过嵌套结构将交互控制与页面探索脱钩。这种设计简化了代理推理，同时使有效的深层网络信息获取成为可能。在具有挑战性的深层IS基准测试上的实验证明，NestBrowse在实践中提供了明显的益处。进一步的深入分析强调了其效率和灵活性。

Summary / 总结

The research aims to enhance information-seeking agents by enabling them to use browsers more effectively, which can provide richer information than simple API calls. The method involves Nested Browser-Use Learning (NestBrowse), which simplifies interaction control and page exploration through a nested structure. Key findings show that NestBrowse improves performance on deep information-seeking tasks, offering clear benefits in practical applications.

研究旨在通过使信息寻求代理能够更有效地使用浏览器来增强其能力，浏览器可以提供比简单API调用更丰富的信息。方法是采用Nested Browser-Use Learning (NestBrowse)，通过嵌套结构简化了交互控制和页面探索。关键发现表明，NestBrowse在深度信息寻求任务中表现出色，提供了实际应用中的明显优势。

OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

Authors: Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang

First: 2025-12-29T17:59:05+00:00 · Latest: 2025-12-29T17:59:05+00:00

Comments: Website:https://kd-tao.github.io/OmniAgent/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.

中文标题/摘要

标题：OmniAgent：基于音频引导的多模态音频视频理解主动感知代理

多模态大型语言模型在统一音频和视觉模态方面取得了显著进展；然而，它们往往缺乏细粒度的跨模态理解，并且难以实现多模态对齐。为了解决这些限制，我们引入了OmniAgent，这是一种完全基于音频的主动感知代理，能够动态协调专门的工具以实现更细粒度的视听推理。与依赖于僵硬的静态工作流和密集的帧字幕的先前工作不同，本文展示了从被动响应生成到主动多模态查询的范式转变。OmniAgent采用动态规划，自主地在需要时调用工具，战略性地将感知注意力集中在与任务相关的线索上。我们方法的核心是新颖的从粗到细的基于音频的感知范式，利用音频线索定位时间事件并指导后续推理。在三个音频视频理解基准上的广泛实证评估表明，OmniAgent达到了最先进的性能，比领先的开源和专有模型在准确率上高出10%-20%。

Summary / 总结

OmniAgent is an audio-guided active perception agent designed to enhance fine-grained audio-visual understanding by dynamically invoking specialized tools. Unlike previous methods, it focuses on active multimodal inquiry rather than passive response generation. Experimental results show that OmniAgent outperforms existing models by up to 20% in accuracy across three benchmarks for audio-video understanding.

由于大型语言模型在跨模态理解和多模态对齐方面存在局限性，该论文提出了OmniAgent，这是一种基于音频的主动感知代理，能够动态规划并协调专门工具进行视听推理。实验评估表明，OmniAgent在三个基准测试中比现有模型高出20%的准确性，证明了其在实现精细多模态理解和对齐方面的有效性。

Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception

Authors: Xiaoyu Li, Peidong Li, Xian Wu, Long Shi, Dedong Liu, Yitao Wu, Jiajia Fu, Dixiao Cui, Lijun Zhao, Lining Sun

Venue: AAAI 2026

First: 2025-12-29T17:48:56+00:00 · Latest: 2025-12-29T17:48:56+00:00

Comments: Accepted to AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.

中文标题/摘要

标题：重新思考端到端3D感知的空间-时间对齐

空间-时间对齐对于自主驾驶（AD）中端到端（E2E）感知的时序建模至关重要，提供了有价值的结构和纹理先验信息。现有方法通常依赖注意力机制在帧间对齐物体，简化了统一的显式物理模型（恒定速度等）。这些方法倾向于使用语义特征进行隐式对齐，挑战了传统感知范式中显式运动建模的重要性。然而，不同类别和帧间运动状态和物体特征的变化使得这种对齐效果不佳。为了解决这一问题，我们提出了HAT，这是一种空间-时间对齐模块，允许每个物体自适应地从多个假设中解码最优对齐提案，无需直接监督。具体而言，HAT首先利用多个显式运动模型生成历史实例的空间锚点和运动感知特征提案，然后通过嵌入在缓存对象查询中的语义和运动线索进行多假设解码，最终为目标帧提供最优对齐提案。在nuScenes上，HAT在各种基线中一致地提高了3D时序检测器和跟踪器的性能。当与DETR3D检测器配对时，它在测试集上实现了46.0%的AMOTA最佳跟踪结果。在基于对象的E2E AD方法中，HAT提高了感知准确性（+1.3% mAP，+3.1% AMOTA），并将碰撞率降低了32%。当语义被破坏（nuScenes-C）时，HAT对运动建模的增强使E2E AD中的感知和规划更加稳健。

Summary / 总结

The paper addresses the limitations of existing spatio-temporal alignment methods in end-to-end 3D perception for autonomous driving by proposing HAT, a module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses. HAT uses multiple explicit motion models to generate spatial anchors and motion-aware feature proposals, and then performs multi-hypothesis decoding to provide the best alignment proposal. On the nuScenes dataset, HAT improves 3D temporal detectors and trackers, achieving state-of-the-art tracking results and enhancing perception accuracy and reducing collision rates in an object-centric E2E AD method, even when semantics are corrupted.

论文针对自动驾驶中端到端3D感知中的时空对齐问题，现有方法通常使用简化运动模型和语义特征进行对齐。为改进这一问题，作者提出了HAT时空对齐模块，该模块允许每个对象从多个假设中自适应地解码最优对齐提案。HAT使用多个显式运动模型生成空间锚点和运动感知特征提案，并进行多假设解码以提供最佳对齐提案。在nuScenes数据集上，HAT增强了3D时空检测器和追踪器，实现了最先进的追踪结果，并在对象中心的端到端自动驾驶方法中提高了感知准确性和降低了碰撞率。

BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Authors: Iris Xu, Guangtao Zeng, Zexue He, Charles Jin, Aldo Pareja, Dan Gutfreund, Chuang Gan, Zhang-Wei Hong

First: 2025-12-29T17:41:11+00:00 · Latest: 2025-12-29T17:41:11+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow-interpreting issues, navigating large codebases, and implementing fixes-within one reasoning chain. Such monolithic designs force the model to retain irrelevant context, leading to spurious correlations and poor generalization. Motivated by how human engineers decompose complex problems, we propose structuring SWE agents as orchestrators coordinating specialized sub-agents for sub-tasks such as localization, editing, and validation. The challenge lies in discovering effective hierarchies automatically: as the number of sub-agents grows, the search space becomes combinatorial, and it is difficult to attribute credit to individual sub-agents within a team. We address these challenges by formulating hierarchy discovery as a multi-armed bandit (MAB) problem, where each arm represents a candidate sub-agent and the reward measures its helpfulness when collaborating with others. This framework, termed Bandit Optimization for Agent Design (BOAD), enables efficient exploration of sub-agent designs under limited evaluation budgets. On SWE-bench-Verified, BOAD outperforms single-agent and manually designed multi-agent systems. On SWE-bench-Live, featuring more recent and out-of-distribution issues, our 36B system ranks second on the leaderboard at the time of evaluation, surpassing larger models such as GPT-4 and Claude. These results demonstrate that automatically discovered hierarchical multi-agent systems significantly improve generalization on challenging long-horizon SWE tasks. Code is available at https://github.com/iamxjy/BOAD-SWE-Agent.

中文标题/摘要

标题：BOAD：通过多臂老虎机优化发现层次化的软件工程代理

大型语言模型（LLMs）展示了强大的推理和编程能力，但在处理长周期和分布外的实际软件工程（SWE）问题时却表现不佳。现有系统通常依赖单一代理来处理整个工作流程，包括解释问题、导航大型代码库和实施修复，都在一个推理链中完成。这种单一设计迫使模型保留无关上下文，导致虚假相关性和较差的泛化能力。受人类工程师如何分解复杂问题的启发，我们建议将SWE代理结构化为协调专门化子代理的协调者，这些子代理负责子任务，如定位、编辑和验证。挑战在于自动发现有效的层次结构：随着子代理数量的增加，搜索空间变得组合化，难以在团队内部为个别子代理分配信用。我们通过将层次结构发现形式化为多臂老虎机（MAB）问题来应对这些挑战，其中每个臂代表一个候选子代理，奖励衡量其与其他代理合作时的有用性。该框架称为代理设计的多臂老虎机优化（BOAD），在有限的评估预算下能够高效探索子代理设计。在SWE-bench-Verified上，BOAD优于单一代理和手动设计的多代理系统。在SWE-bench-Live上，包含更多近期和分布外的问题，我们的36B系统在评估时排名第二，超过了如GPT-4和Claude等更大模型。这些结果表明，自动发现的层次化多代理系统在处理具有挑战性的长周期SWE任务时显著提高了泛化能力。代码可在https://github.com/iamxjy/BOAD-SWE-Agent/ 获取。

Summary / 总结

The paper addresses the challenge of using large language models for real-world software engineering tasks, which are long-horizon and out of distribution. It proposes BOAD, a method that structures software engineering agents as orchestrators coordinating specialized sub-agents for different tasks. By formulating hierarchy discovery as a multi-armed bandit problem, BOAD efficiently explores sub-agent designs and outperforms single-agent and manually designed multi-agent systems on both SWE-bench-Verified and SWE-bench-Live, demonstrating significant improvements in generalization.

研究旨在通过提出层次化的多智能体系统来提高大型语言模型在实际软件工程任务中的泛化能力。方法是使用多臂老虎机（MAB）方法自动发现有效的子智能体层次结构。实验结果表明，提出的BOAD系统在SWE-bench-Verified和SWE-bench-Live上均优于单智能体和手动设计的多智能体系统，特别是在更近期和分布外的问题上表现出显著的泛化能力提升。

Memorization in 3D Shape Generation: An Empirical Study

Authors: Shu Pu, Boya Zeng, Kaichen Zhou, Mengyu Wang, Zhuang Liu

First: 2025-12-29T17:39:21+00:00 · Latest: 2025-12-29T17:39:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at https://github.com/zlab-princeton/3d_mem.

中文标题/摘要

标题：3D形状生成中的记忆化：一项实证研究

生成模型在3D视觉中越来越多地用于合成新的形状，但尚不清楚它们的生成是否依赖于记忆训练形状。理解其记忆化有助于防止训练数据泄露并提高生成结果的多样性。在本文中，我们设计了一种评估框架来量化3D生成模型的记忆化，并研究了不同数据和建模设计对记忆化的影响。我们首先将该框架应用于量化现有方法的记忆化。接下来，通过使用一个潜在向量集（Vecset）扩散模型的受控实验，我们发现，在数据方面，记忆化取决于数据模态，并随着数据多样性和更精细的条件而增加；在建模方面，它在适度的指导规模下达到峰值，并可以通过更长的Vecset和简单的旋转增强来减轻。综上所述，我们的框架和分析为3D生成模型中的记忆化提供了实证理解，并建议了一些简单而有效的策略来减少它而不降低生成质量。我们的代码可在https://github.com/zlab-princeton/3d_mem获取。

Summary / 总结

This paper evaluates memorization in 3D generative models by designing an evaluation framework. It quantifies memorization in existing methods and conducts controlled experiments with a latent vector-set (Vecset) diffusion model. The study finds that memorization depends on data modality and increases with data diversity and finer-grained conditioning. On the modeling side, memorization peaks at a moderate guidance scale and can be reduced by longer Vecsets and simple rotation augmentation. These findings provide insights into reducing memorization without compromising generation quality.

本文研究了3D生成模型是否记忆训练数据，这对于防止数据泄露和提升生成多样性至关重要。作者开发了一个评估框架来衡量记忆现象，并将其应用于现有方法。通过使用一个潜在向量集扩散模型进行受控实验，他们发现记忆现象受数据模态和多样性的影响，并可以通过使用更长的向量集和简单的旋转增强来减轻。该研究提供了关于3D生成模型中记忆现象的实证见解，并提出了减轻记忆现象的实用策略。

Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning

Authors: Deniz Akdemir

First: 2025-12-29T17:21:44+00:00 · Latest: 2025-12-29T17:21:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Distribution shift is the defining challenge of real-world machine learning. The dominant paradigm--Unsupervised Domain Adaptation (UDA)--enforces feature invariance, aligning source and target representations via symmetric divergence minimization [Ganin et al., 2016]. We demonstrate that this approach is fundamentally flawed: when domains are unequally informative (e.g., high-quality vs degraded sensors), strict invariance necessitates information destruction, causing "negative transfer" that can be catastrophic in safety-critical applications [Wang et al., 2019]. We propose a decision-theoretic framework grounded in Le Cam's theory of statistical experiments [Le Cam, 1986], using constructive approximations to replace symmetric invariance with directional simulability. We introduce Le Cam Distortion, quantified by the Deficiency Distance $δ(E_1, E_2)$, as a rigorous upper bound for transfer risk conditional on simulability. Our framework enables transfer without source degradation by learning a kernel that simulates the target from the source. Across five experiments (genomics, vision, reinforcement learning), Le Cam Distortion achieves: (1) near-perfect frequency estimation in HLA genomics (correlation $r=0.999$, matching classical methods), (2) zero source utility loss in CIFAR-10 image classification (81.2% accuracy preserved vs 34.7% drop for CycleGAN), and (3) safe policy transfer in RL control where invariance-based methods suffer catastrophic collapse. Le Cam Distortion provides the first principled framework for risk-controlled transfer learning in domains where negative transfer is unacceptable: medical imaging, autonomous systems, and precision medicine.

Summary / 总结

The paper addresses the challenge of distribution shift in machine learning by proposing a decision-theoretic framework called Le Cam Distortion. This framework uses Le Cam's theory of statistical experiments to replace feature invariance with directional simulability, aiming to avoid information destruction and negative transfer. Experiments across genomics, vision, and reinforcement learning show that Le Cam Distortion achieves near-perfect frequency estimation, preserves source utility in image classification, and ensures safe policy transfer in reinforcement learning, outperforming invariance-based methods in safety-critical applications.

论文提出了一种基于Le Cam统计实验理论的决策框架Le Cam Distortion，用方向可模拟性替代了特征不变性。该框架通过缺陷距离量化了转移风险，并能够在不降低源端效用的情况下实现转移。实验结果显示，Le Cam Distortion在基因组学、视觉和强化学习领域实现了近乎完美的频率估计、保持了源端准确性，并确保了安全的策略转移，优于基于不变性的方法，避免了负向转移。

ClassWise-CRF: Category-Specific Fusion for Enhanced Semantic Segmentation of Remote Sensing Imagery

Authors: Qinfeng Zhu, Yunxi Jiang, Lei Fan

First: 2025-04-30T10:19:21+00:00 · Latest: 2025-12-29T17:14:48+00:00

Comments: Accpted by Neural Networks

Abs · PDF · Code1 · Code2 · Code3

Abstract

We propose a result-level category-specific fusion architecture called ClassWise-CRF. This architecture employs a two-stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise-CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category-specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category-specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise-CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise-CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise-CRF architecture in semantic segmentation of remote sensing images. The full code is available at https://github.com/zhuqinfeng1999/ClassWise-CRF.

中文标题/摘要

标题：ClassWise-CRF：特定类别融合架构以增强遥感影像语义分割

我们提出了一种结果级别的特定类别融合架构，称为ClassWise-CRF。该架构采用两阶段过程：首先，使用贪婪算法从候选网络池中选择在特定类别上表现良好的专家网络；其次，通过根据每个类别中的分割性能自适应加权这些选定网络的分割预测来整合这些网络的分割预测。受条件随机场(CRF)启发，ClassWise-CRF架构将多个网络的分割预测视为置信向量场。它利用验证集上的分割指标（如交并比）作为先验，并采用指数加权策略融合每个网络预测的类别特定置信分数。该融合方法动态调整每个网络在不同类别中的权重，实现类别特定优化。在此基础上，架构进一步使用CRF中的单元势和对势优化融合结果，以确保空间一致性与边界准确性。为了验证ClassWise-CRF的有效性，我们在两个遥感数据集LoveDA和Vaihingen上使用八种经典和先进的语义分割网络进行了实验。结果显示，ClassWise-CRF架构显著提高了分割性能：在LoveDA数据集上，验证集的平均交并比(mIoU)提高了1.00%，测试集提高了0.68%；在Vaihingen数据集上，验证集的mIoU提高了0.87%，测试集提高了0.91%。这些结果充分证明了ClassWise-CRF架构在遥感影像语义分割中的有效性和普适性。完整的代码可在https://github.com/zhuqinfeng1999/ClassWise-CRF获取。

Summary / 总结

The research proposes ClassWise-CRF, a category-specific fusion architecture for remote sensing image semantic segmentation. It selects expert networks for specific categories and integrates their predictions using adaptive weighting based on segmentation performance. Experiments on LoveDA and Vaihingen datasets show that ClassWise-CRF improves mean Intersection over Union (mIoU) by 1.00% and 0.68% on the validation and test sets of LoveDA, and by 0.87% and 0.91% on the validation and test sets of Vaihingen, respectively.

研究提出了一种名为ClassWise-CRF的类别特定融合架构，用于增强遥感图像的语义分割。该架构采用两阶段过程选择特定类别下的专家网络，并根据其性能动态加权其贡献。实验结果表明，ClassWise-CRF在LoveDA数据集上的验证集和测试集的平均交并比（mIoU）分别提高了1.00%和0.68%，在Vaihingen数据集上分别提高了0.87%和0.91%，证明了其在遥感图像语义分割中的有效性和普适性。

OM4OV: Leveraging Ontology Matching for Ontology Versioning

Authors: Zhangcheng Qiang, Kerry Taylor, Weiqing Wang

First: 2024-09-30T14:00:04+00:00 · Latest: 2025-12-29T17:05:52+00:00

Comments: 16 pages, 8 figures, 1 table

Abs · PDF · Code1 · Code2

Abstract

Due to the dynamic nature of the Semantic Web, version control is necessary to manage changes in widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many approaches treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse similarities and differences between OM and OV and formalise the OM4OV pipeline to offer more advanced OV support. The pipeline is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be reused for OV tasks, but without necessary extensions, the current OM4OV pipeline can produce skewed measurements, poor performance in detecting update entities, and limited explainability of false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, which builds on existing OM alignments to reduce the number of matching candidates and to improve overall OV performance.

中文标题/摘要

标题：OM4OV：利用本体匹配进行本体版本管理

由于语义网的动态性质，版本控制对于管理广泛使用的本体中的更改是必要的。尽管本体版本管理（OV）作为有效本体管理的关键组成部分已有长期的认识，但许多方法仍将OV视为与本体匹配（OM）相似，并直接重用OM系统来执行OV任务。在本研究中，我们系统地分析了OM和OV之间的相似性和差异，并形式化了OM4OV管道，以提供更高级的OV支持。该管道在最先进的OM系统Agent-OM中实现和评估。实验结果表明，OM系统可以重用于OV任务，但如果没有必要的扩展，当前的OM4OV管道会产生失真的测量结果，检测更新实体的性能较差，并且对错误映射的解释性有限。为了解决这些问题，我们提出了一种优化方法，称为交叉引用（CR）机制，该机制基于现有的OM对齐来减少匹配候选的数量，并提高整体的OV性能。

Summary / 总结

This study addresses the need for version control in ontologies due to the dynamic nature of the Semantic Web. It analyzes the differences and similarities between ontology matching (OM) and ontology versioning (OV) and proposes an OM4OV pipeline to enhance OV support. The pipeline, implemented in Agent-OM, shows that while OM systems can be reused for OV tasks, they require extensions to avoid skewed measurements and improve performance in detecting update entities and explainability of false mappings. An optimization method called the cross-reference (CR) mechanism is proposed to address these issues by reducing the number of matching candidates and improving overall OV performance.

该研究针对语义网中版本控制的需求，分析了语义网匹配（OM）和语义网版本控制（OV）之间的差异和相似之处，并提出了OM4OV管道，该管道在Agent-OM中实现。研究表明，虽然OM系统可以用于OV任务，但需要扩展以避免测量偏差并提高检测更新实体的性能。还提出了一种优化方法，称为交叉引用（CR）机制，通过减少匹配候选数量来增强OV的整体性能，并提高虚假映射的可解释性。

Learning to Refocus with Video Diffusion Models

Authors: SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

Venue: SIGGRAPH Asia 2025

First: 2025-12-22T19:29:57+00:00 · Latest: 2025-12-29T17:04:36+00:00

Comments: Code and data are available at https://learn2refocus.github.io . SIGGRAPH Asia 2025, Dec. 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at www.learn2refocus.github.io

中文标题/摘要

标题：学习使用视频扩散模型重新聚焦

对焦是摄影的基础，但自动对焦系统往往无法捕捉到预期的主体，用户经常希望在拍摄后调整对焦。我们提出了一种使用视频扩散模型进行现实后对焦的新方法。从单张失焦图像出发，我们的方法生成了一组感知上准确的焦距堆栈，表示为视频序列，支持交互式重新对焦并解锁一系列下游应用。我们发布了一个大规模的焦距堆栈数据集，以支持这项工作和未来的研究。我们的方法在感知质量和在具有挑战性的场景中的鲁棒性方面均优于现有方法，为日常摄影中的更高级对焦编辑能力铺平了道路。代码和数据可在www.learn2refocus.github.io获取

Summary / 总结

The paper introduces a method for realistic post-capture refocusing using video diffusion models. Starting from a single defocused image, the approach generates a perceptually accurate focal stack, allowing for interactive refocusing and supporting various downstream applications. The method outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, and a large-scale focal stack dataset is provided for research support.

该论文解决了摄影中拍摄后对焦的问题，自动对焦系统常常无法捕捉到预期的主体。作者提出了一种使用视频扩散模型的方法，可以从单张失焦图像生成感知上准确的焦距堆栈，实现交互式对焦，并支持多种下游应用。该方法在感知质量和鲁棒性方面均优于现有方法，展示了其在日常摄影中增强对焦编辑能力的潜力。

The Nonstationarity-Complexity Tradeoff in Return Prediction

Authors: Agostino Capponi, Chengpiao Huang, J. Antonio Sidaoui, Kaizheng Wang, Jiacheng Zou

First: 2025-12-29T16:49:19+00:00 · Latest: 2025-12-29T16:49:19+00:00

Abs · PDF · Code1 · Code2

Abstract

We investigate machine learning models for stock return prediction in non-stationary environments, revealing a fundamental nonstationarity-complexity tradeoff: complex models reduce misspecification error but require longer training windows that introduce stronger non-stationarity. We resolve this tension with a novel model selection method that jointly optimizes model class and training window size using a tournament procedure that adaptively evaluates candidates on non-stationary validation data. Our theoretical analysis demonstrates that this approach balances misspecification error, estimation variance, and non-stationarity, performing close to the best model in hindsight. Applying our method to 17 industry portfolio returns, we consistently outperform standard rolling-window benchmarks, improving out-of-sample $R^2$ by 14-23% on average. During NBER-designated recessions, improvements are substantial: our method achieves positive $R^2$ during the Gulf War recession while benchmarks are negative, and improves $R^2$ in absolute terms by at least 80bps during the 2001 recession as well as superior performance during the 2008 Financial Crisis. Economically, a trading strategy based on our selected model generates 31% higher cumulative returns averaged across the industries.

中文标题/摘要

标题：收益预测中的非平稳性-复杂性权衡

我们研究了在非平稳环境中股票收益预测的机器学习模型，揭示了一个基本的非平稳性-复杂性权衡：复杂的模型减少了模型设定误差，但需要更长的训练窗口，这引入了更强的非平稳性。我们通过一种新颖的模型选择方法解决了这一矛盾，该方法使用锦标赛程序联合优化模型类别和训练窗口大小，并在非平稳验证数据上自适应评估候选模型。我们的理论分析表明，这种方法平衡了模型设定误差、估计方差和非平稳性，接近于事后最佳模型的表现。将我们的方法应用于17个行业投资组合收益，我们始终优于标准滚动窗口基准，平均提高离样本$R^2$ 14-23%。在NBER指定的经济衰退期间，改进尤为显著：在海湾战争衰退期间，我们的方法实现了正的$R^2$，而基准为负；在2001年衰退期间，$R^2$绝对值至少提高了80bps，且在2008年金融危机期间表现更优。从经济角度看，基于我们选择的模型的交易策略在各行业中平均累计回报率高出31%。

Summary / 总结

This study explores machine learning models for stock return prediction in non-stationary environments, identifying a tradeoff between model complexity and non-stationarity. The authors propose a novel model selection method that optimizes both model class and training window size, using a tournament procedure to evaluate candidates on non-stationary validation data. This approach balances misspecification error, estimation variance, and non-stationarity, outperforming standard rolling-window benchmarks by 14-23% in out-of-sample $R^2$. During economic recessions, the method significantly improves performance, achieving positive $R^2$ during the Gulf War recession and substantial improvements during the 2001 and 2008 recessions. A trading strategy based on the selected model generates 31% higher cumulative returns across industries.

研究探讨了在非平稳环境下使用机器学习模型进行股票回报预测的问题，发现模型复杂性和非平稳性之间存在权衡。作者提出了一种新的模型选择方法，同时优化模型类别和训练窗口大小，通过在非平稳验证数据上进行比赛程序评估候选模型。这种方法减少了模型偏差，并将平均的$R^2$提高14-23%，特别是在经济衰退期间表现尤为显著，例如在2008年金融危机期间，基于该方法的选择模型产生的累计回报比基准策略高出31%。

How Safe Are AI-Generated Patches? A Large-scale Study on Security Risks in LLM and Agentic Automated Program Repair on SWE-bench

Authors: Amirali Sajadi, Kostadin Damevski, Preetha Chatterjee

First: 2025-06-30T21:10:19+00:00 · Latest: 2025-12-29T16:44:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) and their agentic frameworks are increasingly adopted to perform development tasks such as automated program repair (APR). While prior work has identified security risks in LLM-generated code, most have focused on synthetic, simplified, or isolated tasks that lack the complexity of real-world program repair. In this study, we present the first large-scale security analysis of LLM-generated patches using 20,000+ GitHub issues. We evaluate patches proposed by developers, a standalone LLM (Llama 3.3 Instruct-70B), and three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb). Finally, we analyze a wide range of code, issue, and project-level factors to understand the conditions under which generating insecure patches is more likely. Our findings reveal that Llama introduces many new vulnerabilities, exhibiting unique patterns not found in developers' code. Agentic workflows also generate a number of vulnerabilities, particularly when given more autonomy. We find that vulnerabilities in LLM-generated patches are associated with distinctive code characteristics and are commonly observed in issues missing specific types of information. These results suggest that contextual factors play a critical role in the security of the generated patches and point toward the need for proactive risk assessment methods that account for both issue and code-level information.

中文标题/摘要

标题：AI生成补丁的安全性如何？基于SWE-bench的大规模LLM和代理自动化程序修复安全风险研究

大型语言模型（LLMs）及其代理框架越来越多地被用于执行开发任务，如自动化程序修复（APR）。尽管先前的工作已经识别出LLM生成代码中的安全风险，但大多数研究都集中在合成、简化或孤立的任务上，缺乏真实世界程序修复的复杂性。在本研究中，我们首次使用20,000多个GitHub问题对LLM生成的补丁进行了大规模的安全分析。我们评估了开发人员、独立LLM（Llama 3.3 Instruct-70B）和三个表现最佳的代理框架（OpenHands、AutoCodeRover、HoneyComb）提出的补丁。最后，我们分析了代码、问题和项目层面的各种因素，以了解生成不安全补丁的条件。我们的研究发现，Llama引入了许多新的漏洞，表现出不同于开发人员代码的独特模式。代理工作流程在获得更多自主权时也会生成大量漏洞。我们发现，LLM生成补丁中的漏洞与特定的代码特征相关，并且通常出现在缺少特定类型信息的问题中。这些结果表明，上下文因素在生成补丁的安全性中起着关键作用，并指出了需要考虑问题和代码层面信息的主动风险评估方法的必要性。

Summary / 总结

This study investigates the security risks in AI-generated patches using large language models (LLMs) and agentic frameworks on 20,000+ GitHub issues. It evaluates patches from developers, a standalone LLM, and three agentic frameworks, revealing that LLMs introduce many new vulnerabilities, especially when given more autonomy. The research finds that vulnerabilities in LLM-generated patches are associated with specific code characteristics and missing information in issues, highlighting the importance of contextual factors in ensuring patch security.

本研究使用大型语言模型（LLMs）和代理框架对来自20,000多个GitHub问题的AI生成补丁进行大规模安全分析。研究评估了开发人员、一个独立的LLM以及三个顶级代理框架生成的补丁，发现LLMs在获得更多自主权时会引入许多新的漏洞。研究发现，LLMs生成的补丁中的漏洞与特定的代码特征以及问题中缺失的信息有关，强调了确保补丁安全时考虑上下文因素的重要性。

Same or Not? Enhancing Visual Perception in Vision-Language Models

Authors: Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari

First: 2025-12-29T16:43:47+00:00 · Latest: 2025-12-29T16:43:47+00:00

Comments: Project webpage: https://glab-caltech.github.io/twin/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/

中文标题/摘要

标题：同或不同？提升视觉语言模型的视觉感知能力

视觉语言模型（VLMs）在广泛的视觉理解方面表现出色，但仍然较为粗略，存在视觉偏见，并且忽略了一些细微的视觉细节。现有的训练语料库通过强调一般识别（“这是猫还是狗？”）而非精细的感知，强化了这一局限性。为了解决这一问题，我们引入了一个新的训练语料库和任务，旨在提升VLMs的感知能力。TWIN是一个包含561,000个图像对查询的大规模数据集，要求模型判断两个视觉相似的图像是否描绘同一个物体，鼓励关注细微的视觉线索。该数据集涵盖了各种日常物体在不同上下文、视角和外观下的广泛范围。在TWIN上微调VLMs在精细识别方面取得了显著进步，即使在未见过的领域如艺术、动物、植物和地标上也是如此。为了量化这些进步，我们引入了FGVQA，这是一个包含12,000个查询的基准套件，重新利用了多个领域中的精细识别和检索数据集。虽然现有的VLMs在FGVQA上表现不佳，但在TWIN上微调后，它们的性能提高了高达19.3%，而不会影响通用VQA基准的性能。最后，我们的TWIN数据集在对象注释方面具有可扩展性，我们的分析表明，规模是关键因素。我们设想TWIN可以作为开源VLM训练语料库的即插即用补充，推动未来模型感知精度的提升。项目网页：https://glab-caltech.github.io/twin/

Summary / 总结

This paper addresses the limitations of vision-language models (VLMs) in fine-grained perception by introducing TWIN, a new dataset of 561,000 image-pair queries that encourages models to distinguish subtle visual differences. Fine-tuning VLMs on TWIN improves their performance in fine-grained recognition across various domains, with up to 19.3% improvement on the FGVQA benchmark, without affecting general VQA performance. The dataset's scale is crucial for enhancing perceptual precision, and TWIN is designed to be integrated into VLM training to advance model accuracy. Project webpage: https://glab-caltech.github.io/twin/

该研究引入了TWIN，一个包含561,000个图像对查询的新数据集，旨在提升视觉语言模型（VLM）的细粒度视觉感知能力。通过让模型判断两个相似图像是否描绘同一物体，TWIN促使模型关注细微的视觉特征。在TWIN上微调VLM能够显著提高其在细粒度识别任务中的表现，甚至在艺术、动物、植物和地标等未见过的领域中，性能提升高达19.3%，同时不影响通用VQA性能。研究者还引入了FGVQA基准套件来评估这些改进，显示使用TWIN时有显著提升。TWIN的规模对于性能至关重要，并且该数据集设计为易于集成到VLM训练数据集中。

LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

Authors: Ethan Chern, Zhulin Hu, Bohao Tang, Jiadi Su, Steffi Chern, Zhijie Deng, Pengfei Liu

First: 2025-12-29T16:17:36+00:00 · Latest: 2025-12-29T16:17:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

中文标题/摘要

标题：LiveTalk: 通过改进的在线策略蒸馏实现实时多模态交互视频扩散

通过扩散模型实时生成视频对于构建通用多模态交互AI系统至关重要。然而，通过迭代过程中的双向注意力同时对所有视频帧进行去噪会阻碍实时交互。虽然现有的蒸馏方法可以使模型自回归并减少采样步骤以缓解这一问题，但它们主要集中在文本到视频生成上，使得人机交互显得不自然且效率较低。本文旨在针对多模态上下文（包括文本、图像和音频）条件下的实时交互视频扩散，以弥合这一差距。鉴于观察到领先的在线策略蒸馏方法Self Forcing在多模态条件下的挑战（如视觉伪影、黑屏和质量下降），我们研究了一种改进的蒸馏配方，强调条件输入的质量以及在线策略优化的初始化和调度。在包括HDTF、AVSpeech和CelebV-HQ的多模态条件（音频、图像和文本）头像视频生成基准上，我们的蒸馏模型在与相似或更大规模的全步骤双向基线视觉质量相当的情况下，实现了20倍的推理成本和延迟降低。此外，我们将模型与音频语言模型和长视频推理技术Anchor-Heavy Identity Sinks结合，构建了LiveTalk实时多模态交互头像系统。在我们策划的多轮交互基准上的系统级评估显示，LiveTalk在多轮视频连贯性和内容质量方面优于Sora2和Veo3等最先进的模型，同时将响应延迟从1到2分钟缩短到实时生成，从而实现无缝的人机多模态交互。

Summary / 总结

This paper addresses the challenge of real-time video generation via diffusion models for interactive AI systems. It introduces an improved on-policy distillation method to enhance the quality of condition inputs and optimize the initialization and schedule for better performance. The distilled model achieves visual quality comparable to full-step baselines with 20 times less inference cost and latency, and integrates with audio language models and long-form video inference techniques to create LiveTalk, a real-time multimodal interactive avatar system. LiveTalk outperforms state-of-the-art models in multi-turn video coherence and content quality, with reduced response latency to real-time generation.

该论文通过改进在线策略蒸馏方法解决了实时多模态交互视频生成的挑战。作者专注于提高条件输入的质量并优化在线策略优化的初始化和时间表。他们的模型LiveTalk在20倍更低的推理成本和延迟下实现了与全步骤基线相当的视觉质量，从而在多模态上下文中实现实时的人机交互。

Predicting large scale cosmological structure evolution with generative adversarial network-based autoencoders

Authors: Marion Ullmo, Nabila Aghanim, Aurélien Decelle, Miguel Aragon-Calvo

First: 2024-03-04T16:17:43+00:00 · Latest: 2025-12-29T16:15:13+00:00

Comments: 13 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Predicting the nonlinear evolution of cosmic structure from initial conditions is typically approached using Lagrangian, particle-based methods. These techniques excel in terms of tracking individual trajectories, but they might not be suitable for applications where point-based information is unavailable or impractical. In this work, we explore an alternative, field-based approach using Eulerian inputs. Specifically, we developed an autoencoder architecture based on a generative adversarial network (GAN) and trained it to evolve density fields drawn from dark matter N-body simulations. We tested this method on both 2D and 3D data. We find that while predictions on 2D density maps perform well based on density alone, accurate 3D predictions require the inclusion of associated velocity fields. Our results demonstrate the potential of field-based representations to model cosmic structure evolution, offering a complementary path to Lagrangian methods in contexts where field-level data is more accessible.

中文标题/摘要

标题：使用生成对抗网络为基础的自编码器预测大规模宇宙结构演化

从初始条件预测非线性宇宙结构的演化通常使用拉格朗日、粒子基方法。这些技术在追踪单个轨迹方面表现出色，但在点基信息不可用或不切实际的应用中可能不太适用。在本工作中，我们探索了一种替代的场基方法，使用欧拉输入。具体来说，我们开发了一种基于生成对抗网络（GAN）的自编码器架构，并训练其演化来自暗物质N体模拟的密度场。我们在2D和3D数据上测试了该方法。我们发现，仅基于密度的2D密度图预测表现良好，而准确的3D预测需要包含相关的速度场。我们的结果表明，场基表示法在建模宇宙结构演化方面具有潜力，为在场级数据更易获取的背景下提供了一种与拉格朗日方法互补的路径。

Summary / 总结

This study aims to predict the nonlinear evolution of cosmic structure using a field-based approach with a GAN-based autoencoder, as an alternative to Lagrangian methods. The autoencoder was trained on density fields from dark matter N-body simulations and tested on 2D and 3D data. The results show that while 2D predictions are accurate based on density alone, 3D predictions require the inclusion of velocity fields for accuracy.

该研究旨在使用基于生成对抗网络（GAN）的自编码器，以场为基础的方法来预测宇宙结构的非线性演化，作为拉格朗日、粒子方法的替代方案。自编码器被训练在来自暗物质N体模拟的密度场数据上，并在2D和3D数据上进行了测试。结果表明，虽然基于密度的2D预测是准确的，但3D预测需要包含速度场才能准确。这表明场基方法在场级数据更易获取的背景下，有可能在建模宇宙结构演化方面发挥重要作用。

ProGuard: Towards Proactive Multimodal Safeguard

Authors: Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao

First: 2025-12-29T16:13:23+00:00 · Latest: 2025-12-29T16:13:23+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.

中文标题/摘要

标题：ProGuard：面向主动多模态保护

生成模型的快速进化导致了持续出现的多模态安全风险，暴露了现有防御方法的局限性。为应对这些挑战，我们提出了ProGuard，这是一种视觉-语言主动保护，能够在无需传统反应式方法所需的模型调整的情况下，识别和描述离分布（OOD）的安全风险。我们首先构建了一个包含87,000个样本的模态平衡数据集，每个样本都标注了二元安全标签和在多层次多模态安全分类学下的风险类别，有效缓解了模态偏差，确保了文本、图像和图文输入的一致性审查。基于此数据集，我们通过强化学习（RL）训练我们的视觉-语言基础模型，以实现高效和简洁的推理。为了在受控环境中近似主动安全场景，我们进一步引入了离分布安全类别推理任务，并通过基于同义词库的相似性奖励来增强RL目标，鼓励模型为未见过的不安全类别生成简洁描述。实验结果表明，ProGuard在二元安全分类上的性能与闭源大型模型相当，在不安全内容分类上显著优于现有开源保护模型。最值得注意的是，ProGuard提供了强大的主动审查能力，将离分布风险检测提高了52.6%，离分布风险描述提高了64.8%。

Summary / 总结

ProGuard is designed to address the emerging multimodal safety risks by identifying and describing out-of-distribution (OOD) risks using a modality-balanced dataset of 87K samples. The model is trained through reinforcement learning to perform efficient and concise reasoning. ProGuard outperforms existing open-source guard models in unsafe content categorization and demonstrates a strong proactive moderation ability, improving OOD risk detection and description by 52.6% and 64.8%, respectively.

ProGuard旨在通过使用87K样本的模态平衡数据集来识别和描述生成模型中的出-of-distribution (OOD)风险。该模型通过强化学习进行训练，以实现高效和简洁的推理。ProGuard在不安全内容分类上优于现有的开源防护模型，并展示了强大的主动管理能力，OOD风险检测和描述分别提高了52.6%和64.8%。

ThinkGen: Generalized Thinking for Visual Generation

Authors: Siyu Jiao, Yiheng Lin, Yujie Zhong, Qi She, Wei Zhou, Xiaohan Lan, Zilong Huang, Fei Yu, Yingchen Yu, Yunqing Zhao, Yao Zhao, Yunchao Wei

First: 2025-12-29T16:08:50+00:00 · Latest: 2025-12-29T16:08:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen

中文标题/摘要

标题：ThinkGen: 通用视觉生成的思考方法

多模态大型语言模型（MLLMs）的近期进展表明，链式思考（CoT）推理能够系统地解决复杂的理解任务。然而，其在生成任务中的扩展仍处于初级阶段，并受到特定场景机制的限制，这阻碍了其泛化和适应能力。在本文中，我们提出了ThinkGen，这是第一个通过明确利用MLLM的CoT推理来处理各种生成场景的思考驱动的视觉生成框架。ThinkGen采用解耦架构，包括一个预训练的MLLM和一个扩散变换器（DiT），MLLM根据用户意图生成定制化的指令，DiT则根据这些指令生成高质量的图像。我们还提出了一种分离的GRPO训练范式（SepGRPO），交替强化学习MLLM和DiT模块。这种灵活的设计使得跨多种数据集的联合训练成为可能，从而促进了广泛生成场景中的有效CoT推理。大量实验表明，ThinkGen在多个生成基准测试中实现了稳健的、最先进的性能。代码可在：https://github.com/jiaosiyuu/ThinkGen 获取

Summary / 总结

ThinkGen is a think-driven visual generation framework that uses the Chain-of-Thought (CoT) reasoning capability of Multimodal Large Language Models (MLLMs) to generate high-quality images across various scenarios. It consists of a pretrained MLLM that generates instructions based on user intent, and a Diffusion Transformer (DiT) that produces images guided by these instructions. ThinkGen employs a separable GRPO-based training paradigm to enable joint training across diverse datasets, enhancing CoT reasoning for generative tasks. Experimental results show that ThinkGen outperforms existing methods on multiple generation benchmarks.

ThinkGen 是一种基于链式思考的视觉生成框架，利用多模态大型语言模型进行图像生成，适用于多种场景。它包括一个预训练的 MLLM 生成基于用户意图的指令，以及一个扩散变换器生成高质量的图像。ThinkGen 采用 SepGRPO 训练范式，实现跨多种数据集的联合训练，增强链式思考能力。实验表明，ThinkGen 在多个生成基准上优于现有方法。

From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

Authors: Dimitra Maoutsa

First: 2025-12-29T16:06:08+00:00 · Latest: 2025-12-29T16:06:08+00:00

Comments: 12+50 pages, 6 figures; An earlier account of this work has previously appeared in arXiv:2301.08102 and arXiv:2304.00423 ; main methodology remains the same, this version includes additional numerical experiments and theory

Abs · PDF · Code1 · Code2

Abstract

How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by the geometry in the system's invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from extremely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods.

中文标题/摘要

标题：从几何学到动力学：基于几何约束的稀疏观测过阻尼朗之万动力学学习

当随机系统轨迹在时间上稀疏采样时，我们如何学习其动力学背后的规律？现有方法要么需要高频率的时间分辨观测，要么依赖仅适用于保守系统的几何论证，限制了它们能恢复的动力学范围。在这里，我们提出了一种新的框架，通过将推理重新表述为随机控制问题来弥合这两种视角。该方法利用几何驱动的路径增强，通过系统不变密度中的几何结构来重构可能的轨迹并推断出潜在的动力学，而不假设特定的参数模型。应用于过阻尼朗之万系统时，我们的方法即使在极度欠采样的数据中也能准确恢复随机动力学，优于现有方法在合成基准测试中的表现。这项工作展示了将几何归纳偏置纳入随机系统识别方法的有效性。

Summary / 总结

The research addresses the challenge of inferring the dynamics of stochastic systems from sparse temporal observations. It introduces a framework that combines geometric insights with stochastic control to reconstruct likely trajectories and infer underlying dynamics without assuming specific models. The method outperforms existing techniques in synthetic benchmarks, particularly for undersampled data from overdamped Langevin systems, by leveraging geometric constraints in the system's invariant density.

研究解决了从稀疏时间观测中推断随机系统动力学的挑战。它提出了一种结合几何洞察与随机控制的框架，以重构可能的轨迹并推断出潜在的动力学，而不假设特定模型。该方法在合成基准测试中优于现有技术，特别是在来自过阻尼朗之万系统的稀疏数据方面，通过利用系统不变密度中的几何约束来实现。

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Authors: Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke

First: 2025-12-29T16:05:38+00:00 · Latest: 2025-12-29T16:05:38+00:00

Abs · PDF · Code1 · Code2

Abstract

The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

中文标题/摘要

标题：RxnBench：用于评估大型语言模型在科学文献中理解化学反应能力的多模态基准

将多模态大型语言模型（MLLMs）集成到化学中有望彻底改变科学发现，但它们理解科学文献中密集的图形语言的能力尚未得到充分探索。在这里，我们介绍了RxnBench，这是一个多层次基准，旨在严格评估MLLMs在科学PDF中理解化学反应的能力。RxnBench 包含两个任务：单图QA（SF-QA），测试细粒度的视觉感知和机制推理，使用来自305个精心策划的反应方案的1,525个问题；以及全文QA（FD-QA），挑战模型从108篇文章中综合信息，需要跨模态整合文本、方案和表格。我们的评估表明，模型在提取显式文本方面表现出色，但在深入的化学逻辑和精确的结构识别方面存在关键能力差距。值得注意的是，具有推理时推理的模型显著优于标准架构，但没有一个在FD-QA上达到50%的准确率。这些发现强调了迫切需要领域特定的视觉编码器和更强的推理引擎，以推进自主人工智能化学家。

Summary / 总结

RxnBench is a benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to understand chemical reactions from scientific literature. It includes two tasks: Single-Figure QA (SF-QA) and Full-Document QA (FD-QA). SF-QA tests visual perception and mechanistic reasoning, while FD-QA requires models to integrate information from text, reaction schemes, and tables. The evaluation shows that while models perform well in extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Models with inference-time reasoning outperform standard architectures, but none achieve 50% accuracy on FD-QA, highlighting the need for domain-specific visual encoders and stronger reasoning engines.

RxnBench 是一个用于评估多模态大型语言模型在科学文献中理解化学反应能力的基准。它包括两个任务：单图问答和全文问答。评估结果显示，虽然这些模型可以提取显性文本，但在深入的化学逻辑和精确的结构识别方面存在困难。具有推理时推理能力的模型优于标准架构，但没有一个在全文问答任务中达到50%的准确率，这突显了需要更好的视觉编码器和推理引擎的必要性。

VL-RouterBench: A Benchmark for Vision-Language Model Routing

Authors: Zhehao Huang, Baijiong Lin, Jingyuan Zhang, Jingying Wang, Yuhang Liu, Ning Lu, Tao Li, Xiaolin Huang

First: 2025-12-29T16:01:19+00:00 · Latest: 2025-12-29T16:01:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision-language models (VLMs). We present VL-RouterBench to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample-model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample-model pairs and a total input-output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.

中文标题/摘要

标题：VL-RouterBench：视觉-语言模型路由基准

多模型路由已从工程技巧演变为必不可少的基础架构，但现有工作缺乏系统且可复现的基准来评估视觉-语言模型（VLMs）。我们提出了VL-RouterBench以系统地评估VLM路由系统的整体能力。基准以视觉-语言模型的原始推理和评分日志为基础，构建样本-模型对的质量和成本矩阵。在规模上，VL-RouterBench覆盖了3个任务组中的14个数据集，总计30,540个样本，包括15个开源模型和2个API模型，生成了519,180个样本-模型对，总输入输出标记量为34,494,977。评估协议联合测量平均准确率、平均成本和吞吐量，并通过归一化成本和准确率的调和平均值构建排名分数，以在不同的路由配置和成本预算下进行比较。在该基准上，我们评估了10种路由方法和基线，并观察到显著的可路由性提升，而当前最佳路由器仍与理想的Oracle存在明显差距，表明通过更精细的视觉线索和文本结构建模，路由架构仍有很大的改进空间。我们将开源完整的数据构建和评估工具链，以促进多模态路由研究中的可比性、可复现性和实际部署。

Summary / 总结

VL-RouterBench is designed to evaluate vision-language model routing systems by analyzing raw inference and scoring logs from 15 open-source and 2 API models across 14 datasets. It evaluates 10 routing methods and baselines, showing a significant improvement in routability but still falling short of the ideal Oracle. The benchmark covers 30,540 samples and 519,180 sample-model pairs, with a total input-output token volume of 34,494,977, and measures accuracy, cost, and throughput to rank router configurations effectively.

VL-RouterBench 通过分析来自15个开源和2个API模型的14个数据集的原始推理和评分日志，评估10种路由方法和基线，显示出显著的可路由性提升，但仍低于理想的Oracle。该基准覆盖了30,540个样本和519,180个样本-模型对，总输入-输出标记量为34,494,977，并通过衡量准确率、成本和吞吐量来有效排名路由配置。

Toward Trustworthy Agentic AI: A Multimodal Framework for Preventing Prompt Injection Attacks

Authors: Toqeer Ali Syed, Mishal Ateeq Almutairi, Mahmoud Abdel Moaty

First: 2025-12-29T15:54:33+00:00 · Latest: 2025-12-29T15:54:33+00:00

Comments: It is accepted in a conference paper, ICCA 2025 in Bahrain on 21 to 23 December

Abs · PDF · Code1 · Code2

Abstract

Powerful autonomous systems, which reason, plan, and converse using and between numerous tools and agents, are made possible by Large Language Models (LLMs), Vision-Language Models (VLMs), and new agentic AI systems, like LangChain and GraphChain. Nevertheless, this agentic environment increases the probability of the occurrence of multimodal prompt injection (PI) attacks, in which concealed or malicious instructions carried in text, pictures, metadata, or agent-to-agent messages may spread throughout the graph and lead to unintended behavior, a breach of policy, or corruption of state. In order to mitigate these risks, this paper suggests a Cross-Agent Multimodal Provenanc- Aware Defense Framework whereby all the prompts, either user-generated or produced by upstream agents, are sanitized and all the outputs generated by an LLM are verified independently before being sent to downstream nodes. This framework contains a Text sanitizer agent, visual sanitizer agent, and output validator agent all coordinated by a provenance ledger, which keeps metadata of modality, source, and trust level throughout the entire agent network. This architecture makes sure that agent-to-agent communication abides by clear trust frames such such that injected instructions are not propagated down LangChain or GraphChain-style-workflows. The experimental assessments show that multimodal injection detection accuracy is significantly enhanced, and the cross-agent trust leakage is minimized, as well as, agentic execution pathways become stable. The framework, which expands the concept of provenance tracking and validation to the multi-agent orchestration, enhances the establishment of secure, understandable and reliable agentic AI systems.

中文标题/摘要

标题：迈向可信赖的代理AI：一种多模态防提示注入攻击框架

大型语言模型（LLMs）、视觉-语言模型（VLMs）以及新的代理AI系统（如LangChain和GraphChain）使得能够使用和在众多工具和代理之间进行推理、规划和对话的强大自主系统成为可能。然而，这种代理环境增加了多模态提示注入（PI）攻击的可能性，其中隐藏或恶意指令可能通过图传播，导致意外行为、政策违规或状态破坏。为了减轻这些风险，本文提出了一种跨代理多模态来源感知防御框架，该框架对所有提示（无论是用户生成的还是上游代理生成的）进行清理，并在发送给下游节点之前独立验证LLM生成的所有输出。该框架包括一个文本清理代理、视觉清理代理和输出验证代理，所有这些代理都由一个来源账本协调，该账本在整个代理网络中记录模态、来源和信任水平的元数据。这种架构确保了代理间的通信遵守清晰的信任框架，防止注入指令在LangChain或GraphChain风格的工作流中传播。实验评估表明，多模态注入检测准确性显著提高，跨代理信任泄露最小化，代理执行路径变得稳定。该框架将来源跟踪和验证的概念扩展到多代理编排，增强了安全、可理解且可靠的代理AI系统的建立。

Summary / 总结

This paper addresses the risk of multimodal prompt injection attacks in agentic AI systems, proposing a Cross-Agent Multimodal Provenance-Aware Defense Framework. The framework includes text and visual sanitizers and an output validator, all coordinated by a provenance ledger to ensure secure and trustworthy agent-to-agent communication. Experiments demonstrate improved detection accuracy and minimized trust leakage, contributing to more stable and secure agentic AI systems.

本文针对多模态提示注入攻击在代理AI系统中的风险，提出了一种跨代理多模态溯源感知防御框架。该框架包括文本和视觉净化器以及输出验证器，并由溯源日志协调，以确保元数据的完整性。实验结果表明，检测准确性显著提高，跨代理信任泄露减少，代理执行路径更加稳定。

Scaling Laws for Energy Efficiency of Local LLMs

Authors: Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús

First: 2025-12-18T13:40:33+00:00 · Latest: 2025-12-29T15:54:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

中文标题/摘要

标题：局部LLM能效的标度律

在边缘设备上部署局部大型语言模型和视觉-语言模型需要在准确性与受限的计算和能源预算之间进行权衡。尽管图形处理器主导了现代人工智能部署，但大多数消费级硬件（包括笔记本电脑、台式机、工业控制器和嵌入式系统）仍依赖于中央处理器。尽管如此，仅中央处理器的推理计算法则在局部语言和视觉-语言工作负载中的研究仍相对较少。我们系统地在两个广泛用于局部推理的中央处理器层级上对大型语言和视觉-语言模型进行了基准测试：一台搭载M2芯片的MacBook Pro，代表主流笔记本电脑级部署，以及一个Raspberry Pi 5，代表受限的、低功耗嵌入式设置。基于连续采样处理器和内存使用情况并结合面积-曲线积分的方法，我们描述了计算负载随输入文本长度对语言模型和随图像分辨率对视觉-语言模型的标度关系。我们发现了两条经验标度律：（1）语言模型推理的计算成本大约与标记长度成线性关系；（2）视觉-语言模型表现出预处理驱动的“分辨率拐点”，其中计算在内部分辨率限制以上保持恒定，在以下则急剧下降。除了这些标度律，我们还表明，基于量子启发的压缩可将处理器和内存使用量最多减少71.9%，能耗最多减少62%，同时保持或提高语义准确性。这些结果提供了局部语言和视觉-语言工作负载的多模态中央处理器仅计算法则的系统量化，并指出了模型压缩和输入分辨率预处理作为可持续边缘推理的有效、低成本杠杆。

Summary / 总结

This study explores the energy efficiency of deploying large language models and vision-language models on edge devices, focusing on central processing units. By benchmarking these models on a MacBook Pro M2 and a Raspberry Pi 5, the researchers identify two scaling laws: computational cost for language models scales linearly with token length, while vision-language models show a resolution knee where compute remains constant above a certain resolution and decreases below it. Additionally, quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while maintaining or improving semantic accuracy.

研究探讨了在边缘设备上部署大型语言模型和视觉语言模型时中央处理单元的能量效率。通过在MacBook Pro M2和Raspberry Pi 5上进行基准测试，研究人员发现两个缩放定律：语言模型的计算成本随词元长度线性增加，而视觉语言模型表现出预处理驱动的“分辨率拐点”，即在某个分辨率以上计算量保持不变，在以下则急剧下降。此外，量子启发式压缩可将处理器和内存使用量最多减少71.9%，能量消耗最多减少62%，同时保持或提高语义准确性。

Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs

Authors: Sahil Kale, Antonio Luca Alfeo

First: 2025-12-29T15:41:13+00:00 · Latest: 2025-12-29T15:41:13+00:00

Comments: Accepted to ICPRAM 2026 in Marbella, Spain

Abs · PDF · Code1 · Code2

Abstract

Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection. Specifically, we propose a simple yet powerful approach that enriches hallucination self-detection by (i) converting LLM responses into knowledge graphs of entities and relations, and (ii) using these graphs to estimate the likelihood that a response contains hallucinations. We evaluate the proposed approach using two widely used LLMs, GPT-4o and Gemini-2.5-Flash, across two hallucination detection datasets. To support more reliable future benchmarking, one of these datasets has been manually curated and enhanced and is released as a secondary outcome of this work. Compared to standard self-detection methods and SelfCheckGPT, a state-of-the-art approach, our method achieves up to 16% relative improvement in accuracy and 20% in F1-score. Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs, even when initial outputs contain inaccuracies. This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.

中文标题/摘要

标题：谎言之测：知识图谱在LLM幻觉自检测中的稳健应用

幻觉，生成看似真实但实际上虚假的陈述，仍然是安全部署LLM的主要障碍。基于自检测方法的出色表现，我们探讨了使用结构化知识表示，即知识图谱，以提高幻觉自检测的效果。具体而言，我们提出了一种简单而强大的方法，通过(i) 将LLM响应转换为实体和关系的知识图谱，以及(ii) 利用这些图谱估计响应中包含幻觉的可能性，来丰富幻觉自检测。我们使用两个广泛使用的LLM，GPT-4o和Gemini-2.5-Flash，在两个幻觉检测数据集上评估了所提出的方法。为了支持更可靠的未来基准测试，其中一个数据集已被手动整理和增强，并作为本工作的次要成果发布。与标准自检测方法和SelfCheckGPT（一种最先进的方法）相比，我们的方法在准确性和F1分数上分别实现了高达16%和20%的相对改进。我们的结果表明，当原子事实以知识图谱形式呈现时，即使初始输出包含不准确信息，LLM也能更好地分析这些事实。这一低成本、模型无关的方法为更安全和可信赖的语言模型铺平了道路。

Summary / 总结

This paper addresses the challenge of hallucinations in LLMs by proposing a method that converts LLM responses into knowledge graphs to improve self-detection of false statements. The method uses these graphs to estimate the likelihood of hallucinations and achieves up to 16% relative improvement in accuracy and 20% in F1-score compared to existing methods. The approach is model-agnostic and low-cost, enhancing the analysis of atomic facts even when initial outputs are inaccurate, thus contributing to safer and more trustworthy language models.

该论文通过将LLM响应转换为知识图谱来解决虚假陈述问题，提出了一种改进自我检测的方法。该方法利用响应中的实体和关系来估计幻觉的可能性，相比现有方法，准确率最高可提高16%，F1分数提高20%。这种方法具有低成本且模型无关性，有助于使LLM更加安全和可信，通过更好地分析以知识图谱形式结构化的原子事实。

Diffusion MRI with Machine Learning

Authors: Davood Karimi, Simon K. Warfield

First: 2024-01-01T13:03:35+00:00 · Latest: 2025-12-29T15:36:12+00:00

Abs · PDF · Code1 · Code2

Abstract

\hspace{2mm} Diffusion-weighted magnetic resonance imaging (dMRI) of the brain offers unique capabilities including noninvasive probing of tissue microstructure and structural connectivity. It is widely used for clinical assessment of disease and injury, and for neuroscience research. Analyzing the dMRI data to extract useful information for medical and scientific purposes can be challenging. The dMRI measurements may suffer from strong noise and artifacts, and may exhibit high inter-session and inter-scanner variability in the data, as well as inter-subject heterogeneity in brain structure. Moreover, the relationship between measurements and the phenomena of interest can be highly complex. Recent years have witnessed increasing use of machine learning methods for dMRI analysis. This manuscript aims to assess these efforts, with a focus on methods that have addressed data preprocessing and harmonization, microstructure mapping, tractography, and white matter tract analysis. We study the main findings, strengths, and weaknesses of the existing methods and suggest topics for future research. We find that machine learning may be exceptionally suited to tackle some of the difficult tasks in dMRI analysis. However, for this to happen, several shortcomings of existing methods and critical unresolved issues need to be addressed. There is a pressing need to improve evaluation practices, to increase the availability of rich training datasets and validation benchmarks, as well as model generalizability, reliability, and explainability concerns.

中文标题/摘要

标题：扩散MRI与机器学习

扩散加权磁共振成像（dMRI）的脑部成像提供了独特的功能，包括无创探查组织微观结构和结构连接。它广泛用于临床疾病和损伤评估，以及神经科学研究。分析dMRI数据以提取对医疗和科学研究有用的信息具有挑战性。dMRI测量可能受到强烈噪声和伪影的影响，并且数据在不同会话和不同扫描器之间表现出高变异性，同时不同个体的脑结构也存在异质性。此外，测量与感兴趣现象之间的关系可能非常复杂。近年来，dMRI分析中使用机器学习方法的数量不断增加。本文旨在评估这些努力，重点关注解决数据预处理和标准化、微观结构映射、纤维追踪和白质纤维分析的方法。我们研究了现有方法的主要发现、优势和不足，并建议未来研究的主题。我们发现，机器学习可能特别适合解决dMRI分析中的某些困难任务。然而，为了实现这一点，需要解决现有方法的若干不足和关键未解决的问题。迫切需要改进评估实践，增加丰富的训练数据集和验证基准的可用性，以及提高模型的一般性、可靠性和可解释性。

Summary / 总结

The paper aims to assess the use of machine learning in diffusion MRI (dMRI) analysis, focusing on data preprocessing, microstructure mapping, tractography, and white matter tract analysis. It highlights the challenges in dMRI data, such as noise, artifacts, and variability, and discusses the strengths and weaknesses of existing machine learning methods. The study finds that machine learning can effectively handle complex tasks in dMRI analysis but emphasizes the need to address shortcomings and unresolved issues, including improving evaluation practices and enhancing model generalizability and explainability.

该论文探讨了机器学习在扩散MRI (dMRI) 分析中的应用，解决了噪声、伪影和数据变异性等挑战。它回顾了数据预处理、微结构映射、纤维追踪和白质纤维分析的方法，强调了这些方法的优点和不足。研究结论指出，机器学习可以有效处理复杂的dMRI任务，但强调需要改进评估实践、增加丰富的训练数据集和提高模型的可靠性和可解释性。

PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

Authors: Shengyi Hua, Jianfeng Wu, Tianle Shen, Kangzhe Hu, Zhongzhen Huang, Shujuan Ni, Zhihong Zhang, Yuan Li, Zhe Wang, Xiaofan Zhang

First: 2025-12-29T15:34:27+00:00 · Latest: 2025-12-29T15:34:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.

中文标题/摘要

标题：PathFound：一种促进证据寻求的病理性诊断代理多模态模型

近期的病理性基础模型在视觉表示学习和多模态交互方面取得了显著进展。然而，大多数模型仍然依赖于静态推理范式，在这种范式中，全切片图像仅处理一次以生成预测，而不会在模糊诊断下进行重新评估或有针对性的证据获取。这与临床诊断工作流程形成对比，后者通过反复观察切片并进一步提出检查要求来细化假设。我们提出PathFound，一种旨在支持病理性诊断中证据寻求推理的代理多模态模型。PathFound 结合了病理性视觉基础模型、视觉语言模型和通过强化学习训练的推理模型的力量，通过初始诊断、证据寻求和最终决策阶段的进展来进行主动信息获取和诊断细化。在多个大型多模态模型中，采用这种策略始终提高了诊断准确性，表明在计算病理学中证据寻求工作流程的有效性。在这些模型中，PathFound 在多种临床场景中实现了最先进的诊断性能，并展示了发现细微特征（如核特征和局部侵袭）的强大潜力。

Summary / 总结

PathFound is an agentic multimodal model that supports evidence-seeking inference in pathological diagnosis. It integrates visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to refine diagnoses through multiple stages. Across various large multimodal models, PathFound improves diagnostic accuracy, showing the effectiveness of evidence-seeking workflows in computational pathology and its potential to uncover subtle details like nuclear features and local invasions.

PathFound 是一种支持病理诊断中证据寻求推理的智能多模态模型，它结合了视觉基础模型、视觉语言模型和通过强化学习训练的推理模型，在多个诊断阶段逐步优化诊断。在多种大型多模态模型中，PathFound 提高了诊断准确性，展示了证据寻求工作流程在计算病理学中的有效性，并且具有发现细微特征如核特征和局部侵袭的强大潜力。

Act2Goal: From World Model To General Goal-conditioned Policy

Authors: Pengfei Zhou, Liliang Chen, Shengcong Chen, Di Chen, Wenzhi Zhao, Rongjun Jin, Guanghui Ren, Jianlan Luo

First: 2025-12-29T15:28:42+00:00 · Latest: 2025-12-29T15:28:42+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: https://act2goal.github.io/

中文标题/摘要

标题：Act2Goal: 从世界模型到通用目标条件策略

以既表达能力强又精确的方式指定机器人操作任务仍然是一个核心挑战。虽然视觉目标提供了一种紧凑且无歧义的任务规范，但现有的目标条件策略往往难以处理长时序操作，因为它们依赖于单步动作预测，而没有明确建模任务进展。我们提出了Act2Goal，这是一种通用的目标条件操作策略，结合了目标条件视觉世界模型和多尺度时间控制。给定当前观察和目标视觉目标，世界模型生成一个可能的中间视觉状态序列，捕捉长时序结构。为了将这个视觉计划转化为稳健的执行，我们引入了多尺度时间哈希（MSTH），它将想象的轨迹分解为密集的近端帧和稀疏的远端帧，以实现细粒度的闭环控制并锚定全局任务一致性。策略通过端到端的交叉注意力将这些表示与运动控制耦合，从而实现连贯的长时序行为，同时对局部干扰保持反应。Act2Goal 在新对象、空间布局和环境上的零样本泛化表现优异。我们进一步通过基于LoRA的后见之明目标重新标记实现无奖励的在线适应，允许快速自主改进而无需外部监督。实验证明，Act2Goal 在几分钟的自主交互后，成功率达到从30%提高到90%，验证了多尺度时间控制的目标条件世界模型为稳健的长时序操作提供了必要的结构指导。项目页面：https://act2goal.github.io/

Summary / 总结

Act2Goal addresses the challenge of specifying expressive and precise robotic manipulation tasks by integrating a goal-conditioned visual world model with multi-scale temporal control. The method generates a sequence of intermediate visual states to capture long-horizon task structure and uses Multi-Scale Temporal Hashing for robust execution through dense and sparse frames. Experiments show strong zero-shot generalization and rapid autonomous improvement in handling novel objects and environments, with success rates increasing from 30% to 90% on challenging tasks within minutes of interaction.

Act2Goal通过结合目标条件的视觉世界模型和多尺度时间控制来解决机器人操作任务的表达性和精确性问题。它生成一系列中间视觉状态以捕捉长期任务结构，并使用多尺度时间哈希进行稳健执行。实验表明，它具有强大的零样本泛化能力和通过无奖励在线适应快速改进的能力，显著将挑战任务的成功率从30%提高到90%。

AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

Authors: Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang

First: 2025-12-29T15:26:25+00:00 · Latest: 2025-12-29T15:26:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.

中文标题/摘要

标题：AnyMS：基于布局引导和无需训练的多主题定制中自底向上注意力解耦

多主题定制旨在将多个用户指定的主题合成到一个连贯的图像中。为了解决主题缺失或冲突等问题，最近的工作引入了布局指导以提供明确的空间约束。然而，现有方法仍然难以平衡文本对齐、主题身份保留和布局控制这三个关键目标，而依赖额外训练进一步限制了其可扩展性和效率。在本文中，我们提出了一种名为AnyMS的新颖的无需训练框架，用于布局引导的多主题定制。AnyMS利用三种输入条件：文本提示、主题图像和布局约束，并引入了一种自底向上的双层注意力解耦机制，以在生成过程中协调它们的整合。具体而言，全局解耦将文本和视觉条件之间的跨注意力分离，以确保文本对齐。局部解耦将每个主题的注意力限制在其指定区域内，从而防止主题冲突，从而保证身份保留和布局控制。此外，AnyMS使用预训练的图像适配器来提取与扩散模型对齐的主题特定特征，从而去除主题学习或适配器调优的需要。大量实验表明，AnyMS达到了最先进的性能，支持复杂的组合，并可扩展到更多的主题。

Summary / 总结

AnyMS is a training-free framework for layout-guided multi-subject customization that addresses the challenges of text alignment, subject identity preservation, and layout control. It uses a bottom-up dual-level attention decoupling mechanism to integrate text prompts, subject images, and layout constraints. Global decoupling ensures text alignment, while local decoupling prevents subject conflicts, preserving identity and layout. Pre-trained image adapters are used to extract subject-specific features, eliminating the need for additional training. Experimental results show that AnyMS outperforms existing methods in handling complex compositions and scaling to multiple subjects.

AnyMS 是一个无需训练的框架，用于指导布局的多主体定制，解决了文本对齐、主体身份保留和布局控制的挑战。它使用自底向上的双层注意力解耦机制，在生成过程中整合文本提示、主体图像和布局约束。全局解耦确保文本对齐，而局部解耦防止主体冲突，从而保持身份和布局。实验表明，AnyMS 在复杂组合和处理更多主体方面优于现有方法。

Timepoint-Specific Benchmarking of Deep Learning Models for Glioblastoma Follow-Up MRI

Authors: Wenhao Guo, Golrokh Mirzaei

Venue: Cancers, 2026, 18(1), 36

First: 2025-11-23T19:38:03+00:00 · Latest: 2025-12-29T15:25:19+00:00

Comments: 15 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Differentiating true tumor progression (TP) from treatment-related pseudoprogression (PsP) in glioblastoma remains challenging, especially at early follow-up. We present the first stage-specific, cross-sectional benchmarking of deep learning models for follow-up MRI using the Burdenko GBM Progression cohort (n = 180). We analyze different post-RT scans independently to test whether architecture performance depends on time-point. Eleven representative DL families (CNNs, LSTMs, hybrids, transformers, and selective state-space models) were trained under a unified, QC-driven pipeline with patient-level cross-validation. Across both stages, accuracies were comparable (~0.70-0.74), but discrimination improved at the second follow-up, with F1 and AUC increasing for several models, indicating richer separability later in the care pathway. A Mamba+CNN hybrid consistently offered the best accuracy-efficiency trade-off, while transformer variants delivered competitive AUCs at substantially higher computational cost and lightweight CNNs were efficient but less reliable. Performance also showed sensitivity to batch size, underscoring the need for standardized training protocols. Notably, absolute discrimination remained modest overall, reflecting the intrinsic difficulty of TP vs. PsP and the dataset's size imbalance. These results establish a stage-aware benchmark and motivate future work incorporating longitudinal modeling, multi-sequence MRI, and larger multi-center cohorts.

中文标题/摘要

标题：胶质母细胞瘤随访MRI中基于时间点的深度学习模型基准测试

在胶质母细胞瘤中，区分真正的肿瘤进展（TP）与治疗相关的假性进展（PsP）在早期随访中尤为具有挑战性。我们首次对胶质母细胞瘤进展队列（n = 180）的随访MRI使用深度学习模型进行了阶段特异性、横断面基准测试。我们独立分析了不同的术后放疗扫描，以测试架构性能是否依赖于时间点。在统一的、经过质量控制驱动的管道中，使用患者水平交叉验证训练了11种代表性DL家族（CNNs、LSTMs、混合模型、变压器和选择性状态空间模型）。在两个阶段中，准确率相当（~0.70-0.74），但在第二次随访中，几种模型的F1和AUC有所提高，表明后期护理路径中具有更丰富的可分性。Mamba+CNN混合模型始终提供了最佳的准确率-效率权衡，而变压器变体在显著更高的计算成本下提供了竞争力的AUC，轻量级的CNN模型虽然高效但可靠性较低。性能还对批次大小敏感，强调了标准化训练协议的必要性。值得注意的是，总体绝对区分度仍然有限，反映了TP与PsP之间的固有难度以及数据集的大小不平衡。这些结果建立了阶段意识的基准，并激发了未来结合纵向建模、多序列MRI和更大规模多中心队列的工作。

Summary / 总结

The study aims to differentiate true tumor progression from treatment-related pseudoprogression in glioblastoma follow-up MRI using deep learning models. Eleven deep learning architectures were benchmarked across different post-RT scans, showing comparable accuracy but improved discrimination at the second follow-up. A Mamba+CNN hybrid provided the best accuracy-efficiency trade-off, while transformer models had higher computational costs. The results highlight the need for standardized training protocols and suggest that absolute discrimination remains challenging due to the intrinsic difficulty of the task and dataset imbalance.

本研究评估了深度学习模型在胶质母细胞瘤随访MRI中区分真正肿瘤进展与治疗相关假进展的能力。使用统一的管道训练了11种模型家族于不同时间点的术后扫描。各阶段的准确率相似，但在第二次随访时有所提高，表明后期的可区分性更好。Mamba+CNN混合模型在准确性和效率之间表现最佳，而变压器变体则在更高的计算成本下提供了竞争性的AUC值。性能还受到批次大小的影响，强调了标准化训练协议的必要性。总体而言，由于固有的难度和数据集不平衡，区分效果有限。

When Deepfake Detection Meets Graph Neural Network:a Unified and Lightweight Learning Framework

Authors: Haoyu Liu, Chaoyu Gong, Mengke He, Jiate Li, Kai Han, Siqiang Luo

Venue: KDD 2026

First: 2025-08-07T15:55:13+00:00 · Latest: 2025-12-29T15:24:22+00:00

Comments: Accepted to KDD 2026

Abs · PDF · Code1 · Code2

Abstract

The proliferation of generative video models has made detecting AI-generated and manipulated videos an urgent challenge. Existing detection approaches often fail to generalize across diverse manipulation types due to their reliance on isolated spatial, temporal, or spectral information, and typically require large models to perform well. This paper introduces SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework that represents videos as structured graphs, enabling joint reasoning over spatial inconsistencies, temporal artifacts, and spectral distortions. SSTGNN incorporates learnable spectral filters and spatial-temporal differential modeling into a unified graph-based architecture, capturing subtle manipulation traces more effectively. Extensive experiments on diverse benchmark datasets demonstrate that SSTGNN not only achieves superior performance in both in-domain and cross-domain settings, but also offers strong efficiency and resource allocation. Remarkably, SSTGNN accomplishes these results with up to 42$\times$ fewer parameters than state-of-the-art models, making it highly lightweight and resource-friendly for real-world deployment.

中文标题/摘要

标题：当深度伪造检测遇到图神经网络：统一且轻量级的学习框架

生成视频模型的普及使得检测AI生成和篡改的视频成为迫切的挑战。现有检测方法往往由于依赖孤立的空间、时间和频谱信息而难以在多种篡改类型之间泛化，通常需要大型模型才能表现良好。本文介绍了一种轻量级的时空频谱图神经网络框架SSTGNN，该框架将视频表示为结构化的图，能够联合推理空间不一致、时间伪影和频谱失真。SSTGNN将可学习的频谱滤波器和时空差分建模结合到统一的图架构中，更有效地捕捉细微的篡改痕迹。在多种基准数据集上的广泛实验表明，SSTGNN不仅在领域内和跨领域设置中均表现出优越的性能，而且具有强大的效率和资源分配能力。令人惊讶的是，SSTGNN仅需比最先进的模型少42倍的参数，使其在实际部署中具有高度的轻量化和资源友好性。

Summary / 总结

This paper addresses the challenge of detecting AI-generated and manipulated videos by proposing SSTGNN, a lightweight Spatial-Spectral-Temporal Graph Neural Network framework. SSTGNN represents videos as structured graphs to jointly reason over spatial inconsistencies, temporal artifacts, and spectral distortions. The framework incorporates learnable spectral filters and spatial-temporal differential modeling, effectively capturing subtle manipulation traces. Experimental results show that SSTGNN outperforms state-of-the-art models in both in-domain and cross-domain settings while being highly efficient and resource-friendly, with up to 42 times fewer parameters.

本文提出了一种轻量级的时空频图神经网络框架SSTGNN，以应对检测AI生成和篡改视频的挑战。SSTGNN将视频表示为结构化的图，以联合推理空间不一致、时间伪影和频谱失真。该框架结合了可学习的频谱滤波器和时空差分建模，有效地捕捉细微的篡改痕迹。实验结果表明，SSTGNN在领域内和跨领域设置中均优于现有模型，同时具有高度的高效性和资源友好性，参数量最多减少42倍。

Machine Unlearning using Forgetting Neural Networks

Authors: Amartya Hatua, Trung T. Nguyen, Filip Cano, Andrew H. Sung

First: 2024-10-29T02:52:26+00:00 · Latest: 2025-12-29T15:15:41+00:00

Comments: 12 Pages, Accepted at ICAART 2026 - 18th International Conference on Agents and Artificial Intelligence

Abs · PDF · Code1 · Code2

Abstract

Modern computer systems store vast amounts of personal data, enabling advances in AI and ML but risking user privacy and trust. For privacy reasons, it is sometimes desired for an ML model to forget part of the data it was trained on. In this paper, we introduce a novel unlearning approach based on Forgetting Neural Networks (FNNs), a neuroscience-inspired architecture that explicitly encodes forgetting through multiplicative decay factors. While FNNs had previously been studied as a theoretical construct, we provide the first concrete implementation and demonstrate their effectiveness for targeted unlearning. We propose several variants with per-neuron forgetting factors, including rank-based assignments guided by activation levels, and evaluate them on MNIST and Fashion-MNIST benchmarks. Our method systematically removes information associated with forget sets while preserving performance on retained data. Membership inference attacks confirm the effectiveness of FNN-based unlearning in erasing information about the training data from the neural network. These results establish FNNs as a promising foundation for efficient and interpretable unlearning.

中文标题/摘要

标题：使用遗忘神经网络的机器卸载

现代计算机系统存储了大量的个人数据，这虽然促进了人工智能和机器学习的发展，但也可能损害用户的隐私和信任。出于隐私原因，有时希望机器学习模型忘记其训练数据的一部分。在本文中，我们介绍了一种基于遗忘神经网络（FNNs）的新卸载方法，这是一种受神经科学启发的架构，通过乘法衰减因子显式地编码遗忘。虽然FNNs之前曾被作为理论构架研究，但我们首次提供了具体的实现，并证明了它们在目标卸载中的有效性。我们提出了几种变体，包括基于激活水平的排名分配的每神经元遗忘因子，并在MNIST和Fashion-MNIST基准上进行了评估。我们的方法系统地移除了与遗忘集相关的信息，同时保留了保留数据的性能。成员推理攻击证实了基于FNN的卸载方法在从神经网络中擦除训练数据信息方面的有效性。这些结果确立了FNNs作为高效和可解释卸载的基础。

Summary / 总结

This paper addresses the challenge of enabling machine learning models to forget specific parts of the data they were trained on, which is crucial for privacy protection. It introduces Forgetting Neural Networks (FNNs), a neuroscience-inspired architecture that incorporates multiplicative decay factors to explicitly encode forgetting. The authors propose variants with per-neuron forgetting factors and evaluate their method on MNIST and Fashion-MNIST datasets, demonstrating that FNNs can systematically remove information associated with specific data while maintaining performance on retained data. Membership inference attacks confirm the effectiveness of FNN-based unlearning in erasing information from the neural network.

本文探讨了使机器学习模型能够忘记训练数据中特定部分的挑战，这对于保护隐私至关重要。文中引入了遗忘神经网络（FNNs），这是一种借鉴神经科学原理的架构，通过乘法衰减因子明确编码遗忘。作者提出了基于神经元的遗忘因子的多种变体，并在MNIST和Fashion-MNIST数据集上进行了评估，展示了FNNs可以系统地移除特定数据的信息，同时保持对保留数据的性能。成员推断攻击证实了基于FNN的遗忘在神经网络中有效擦除信息。

Expressive Temporal Specifications for Reward Monitoring

Authors: Omar Adalat, Francesco Belardinelli

Venue: AAAI

First: 2025-11-16T22:28:30+00:00 · Latest: 2025-12-29T15:04:16+00:00

Comments: Accepted at AAAI-26

Abs · PDF · Code1 · Code2

Abstract

Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.

Summary / 总结

This work addresses the challenge of specifying dense reward functions in Reinforcement Learning by utilizing the expressive power of quantitative Linear Temporal Logic on finite traces. The method synthesizes reward monitors that provide detailed feedback during training, guiding agents towards optimal behavior and mitigating the issue of sparse rewards. Experiments demonstrate that the quantitative monitors outperform Boolean monitors in both maximizing task completion and reducing convergence time.

该研究通过利用有限轨迹上的定量线性时序逻辑来合成奖励监控器，以解决强化学习中密集奖励函数的指定难题。这些监控器在训练过程中提供精细化反馈，引导智能体达到最优行为，并在长期决策任务中提高任务完成度和收敛速度。实验结果表明，定量监控器在这些场景中优于布尔监控器。

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

Authors: Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

First: 2025-09-02T14:17:16+00:00 · Latest: 2025-12-29T15:03:06+00:00

Comments: 10 pages, 5 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

中文标题/摘要

标题：序数自适应校正：一种面向数据的序数图像分类噪声标签校正方法

标记数据是训练计算机视觉任务的监督深度学习模型的基本组成部分。然而，标记过程，尤其是在序数图像分类中，由于类边界往往模糊不清，容易出错和产生噪声。这种标签噪声会显著降低机器学习模型的性能和可靠性。本文针对序数图像分类任务中的标签噪声检测和校正问题进行了研究。为此，提出了一种新颖的数据为中心的方法，称为序数自适应校正（ORDAC），用于噪声标签的自适应校正。该方法利用标签分布学习（LDL）的能力来建模序数标签中存在的固有模糊性和不确定性。在训练过程中，ORDAC 动态调整每个样本的标签分布的均值和标准差。该方法的目标是校正这些潜在的噪声样本，并充分利用整个训练数据集。通过在年龄估计（Adience）和疾病严重程度检测（糖尿病视网膜病变）基准数据集上进行各种非对称高斯噪声场景下的评估，证明了所提出方法的有效性。例如，在Adience数据集40%噪声的情况下，ORDAC_R将平均绝对误差从0.86降低到0.62，召回率从0.37提高到0.49。该方法还证明了其在纠正原始数据集中固有噪声方面的有效性。研究表明，使用标签分布进行自适应标签校正是在噪声数据存在的情况下增强序数分类模型的鲁棒性和准确性的有效策略。

Summary / 总结

This paper addresses the issue of label noise in ordinal image classification, proposing a data-centric method called ORDinal Adaptive Correction (ORDAC). ORDAC uses Label Distribution Learning (LDL) to dynamically adjust the mean and standard deviation of label distributions, aiming to correct noisy labels rather than discarding them. Experiments on age estimation and disease severity detection datasets showed significant improvements, with ORDAC_R reducing mean absolute error from 0.86 to 0.62 and increasing recall from 0.37 to 0.49 on the Adience dataset with 40% noise.

本文提出了一种数据为中心的方法ORDinal Adaptive Correction (ORDAC)，通过动态调整标签分布来纠正有序图像分类中的噪声标签。ORDAC 利用 Label Distribution Learning (LDL) 来处理有序标签中的固有模糊性。实验结果表明，ORDAC 及其扩展版本在年龄估计和疾病严重程度检测数据集上的表现显著提升，减少了平均绝对误差并提高了召回率，在各种噪声场景下表现优异。