arXiv 论文速递

VideoMaMa: Mask-Guided Video Matting via Generative Prior

Authors: Sangbeom Lim, Seoung Wug Oh, Jiahui Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee

First: 2026-01-20T18:59:56+00:00 · Latest: 2026-01-20T18:59:56+00:00

Comments: Project page: https://cvlab-kaist.github.io/VideoMaMa/

Abstract

Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.

中文标题/摘要

标题：VideoMaMa：基于生成先验的遮罩引导视频抠像

将视频抠像模型推广到真实世界视频中仍然是一个重大挑战，因为标注数据稀缺。为了解决这一问题，我们提出了Video Mask-to-Matte Model（VideoMaMa），该模型通过利用预训练的视频扩散模型，将粗略的分割掩码转换为像素级准确的alpha抠像。尽管VideoMaMa仅在合成数据上进行训练，但它在真实世界片段上的零样本泛化能力仍然很强。在此基础上，我们开发了一个可扩展的伪标签流水线，用于大规模视频抠像，并构建了Matting Anything in Video（MA-V）数据集，该数据集为超过50,000个真实世界视频提供了高质量的抠像注释，这些视频涵盖了多种场景和动作。为了验证该数据集的有效性，我们在MA-V上微调了SAM2模型，得到SAM2-Matte，其在野外视频上的鲁棒性优于在现有抠像数据集上训练的同一模型。这些发现强调了大规模伪标签视频抠像的重要性，并展示了生成先验和可访问的分割提示如何推动视频抠像研究的可扩展进展。

Summary / 总结

VideoMaMa is a model that converts coarse segmentation masks into pixel-accurate alpha mattes by using pretrained video diffusion models, addressing the challenge of generalizing video matting to real-world videos. It demonstrates strong zero-shot generalization and is trained on synthetic data. The model is fine-tuned on the newly constructed MA-V dataset, which provides high-quality matting annotations for over 50K real-world videos, leading to improved robustness in in-the-wild videos compared to existing datasets.

VideoMaMa是一种通过利用预训练的视频扩散模型将粗略的分割掩码转换为精确的alpha mattes的模型，展示了其在真实世界视频上的强大零样本泛化能力。它使开发可扩展的伪标签流水线和构建包含超过50K真实世界视频的MA-V数据集成为可能，这些视频覆盖了多种场景和动作。在MA-V上微调SAM2模型得到SAM2-Matte，其在野生视频上的鲁棒性优于现有模型，突显了大规模伪标签视频抠图的重要性以及生成先验和可访问的分割线索在推动视频抠图研究中的作用。

Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

Authors: Hongyuan Chen, Xingyu Chen, Youjia Zhang, Zexiang Xu, Anpei Chen

First: 2026-01-20T18:59:48+00:00 · Latest: 2026-01-20T18:59:48+00:00

Comments: Project page: https://motion3-to-4.github.io/. Code: https://github.com/Inception3D/Motion324

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at https://motion3-to-4.github.io/.

中文标题/摘要

标题：Motion 3-to-4：从单目视频合成高质量4D动态对象

我们提出了Motion 3-to-4，这是一种前馈框架，可以从单目视频和可选的3D参考网格中合成高质量的4D动态对象。尽管最近的进步显著提高了2D、视频和3D内容的生成，但由于训练数据有限和从单目视角恢复几何和运动的固有不确定性，4D合成仍然具有挑战性。Motion 3-to-4通过将4D合成分解为静态3D形状生成和运动重建来应对这些挑战。使用一个标准参考网格，我们的模型学习了一个紧凑的运动潜在表示，并预测每帧顶点轨迹以恢复完整且时间上一致的几何结构。可扩展的逐帧变换器进一步增强了对序列长度变化的鲁棒性。在标准基准和具有准确地面真值几何的新数据集上的评估表明，Motion 3-to-4在保真度和空间一致性方面优于先前的工作。项目页面可在https://motion3-to-4.github.io/获取。

Summary / 总结

Motion 3-to-4 is a feed-forward framework that synthesizes high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. It addresses the challenges of 4D synthesis by decomposing it into static 3D shape generation and motion reconstruction. The model uses a canonical reference mesh to learn a compact motion latent representation and predict per-frame vertex trajectories, ensuring temporally coherent geometry. Evaluations show that Motion 3-to-4 outperforms previous methods in terms of fidelity and spatial consistency.

Motion 3-to-4 是一个前馈框架，可以从单个单目视频和可选的 3D 参考网格中合成高质量的 4D 动态对象。通过将 4D 合成分解为静态 3D 形状生成和运动重建，该模型使用一个标准的参考网格来学习紧凑的运动潜在表示，并预测每一帧的顶点轨迹，以确保时空一致性。评估结果显示，Motion 3-to-4 在保真度和空间一致性方面优于先前的方法。

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Authors: Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin

First: 2026-01-20T18:58:32+00:00 · Latest: 2026-01-20T18:58:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.

中文标题/摘要

标题：LightOnOCR：一种端到端多语言视觉-语言模型，用于达到最佳OCR效果

我们提出了**LightOnOCR-2-1B**，这是一种1B参数的端到端多语言视觉-语言模型，能够将文档图像（例如PDF）转换为干净、自然排序的文本，而无需脆弱的OCR管道。该模型在大规模高质量的蒸馏混合数据集上进行训练，该数据集涵盖了扫描文档、法语文档和科学PDF。LightOnOCR-2在OlmOCR-Bench上达到了最先进的结果，同时比之前表现最好的模型小9倍，并且速度更快。我们进一步扩展了输出格式，预测嵌入图像的标准化边界框，在预训练中通过恢复策略引入定位，并使用基于IoU的奖励进行RLVR细化。最后，我们通过检查点平均和任务算术合并提高了鲁棒性。我们以Apache 2.0许可证发布模型检查点，并公开发布了数据集和**LightOnOCR-bbox-bench**评估。

Summary / 总结

LightOnOCR-2-1B is a 1 billion parameter end-to-end multilingual vision-language model that converts document images into clean text, outperforming previous models in size and speed while achieving state-of-the-art results on OlmOCR-Bench. It includes a novel output format for predicting bounding boxes of embedded images and uses RLVR for refinement. The model is released under Apache 2.0 and includes a public dataset and evaluation benchmark.

LightOnOCR-2-1B 是一个 1 亿参数的端到端多语言视觉-语言模型，能够将文档图像转换为干净的文本，其大小和速度均优于先前的模型，并在OlmOCR-Bench上达到了最先进的结果。该模型还包括预测嵌入图像边界框的新输出格式，并使用 RLVR 进行细化。该模型在 Apache 2.0 许可下发布，并包括公共数据集和评估基准。

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Authors: Pengze Zhang, Yanze Wu, Mengtian Li, Xu Bai, Songtao Zhao, Fulong Ye, Chong Mou, Xinghui Li, Zhuowei Chen, Qian He, Mingyuan Gao

First: 2026-01-20T18:58:11+00:00 · Latest: 2026-01-20T18:58:11+00:00

Comments: Github Page: https://pangzecheung.github.io/OmniTransfer/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.

中文标题/摘要

标题：OmniTransfer：一站式时空视频转移框架

视频比图像或文本传达更多信息，能够捕捉空间和时间动态。然而，大多数现有的视频定制方法依赖于参考图像或特定任务的时间先验，未能充分利用视频中固有的丰富时空信息，从而限制了视频生成的灵活性和泛化能力。为了解决这些限制，我们提出了OmniTransfer，这是一种统一的时空视频转移框架。它利用跨帧的多视角信息来增强外观一致性，并利用时间线索实现精细的时间控制。为了统一各种视频转移任务，OmniTransfer 包含三个关键设计：任务感知位置偏差，能够自适应地利用参考视频信息以提高时间对齐或外观一致性；参考解耦因果学习，将参考和目标分支分离，以实现精确的参考转移并提高效率；以及任务自适应多模态对齐，使用多模态语义指导动态区分和解决不同的任务。广泛的实验表明，OmniTransfer 在外观（身份和风格）和时间转移（摄像机运动和视频效果）方面优于现有方法，同时在不使用姿态的情况下达到姿态引导方法在运动转移方面的效果，从而建立了一种新的灵活、高保真视频生成范式。

Summary / 总结

OmniTransfer is a unified framework for spatio-temporal video transfer that enhances appearance consistency and enables fine-grained temporal control. It incorporates three key designs: Task-aware Positional Bias, Reference-decoupled Causal Learning, and Task-adaptive Multimodal Alignment. Experimental results show that OmniTransfer outperforms existing methods in appearance and temporal transfer tasks, while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.

OmniTransfer 是一个统一的时空视频转移框架，通过利用多视图信息和时间线索来解决现有方法的局限性。它包括任务感知位置偏差、参考解耦因果学习和任务自适应多模态对齐，以增强外观一致性、实现精细的时间控制并统一各种视频转移任务。实验结果表明，OmniTransfer 在外观和时间转移方面优于现有方法，同时在不使用姿态的情况下达到姿态引导方法在运动转移方面的效果，从而建立了一个新的灵活、高保真视频生成范式。

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Authors: Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, Ligeng Zhu

First: 2026-01-20T18:54:31+00:00 · Latest: 2026-01-20T18:54:31+00:00

Comments: 11 pages, 6 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.

中文标题/摘要

标题：Jet-RL：统一训练和回放精度流的FP8强化学习

强化学习（RL）对于提升大型语言模型（LLMs）的复杂推理能力至关重要。然而，现有的RL训练管道在计算效率和资源消耗方面存在瓶颈，回放阶段占总训练时间的70%以上。使用FP8精度的量化RL训练提供了一种有前景的方法来缓解这一瓶颈。一种常用策略是在回放中使用FP8精度，而在训练中保留BF16精度。在本文中，我们首次全面研究了FP8 RL训练，并证明了广泛使用的BF16训练+FP8回放策略在长时回放和具有挑战性的任务中会遭受严重的训练不稳定性和灾难性的准确率崩溃。我们的分析表明，这些失败源于该方法的非策略性，这在训练和推理之间引入了显著的数值不匹配。受这些观察的启发，我们提出了Jet-RL，这是一种FP8 RL训练框架，能够实现稳健和稳定的RL优化。关键思想是采用统一的FP8精度流，用于训练和回放，从而最小化数值差异并消除不必要的跨步骤校准需求。广泛的实验验证了Jet-RL的有效性：我们的方法在回放阶段可实现高达33%的加速，在训练阶段可实现高达41%的加速，相对于BF16训练的整体加速率为16%，同时在所有设置中保持稳定的收敛，并且几乎不降低准确率。

Summary / 总结

This work addresses the computational inefficiency of reinforcement learning (RL) training pipelines by proposing Jet-RL, an FP8 RL training framework. It unifies the precision flow for both training and rollout, reducing numerical discrepancies and improving stability. Experiments show that Jet-RL achieves up to 41% speedup in training, 33% speedup in rollout, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence and negligible accuracy degradation. The study also highlights the instability of the common BF16-training + FP8-rollout approach under long-horizon rollouts and challenging tasks.

该研究提出Jet-RL框架，通过统一使用FP8精度进行训练和回放，解决强化学习（RL）训练管道的计算效率低下问题。研究发现，常见的BF16训练+FP8回放策略会导致训练不稳定和精度崩溃。Jet-RL方法减少了数值差异，提高了稳定性。实验结果显示，在训练、回放和整体性能上分别实现了最高41%、33%和16%的加速，且精度损失可以忽略不计。

APEX-Agents

Authors: Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski

First: 2026-01-20T18:53:44+00:00 · Latest: 2026-01-20T18:53:44+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

中文标题/摘要

标题：APEX-Agents

我们介绍了代理人工智能生产力指数（APEX-Agents），这是一个基准测试，用于评估AI代理是否能够执行由投资银行分析师、管理咨询顾问和公司律师创建的长期跨应用任务。APEX-Agents 要求代理在包含文件和工具的现实工作环境中导航。我们使用 Pass@1 测试了八种代理以确定排行榜。Gemini 3 Flash（思考=高）获得最高分为 24.0%，其次是 GPT-5.2（思考=高）、Claude Opus 4.5（思考=高）和 Gemini 3 Pro（思考=高）。我们开源了包含 480 个提示、评分标准、黄金输出、文件和元数据的 APEX-Agents 基准测试。我们还开源了我们的代理执行和评估基础设施 Archipelago。

Spatiotemporal Wildfire Prediction and Reinforcement Learning for Helitack Suppression

Authors: Shaurya Mathur, Shreyas Bellary Manjunath, Nitin Kulkarni, Alina Vereshchaka

Venue: www

First: 2026-01-20T18:50:12+00:00 · Latest: 2026-01-20T18:50:12+00:00

Comments: 6 pages, 5 figures (two of them in tables), Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): https://www.icmla-conference.org/icmla25/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Wildfires are growing in frequency and intensity, devastating ecosystems and communities while causing billions of dollars in suppression costs and economic damage annually in the U.S. Traditional wildfire management is mostly reactive, addressing fires only after they are detected. We introduce \textit{FireCastRL}, a proactive artificial intelligence (AI) framework that combines wildfire forecasting with intelligent suppression strategies. Our framework first uses a deep spatiotemporal model to predict wildfire ignition. For high-risk predictions, we deploy a pre-trained reinforcement learning (RL) agent to execute real-time suppression tactics with helitack units inside a physics-informed 3D simulation. The framework generates a threat assessment report to help emergency responders optimize resource allocation and planning. In addition, we are publicly releasing a large-scale, spatiotemporal dataset containing $\mathbf{9.5}$ million samples of environmental variables for wildfire prediction. Our work demonstrates how deep learning and RL can be combined to support both forecasting and tactical wildfire response. More details can be found at https://sites.google.com/view/firecastrl.

中文标题/摘要

标题：时空野火预测与强化学习在直升机灭火中的应用

野火的频率和强度正在增加，破坏生态系统和社区，每年在美国造成数十亿美元的扑灭成本和经济损失。传统的野火管理主要是被动的，只在野火被检测到后才进行应对。我们引入了FireCastRL，这是一种主动的人工智能框架，将野火预测与智能灭火策略相结合。该框架首先使用深度时空模型预测野火的点火。对于高风险预测，我们部署一个预训练的强化学习（RL）代理，在物理信息的3D模拟中实时执行直升机单位的灭火战术。该框架生成威胁评估报告，以帮助应急响应者优化资源分配和规划。此外，我们还公开发布了一个包含950万条环境变量样本的大规模时空数据集，用于野火预测。我们的工作展示了深度学习和RL如何结合以支持预测和战术野火应对。更多细节请参见https://sites.google.com/view/firecastrl。

Summary / 总结

The research aims to address the increasing frequency and intensity of wildfires by developing a proactive AI framework called FireCastRL. It combines spatiotemporal prediction models with reinforcement learning to forecast wildfire ignition and deploy helitack units for real-time suppression. Key findings include improved threat assessment reports and optimized resource allocation, with the framework generating a large-scale dataset for wildfire prediction. The study shows the potential of deep learning and reinforcement learning in supporting both wildfire forecasting and tactical response.

研究旨在通过开发名为FireCastRL的前瞻性AI框架来应对野火频率和强度的增加。该框架结合了时空预测模型和强化学习，以预测野火的点燃并实时部署直升机灭火单位。主要发现包括改进的威胁评估报告和优化的资源分配，框架还生成了一个大规模的时空数据集用于野火预测。研究展示了深度学习和强化学习在支持野火预报和战术响应方面的潜力。

Q-learning with Adjoint Matching

Authors: Qiyang Li, Sergey Levine

First: 2026-01-20T18:45:34+00:00 · Latest: 2026-01-20T18:45:34+00:00

Comments: 32 pages, 8 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

中文标题/摘要

标题：伴随匹配的Q学习

我们提出了一种新颖的基于时差的强化学习（RL）算法——伴随匹配的Q学习（QAM），该算法解决了连续动作RL中的长期挑战：针对参数化Q函数优化表达性强的扩散或流匹配策略的有效优化。有效的优化需要利用评论者的一阶信息，但由于直接通过其多步去噪过程进行梯度优化的反向传播在数值上不稳定，因此对流或扩散策略进行梯度优化具有挑战性。现有方法要么仅使用价值而丢弃梯度信息，要么依赖牺牲策略表达性或偏置学习策略的近似方法。QAM通过利用生成建模中最近提出的技术——伴随匹配，绕过了这两个挑战，将评论者的动作梯度转换为逐步的目标函数，从而避免了不稳定的反向传播，同时在最优状态下提供了一个无偏且表达性强的策略。结合评论者学习的时差备份，QAM在硬的稀疏奖励任务中的一系列离线和离线到在线RL任务中都优于先前的方法。

Summary / 总结

Q-learning with Adjoint Matching (QAM) is a novel TD-based reinforcement learning algorithm designed to optimize expressive diffusion or flow-matching policies efficiently. It addresses the challenge of numerically unstable gradient-based optimization by using adjoint matching to transform the critic's action gradient, avoiding backpropagation through the multi-step denoising process. QAM consistently outperforms previous methods on hard, sparse reward tasks in both offline and offline-to-online reinforcement learning scenarios.

Q-learning with Adjoint Matching (QAM) 是一种新型的 TD 基准强化学习算法，旨在高效优化表达性的扩散或流匹配策略。它通过使用伴随匹配将批评家的动作梯度转换，从而避免通过多步去噪过程的反向传播，解决了数值不稳定的梯度优化问题。QAM 在硬的稀疏奖励任务的离线和离线到在线强化学习中均优于先前的方法。

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

Authors: Egor Cherepanov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov

First: 2026-01-20T18:44:28+00:00 · Latest: 2026-01-20T18:44:28+00:00

Comments: 38 pages, 44 figures, 3 tables

Abs · PDF · Code1 · Code2 · Project1

Abstract

Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors. Code: https://avanturist322.github.io/KAGEBench/.

中文标题/摘要

标题：KAGE-Bench：快速已知轴视觉泛化评估框架用于强化学习

基于像素的强化学习代理在纯视觉分布变化下常常失败，即使潜在动力学和奖励未变，但现有基准将多种变化源混杂在一起，妨碍了系统的分析。我们引入了KAGE-Env，这是一个JAX原生的2D平台游戏，将观察过程分解为独立可控的视觉轴，而底层控制问题保持不变。通过构造，改变视觉轴仅通过像素策略的状态条件动作分布影响性能，提供了一个清晰的视觉泛化抽象。基于此环境，我们定义了KAGE-Bench，一个包含六个已知轴套件的基准，共有34个训练-评估配置对，以隔离个体视觉变化。使用标准的PPO-CNN基线，我们观察到强烈的轴依赖性失败，背景和光度变化通常导致失败，而代理外观变化相对无害。某些变化在保持前向运动的同时破坏任务完成，表明仅凭回报可能掩盖泛化失败。最后，完全向量化JAX实现使单个GPU每秒可达到3300万环境步骤，从而实现视觉因素的快速和可重复扫描。代码：https://avanturist322.github.io/KAGEBench/。

Summary / 总结

The research aims to evaluate visual generalization in reinforcement learning agents by isolating different visual factors. KAGE-Env, a JAX-native 2D platformer, is introduced to factorize the observation process into independently controllable visual axes. KAGE-Bench, a benchmark of six known-axis suites, is defined to isolate individual visual shifts. Experiments using a PPO-CNN baseline show strong axis-dependent failures, particularly with background and photometric shifts, while agent-appearance shifts are less problematic. The study also demonstrates that return alone can mask generalization failures and that the fully vectorized JAX implementation allows for fast and reproducible experiments.

KAGE-Bench 引入了 KAGE-Env，这是一种 2D 平台游戏，可以分离视觉因素并保持控制问题不变，以评估强化学习中的视觉泛化能力。使用 PPO-CNN 基线，研究发现背景和光度变化会导致显著失败，而代理外观变化的影响较小。该基准可以隔离个体视觉变化，并且能够以单个 GPU 每秒 3300 万环境步骤的速度进行快速且可重复的实验。

Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment

Authors: Punit Kumar, Vaibhav Saran, Divyesh Patel, Nitin Kulkarni, Alina Vereshchaka

Venue: www

First: 2026-01-20T18:41:44+00:00 · Latest: 2026-01-20T18:41:44+00:00

Comments: 8 pages, 6 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): https://www.icmla-conference.org/icmla25/

Abs · PDF · Code1 · Code2

Abstract

Sepsis remains one of the leading causes of mortality in intensive care units, where timely and accurate treatment decisions can significantly impact patient outcomes. In this work, we propose an interpretable decision support framework. Our system integrates four core components: (1) a clustering-based stratification module that categorizes patients into low, intermediate, and high-risk groups upon ICU admission, using clustering with statistical validation; (2) a synthetic data augmentation pipeline leveraging variational autoencoders (VAE) and diffusion models to enrich underrepresented trajectories such as fluid or vasopressor administration; (3) an offline reinforcement learning (RL) agent trained using Advantage Weighted Regression (AWR) with a lightweight attention encoder and supported by an ensemble models for conservative, safety-aware treatment recommendations; and (4) a rationale generation module powered by a multi-modal large language model (LLM), which produces natural-language justifications grounded in clinical context and retrieved expert knowledge. Evaluated on the MIMIC-III and eICU datasets, our approach achieves high treatment accuracy while providing clinicians with interpretable and robust policy recommendations.

中文标题/摘要

标题：基于注意力的离线强化学习和聚类以实现可解释的脓毒症治疗

脓毒症仍然是重症监护病房中导致死亡的主要原因之一，及时和准确的治疗决策可以显著影响患者结果。在本工作中，我们提出了一种可解释的决策支持框架。我们的系统整合了四个核心组件：(1) 一种基于聚类的分层模块，在重症监护病房入院时将患者分为低、中、高风险组，使用聚类和统计验证；(2) 一种利用变分自编码器（VAE）和扩散模型的合成数据增强管道，以丰富如液体或血管加压素给药等代表性不足的轨迹；(3) 一种使用优势加权回归（AWR）训练的离线强化学习（RL）代理，配备轻量级注意力编码器，并由集成模型支持，以提供保守、安全的治疗建议；(4) 一种由多模态大型语言模型（LLM）驱动的推理生成模块，生成基于临床背景和检索专家知识的自然语言解释。在MIMIC-III和eICU数据集上评估，我们的方法在提供给临床医生可解释和稳健的策略建议的同时，实现了高治疗准确性。

Summary / 总结

This study aims to improve sepsis treatment in intensive care units by developing an interpretable decision support framework. The framework includes a clustering module for patient stratification, a synthetic data augmentation pipeline, an offline reinforcement learning agent, and a rationale generation module. The offline RL agent uses AWR with an attention encoder and ensemble models to provide conservative and safety-aware treatment recommendations. The approach achieves high treatment accuracy and offers interpretable policy recommendations, as evaluated on MIMIC-III and eICU datasets.

该研究旨在通过开发一个可解释的决策支持框架来改善重症监护病房中的败血症治疗。框架包括患者分层模块、合成数据增强管道、离线强化学习代理和推理生成模块。离线RL代理使用AWR和注意力编码器以及集成模型来提供保守和安全的治疗建议。该方法在MIMIC-III和eICU数据集上的评估显示其具有高治疗准确性，并提供可解释的策略建议。

AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning

Authors: Ran Gong, Xiaohan Zhang, Jinghuan Shang, Maria Vittoria Minniti, Jigarkumar Patel, Valerio Pepe, Riedana Yan, Ahmet Gundogdu, Ivan Kapelyukh, Ali Abbas, Xiaoqiang Yan, Harsh Patel, Laura Herlant, Karl Schmeckpeper

First: 2025-12-19T17:55:48+00:00 · Latest: 2026-01-20T18:25:48+00:00

Comments: 28 pages, 25 figures. The first four authors contributed equally

Abs · PDF · Code1 · Code2

Abstract

Generalist robot learning remains constrained by data: large-scale, diverse, and high-quality interaction data are expensive to collect in the real world. While simulation has become a promising way for scaling up data collection, the related tasks, including simulation task design, task-aware scene generation, expert demonstration synthesis, and sim-to-real transfer, still demand substantial human effort. We present AnyTask, an automated framework that pairs massively parallel GPU simulation with foundation models to design diverse manipulation tasks and synthesize robot data. We introduce three AnyTask agents for generating expert demonstrations aiming to solve as many tasks as possible: 1) ViPR, a novel task and motion planning agent with VLM-in-the-loop Parallel Refinement; 2) ViPR-Eureka, a reinforcement learning agent with generated dense rewards and LLM-guided contact sampling; 3) ViPR-RL, a hybrid planning and learning approach that jointly produces high-quality demonstrations with only sparse rewards. We train behavior cloning policies on generated data, validate them in simulation, and deploy them directly on real robot hardware. The policies generalize to novel object poses, achieving 44% average success across a suite of real-world pick-and-place, drawer opening, contact-rich pushing, and long-horizon manipulation tasks. Our project website is at https://anytask.rai-inst.com .

中文标题/摘要

标题：AnyTask：一种自动化的任务和数据生成框架，用于推进仿真到现实的策略学习

通用型机器人学习仍然受到数据的限制：在现实世界中收集大规模、多样性和高质量的交互数据成本高昂。虽然仿真已成为扩展数据收集规模的一种有前景的方法，但相关的任务，包括仿真任务设计、任务感知场景生成、专家演示合成以及仿真到现实的转移，仍然需要大量的人工努力。我们提出了AnyTask，这是一种自动化框架，结合了大规模并行GPU仿真和基础模型来设计多样化的操作任务并合成机器人数据。我们介绍了三个AnyTask代理以生成尽可能多任务的专家演示：1) ViPR，一种具有VLM在环并行精化的新型任务和运动规划代理；2) ViPR-Eureka，一种基于生成密集奖励和LLM引导接触采样的强化学习代理；3) ViPR-RL，一种结合规划和学习的混合方法，仅使用稀疏奖励即可生成高质量的演示。我们在生成的数据上训练行为克隆策略，在仿真中验证它们，并直接部署到真实机器人硬件上。这些策略泛化到新的物体姿态，在一系列真实世界的拾取放置、抽屉打开、接触丰富的推拉和长时操作任务中平均成功率达到了44%。我们的项目网站为https://anytask.rai-inst.com。

Summary / 总结

AnyTask is an automated framework that uses GPU simulation and foundation models to generate diverse manipulation tasks and robot data. It includes three agents: ViPR for task and motion planning, ViPR-Eureka for reinforcement learning with generated rewards, and ViPR-RL for hybrid planning and learning. The framework trains behavior cloning policies on generated data, validates them in simulation, and deploys them on real robots, achieving 44% average success across various manipulation tasks in real-world scenarios.

AnyTask 是一个自动化框架，旨在生成多样化的操作任务和机器人数据，以促进从仿真到现实的策略学习。该框架利用大规模并行 GPU 仿真和基础模型通过三个代理（ViPR、ViPR-Eureka 和 ViPR-RL）生成专家演示。该框架在生成数据上训练行为克隆策略，在仿真中验证，并直接部署到真实机器人硬件上。这些策略在各种实际操作任务中实现了 44% 的平均成功率，展示了对新物体姿态的显著泛化能力。

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

Authors: Matthew Y. R. Yang, Hao Bai, Ian Wu, Gene Yang, Amrith Setlur, Aviral Kumar

First: 2026-01-20T18:15:38+00:00 · Latest: 2026-01-20T18:15:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.

中文标题/摘要

标题：InT：自我提议的干预措施在LLM推理中实现信用分配

结果奖励强化学习（RL）已被证明能够有效提升大型语言模型（LLMs）的推理能力。然而，标准RL仅在最终答案层面分配信用，当结果错误时会惩罚整个推理过程，当结果正确时则均匀强化所有步骤。因此，在失败的推理过程中，正确的中间步骤可能会被抑制，而在成功的推理过程中，虚假步骤可能会被强化。我们将这种失败模式称为信用分配问题。虽然自然的补救措施是训练一个过程奖励模型，但准确优化此类模型以识别纠正的推理步骤仍然具有挑战性。我们引入了干预训练（InT），这是一种训练范式，在这种范式中，模型通过提出简短、针对性的修正来对自己的推理过程进行细粒度的信用分配，从而引导轨迹向更高的奖励方向发展。利用数学推理数据集中通常可用的参考解决方案，并利用验证模型生成的解决方案比从头开始生成正确的解决方案更容易的事实，模型识别出其推理中的第一个错误，并提出一个单一步骤的干预措施，将轨迹引导向正确的解决方案。然后，我们对包含错误点及其干预措施的策略进行监督微调（SFT），将错误定位到导致失败的具体步骤。我们展示了这种模型作为RL训练的良好初始化的效果。在对InT和后续的RL微调后，我们在IMO-AnswerBench上将准确率提高了近14%，超过了4B参数的基础模型，优于更大的开源模型如gpt-oss-20b。

Summary / 总结

The paper addresses the issue of credit assignment in reinforcement learning for large language models (LLMs), where credit is only assigned at the final outcome level. To tackle this, the authors propose Intervention Training (InT), a method where the model self-corrects its reasoning steps by proposing short interventions to steer towards higher rewards. This approach improves accuracy by nearly 14% on the IMO-AnswerBench dataset compared to a 4B-parameter base model, outperforming larger models like gpt-oss-20b.

论文针对大型语言模型（LLMs）在强化学习中的信用分配问题，即信用仅在最终结果层面进行分配。为此，作者提出了干预训练（InT）方法，该方法使模型自我纠正其推理步骤，通过提出短干预措施引导向更高奖励的方向。这种方法在IMO-AnswerBench数据集上的准确率提高了近14%，超过了4B参数的基础模型，也优于更大的开源模型如gpt-oss-20b。

Rig-Aware 3D Reconstruction of Vehicle Undercarriages using Gaussian Splatting

Authors: Nitin Kulkarni, Akhil Devarashetti, Charlie Cluss, Livio Forte, Dan Buckmaster, Philip Schneider, Chunming Qiao, Alina Vereshchaka

Venue: www

First: 2026-01-20T18:13:03+00:00 · Latest: 2026-01-20T18:13:03+00:00

Comments: 8 pages, 9 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): https://www.icmla-conference.org/icmla25/

Abs · PDF · Code1 · Code2

Abstract

Inspecting the undercarriage of used vehicles is a labor-intensive task that requires inspectors to crouch or crawl underneath each vehicle to thoroughly examine it. Additionally, online buyers rarely see undercarriage photos. We present an end-to-end pipeline that utilizes a three-camera rig to capture videos of the undercarriage as the vehicle drives over it, and produces an interactive 3D model of the undercarriage. The 3D model enables inspectors and customers to rotate, zoom, and slice through the undercarriage, allowing them to detect rust, leaks, or impact damage in seconds, thereby improving both workplace safety and buyer confidence. Our primary contribution is a rig-aware Structure-from-Motion (SfM) pipeline specifically designed to overcome the challenges of wide-angle lens distortion and low-parallax scenes. Our method overcomes the challenges of wide-angle lens distortion and low-parallax scenes by integrating precise camera calibration, synchronized video streams, and strong geometric priors from the camera rig. We use a constrained matching strategy with learned components, the DISK feature extractor, and the attention-based LightGlue matcher to generate high-quality sparse point clouds that are often unattainable with standard SfM pipelines. These point clouds seed the Gaussian splatting process to generate photorealistic undercarriage models that render in real-time. Our experiments and ablation studies demonstrate that our design choices are essential to achieve state-of-the-art quality.

中文标题/摘要

标题：基于高斯点云绘制的车辆底盘 Rig-Aware 3D 重建

检查二手车底盘是一项劳动密集型任务，需要检查员在每辆车下蹲或爬行以彻底检查。此外，网上买家很少能看到底盘照片。我们提出了一种端到端的流水线，利用三摄像头装置在车辆行驶时拍摄底盘视频，并生成交互式底盘3D模型。3D模型使检查员和客户能够旋转、缩放和切片底盘，以在几秒钟内检测锈蚀、泄漏或碰撞损伤，从而提高工作场所安全性和买家信心。我们的主要贡献是一种针对宽视角镜头失真和低视差场景的Rig-Aware结构从运动（SfM）流水线。我们的方法通过集成精确的相机校准、同步视频流和来自相机装置的强几何先验来克服宽视角镜头失真和低视差场景的挑战。我们使用受约束的匹配策略、学习组件、DISK特征提取器和基于注意力的LightGlue匹配器生成高质量的稀疏点云，这些点云通常无法通过标准SfM流水线获得。这些点云为高斯点云绘制过程提供种子，生成实时渲染的逼真底盘模型。我们的实验和消融研究证明，我们的设计选择对于实现最先进的质量至关重要。

Summary / 总结

This paper presents an end-to-end pipeline for 3D reconstruction of vehicle undercarriages using a three-camera rig to capture videos as the vehicle drives over it. The pipeline addresses wide-angle lens distortion and low-parallax scenes through precise camera calibration, synchronized video streams, and strong geometric priors. Key findings show that the method generates high-quality sparse point clouds using a constrained matching strategy and attention-based LightGlue matcher, which are then used in Gaussian splatting to produce photorealistic 3D models that can be interactively rotated and sliced, improving workplace safety and buyer confidence. Ablation studies confirm the importance of these design choices for achieving state-of-the-art quality.

论文提出了一种使用三摄像头拍摄车辆行驶过底盘视频的端到端管道，以进行3D重建。该方法通过精确的相机校准、同步视频流和强大的几何先验来解决广角镜头失真和低视差场景的问题。关键实验结果表明，受限匹配策略和DISK及LightGlue的使用提高了稀疏点云的质量，生成了实时渲染的逼真底盘模型，有助于提高工作场所安全性和买家信心。

Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

Authors: Rotem Gatenyo, Ohad Fried

First: 2026-01-20T18:12:55+00:00 · Latest: 2026-01-20T18:12:55+00:00

Abs · PDF · Code1 · Code2

Abstract

We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.

中文标题/摘要

标题：复制-转换-粘贴：由视觉语言和几何约束引导的零样本对象对齐

我们研究了使用描述两个给定网格空间关系的文本提示来进行零样本3D对齐的问题——这是内容创作和场景组装中的一个基本能力。早期的方法主要依赖于几何对齐过程，而最近的工作则利用预训练的2D扩散模型来建模语言条件下的对象间空间关系。相比之下，我们直接在测试时优化相对姿态，通过可微渲染器使用CLIP驱动的梯度更新平移、旋转和各向同性缩放，而无需训练新的模型。我们的框架通过几何感知的目标增强语言监督：一种软迭代最近点（ICP）项的变体以鼓励表面附着，以及一个穿透损失以防止相互穿插。分阶段的时间表随着时间的推移加强接触约束，而相机控制则将优化集中在交互区域。为了进行评估，我们整理了一个包含多种类别和关系的基准，并与基线进行比较。我们的方法优于所有替代方案，产生了语义上忠实且物理上合理的对齐。

Summary / 总结

The research aims to achieve zero-shot 3D alignment of two given meshes using a text prompt, which is crucial for content creation and scene assembly. The method directly optimizes the relative pose at test time using CLIP-driven gradients via a differentiable renderer, incorporating geometry-aware objectives such as a soft-ICP term and a penetration loss. The approach outperforms existing methods by providing semantically faithful and physically plausible alignments. Evaluation is conducted on a curated benchmark with diverse categories and relations, demonstrating superior performance compared to baselines.

研究旨在通过文本提示实现两个给定网格的零样本3D对齐，这对于内容创作和场景组装至关重要。方法直接在测试时优化相对姿态，使用CLIP驱动的梯度并通过可微渲染器，结合几何感知的目标，如软ICP项和穿透损失。该方法通过提供语义上忠实且物理上合理的对齐优于现有方法。在包含多种类别和关系的基准上进行评估，显示出优于基线的性能。

GeLoc3r: Enhancing Relative Camera Pose Regression with Geometric Consistency Regularization

Authors: Jingxing Li, Yongjae Lee, Deliang Fan

First: 2025-09-27T01:21:38+00:00 · Latest: 2026-01-20T18:07:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Prior ReLoc3R achieves breakthrough performance with fast 25ms inference and state-of-the-art regression accuracy, yet our analysis reveals subtle geometric inconsistencies in its internal representations that prevent reaching the precision ceiling of correspondence-based methods like MASt3R (which require 300ms per pair). In this work, we present GeLoc3r, a novel approach to relative camera pose estimation that enhances pose regression methods through Geometric Consistency Regularization (GCR). GeLoc3r overcomes the speed-accuracy dilemma by training regression networks to produce geometrically consistent poses without inference-time geometric computation. During training, GeLoc3r leverages ground-truth depth to generate dense 3D-2D correspondences, weights them using a FusionTransformer that learns correspondence importance, and computes geometrically-consistent poses via weighted RANSAC. This creates a consistency loss that transfers geometric knowledge into the regression network. Unlike FAR method which requires both regression and geometric solving at inference, GeLoc3r only uses the enhanced regression head at test time, maintaining ReLoc3R's fast speed and approaching MASt3R's high accuracy. On challenging benchmarks, GeLoc3r consistently outperforms ReLoc3R, achieving significant improvements including 40.45% vs. 34.85% AUC@5° on the CO3Dv2 dataset (16% relative improvement), 68.66% vs. 66.70% AUC@5° on RealEstate10K, and 50.45% vs. 49.60% on MegaDepth1500. By teaching geometric consistency during training rather than enforcing it at inference, GeLoc3r represents a paradigm shift in how neural networks learn camera geometry, achieving both the speed of regression and the geometric understanding of correspondence methods.

中文标题/摘要

标题：GeLoc3r：通过几何一致性正则化增强相对相机姿态回归

Prior ReLoc3R 以快速 25ms 推断和最先进的回归精度实现了突破性性能，但我们的分析揭示了其内部表示中的细微几何不一致性，这阻碍了达到基于对应关系方法（如 MASt3R 所需每对 300ms 的时间）的精度上限。在此项工作中，我们提出了 GeLoc3r，这是一种新颖的相对相机姿态估计方法，通过几何一致性正则化（GCR）增强姿态回归方法。GeLoc3r 通过在训练过程中训练回归网络生成几何上一致的姿态，从而克服了速度与精度之间的矛盾，而无需在推断时进行几何计算。在训练过程中，GeLoc3r 利用地面真实深度生成密集的 3D-2D 对应关系，使用 FusionTransformer 学习对应关系的重要性进行加权，并通过加权 RANSAC 计算几何上一致的姿态。这创建了一致性损失，将几何知识转移到回归网络中。与 FAR 方法在推断时需要同时进行回归和几何求解不同，GeLoc3r 在测试时仅使用增强的回归头，保持 ReLoc3R 的快速速度并接近 MASt3R 的高精度。在具有挑战性的基准测试中，GeLoc3r 一致地优于 ReLoc3R，分别在 CO3Dv2 数据集上实现了 40.45% 对比 34.85% 的 AUC@5°（相对改进 16%），在 RealEstate10K 上实现了 68.66% 对比 66.70% 的 AUC@5°，以及在 MegaDepth1500 上实现了 50.45% 对比 49.60%。通过在训练期间而不是在推断时强制几何一致性来教导几何一致性，GeLoc3r 代表了神经网络学习相机几何学的新范式，实现了回归的速度和对应关系方法的几何理解。

Summary / 总结

GeLoc3r enhances relative camera pose regression by introducing Geometric Consistency Regularization (GCR) during training. This method generates geometrically consistent poses without requiring geometric computation at inference time, maintaining fast speed while improving accuracy. On benchmarks, GeLoc3r outperforms ReLoc3R, achieving significant improvements in AUC@5° on CO3Dv2, RealEstate10K, and MegaDepth1500 datasets, with relative improvements of 16%, 3.09%, and 1.71% respectively.

GeLoc3r 通过在训练过程中引入几何一致性正则化（GCR）来提升相对相机姿态回归。该方法在不需在推理时进行几何计算的情况下生成几何上一致的姿态，从而保持快速速度并提高准确性。在基准测试中，GeLoc3r 在 CO3Dv2、RealEstate10K 和 MegaDepth1500 数据集上的 AUC@5° 表现优于 ReLoc3R，分别取得了 16%、3.09% 和 1.71% 的相对改进。

DiffusionAgent: Navigating Expert Models for Agentic Image Generation

Authors: Jie Qin, Jie Wu, Weifeng Chen, Yueming Lyu

First: 2024-01-18T15:30:58+00:00 · Latest: 2026-01-20T18:02:51+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In the accelerating era of human-instructed visual content creation, diffusion models have demonstrated remarkable generative potential. Yet their deployment is constrained by a dual bottleneck: semantic ambiguity in diverse prompts and the narrow specialization of individual models. A single diffusion architecture struggles to maintain optimal performance across heterogeneous prompts, while conventional "parse-then-call" pipelines artificially separate semantic understanding from generative execution. To bridge this gap, we introduce DiffusionAgent, a unified, language-model-driven agent that casts the entire "prompt comprehension-expert routing-image synthesis" loop into a agentic framework. Our contributions are three-fold: (1) a tree-of-thought-powered expert navigator that performs fine-grained semantic parsing and zero-shot matching to the most suitable diffusion model via an extensible prior-knowledge tree; (2) an advantage database updated with human-in-the-loop feedback, continually aligning model-selection policy with human aesthetic and semantic preferences; and (3) a fully decoupled agent architecture that activates the optimal generative path for open-domain prompts without retraining or fine-tuning any expert. Extensive experiments show that DiffusionAgent retains high generation quality while significantly broadening prompt coverage, establishing a new performance and generality benchmark for multi-domain image synthesis. The code is available at https://github.com/DiffusionAgent/DiffusionAgent

中文标题/摘要

标题：DiffusionAgent：导航专家模型进行有能动性的图像生成

在加速发展的由人类指导的视觉内容创作时代，扩散模型展现了显著的生成潜力。然而，它们的部署受到双重瓶颈的限制：多变提示中的语义模糊性和单个模型的狭窄专业化。单一的扩散架构难以在异构提示下保持最佳性能，而传统的“解析-调用”管道人为地将语义理解与生成执行分离。为弥合这一差距，我们引入了DiffusionAgent，这是一种统一的语言模型驱动代理，将整个“提示理解-专家导航-图像合成”循环转化为一个有能动性的框架。我们的贡献包括三个方面：(1) 一种基于思维树的专家导航器，进行精细的语义解析和零样本匹配，通过可扩展的先验知识树选择最合适的扩散模型；(2) 一个不断更新的人类在环反馈数据库，持续调整模型选择策略以符合人类的审美和语义偏好；(3) 一个完全解耦的代理架构，无需重新训练或微调任何专家即可激活开放领域提示的最佳生成路径。大量实验表明，DiffusionAgent 在保持高质量生成的同时，显著扩展了提示覆盖范围，为多领域图像合成建立了新的性能和通用性基准。代码可在 https://github.com/DiffusionAgent/DiffusionAgent 获取

Summary / 总结

DiffusionAgent addresses the limitations of diffusion models in handling diverse prompts and narrow specialization by introducing a unified language-model-driven agent. This agent combines semantic parsing and zero-shot matching with an extensible prior-knowledge tree to navigate to the most suitable diffusion model. Additionally, it uses an advantage database updated with human feedback to align model selection with human preferences. Experiments demonstrate that DiffusionAgent maintains high generation quality and significantly broadens prompt coverage, setting a new benchmark for multi-domain image synthesis.

DiffusionAgent旨在解决扩散模型在处理多样提示和专门任务时的局限性。它引入了一个统一的代理，结合了语义解析和专家模型路由，使用树思考方法和可扩展的先验知识树。该系统还包括一个通过人类反馈更新的优势数据库，以使模型选择与美学和语义偏好保持一致。实验结果表明，DiffusionAgent保持了高质量的生成效果，并显著扩展了提示覆盖范围，为多领域图像合成设定了新的基准。

Differentiated Pickup Point Offering for Emission Reduction in Last-Mile Delivery

Authors: Albina Galiullina, Wouter van Heeswijk, Tom van Woensel

First: 2026-01-20T18:00:42+00:00 · Latest: 2026-01-20T18:00:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Pickup points are widely recognized as a sustainable alternative to home delivery, as consolidating orders at pickup locations can shorten delivery routes and improve first-attempt success rates. However, these benefits may be negated when customers drive to pick up their orders. This study proposes a Differentiated Pickup Point Offering (DPO) policy that aims to jointly reduce emissions from delivery truck routes and customer travel. Under DPO, each arriving customer is offered a single recommended pickup point, rather than an unrestricted choice among all locations, while retaining the option of home delivery. We study this problem in a dynamic and stochastic setting, where the pickup point offered to each customer depends on previously realized customer locations and delivery choices. To design effective DPO policies, we adopt a reinforcement learning-based approach that accounts for spatial relationships between customers and pickup points and their implications for future route consolidation. Computational experiments show that differentiated pickup point offerings can substantially reduce total carbon emissions. The proposed policies reduce total emissions by up to 9% relative to home-only delivery and by 2% on average compared with alternative policies, including unrestricted pickup point choice and nearest pickup point assignment. Differentiated offerings are particularly effective in dense urban settings with many pickup points and short inter-location distances. Moreover, explicitly accounting for the dynamic nature of customer arrivals and choices is especially important when customers are less inclined to choose pickup point delivery over home delivery.

中文标题/摘要

标题：最后一公里配送中差异化取货点提供以减少排放

取货点被广泛认为是可持续的替代家庭送货的选择，因为将订单集中到取货地点可以缩短配送路线并提高首次送货成功率。然而，当客户开车去取货时，这些好处可能会被抵消。本研究提出了一种差异化取货点提供（DPO）政策，旨在同时减少配送卡车路线和客户行程的排放。在DPO下，每个到达的客户只能被推荐一个取货点，而不是在所有地点中自由选择，同时保留家庭送货的选项。我们研究了在动态和随机的环境中这一问题，其中每个客户被提供的取货点取决于之前已实现的客户位置和配送选择。为了设计有效的DPO政策，我们采用了一种基于强化学习的方法，该方法考虑了客户和取货点之间的空间关系及其对未来路线合并的影响。计算实验表明，差异化取货点提供可以显著减少总碳排放。所提出的政策相对于仅家庭送货减少了高达9%的总排放，平均而言，与包括不限制取货点选择和最近取货点分配在内的其他政策相比，减少了2%的排放。差异化提供特别适用于取货点众多且地点间距离较短的密集城市环境。此外，明确考虑客户到达和选择的动态性质在客户更倾向于选择家庭送货而非取货点送货时尤为重要。

Summary / 总结

This study addresses the challenge of reducing emissions in last-mile delivery by proposing a Differentiated Pickup Point Offering (DPO) policy. The policy recommends a single pickup point for each customer, balancing the benefits of consolidation with the need to avoid customer travel. Computational experiments demonstrate that DPO can significantly reduce total carbon emissions, with reductions up to 9% compared to home-only delivery and 2% on average compared to other policies. The effectiveness of DPO is particularly notable in dense urban areas with many pickup points and short distances between them.

本研究提出了一种差异化取货点提供（DPO）政策，旨在减少最后一英里配送中的排放。该政策为每位顾客推荐一个取货点，平衡了货物集中带来的好处和避免不必要的顾客出行的需求。计算实验表明，DPO 可以显著减少总碳排放，与仅在家交付相比最多可减少9%，与其它政策相比平均减少2%。DPO 在密集的城市环境中尤其有效，这些环境中有很多取货点且各点之间的距离较短。

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

Authors: Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, Chris Lott

Venue: NeurIPS 2025

First: 2025-04-21T18:12:46+00:00 · Latest: 2026-01-20T17:55:29+00:00

Comments: 37 pages, 19 figures, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

中文标题/摘要

标题：KeyDiff：基于键相似性的KV缓存淘汰方法以应对资源受限环境中的长上下文LLM推理

我们证明，在LLM推理过程中，几何上独特的键往往具有较高的注意力分数。基于这一现象，我们提出了KeyDiff，一种仅基于键相似性的无需训练的KV缓存淘汰方法。与其它KV缓存淘汰方法不同，KeyDiff可以在严格的资源限制下处理任意长的提示，并高效生成响应。我们通过将键多样性与注意力分数联系起来，为KeyDiff提供了理论基础。这些结果表明，KeyDiff可以有效地识别出需要保留的最重要令牌。值得注意的是，KeyDiff不依赖于注意力分数，允许使用优化的注意力机制如FlashAttention。在严格的内存限制下，我们通过在LongBench上观察到Llama 3.1-8B和Llama 3.2-3B模型家族的性能差距小于0.04%，以及使用8K缓存预算（约23%的KV缓存减少）从非淘汰基线中获得的效果，证明了KeyDiff的有效性。我们还观察到，在Deepseek-R1-Distill-Llama-8B模型上，KeyDiff在Math500推理基准测试中的性能接近基线，并将端到端推理延迟降低了高达30%，相比其他令牌淘汰方法。

Summary / 总结

KeyDiff is a training-free KV cache eviction method based on key similarity, designed for long-context LLM inference in resource-constrained environments. It efficiently processes long prompts and generates responses without relying on attention scores, allowing the use of optimized mechanisms like FlashAttention. KeyDiff shows a performance gap of less than 0.04% with an 8K cache budget, reducing the KV cache by about 23% compared to the non-evicting baseline on LongBench for Llama and Qwen models. It also improves end-to-end inference latency by up to 30% and maintains near baseline performance on the Math500 reasoning benchmark.

KeyDiff 是一种基于键相似性的训练-free KV 缓存淘汰方法，适用于资源受限环境下的长上下文 LLM 推理。它不依赖于注意力分数，能够高效地识别并保留最重要的令牌，允许使用优化的注意力机制。KeyDiff 在 LongBench 上对 Llama 3.1-8B 和 Llama 3.2-3B 的非淘汰基线具有不到 0.04% 的性能差距，KV 缓存减少了约 23%。它还在 Math500 推理基准测试中接近基线性能，并将端到端推理延迟最多减少 30%。

Toward Efficient Agents: Memory, Tool learning, and Planning

Authors: Xiaofang Yang, Lijun Li, Heng Zhou, Tong Zhu, Xiaoye Qu, Yuchen Fan, Qianshan Wei, Rui Ye, Li Kang, Yiran Qin, Zhiqiang Kou, Daizong Liu, Qi Li, Ning Ding, Siheng Chen, Jing Shao

First: 2026-01-20T17:51:56+00:00 · Latest: 2026-01-20T17:51:56+00:00

Comments: 35 pages, 200 references

Abs · PDF · Code1 · Code2

Abstract

Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.

中文标题/摘要

标题：向高效智能体迈进：记忆、工具学习与规划

近年来，人们越来越关注将大型语言模型扩展为智能系统。尽管智能体的有效性不断提高，但对其效率的关注却往往不足，而效率对于实际部署至关重要。因此，本文从智能体的三个核心组件——记忆、工具学习和规划——出发，考虑延迟、令牌、步骤等成本，旨在全面研究智能系统本身的效率。我们回顾了多种不同的近期方法，尽管实现方式不同，但经常共享一些高层次的原则，包括但不限于通过压缩和管理限制上下文，设计强化学习奖励以减少工具调用，以及采用受控搜索机制以提高效率，我们对此进行了详细讨论。因此，我们从两种互补的方式定义效率：在固定成本预算下比较有效性，以及在相似有效性水平下比较成本。这种权衡也可以通过效率和成本之间的帕累托前沿来观察。从这个角度来看，我们还通过总结这些组件的评估协议并汇总基准和方法论研究中常见的效率指标，来考察效率导向的基准。此外，我们讨论了关键挑战和未来方向，旨在提供有价值的见解。

Summary / 总结

This paper investigates the efficiency of agentic systems by focusing on memory, tool learning, and planning. It reviews recent approaches that aim to reduce costs such as latency and tokens, and discusses how to balance effectiveness and cost through Pareto frontiers. The study characterizes efficiency in two ways: comparing effectiveness under a fixed cost budget and comparing cost at a similar level of effectiveness, and it summarizes evaluation protocols and efficiency metrics for these components.

本文探讨了通过关注记忆、工具学习和规划来提高代理系统的效率，考虑了延迟和令牌等因素。研究回顾了通过上下文压缩、最小化工具调用和采用受控搜索机制等方法来提高效率的各种方法。通过在固定成本预算下比较有效性以及在相似有效性水平下的成本来表征效率，并讨论了提高代理系统效率的关键挑战和未来方向。

The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

Authors: Sangmitra Madhusudan, Kaige Chen, Ali Emami

First: 2025-10-23T13:30:40+00:00 · Latest: 2026-01-20T17:46:36+00:00

Comments: 9 pages (excluding references), accepted to EACL 2026 Main Conference

Abs · PDF · Code1 · Code2

Abstract

When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

中文标题/摘要

标题：追猫的狗难住了模型：测量语言模型何时放弃结构转而使用捷径

当语言模型正确解析“the cat that the dog chased meowed”时，它们是在分析句法结构还是仅仅熟悉狗追猫的情况？尽管进行了广泛的基准测试，但我们缺乏区分结构理解与语义模式匹配的方法。我们引入了CenterBench数据集，包含9,720个关于中心嵌套句子（如“the cat [that the dog chased] meowed”）的阅读理解问题，其中相对从句递归嵌套，从简单到复杂地创建了处理需求。每个句子都有一个句法上相同但语义上不合理的对应句子（例如，邮差开药，医生送信），并有六个测试表面理解、句法依赖性和因果推理的问题。测试六种模型发现，可能句子与不可能句子之间的性能差距随着复杂性的增加而系统性地扩大，模型的中位数差距高达26.8个百分点，量化了它们何时放弃结构分析转而使用语义关联。值得注意的是，语义合理性在关于结果行为的问题上损害了性能，因为遵循因果关系比语义连贯性更重要。推理模型提高了准确性，但它们的推理过程显示了语义捷径、过度思考和拒绝回答。与随复杂性系统性扩大的合理性优势不同，人类在语义效果上表现出变化。CenterBench提供了第一个框架，用于识别模型何时从结构分析转向模式匹配。

Summary / 总结

The study aims to distinguish whether language models understand syntax or rely on semantic patterns by introducing CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences. Testing six models, the research finds that performance gaps between plausible and implausible sentences increase with complexity, with models showing up to 26.8 percentage point differences. Notably, semantic plausibility negatively impacts performance on questions about resulting actions, where causal relationships are crucial. The study quantifies when models abandon structural analysis for semantic shortcuts, providing insights into their reasoning processes. Unlike models, humans show variable semantic effects. CenterBench offers a framework to identify when models shift from structural analysis to pattern matching.

研究旨在通过引入包含9,720个理解问题的CenterBench数据集，区分语言模型是理解语法还是依赖语义模式。测试六种模型后发现，随着复杂性的增加，可能句子和不合逻辑句子之间的性能差距会增大，模型之间的差异最高可达26.8个百分点。值得注意的是，语义合理性在关于结果行为的问题上会负面影响表现，因为因果关系比语义连贯性更重要。研究量化了模型何时放弃结构分析转而使用语义捷径。与模型不同，人类在语义效果上表现出变化。CenterBench提供了一个框架，用于识别模型何时从结构分析转向模式匹配。

IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

Authors: Liang Shi, Wei Li, Kevin M Beussman, Lin Chen, Yun Fu

First: 2026-01-20T17:45:24+00:00 · Latest: 2026-01-20T17:45:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.

中文标题/摘要

标题：IIR-VLM：大型视觉语言模型的上下文实例级识别

实例级识别（ILR）涉及区分单个实例，其中人员再识别是一个突出的例子。尽管现代VLMs在视觉感知方面表现出色，但我们发现它们在ILR上的表现令人不满意，通常远逊于专门的ILR模型。这一限制阻碍了许多VLMs的实际应用，例如，在有效视觉理解中识别熟悉的人和物体至关重要。现有解决方案通常使用特定实例的数据集一次学习识别单个实例，这不仅需要大量数据收集和训练成本，而且难以进行精细区分。在本文中，我们提出了一种增强的IIR-VLM，一种用于上下文实例级识别的VLM。我们整合了预训练的ILR专家模型作为辅助视觉编码器，以提供专门的特征来学习多样化的实例，从而使VLMs能够以一次学习的方式在上下文中学习新实例。此外，IIR-VLM利用这些知识进行实例感知的视觉理解。我们在现有的实例个性化基准上验证了IIR-VLM的有效性。最后，我们在一个具有挑战性的新基准上展示了其优越的ILR性能，该基准评估了不同难度和多样类别的ILR能力，其中人员、面部、宠物和通用物体是任务中的实例。

Summary / 总结

The research aims to enhance the instance-level recognition (ILR) capabilities of large vision-language models (VLMs) by addressing their poor performance compared to domain-specific models. The method involves integrating pre-trained ILR expert models as auxiliary encoders to provide specialized features for learning diverse instances, allowing VLMs to recognize new instances in a one-shot manner. Key findings show improved ILR performance on existing benchmarks and a new challenging benchmark, demonstrating the model's effectiveness in fine-grained discrimination across various categories.

研究旨在通过解决大型视觉语言模型（VLM）在实例级识别（ILR）方面的不足，提高其ILR能力，这些模型的表现远逊于领域特定模型。提出的IIR-VLM将预训练的ILR专家模型作为辅助编码器，使VLM能够以单次学习的方式学习新实例。该方法在现有基准和一个新挑战基准上展示了优越的ILR性能，该基准评估了ILR能力在不同难度和多种类别下的表现。

AlphaMapleSAT: An MCTS-based Cube-and-Conquer SAT Solver for Hard Combinatorial Problems

Authors: Piyush Jha, Zhengyu Li, Zhengyang Lu, Raymond Zeng, Curtis Bright, Vijay Ganesh

First: 2024-01-24T19:37:10+00:00 · Latest: 2026-01-20T17:44:29+00:00

Comments: Added more experiments

Abs · PDF · Code1 · Code2

Abstract

This paper introduces AlphaMapleSAT, a Cube-and-Conquer (CnC) parallel SAT solver that integrates Monte Carlo Tree Search (MCTS) with deductive feedback to efficiently solve challenging combinatorial SAT problems. Traditional lookahead cubing methods, used by solvers such as March, limit their search depth to reduce overhead often resulting in suboptimal partitions. By contrast, AlphaMapleSAT performs a deeper MCTS search guided by deductive rewards from SAT solvers. This approach enables informed exploration of the cubing space while keeping cubing costs low. We demonstrate the efficacy of our technique via extensive evaluations against the widely used and established March cubing solver on three well-known challenging combinatorial benchmarks, including the minimum Kochen-Specker (KS) problem from quantum mechanics, the Murty-Simon Conjecture, and the Ramsey problems from extremal graph theory. We compare AlphaMapleSAT against March using different types of conquering solvers such as SAT Modulo Symmetries (SMS) and SAT+CAS, both built on top of the CaDiCaL SAT solver. We show that in all cases, there is a speedup in elapsed real time (wall clock time) ranging from 1.61x to 7.57x on a 128 core machine for the above-mentioned problems. We also perform cube-level and parallel scaling analysis over 32, 64, and 128 cores, which shows that AlphaMapleSAT outperforms March on all these settings. Our results show that deductively-guided MCTS search technique for cubing in CnC solvers can significantly outperform March on hard combinatorial problems.

中文标题/摘要

标题：AlphaMapleSAT：一种基于MCTS的Cube-and-Conquer SAT求解器，用于解决棘手的组合问题

本文介绍了AlphaMapleSAT，这是一种结合了蒙特卡洛树搜索（MCTS）和演绎反馈的Cube-and-Conquer（CnC）并行SAT求解器，用于高效解决具有挑战性的组合SAT问题。传统的前瞻立方方法，如March求解器所使用的方法，限制搜索深度以减少开销，通常导致子优化的分区。相比之下，AlphaMapleSAT通过SAT求解器提供的演绎奖励进行更深层次的MCTS搜索。这种方法使立方空间的探索更加明智，同时保持立方成本较低。我们通过在三个著名的具有挑战性的组合基准测试上与广泛使用的March立方求解器进行广泛的比较评估，展示了我们技术的有效性，包括量子力学中的最小Kochen-Specker（KS）问题、Murty-Simon猜想和极值图论中的Ramsey问题。我们使用基于CaDiCaL SAT求解器的SAT Modulo Symmetries（SMS）和SAT+CAS两种征服求解器与March进行比较。结果显示，在上述问题上，AlphaMapleSAT在所有情况下相对于March都具有1.61x到7.57x的实时时间（墙钟时间）加速。我们还进行了立方级别和并行扩展分析，使用32、64和128个核心，结果显示AlphaMapleSAT在所有这些设置上都优于March。我们的结果表明，在CnC求解器中使用演绎指导的MCTS搜索技术可以显著优于March，特别是在棘手的组合问题上。

Summary / 总结

AlphaMapleSAT is a Cube-and-Conquer SAT solver that integrates Monte Carlo Tree Search with deductive feedback to solve hard combinatorial problems more efficiently than traditional lookahead cubing methods. It demonstrates a speedup ranging from 1.61x to 7.57x on a 128-core machine compared to the March cubing solver across various benchmarks, including the minimum Kochen-Specker problem, the Murty-Simon Conjecture, and Ramsey problems. The technique shows significant performance improvements in both cube-level and parallel scaling analyses.

AlphaMapleSAT 是一种结合了蒙特卡洛树搜索 (MCTS) 和演绎反馈的 Cube-and-Conquer (CnC) SAT 解决器，用于更有效地解决硬组合问题，比传统的前瞻立方方法更高效。它在保持成本低的同时更有效地探索立方空间。在三个具有挑战性的基准测试——最小的 Kochen-Specker 问题、Murty-Simon 假设和极值图论中的 Ramsey 问题——上的广泛评估表明，AlphaMapleSAT 在 128 核机器上的实际时间加速比从 1.61 倍到 7.57 倍。立方级别和并行扩展分析也证明了 AlphaMapleSAT 在不同核心设置下的优越性能。

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians

Authors: Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Stefano Gasperini, Hendrik P. A. Lensch, Nassir Navab, Federico Tombari

First: 2024-12-13T16:01:19+00:00 · Latest: 2026-01-20T17:27:32+00:00

Comments: 13 pages, 8 figures. Project page: supergseg.github.io

Abs · PDF · Code1 · Code2

Abstract

3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While its vanilla representation is mainly designed for view synthesis, recent works extended it to scene understanding with language features. However, storing additional high-dimensional features per Gaussian for semantic information is memory-intensive, which limits their ability to segment and interpret challenging scenes. To this end, we introduce SuperGSeg, a novel approach that fosters cohesive, context-aware hierarchical scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural 3D Gaussians to learn geometry, instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of \acrlong{superg}s. \acrlong{superg}s facilitate the lifting and distillation of 2D language features into 3D space. They enable hierarchical scene understanding with high-dimensional language feature rendering at moderate GPU memory costs. Extensive experiments demonstrate that SuperGSeg achieves remarkable performance on both open-vocabulary object selection and semantic segmentation tasks.

Summary / 总结

SuperGSeg introduces a novel approach for 3D segmentation using structured Super-Gaussians to facilitate hierarchical scene representation. It leverages neural 3D Gaussians to learn geometry and segmentation features from multi-view images, and then distills 2D language features into 3D space through Super-Gaussians. The method achieves significant performance on open-vocabulary object selection and semantic segmentation tasks with moderate GPU memory usage.

SuperGSeg提出了一种使用结构化Super-Gaussians高效将2D语言特征提升到3D空间的新方法，通过分离分割和语言场提炼。该方法利用多视图图像和现成的2D掩码学习几何、实例和层次分割特征。实验表明，SuperGSeg在开放词汇对象选择和语义分割任务上取得了显著性能，同时使用了适度的GPU内存。

Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

Authors: Víctor Yeste, Paolo Rosso

First: 2026-01-20T17:25:33+00:00 · Latest: 2026-01-20T17:25:33+00:00

Comments: Code: https://github.com/VictorMYeste/human-value-detection, 37 pages, 4 figures,

Abs · PDF · Code1 · Code2 · Code3

Abstract

We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task ("does any value appear?") and show that it is learnable from single sentences (positive-class F1 $\approx$ 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.

Summary / 总结

This study focuses on identifying the 19 values in the Schwartz motivational continuum from out-of-context sentences, which are challenging due to sparse moral cues and class imbalance. The research first addresses a binary task of detecting any moral presence and then compares a hierarchy-based model with a direct multi-label classifier. While the hierarchy does not outperform the direct approach, lightweight signals and small ensembles improve performance. A soft-vote supervised ensemble achieves a macro-F1 score of 0.332, surpassing previous English-only baselines. The study suggests that carefully tuned supervised encoders are effective and efficient under resource constraints, and richer value structures could further enhance results.

研究聚焦于从孤立句子中识别斯德哥尔摩动机连续体中的19种价值观，由于稀疏的道德线索和类别不平衡，这一任务极具挑战性。研究首先处理了一个二元任务，即检测任何价值观的存在，表明可以从单个句子中学习。然后比较了基于存在门控的层次结构与直接多标签分类器，发现层次结构没有显著优势。研究还对指令调优的大语言模型进行了基准测试，并构建了集成模型，发现软投票的监督集成模型达到了最佳的宏F1分数，超过了之前的基线。研究结果表明，在此情境下，轻量级信号和小型集成比层次门控更具有效性。

Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Authors: Qianli Ma, Chang Guo, Zhiheng Tian, Siyu Wang, Jipeng Xiao, Yuanhao Yue, Zhipeng Zhang

First: 2026-01-20T17:23:51+00:00 · Latest: 2026-01-20T17:23:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Writing effective rebuttals is a high-stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct-to-text generation problem, suffering from hallucination, overlooked critiques, and a lack of verifiable grounding. To address these limitations, we introduce $\textbf{RebuttalAgent}$, the first multi-agents framework that reframes rebuttal generation as an evidence-centric planning task. Our system decomposes complex feedback into atomic concerns and dynamically constructs hybrid contexts by synthesizing compressed summaries with high-fidelity text while integrating an autonomous and on-demand external search module to resolve concerns requiring outside literature. By generating an inspectable response plan before drafting, $\textbf{RebuttalAgent}$ ensures that every argument is explicitly anchored in internal or external evidence. We validate our approach on the proposed $\textbf{RebuttalBench}$ and demonstrate that our pipeline outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Code will be released.

中文标题/摘要

标题：Paper2Rebuttal: 一种透明作者回应辅助的多智能体框架

撰写有效的反驳是一项高风险的任务，不仅需要语言流畅性，还需要精确对齐审稿人意图和手稿细节。当前解决方案通常将此视为直接的文本生成问题，容易出现幻觉、遗漏批评和缺乏可验证的依据。为解决这些局限性，我们引入了**RebuttalAgent**，这是第一个将反驳生成重新定义为以证据为中心的规划任务的多智能体框架。我们的系统将复杂的反馈分解为基本的关注点，并通过合成压缩摘要与高保真文本来动态构建混合上下文，同时整合一个自主的按需外部搜索模块以解决需要外部文献的问题。通过在起草前生成可检查的回应计划，**RebuttalAgent** 确保每个论点都明确地锚定在内部或外部证据中。我们在提出的**RebuttalBench**上验证了我们的方法，并证明我们的管道在覆盖面、忠实度和战略连贯性方面优于强大的基线，为同行评审过程提供了一个透明和可控的助手。代码将被发布。

Summary / 总结

The paper addresses the challenges of writing effective rebuttals by introducing RebuttalAgent, a multi-agent framework that reframes rebuttal generation as an evidence-centric planning task. It decomposes feedback into atomic concerns and dynamically constructs hybrid contexts using compressed summaries and an external search module. Experiments show that RebuttalAgent outperforms strong baselines in coverage, faithfulness, and strategic coherence, providing a transparent and controllable assistant for the peer review process.

论文通过引入RebuttalAgent框架，解决生成有效反驳的挑战，该框架将任务重新定义为以证据为中心的规划过程。它将反馈分解为基本问题，并动态构建混合上下文，结合压缩摘要和外部搜索模块。系统确保每个论点都基于证据，实验结果表明，RebuttalAgent在覆盖面、忠实性和战略连贯性方面优于强基线，提供了一个透明且可控的同行评审助手。

WaveletInception Networks for on-board Vibration-Based Infrastructure Health Monitoring

Authors: Reza Riahi Samani, Alfredo Nunez, Bart De Schutter

First: 2025-07-17T10:14:20+00:00 · Latest: 2026-01-20T17:19:43+00:00

Comments: Under reviewer for the Journal of Engineering Application of Artificial Intelligence

Abs · PDF · Code1 · Code2

Abstract

This paper presents a deep learning framework for analyzing on board vibration response signals in infrastructure health monitoring. The proposed WaveletInception-BiGRU network uses a Learnable Wavelet Packet Transform (LWPT) for early spectral feature extraction, followed by one-dimensional Inception-Residual Network (1D Inception-ResNet) modules for multi-scale, high-level feature learning. Bidirectional Gated Recurrent Unit (BiGRU) modules then integrate temporal dependencies and incorporate operational conditions, such as the measurement speed. This approach enables effective analysis of vibration signals recorded at varying speeds, eliminating the need for explicit signal preprocessing. The sequential estimation head further leverages bidirectional temporal information to produce an accurate, localized assessment of infrastructure health. Ultimately, the framework generates high-resolution health profiles spatially mapped to the physical layout of the infrastructure. Case studies involving track stiffness regression and transition zone classification using real-world measurements demonstrate that the proposed framework significantly outperforms state-of-the-art methods, underscoring its potential for accurate, localized, and automated on-board infrastructure health monitoring.

Summary / 总结

This paper introduces a deep learning framework for infrastructure health monitoring using on-board vibration signals. The WaveletInception-BiGRU network employs a Learnable Wavelet Packet Transform for spectral feature extraction, followed by 1D Inception-ResNet modules for multi-scale feature learning. Bidirectional Gated Recurrent Units then integrate temporal dependencies and operational conditions. The framework generates high-resolution health profiles and outperforms existing methods in track stiffness regression and transition zone classification.

本文提出了一种基于振动信号的深学习框架，用于车载基础设施健康监测。WaveletInception-BiGRU网络采用可学习小波包变换进行早期频谱特征提取，随后通过1D Inception-ResNet模块进行多尺度特征学习。双向门控循环单元则整合了时间依赖性和操作条件。该框架能够有效分析不同速度下的振动信号，无需显式预处理，并生成高分辨率的健康状况图谱。案例研究显示，该方法在轨道刚度回归和过渡区分类中优于现有方法。

ASBA: A-line State Space Model and B-line Attention for Sparse Optical Doppler Tomography Reconstruction

Authors: Zhenghong Li, Wensheng Cheng, Congwu Du, Yingtian Pan, Zhaozheng Yin, Haibin Ling

First: 2026-01-20T17:17:02+00:00 · Latest: 2026-01-20T17:17:02+00:00

Comments: 17 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Optical Doppler Tomography (ODT) is an emerging blood flow analysis technique. A 2D ODT image (B-scan) is generated by sequentially acquiring 1D depth-resolved raw A-scans (A-line) along the lateral axis (B-line), followed by Doppler phase-subtraction analysis. To ensure high-fidelity B-scan images, current practices rely on dense sampling, which prolongs scanning time, increases storage demands, and limits the capture of rapid blood flow dynamics. Recent studies have explored sparse sampling of raw A-scans to alleviate these limitations, but their effectiveness is hindered by the conservative sampling rates and the uniform modeling of flow and background signals. In this study, we introduce a novel blood flow-aware network, named ASBA (A-line ROI State space model and B-line phase Attention), to reconstruct ODT images from highly sparsely sampled raw A-scans. Specifically, we propose an A-line ROI state space model to extract sparsely distributed flow features along the A-line, and a B-line phase attention to capture long-range flow signals along each B-line based on phase difference. Moreover, we introduce a flow-aware weighted loss function that encourages the network to prioritize the accurate reconstruction of flow signals. Extensive experiments on real animal data demonstrate that the proposed approach clearly outperforms existing state-of-the-art reconstruction methods.

中文标题/摘要

标题：ASBA：沿A线的区域状态空间模型和沿B线的相位注意力稀疏光学多普勒断层成像重建

光学多普勒断层成像（ODT）是一种新兴的血液流分析技术。通过沿横向轴（B线）顺序获取1D深度解析原始A扫描（A线），并进行多普勒相位减法分析，生成2D ODT图像（B扫描）。为了确保高保真B扫描图像，当前做法依赖密集采样，这延长了扫描时间，增加了存储需求，并限制了快速血液流动力学的捕捉。最近的研究探索了原始A扫描的稀疏采样以缓解这些限制，但其效果受限于保守的采样率和对流和背景信号的统一建模。在本研究中，我们提出了一种名为ASBA（沿A线的区域兴趣状态空间模型和沿B线的相位注意力）的新血液流感知网络，以从高度稀疏采样的原始A扫描中重建ODT图像。具体而言，我们提出了一种沿A线的区域兴趣状态空间模型来提取沿A线分布的稀疏流特征，并提出了一种基于相位差的沿B线的相位注意力来捕捉每个B线上的长程流信号。此外，我们引入了一种流感知加权损失函数，以鼓励网络优先重建流信号。在真实动物数据上的广泛实验表明，所提出的方法明显优于现有的最先进的重建方法。

Summary / 总结

The study aims to improve the reconstruction of Optical Doppler Tomography (ODT) images by addressing the limitations of dense sampling, such as prolonged scanning time and increased storage demands. The proposed ASBA method uses an A-line ROI state space model to extract flow features and a B-line phase attention mechanism to capture long-range flow signals. The approach also includes a flow-aware weighted loss function to prioritize the accurate reconstruction of flow signals. Experiments on real animal data show that ASBA outperforms existing methods in reconstructing ODT images from sparsely sampled raw A-scans.

研究旨在通过使用稀疏采样的原始A扫描来提高光学多普勒断层成像（ODT）图像的重建质量，从而减少扫描时间和存储需求。提出的ASBA网络包括一个A线ROI状态空间模型用于提取流信号特征，以及一个B线相位注意力机制用于捕捉沿每个B线的长程流信号。实验表明，ASBA在从稀疏数据重建ODT图像方面优于现有方法。

TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection

Authors: Xinqi Xiong, Prakrut Patel, Qingyuan Fan, Amisha Wadhwa, Sarathy Selvam, Xiao Guo, Luchao Qi, Xiaoming Liu, Roni Sengupta

First: 2025-05-30T17:59:08+00:00 · Latest: 2026-01-20T17:13:06+00:00

Comments: WACV2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

The rapid advancement of talking-head deepfake generation fueled by advanced generative models has elevated the realism of synthetic videos to a level that poses substantial risks in domains such as media, politics, and finance. However, current benchmarks for deepfake talking-head detection fail to reflect this progress, relying on outdated generators and offering limited insight into model robustness and generalization. We introduce TalkingHeadBench, a comprehensive multi-model multi-generator benchmark and curated dataset designed to evaluate the performance of state-of-the-art detectors on the most advanced generators. Our dataset includes deepfakes synthesized by leading academic and commercial models and features carefully constructed protocols to assess generalization under distribution shifts in identity and generator characteristics. We benchmark a diverse set of existing detection methods, including CNNs, vision transformers, and temporal models, and analyze their robustness and generalization capabilities. In addition, we provide error analysis using Grad-CAM visualizations to expose common failure modes and detector biases. TalkingHeadBench is hosted on https://huggingface.co/datasets/luchaoqi/TalkingHeadBench with open access to all data splits and protocols. Our benchmark aims to accelerate research towards more robust and generalizable detection models in the face of rapidly evolving generative techniques.

中文标题/摘要

标题：TalkingHeadBench：多模态反欺诈基准及对话头深度伪造检测分析

随着先进生成模型的快速发展，对话头深度伪造生成技术的逼真度达到了前所未有的水平，这在媒体、政治和金融等领域带来了重大风险。然而，当前的深度伪造对话头检测基准未能反映这一进展，依赖于过时的生成器，提供的模型鲁棒性和泛化能力的见解有限。我们提出了TalkingHeadBench，这是一个全面的多模型多生成器基准和数据集，旨在评估最先进的检测器在最先进生成器上的性能。我们的数据集包括由领先学术和商业模型合成的深度伪造，并包含精心构建的协议，以评估身份和生成器特征分布变化下的泛化能力。我们对现有的多种检测方法进行了基准测试，包括CNN、视觉变换器和时序模型，并分析了它们的鲁棒性和泛化能力。此外，我们还提供了Grad-CAM可视化错误分析，以揭示常见的失败模式和检测器偏差。TalkingHeadBench托管在https://huggingface.co/datasets/luchaoqi/TalkingHeadBench，所有数据分割和协议均开放访问。我们的基准旨在加速研究，以应对快速发展的生成技术，推动更鲁棒和泛化的检测模型的发展。

Summary / 总结

The research aims to address the growing threat of realistic talking-head deepfakes by developing a new benchmark, TalkingHeadBench, which includes deepfakes generated by advanced models. The method involves creating a comprehensive dataset with diverse deepfakes and protocols to evaluate the robustness and generalization of existing detection methods. Key findings show that current detectors struggle with advanced deepfakes, highlighting the need for more robust models. Error analysis using Grad-CAM visualizations reveals common failure modes and biases in these detectors.

TalkingHeadBench 是一个全面的基准和数据集，用于评估深伪唇动检测方法。该基准旨在应对深伪日益逼真的问题，包含了由领先学术和商业模型生成的深伪。通过精心设计的协议，它评估了包括 CNN、视觉变压器和时序模型在内的各种检测方法的鲁棒性和泛化能力。关键发现表明，当前方法在处理身份和生成器特征分布变化方面存在局限性，并且错误分析揭示了常见的失败模式和检测偏见。

One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

Authors: Yitong Dong, Qi Zhang, Minchao Jiang, Zhiqiang Wu, Qingnan Fan, Ying Feng, Huaqi Zhang, Hujun Bao, Guofeng Zhang

First: 2026-01-20T17:11:55+00:00 · Latest: 2026-01-20T17:11:55+00:00

Abs · PDF · Code1 · Code2

Abstract

We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.

中文标题/摘要

标题：一次性精炼器：通过一步扩散增强前馈新颖视图合成

我们提出了一种新颖框架，用于从稀疏图像中生成高保真新颖视图（NVS），解决了基于视觉变换器（ViT）骨干的近期前馈3D高斯点积（3DGS）方法的关键限制。虽然基于ViT的流水线提供了强大的几何先验，但由于计算成本，它们往往受限于低分辨率输入。此外，现有的生成增强方法通常对3D缺乏感知，导致不同视图之间结构不一致，尤其是在未见区域。为克服这些挑战，我们设计了双域细节感知模块，该模块能够在不被ViT骨干限制的情况下处理高分辨率图像，并赋予高斯分布额外特征以存储高频细节。我们开发了一种特征引导的扩散网络，在恢复过程中可以保留高频细节。我们引入了一种统一的训练策略，使基于ViT的几何骨干和基于扩散的精炼模块能够联合优化。实验表明，我们的方法可以在多个数据集中保持优越的生成质量。

Summary / 总结

The research aims to enhance the quality of novel view synthesis (NVS) from sparse images by addressing limitations in feed-forward 3D Gaussian Splatting (3DGS) methods based on Vision Transformer (ViT) backbones. The proposed framework includes a Dual-Domain Detail Perception Module and a feature-guided diffusion network, which together allow handling high-resolution images and preserving high-frequency details. Experiments show that the method can maintain superior generation quality across multiple datasets.

本文提出了一种新颖的框架，用于从稀疏图像中进行高保真新视角合成（NVS），解决了基于Vision Transformer（ViT）骨干的最近前馈3D高斯散点图（3DGS）方法的局限性。作者提出了一种双域细节感知模块来处理高分辨率图像，并开发了一种特征引导的扩散网络，在恢复过程中保留高频细节。通过统一的训练策略优化了基于ViT的几何骨干和基于扩散的精炼模块。实验表明，所提出的方法在多个数据集上保持了优越的生成质量。

Dynamics of Agentic Loops in Large Language Models: A Geometric Theory of Trajectories

Authors: Nicolas Tacheny

First: 2025-12-11T07:06:14+00:00 · Latest: 2026-01-20T16:35:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Agentic systems built on large language models operate through recursive feedback loops, where each output becomes the next input. Yet the geometric behavior of these agentic loops (whether they converge, diverge, or exhibit more complex dynamics) remains poorly understood. This paper introduces a geometric framework for analyzing agentic trajectories in semantic embedding space, treating iterative transformations as discrete dynamical systems. We distinguish the artifact space, where linguistic transformations occur, from the embedding space, where geometric measurements are performed. Because cosine similarity is biased by embedding anisotropy, we introduce an isotonic calibration that eliminates systematic bias and aligns similarities with human semantic judgments while preserving high local stability. This enables rigorous measurement of trajectories, clusters and attractors. Through controlled experiments on singular agentic loops, we identify two fundamental regimes. A contractive rewriting loop converges toward a stable attractor with decreasing dispersion, while an exploratory summarize and negate loop produces unbounded divergence with no cluster formation. These regimes display qualitatively distinct geometric signatures of contraction and expansion. Our results show that prompt design directly governs the dynamical regime of an agentic loop, enabling systematic control of convergence, divergence and trajectory structure in iterative LLM transformations.

Summary / 总结

This paper explores the geometric behavior of agentic loops in large language models, which operate through recursive feedback. It introduces a geometric framework to analyze these loops in semantic embedding space, distinguishing between artifact and embedding spaces. The study identifies two fundamental regimes: contractive loops converge to a stable attractor, while exploratory loops diverge without forming clusters. These regimes are characterized by distinct geometric signatures of contraction and expansion, showing that prompt design directly influences the dynamical regime of agentic loops.

该论文探讨了大型语言模型中递归反馈操作的代理循环的几何行为。引入了一种几何框架来分析这些循环在语义嵌入空间中的行为，区分了艺术制品空间和嵌入空间。关键发现包括两种基本模式：收敛循环向稳定吸引子收敛，而探索性循环发散而不形成簇。提示设计显著影响循环的动力学模式，从而可以控制收敛和发散。

Towards Fast Coarse-graining and Equation Discovery with Foundation Inference Models

Authors: Manuel Hinz, Maximilian Mauel, Patrick Seifner, David Berghaus, Kostadin Cvejoski, Ramses J. Sanchez

First: 2025-10-14T15:17:23+00:00 · Latest: 2026-01-20T16:34:49+00:00

Abs · PDF · Code1 · Code2

Abstract

High-dimensional recordings of dynamical processes are often characterized by a much smaller set of effective variables, evolving on low-dimensional manifolds. Identifying these latent dynamics requires solving two intertwined problems: discovering appropriate coarse-grained variables and simultaneously fitting the governing equations. Most machine learning approaches tackle these tasks jointly by training autoencoders together with models that enforce dynamical consistency. We propose to decouple the two problems by leveraging the recently introduced Foundation Inference Models (FIMs). FIMs are pretrained models that estimate the infinitesimal generators of dynamical systems (e.g., the drift and diffusion of a stochastic differential equation) in zero-shot mode. By amortizing the inference of the dynamics through a FIM with frozen weights, and training only the encoder-decoder map, we define a simple, simulation-consistent loss that stabilizes representation learning. A proof of concept on a stochastic double-well system with semicircle diffusion, embedded into synthetic video data, illustrates the potential of this approach for fast and reusable coarse-graining pipelines.

Summary / 总结

The research aims to develop a fast and efficient method for coarse-graining high-dimensional dynamical processes and discovering their governing equations. The authors propose using Foundation Inference Models (FIMs) to decouple the discovery of coarse-grained variables from the fitting of governing equations. By leveraging pretrained FIMs to estimate the dynamics and training only the encoder-decoder map, they achieve a simulation-consistent loss that stabilizes representation learning. The method is demonstrated on a stochastic double-well system, showing potential for fast and reusable coarse-graining pipelines.

研究旨在开发一种快速高效的方法来粗粒化高维动力过程并发现其支配方程。作者提出使用基础推理模型（FIMs）将粗粒化变量的发现与支配方程的拟合任务分离。通过利用预训练的FIMs估计动力学，并仅训练编码-解码映射，他们实现了与模拟一致的损失，从而稳定表示学习。该方法在随机双阱系统上进行了演示，展示了快速和可重用粗粒化管道的潜力。

Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Laser Powder Bed Fusion

Authors: R. Sharma, M. Raissi, Y. B. Guo

First: 2025-06-25T15:25:01+00:00 · Latest: 2026-01-20T16:30:50+00:00

Comments: Further investigation revealed that the current version reflects an incomplete formulation and limited validation of the proposed method. We have since developed a substantially revised and extended study with updated assumptions and results, and therefore withdraw this version to prevent citation of superseded findings

Abs · PDF · Code1 · Code2

Abstract

Efficient simulation of Laser Powder Bed Fusion (LPBF) is crucial for process prediction due to the lasting issue of high computation cost using traditional numerical methods such as finite element analysis (FEA). This study presents an efficient modeling framework termed FEA-Regulated Physics-Informed Neural Network (FEA-PINN) to accelerate the thermal field prediction in a LPBF process while maintaining the FEA accuracy. A novel dynamic material updating strategy is developed to capture the dynamic phase change of powder-liquid-solid in the PINN model. The PINN model incorporates temperature-dependent material properties and phase change behavior using the apparent heat capacity method. While the PINN model demonstrates high accuracy with a small training data and enables generalization of new process parameters via transfer learning, it faces the challenge of high computation cost in time-dependent problems due to the residual accumulation. To overcome this issue, the FEA-PINN framework integrates corrective FEA simulations during inference to enforce physical consistency and reduce error drift. A comparative analysis shows that FEA-PINN achieves equivalent accuracy to FEA while significantly reducing computational cost. The framework has been validated using the benchmark FEA data and demonstrated through single-track scanning in LPBF.

中文标题/摘要

标题：基于有限元分析调节的物理信息机器学习用于激光粉末床融合仿真加速

激光粉末床融合（LPBF）的高效仿真对于过程预测至关重要，因为传统数值方法如有限元分析（FEA）带来的高昂计算成本持续存在。本研究提出了一种称为FEA调节的物理信息神经网络（FEA-PINN）的高效建模框架，以加速LPBF过程中的热场预测，同时保持FEA的准确性。开发了一种新的动态材料更新策略，以捕捉PINN模型中的粉末-液体-固体动态相变。PINN模型使用显热容方法结合温度依赖的材料属性和相变行为。尽管PINN模型具有小训练数据的高精度，并且通过迁移学习能够泛化新的工艺参数，但它在时间依赖问题中面临高计算成本的挑战，因为残差积累。为克服这一问题，FEA-PINN框架在推理过程中结合了校正的FEA仿真，以确保物理一致性并减少误差漂移。比较分析表明，FEA-PINN在计算成本显著降低的情况下实现了与FEA相当的精度。该框架已使用基准FEA数据进行了验证，并通过LPBF中的单道扫描进行了演示。

Summary / 总结

This study introduces an efficient modeling framework called FEA-Regulated Physics-Informed Neural Network (FEA-PINN) to accelerate thermal field prediction in Laser Powder Bed Fusion (LPBF) processes while maintaining FEA accuracy. The framework uses a dynamic material updating strategy and the apparent heat capacity method to capture phase changes. Although the PINN model is accurate with minimal training data and supports transfer learning, it suffers from high computational costs in time-dependent problems due to residual accumulation. The FEA-PINN framework integrates corrective FEA simulations during inference to ensure physical consistency and reduce error drift, achieving equivalent accuracy to FEA while significantly reducing computational cost. The framework has been validated using benchmark FEA data and demonstrated in single-track scanning in LPBF.

研究旨在通过保持准确性的同时加速激光粉末床融合（LPBF）的热场预测。引入了FEA-调节的物理知情神经网络（FEA-PINN）框架，该框架使用动态材料更新策略和显热容方法来捕捉相变。FEA-PINN模型在有限训练数据下表现出高精度，并支持迁移学习。然而，它在时间依赖问题中面临高计算成本的问题。为解决这一问题，框架在推理过程中整合了纠正的FEA模拟，以确保物理一致性并减少误差漂移。比较分析表明，FEA-PINN在LPBF中的单道扫描验证中实现了与FEA相当的精度，但计算成本显著降低，通过基准FEA数据进行了验证。

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Authors: Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, Kai Chen

First: 2026-01-20T16:30:07+00:00 · Latest: 2026-01-20T16:30:07+00:00

Comments: GitHub: https://github.com/ZGC-EmbodyAI/TwinBrainVLA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.

中文标题/摘要

标题：TwinBrainVLA：通过不对称混合变换器协调通用VLM和专门的体感VLM以实现体感任务的潜力

标准的视觉-语言-动作（VLA）模型通常会针对机器人控制明确微调一个统一的视觉-语言模型（VLM）主干。然而，这种方法在保持高层次的通用语义理解与学习低层次的精细传感器运动技能之间造成了关键的紧张关系，经常导致模型对开放世界能力的“灾难性遗忘”。为了解决这一冲突，我们提出了TwinBrainVLA，这是一种新颖的架构，它协调了一个保留通用语义理解的通用VLM和一个专注于体感 proprioception 的专门VLM，以实现联合机器人控制。TwinBrainVLA 通过一种新颖的不对称混合变换器（AsyMoT）机制，将一个冻结的“左脑”，保留了稳健的通用视觉推理能力，与一个专门用于体感感知的可训练“右脑”相结合。这种设计允许右脑动态查询冻结的左脑的语义知识，并将其与体感状态融合，为流动匹配动作专家生成精确的连续控制提供丰富的条件。在 SimplerEnv 和 RoboCasa 基准测试中的广泛实验表明，TwinBrainVLA 在操纵性能上优于最先进的基线模型，同时明确保留了预训练VLM的全面视觉理解能力，为同时实现高层次语义理解和低层次物理灵巧性的通用机器人提供了有希望的方向。

Summary / 总结

TwinBrainVLA addresses the challenge of maintaining both high-level semantic understanding and low-level sensorimotor skills in robotic control by introducing a dual-VLM architecture. It uses a frozen 'Left Brain' for general visual reasoning and a trainable 'Right Brain' specialized for embodied perception, connected via an Asymmetric Mixture-of-Transformers mechanism. This design allows the Right Brain to query semantic knowledge from the Left Brain and generate precise controls, demonstrating superior manipulation performance compared to existing methods while preserving comprehensive visual understanding.

TwinBrainVLA通过引入双VLM架构解决了保持高阶语义理解和低阶传感器运动技能之间的挑战。它使用一个冻结的‘左脑’进行通用视觉推理和一个专门训练的‘右脑’进行体感感知，并通过不对称混合变换器机制连接。这种设计允许右脑从左脑查询语义知识并生成精确控制，实验结果表明其在SimplerEnv和RoboCasa基准测试中表现出色，同时保留了全面的视觉理解能力。

GIC-DLC: Differentiable Logic Circuits for Hardware-Friendly Grayscale Image Compression

Authors: Till Aczel, David F. Jenny, Simon Bührer, Andreas Plesner, Antonio Di Maio, Roger Wattenhofer

First: 2026-01-20T16:29:23+00:00 · Latest: 2026-01-20T16:29:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Neural image codecs achieve higher compression ratios than traditional hand-crafted methods such as PNG or JPEG-XL, but often incur substantial computational overhead, limiting their deployment on energy-constrained devices such as smartphones, cameras, and drones. We propose Grayscale Image Compression with Differentiable Logic Circuits (GIC-DLC), a hardware-aware codec where we train lookup tables to combine the flexibility of neural networks with the efficiency of Boolean operations. Experiments on grayscale benchmark datasets show that GIC-DLC outperforms traditional codecs in compression efficiency while allowing substantial reductions in energy consumption and latency. These results demonstrate that learned compression can be hardware-friendly, offering a promising direction for low-power image compression on edge devices.

中文标题/摘要

标题：GIC-DLC：面向硬件的可微逻辑电路灰度图像压缩

神经图像编解码器在压缩比方面超过了传统的手工设计方法，如PNG或JPEG-XL，但通常会带来显著的计算开销，限制了它们在智能手机、相机和无人机等能量受限设备上的部署。我们提出了灰度图像压缩的可微逻辑电路（GIC-DLC），这是一种硬件感知的编解码器，其中我们训练查找表以结合神经网络的灵活性和布尔操作的效率。在灰度基准数据集上的实验表明，GIC-DLC在压缩效率上优于传统编解码器，同时允许大幅减少能耗和延迟。这些结果表明，学习压缩可以是硬件友好的，为边缘设备上的低功耗图像压缩提供了有希望的方向。

Summary / 总结

The research aims to address the computational overhead of neural image codecs, which limits their deployment on energy-constrained devices. The authors propose GIC-DLC, which combines the flexibility of neural networks with the efficiency of Boolean operations by training lookup tables. Experiments show that GIC-DLC outperforms traditional codecs in compression efficiency and reduces energy consumption and latency.

研究旨在通过提出GIC-DLC，一种硬件感知的编解码器，来解决神经图像编解码器的计算开销问题，该编解码器结合了神经网络的灵活性和布尔操作的效率。方法包括训练查找表以实现更高的压缩比并减少能耗和延迟。实验表明，GIC-DLC在灰度基准数据集上的压缩效率和能耗方面优于传统编解码器，适用于边缘设备。

The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning

Authors: Renmiao Chen, Yida Lu, Shiyao Cui, Xuan Ouyang, Victor Shea-Jay Huang, Shumin Zhang, Chengwei Pan, Han Qiu, Minlie Huang

First: 2026-01-20T16:24:18+00:00 · Latest: 2026-01-20T16:24:18+00:00

Comments: *15 pages, 5 figures. Introduces MIR-SafetyBench (2,676 instances; 9 multi-image relations). Equal contribution; †Corresponding author. Code/data: https://github.com/thu-coai/MIR-SafetyBench

Abs · PDF · Code1 · Code2 · Code3

Abstract

As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at https://github.com/thu-coai/MIR-SafetyBench.

中文标题/摘要

标题：聪明的副作用：多图像推理中的安全风险

随着多模态大型语言模型（MLLMs）获得更强的推理能力以处理复杂的多图像指令，这一进步可能会带来新的安全风险。我们通过引入MIR-SafetyBench，这是第一个专注于多图像推理安全性的基准测试，来研究这一问题，该基准测试包含2,676个实例，涵盖9种多图像关系的分类。我们在19个MLLM上的广泛评估揭示了一个令人担忧的趋势：具有更高级多图像推理能力的模型在MIR-SafetyBench上更容易受到攻击。除了攻击成功率外，我们发现许多被标记为安全的响应往往是肤浅的，通常是由误解或回避、非承诺的回答驱动的。我们还观察到，不安全的生成在平均上比安全的生成具有更低的注意力熵。这一内部特征表明模型可能过度专注于任务解决，而忽视了安全约束。我们的代码和数据可在https://github.com/thu-coai/MIR-SafetyBench获取。

Summary / 总结

This study investigates the safety risks associated with advanced multi-image reasoning capabilities in MLLMs by introducing MIR-SafetyBench, a benchmark with 2,676 instances. Evaluations on 19 MLLMs show that more advanced reasoning models are more vulnerable, with many safe responses being superficial or evasive. Unsafe responses tend to have lower attention entropy, suggesting a potential risk of over-focusing on task-solving at the expense of safety constraints.

研究通过引入包含2,676个实例的MIR-SafetyBench基准，探讨了MLLMs在高级多图像推理能力方面的安全风险。对19个MLLMs的评估显示，更先进的推理模型更容易出现问题，许多被认为是安全的响应往往是表面的或回避性的。不安全的生成通常具有较低的注意力熵，这表明模型可能过度专注于任务解决而忽视了安全约束。

Adaptive Riemannian Graph Neural Networks

Authors: Xudong Wang, Chris Ding, Tongxin Li, Jicong Fan

Venue: AAAI

First: 2025-08-04T16:55:02+00:00 · Latest: 2026-01-20T16:23:35+00:00

Comments: Accepted in The Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), Main Technical Track

Abs · PDF · Code1 · Code2

Abstract

Graph data often exhibits complex geometric heterogeneity, where structures with varying local curvature, such as tree-like hierarchies and dense communities, coexist within a single network. Existing geometric GNNs, which embed graphs into single fixed-curvature manifolds or discrete product spaces, struggle to capture this diversity. We introduce Adaptive Riemannian Graph Neural Networks (ARGNN), a novel framework that learns a continuous and anisotropic Riemannian metric tensor field over the graph. It allows each node to determine its optimal local geometry, enabling the model to fluidly adapt to the graph's structural landscape. Our core innovation is an efficient parameterization of the node-wise metric tensor, specializing to a learnable diagonal form that captures directional geometric information while maintaining computational tractability. To ensure geometric regularity and stable training, we integrate a Ricci flow-inspired regularization that smooths the learned manifold. Theoretically, we establish the rigorous geometric evolution convergence guarantee for ARGNN and provide a continuous generalization that unifies prior fixed or mixed-curvature GNNs. Empirically, our method demonstrates superior performance on both homophilic and heterophilic benchmark datasets with the ability to capture diverse structures adaptively. Moreover, the learned geometries both offer interpretable insights into the underlying graph structure and empirically corroborate our theoretical analysis.

中文标题/摘要

标题：自适应黎曼图神经网络

图数据通常表现出复杂的几何异质性，其中具有不同局部曲率的结构，如树状层次结构和密集社区，在同一网络中共存。现有的几何GNN将图嵌入到单一固定曲率流形或离散乘积空间中，难以捕捉这种多样性。我们提出了自适应黎曼图神经网络（ARGNN），这是一种新颖的框架，可以在图上学习连续且各向异性的黎曼度量张量场。它允许每个节点确定其最优局部几何结构，使模型能够灵活地适应图的结构景观。我们的核心创新是一种节点级度量张量的有效参数化，专门化为可学习的对角形式，以捕捉方向几何信息同时保持计算上的可处理性。为了确保几何连续性和稳定的训练，我们引入了一种基于Ricci流的正则化方法，以平滑学习到的流形。理论上，我们为ARGNN建立了严格的几何演化收敛保证，并提供了一种连续泛化，统一了先前的固定或混合曲率GNN。实验上，我们的方法在同质性和异质性基准数据集上均表现出优越性能，能够适应性地捕捉各种结构。此外，学习到的几何结构提供了对底层图结构的可解释洞察，并且实证上支持我们的理论分析。

Summary / 总结

ARGNN is designed to address the limitations of existing geometric GNNs in handling graphs with varying local curvature. It introduces a framework that learns a continuous and anisotropic Riemannian metric tensor field, allowing nodes to determine their optimal local geometry. The method uses an efficient parameterization of the metric tensor and integrates a Ricci flow-inspired regularization to ensure geometric regularity. Empirically, ARGNN outperforms existing methods on various benchmark datasets and provides interpretable insights into the graph structure.

ARGNN旨在解决现有几何GNN在处理具有不同局部曲率的图时的局限性。它学习一个连续且各向异性的黎曼度量张量场，使每个节点能够确定其最优局部几何结构，适应图的结构景观。该方法在基准数据集上表现出色，并提供了对图结构的可解释洞察。

Riemannian Liquid Spatio-Temporal Graph Network

Authors: Liangsi Lu, Jingchao Wang, Zhaorong Dai, Hanqian Liu, Yang Shi

Venue: The Web Conference 2026

First: 2026-01-20T16:09:05+00:00 · Latest: 2026-01-20T16:09:05+00:00

Comments: This paper has been accepted to The Web Conference 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Liquid Time-Constant networks (LTCs), a type of continuous-time graph neural network, excel at modeling irregularly-sampled dynamics but are fundamentally confined to Euclidean space. This limitation introduces significant geometric distortion when representing real-world graphs with inherent non-Euclidean structures (e.g., hierarchies and cycles), degrading representation quality. To overcome this limitation, we introduce the Riemannian Liquid Spatio-Temporal Graph Network (RLSTG), a framework that unifies continuous-time liquid dynamics with the geometric inductive biases of Riemannian manifolds. RLSTG models graph evolution through an Ordinary Differential Equation (ODE) formulated directly on a curved manifold, enabling it to faithfully capture the intrinsic geometry of both structurally static and dynamic spatio-temporal graphs. Moreover, we provide rigorous theoretical guarantees for RLSTG, extending stability theorems of LTCs to the Riemannian domain and quantifying its expressive power via state trajectory analysis. Extensive experiments on real-world benchmarks demonstrate that, by combining advanced temporal dynamics with a Riemannian spatial representation, RLSTG achieves superior performance on graphs with complex structures. Project Page: https://rlstg.github.io

中文标题/摘要

标题：黎曼流时空图网络

液态时间常数网络（LTCs），一种连续时间图神经网络，擅长建模不规则采样的动力学，但本质上局限于欧几里得空间。这一限制在表示具有固有非欧几里得结构（例如层次结构和循环）的真实世界图时引入了显著的几何失真，降低了表示质量。为克服这一限制，我们引入了黎曼流时空图网络（RLSTG），这是一种将连续时间液态动力学与黎曼流形的几何归纳偏置统一起来的框架。RLSTG 通过直接在弯曲流形上形成的常微分方程（ODE）建模图的演变，使其能够忠实捕捉结构静态和动态时空图的内在几何结构。此外，我们为 RLSTG 提供了严格的理论保证，将 LTCs 的稳定性定理扩展到黎曼域，并通过状态轨迹分析量化其表达能力。在现实世界的基准测试上的广泛实验表明，通过结合先进的时序动力学和黎曼空间表示，RLSTG 在复杂结构的图上实现了更好的性能。项目页面：https://rlstg.github.io

Summary / 总结

The research aims to improve the representation quality of graphs with non-Euclidean structures by addressing the geometric distortion in Liquid Time-Constant networks (LTCs). The method introduces the Riemannian Liquid Spatio-Temporal Graph Network (RLSTG), which uses an Ordinary Differential Equation (ODE) on a curved manifold to model graph evolution. Key findings show that RLSTG outperforms LTCs on real-world benchmarks, especially for graphs with complex structures, by combining advanced temporal dynamics with a Riemannian spatial representation.

研究旨在通过引入Riemannian Liquid Spatio-Temporal Graph Network (RLSTG)解决使用Liquid Time-Constant网络（LTCs）建模真实世界图时的几何失真问题。RLSTG通过在曲面上使用常微分方程（ODE）来建模图的演变，使其能够捕捉静态和动态图的内在几何结构。实验表明，通过结合先进的时序动态和Riemannian空间表示，RLSTG在复杂结构的图上表现优于LTCs。

Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns

Authors: George Mihaila

First: 2026-01-20T16:06:34+00:00 · Latest: 2026-01-20T16:06:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.

中文标题/摘要

标题：学习解释：基于变压器注意力模式的监督词元归因

可解释人工智能（XAI）在医疗保健、法律系统和金融服务等高风险应用中变得至关重要，因为不透明性阻碍了信任和问责。变压器的自注意力机制已被证明对模型可解释性有价值，注意力权重成功地用于理解模型的关注点和行为（Xu等，2015）；（Wiegreffe和Pinter，2019）。然而，现有的基于注意力的解释方法依赖于手动定义的聚合策略和固定的归因规则（Abnar和Zuidema，2020a）；（Chefer等，2021），而模型无偏的方法（LIME，SHAP）将模型视为黑盒，并通过输入扰动产生显著的计算成本。我们引入了解释网络（ExpNet），这是一种轻量级的神经网络，它学习从变压器注意力模式到词元级重要性评分的显式映射。与先前的方法不同，ExpNet 自动发现最佳的注意力特征组合，而不是依赖于预定义的规则。我们在一个具有挑战性的跨任务设置中评估了ExpNet，并将其与涵盖四种方法学家族的模型无偏方法和基于注意力的技术进行了基准测试。

Summary / 总结

This paper addresses the need for explainable AI in high-stakes applications by introducing Explanation Network (ExpNet), a lightweight neural network that learns to map transformer attention patterns to token-level importance scores. Unlike existing methods that rely on manual aggregation strategies or treat models as black boxes, ExpNet automatically discovers optimal attention feature combinations. The study evaluates ExpNet in a cross-task setting and compares it to various model-agnostic and attention-based techniques, demonstrating its effectiveness in providing interpretable explanations.

本文针对高风险应用中对可解释AI的需求，提出了Explanation Network (ExpNet)，这是一种轻量级神经网络，能够将变压器的注意力模式映射到标记级别的重要性评分。与依赖手动聚合策略或视模型为黑盒的方法不同，ExpNet能够自动发现最优的注意力特征组合。研究在跨任务设置中评估了ExpNet，并将其与各种模型无关和注意力基方法进行了比较，展示了其在提供可解释性解释方面的有效性。

PMCE: Probabilistic Multi-Granularity Semantics with Caption-Guided Enhancement for Few-Shot Learning

Authors: Jiaying Wu, Can Gao, Jinglu Hu, Hui Li, Xiaofeng Cao, Jingcai Guo

First: 2026-01-20T16:06:23+00:00 · Latest: 2026-01-20T16:06:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Few-shot learning aims to identify novel categories from only a handful of labeled samples, where prototypes estimated from scarce data are often biased and generalize poorly. Semantic-based methods alleviate this by introducing coarse class-level information, but they are mostly applied on the support side, leaving query representations unchanged. In this paper, we present PMCE, a Probabilistic few-shot framework that leverages Multi-granularity semantics with Caption-guided Enhancement. PMCE constructs a nonparametric knowledge bank that stores visual statistics for each category as well as CLIP-encoded class name embeddings of the base classes. At meta-test time, the most relevant base classes are retrieved based on the similarities of class name embeddings for each novel category. These statistics are then aggregated into category-specific prior information and fused with the support set prototypes via a simple MAP update. Simultaneously, a frozen BLIP captioner provides label-free instance-level image descriptions, and a lightweight enhancer trained on base classes optimizes both support prototypes and query features under an inductive protocol with a consistency regularization to stabilize noisy captions. Experiments on four benchmarks show that PMCE consistently improves over strong baselines, achieving up to 7.71% absolute gain over the strongest semantic competitor on MiniImageNet in the 1-shot setting. Our code is available at https://anonymous.4open.science/r/PMCE-275D

Summary / 总结

PMCE is a probabilistic few-shot learning framework that enhances query representations using multi-granularity semantics and caption-guided image descriptions. It constructs a knowledge bank with visual statistics and class name embeddings, retrieves relevant base classes based on class name similarities, and fuses these statistics with support set prototypes. Additionally, a lightweight enhancer optimizes prototypes and query features with consistency regularization. Experiments show PMCE outperforms strong baselines, achieving up to 7.71% improvement on MiniImageNet in the 1-shot setting.

PMCE 通过结合多粒度语义和基于描述的增强来解决少样本学习的限制。它构建了一个非参数化的知识库，使用视觉统计和CLIP编码的类名嵌入。在测试时，它检索相关的基类并将它们的统计信息与支持集原型融合。此外，一个冻结的BLIP描述器和一个基于基类训练的轻量级增强器在诱导协议下优化支持和查询特征，从而提高性能。PMCE 在 MiniImageNet 的 1 射设置中相对于最强的语义竞争对手实现了 7.71% 的绝对收益。

Diffusion-Guided Backdoor Attacks in Real-World Reinforcement Learning

Authors: Tairan Huang, Qingqing Ye, Yulin Jin, Jiawei Lian, Yi Wang, Haibo Hu

First: 2026-01-20T16:03:51+00:00 · Latest: 2026-01-20T16:03:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Backdoor attacks embed hidden malicious behaviors in reinforcement learning (RL) policies and activate them using triggers at test time. Most existing attacks are validated only in simulation, while their effectiveness in real-world robotic systems remains unclear. In physical deployment, safety-constrained control pipelines such as velocity limiting, action smoothing, and collision avoidance suppress abnormal actions, causing strong attenuation of conventional backdoor attacks. We study this previously overlooked problem and propose a diffusion-guided backdoor attack framework (DGBA) for real-world RL. We design small printable visual patch triggers placed on the floor and generate them using a conditional diffusion model that produces diverse patch appearances under real-world visual variations. We treat the robot control stack as a black-box system. We further introduce an advantage-based poisoning strategy that injects triggers only at decision-critical training states. We evaluate our method on a TurtleBot3 mobile robot and demonstrate reliable activation of targeted attacks while preserving normal task performance. Demo videos and code are available in the supplementary material.

中文标题/摘要

标题：面向现实世界的强化学习的扩散引导后门攻击

后门攻击将隐藏的恶意行为嵌入强化学习(RL)策略中，并在测试时通过触发器激活。大多数现有的攻击仅在仿真中得到验证，而它们在现实世界机器人系统中的有效性尚不清楚。在实际部署中，如速度限制、动作平滑和碰撞避免等安全约束控制管道会抑制异常动作，导致传统后门攻击的强烈衰减。我们研究了这一之前被忽视的问题，并提出了一种面向现实世界的RL的扩散引导后门攻击框架(DGBA)。我们设计了小型可打印的视觉贴片触发器放置在地板上，并使用条件扩散模型生成在现实世界视觉变化下具有多样外观的贴片。我们将机器人控制堆栈视为黑盒系统。我们进一步引入了一种基于优势的污染策略，在决策关键的训练状态下仅注入触发器。我们在TurtleBot3移动机器人上评估了该方法，并展示了在保持正常任务性能的同时可靠激活目标攻击。演示视频和代码可在补充材料中获取。

Summary / 总结

The study addresses the effectiveness of backdoor attacks in real-world reinforcement learning (RL) systems, where conventional attacks are often suppressed by safety mechanisms. It introduces a diffusion-guided backdoor attack framework (DGBA) that uses small printable visual patches as triggers. The method generates diverse patch appearances using a conditional diffusion model and injects triggers only in critical training states. Evaluation on a TurtleBot3 robot shows that the attack can reliably activate malicious behaviors without degrading normal task performance.

该论文探讨了在实际部署的强化学习（RL）系统中，传统后门攻击由于安全限制往往无效的问题。作者提出了一种基于扩散的后门攻击框架（DGBA），使用放置在地面上的小型视觉触发器，并通过条件扩散模型生成适应现实世界视觉变化的触发器。通过将机器人控制堆栈视为黑盒系统，并仅在关键训练状态下注入触发器，该方法确保了攻击的有效激活而不影响正常任务性能。在TurtleBot3移动机器人上的评估证实了所提出方法的有效性。