arXiv 论文速递

SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Authors: Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang

First: 2025-12-31T18:59:57+00:00 · Latest: 2025-12-31T18:59:57+00:00

Comments: Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot

中文标题/摘要

标题：SpaceTimePilot：时空分离的动态场景生成渲染

我们提出了SpaceTimePilot，一种视频扩散模型，能够分离空间和时间以实现可控的生成渲染。给定单目视频，SpaceTimePilot可以在生成过程中独立改变摄像机视角和运动序列，重新渲染场景，实现时空连续和任意探索。为此，我们在扩散过程中引入了有效的动画时间嵌入机制，允许对输出视频的运动序列进行显式控制，相对于源视频。由于现有数据集中没有提供同一动态场景的配对视频以连续的时间变化，我们提出了一种简单而有效的时空扭曲训练方案，利用现有的多视角数据集来模拟时间差异。该策略有效地监督模型学习时间控制并实现稳健的时空分离。为了进一步提高双重控制的精度，我们引入了两个额外组件：改进的摄像机条件机制，允许从第一帧开始改变摄像机，以及CamxTime，第一个合成的时空全覆盖渲染数据集，提供了场景内的完全自由时空视频轨迹。在时空扭曲方案和CamxTime数据集上的联合训练产生了更精确的时间控制。我们在现实世界和合成数据上评估了SpaceTimePilot，展示了清晰的时空分离和与先前工作相比的强劲结果。项目页面：https://zheninghuang.github.io/Space-Time-Pilot/ 代码：https://github.com/ZheningHuang/spacetimepilot

Summary / 总结

SpaceTimePilot is a video diffusion model that separates space and time for controllable generative rendering. Given a monocular video, it can independently modify the camera viewpoint and motion sequence, enabling continuous exploration across space and time. The model uses an animation time-embedding mechanism in the diffusion process to control the output video's motion sequence. A temporal-warping training scheme and the CamxTime dataset are used to achieve robust space-time disentanglement and precise dual control, outperforming previous methods on both real-world and synthetic data.

SpaceTimePilot 是一种视频扩散模型，能够将空间和时间分离，实现可控的生成渲染。给定一个单目视频，它可以独立改变摄像机视角和运动序列，从而在空间和时间上进行连续探索。模型使用动画时间嵌入机制和时间扭曲训练方案，以实现稳健的空间-时间分离。此外，改进的摄像机条件机制和 CamxTime 数据集进一步增强了时间控制。实验结果显示了清晰的空间-时间分离和与先前方法相比的强劲性能。

GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

Authors: Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu

First: 2025-12-31T18:59:55+00:00 · Latest: 2025-12-31T18:59:55+00:00

Comments: Project page: https://yichuanh.github.io/GaMO/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/

中文标题/摘要

标题：GaMO：基于几何的多视图扩散外延法用于稀疏视图三维重建

近年来，三维重建技术在从密集多视角图像中高质量捕捉场景方面取得了显著进展，但在输入视角有限时却面临挑战。各种方法，包括正则化技术、语义先验和几何约束，已被实施以应对这一挑战。最新的基于扩散的方法通过生成新的视角来增强训练数据，从而产生了显著的改进，超越了早期的正则化和基于先验的方法。尽管取得了这些进展，我们仍发现这些最先进的方法存在三个关键限制：对已知视图边缘之外的覆盖不足、生成视图之间的几何不一致以及计算成本高昂的管道。我们提出了GaMO（几何感知多视图外延器），一种通过多视图外延重新构建稀疏视图的框架。与生成新视角不同，GaMO 从现有相机姿态扩展视场，这本身就能保持几何一致性并提供更广泛的场景覆盖。我们的方法以零样本方式采用多视图条件和几何感知去噪策略，无需训练。在Replica和ScanNet++上的广泛实验表明，GaMO 在3、6和9个输入视图下的重建质量达到最先进的水平，PSNR和LPIPS均优于先前方法，同时比最先进的基于扩散的方法快25倍，处理时间不到10分钟。项目页面：https://yichuanh.github.io/GaMO/

Summary / 总结

The research aims to address the limitations of sparse-view 3D reconstruction by introducing GaMO, a framework that expands the field of view from existing camera poses. It uses multi-view conditioning and geometry-aware denoising strategies to generate broader scene coverage while preserving geometric consistency. Experiments show that GaMO outperforms previous methods in PSNR and LPIPS across 3, 6, and 9 input views and achieves a 25 times speedup over state-of-the-art diffusion-based methods with processing times under 10 minutes.

研究旨在通过提出GaMO来解决稀疏视角3D重建的限制，GaMO从现有相机姿态扩展视野，增强场景覆盖和几何一致性。该方法不需训练，使用多视图条件和几何感知去噪，实现了比现有方法更出色的重建质量，并显著加快了处理速度。

Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

Authors: Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang

First: 2025-12-31T18:59:53+00:00 · Latest: 2025-12-31T18:59:53+00:00

Comments: Project page: https://edit3r.github.io/edit3r/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

中文标题/摘要

标题：Edit3r：从稀疏未对齐图像即时编辑3D场景

我们提出了Edit3r，这是一种前馈框架，可以从未对齐、视角不一致、指令编辑过的图像中一次性重建和编辑3D场景。与需要针对每个场景进行优化的先前方法不同，Edit3r可以直接预测指令对齐的3D编辑，从而实现快速且逼真的渲染，无需优化或姿态估计。训练此类模型的关键挑战在于缺乏多视角一致的编辑图像作为监督。我们通过(i)基于SAM2的重新着色策略生成可靠的、跨视角一致的监督，以及(ii)不对称输入策略将重新着色的参考视图与原始辅助视图配对，鼓励网络融合和对齐不同的观察结果来解决这一问题。在推理时，我们的模型能够有效处理由2D方法（如InstructPix2Pix）编辑的图像，尽管在训练过程中并未接触此类编辑。为了进行大规模的定量评估，我们引入了DL3DV-Edit-Bench基准测试，该基准测试基于DL3DV测试集构建，包含20个不同的场景、4种编辑类型和总共100次编辑。全面的定量和定性结果表明，Edit3r在语义对齐和3D一致性方面优于最近的基线方法，同时在推理速度上显著更快，使其有望应用于实时3D编辑应用。

Summary / 总结

Edit3r is a feed-forward framework that reconstructs and edits 3D scenes from unposed, view-inconsistent images in a single pass. It directly predicts instruction-aligned 3D edits without requiring per-scene optimization or pose estimation, enabling fast and photorealistic rendering. Key to training this model is addressing the lack of multi-view consistent edited images through a SAM2-based recoloring strategy and an asymmetric input strategy. Experimental results demonstrate that Edit3r achieves better semantic alignment and 3D consistency compared to recent baselines, while operating at higher inference speed, making it suitable for real-time 3D editing applications.

Edit3r 是一个无需优化或姿态估计即可从不一致视角的未摆拍图像中一次性重建和编辑 3D 场景的前馈框架。它直接预测指令对齐的 3D 编辑，避免了逐场景优化。关键在于使用 SAM2 进行颜色校正以生成跨视角一致的监督信息，以及将颜色校正的参考视图与原始辅助视图配对的非对称输入策略。Edit3r 能有效处理如 InstructPix2Pix 等 2D 编辑，并在语义对齐和 3D 一致性方面优于最近的基线，同时具有更快的推理速度，适用于实时 3D 编辑应用。

Scaling Open-Ended Reasoning to Predict the Future

Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

First: 2025-12-31T18:59:51+00:00 · Latest: 2025-12-31T18:59:51+00:00

Comments: 45 pages

Abs · PDF · Code1 · Code2

Abstract

High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

中文标题/摘要

标题：将开放性推理扩展以预测未来

高风险决策涉及对未来不确定性的推理。在本研究中，我们训练语言模型对开放性预测问题进行预测。为了扩大训练数据，我们从每日新闻中报道的全球事件中自动合成新的预测问题，采用完全自动化的精心编纂配方。我们使用我们的数据集OpenForesight对Qwen3思考模型进行训练。为了防止训练和评估期间出现未来信息泄露，我们在数据生成和检索中使用离线新闻语料库。在小型验证集的指导下，我们展示了检索的好处以及强化学习（RL）中改进的奖励函数。一旦我们获得最终的预测系统，我们将在2025年5月至8月之间进行保留测试。我们的专门模型OpenForecaster 8B与更大的专有模型相当，我们的训练提高了预测的准确性、校准性和一致性。我们发现来自预测训练的校准改进在流行基准上具有普遍性。我们开源了所有模型、代码和数据，以使语言模型预测研究广泛可访问。

Summary / 总结

This work aims to enhance language models' ability to reason about open-ended forecasting questions for high-stakes decision-making. The authors synthesize forecasting questions from daily news and train the Qwen3 models on a dataset called OpenForesight. They use an offline news corpus for data generation and retrieval to avoid future information leakage. The model, OpenForecaster 8B, shows improved accuracy, calibration, and consistency compared to larger proprietary models. Calibration improvements generalize across popular benchmarks.

该研究旨在通过从每日新闻中合成问题来增强语言模型的开放性预测能力。Qwen3模型在OpenForesight数据集上进行训练，使用离线新闻语料库以避免未来信息泄露。研究显示检索和强化学习中改进的奖励函数可以提升模型性能。OpenForecaster 8B模型展示了准确度、校准性和一致性方面的改进，与更大规模的专有模型相当。校准改进在多个基准上具有普适性，所有资源均已开源，以促进更广泛的科研访问。

FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

Authors: Dian Shao, Mingfei Shi, Like Liu

Venue: AAAI 2026

First: 2025-12-31T18:59:12+00:00 · Latest: 2025-12-31T18:59:12+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.

中文标题/摘要

标题：FineTec：通过骨架分解和序列完成在时间腐蚀下的细粒度动作识别

从时间腐蚀的骨架序列中识别细粒度动作仍然是一个重大挑战，尤其是在现实场景中，实时姿态估计经常导致大量数据缺失。现有方法往往难以准确恢复时间动态和细粒度的空间结构，导致丢失了区分相似动作的关键微动线索。为了解决这一问题，我们提出了一种名为FineTec的统一框架，用于在时间腐蚀下的细粒度动作识别。FineTec首先使用具有多样时间掩码的上下文感知完成从受腐蚀输入中恢复基础骨架序列。接着，一个基于骨架的空间分解模块将骨架划分为五个语义区域，并根据运动方差进一步划分为动态和静态子组，通过目标扰动生成两个增强的骨架序列。这些序列与基础序列一起通过一个基于拉格朗日动力学的物理驱动估计模块进行处理，该模块利用拉格朗日动力学估计关节加速度。最后，融合后的骨架位置序列和融合后的加速度序列一起输入到基于GCN的动作识别头部。在粗粒度（NTU-60, NTU-120）和细粒度（Gym99, Gym288）基准上的广泛实验表明，FineTec在各种时间腐蚀水平下显著优于先前的方法。具体而言，FineTec在具有挑战性的Gym99-严重和Gym288-严重设置中分别实现了89.1%和78.1%的顶级准确率，展示了其鲁棒性和泛化能力。代码和数据集可在https://smartdianlab.github.io/projects-FineTec/获取。

Summary / 总结

FineTec is a unified framework for fine-grained action recognition under temporal corruption. It first restores a base skeleton sequence using context-aware completion and temporal masking, then decomposes the skeleton into dynamic and static subgroups and generates augmented sequences through targeted perturbation. A physics-driven estimation module estimates joint accelerations using Lagrangian dynamics, and these sequences are fed into a GCN-based action recognition head. Experiments show that FineTec outperforms previous methods, achieving top-1 accuracies of 89.1% and 78.1% on Gym99-severe and Gym288-severe settings, respectively.

FineTec 是一种针对时间失真的细粒度动作识别统一框架，首先使用上下文感知的完成方法恢复基骨架序列，然后将骨架分解为动态和静态子组，并通过目标扰动生成增强序列。它使用物理驱动的估计模块估计关节加速度，并将融合序列输入到基于GCN的动作识别头部。实验表明，FineTec 在 Gym99-severe 和 Gym288-severe 设置中分别实现了 89.1% 和 78.1% 的 top-1 准确率，优于先前的方法。

From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

Authors: Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu

First: 2025-12-31T18:58:30+00:00 · Latest: 2025-12-31T18:58:30+00:00

Comments: Project Page https://hjrphoebus.github.io/X-Dub

Abs · PDF · Code1 · Code2 · Project1

Abstract

Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

中文标题/摘要

标题：从修复到编辑：一种自强化框架实现语境丰富的视觉配音

基于音频的视觉配音旨在使视频的唇部动作与新语音同步，但根本上受到理想训练数据的挑战：仅唇部动作不同而其他所有视觉条件都相同的配对视频。现有方法通过基于掩码的修复范式绕过了这一问题，不完整的视觉条件迫使模型同时生成缺失的内容并同步唇部，导致视觉伪影、身份漂移和同步不良。在本文中，我们提出了一种新颖的自强化框架，将视觉配音重新定义为一个从病态的修复任务到一个良好的视频到视频编辑问题。我们的方法采用扩散变换器，首先作为数据生成器，合成理想的训练数据：每个真实样本的唇部修改同伴视频，形成视觉对齐的视频对。基于扩散变换器的音频驱动编辑器随后在这些对上端到端训练，利用完整的和对齐的输入视频帧专注于精确的音频驱动唇部修改。这种完整的、帧对齐的输入条件为编辑器提供了丰富的视觉上下文，提供了完整的身份线索、场景交互和连续的空间-时间动态。利用这种丰富的上下文，我们的方法能够实现高度准确的唇部同步、忠实的身份保留和对复杂野外场景的出色鲁棒性。我们进一步引入了一种时间步长自适应多阶段学习策略，作为必要组件，以在扩散时间步长中分离相互冲突的编辑目标，从而促进稳定训练并获得增强的唇部同步和视觉保真度。此外，我们提出了ContextDubBench，一个全面的基准数据集，用于在多样性和挑战性的实际应用场景中进行稳健评估。

Summary / 总结

This work addresses the challenge of audio-driven visual dubbing by proposing a self-bootstrapping framework that transforms the task from an ill-posed inpainting problem into a well-conditioned video-to-video editing problem. The framework uses a Diffusion Transformer to generate ideal training data, which are then used to train an audio-driven editor. This approach ensures precise lip synchronization, faithful identity preservation, and robust performance in challenging scenarios. The method also introduces a timestep-adaptive multi-phase learning strategy to improve training stability and visual fidelity.

本文提出了一种自强化框架，将音频驱动的视觉配音任务从一个插补问题转化为一个视频到视频的编辑问题。该框架使用扩散变换器生成理想训练数据，通过修改真实视频中的唇部动作，创建对齐的视频对。然后，基于这些对进行训练的音频驱动编辑器专注于精确的唇部修改。这种方法提供了丰富的视觉上下文，导致了准确的唇同步、忠实的身份保留以及在挑战性场景中的稳健性能。此外，还提出了一种时间步长自适应多阶段学习策略，以稳定训练并提高唇同步和视觉保真度。同时，还提出了一个新基准数据集ContextDubBench，用于在各种实际应用场景中进行稳健评估。

Many Minds from One Model: Bayesian Transformers for Population Intelligence

Authors: Diji Yang, Yi Zhang

First: 2025-12-31T18:56:02+00:00 · Latest: 2025-12-31T18:56:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.

中文标题/摘要

标题：一种模型多个心智：贝叶斯变换器与群体智能

尽管现代变换器规模庞大且成功，但它们几乎无一例外地被训练为单一目标系统：优化产生一组确定性的参数，代表对数据的单一功能假设。受心智多样性产生智能这一理念的启发，我们提出了群体贝叶斯变换器（B-Trans），将标准大型语言模型转换为贝叶斯变换器模型，以支持从一组预训练权重中采样多样且连贯的模型实例。 B-Trans 通过将归一化层中的偏置类似偏移视为具有高斯变分近似的随机变量，引入了一个贝叶斯动机的后验代理，从而在不训练完整的贝叶斯神经网络的情况下诱导模型行为的分布。从这个代理中采样会产生一组具有不同行为但保持一般能力的模型实例。为了在每次生成中保持连贯性，我们在序列级别冻结采样的噪声，确保时间上各标记的一致性。B-Trans 允许群体级别的决策，其中跨采样个体汇总预测显著增强了探索性。在零样本生成、具有可验证奖励的强化学习（RLVR）以及无需显式标签的强化学习实验中，B-Trans 有效地利用了群体智慧，提供了更好的语义多样性，同时在任务性能上优于确定性基线。

Summary / 总结

The paper proposes Population Bayesian Transformers (B-Trans) to address the limitation of modern transformers being single-minded systems. B-Trans transforms a standard Large Language Model into a Bayesian model, allowing for sampling diverse yet coherent model instances from a single set of pre-trained weights. The method introduces a Bayesian-motivated posterior proxy by treating normalization layer offsets as stochastic variables, which enables sampling model instances with diverse behaviors while maintaining general competence. Experiments show that B-Trans enhances exploration and task performance, particularly in zero-shot generation and reinforcement learning tasks, by leveraging the wisdom of crowds and achieving better semantic diversity compared to deterministic baselines.

论文提出了一种Population Bayesian Transformers（B-Trans）方法，以解决现代变压器作为单一思维系统的局限性。B-Trans 将标准的大语言模型转换为贝叶斯模型，允许从单个预训练权重集中采样出多样且一致的模型实例。该方法通过将归一化层偏移视为随机变量，引入了一个贝叶斯动机的后验近似，从而在不增加全贝叶斯神经网络训练成本的情况下，诱导出模型行为的分布。实验表明，B-Trans 通过群体级别的决策制定增强了语义多样性，并在零样本生成、具有可验证奖励的强化学习以及无需显式标签的强化学习中优于确定性基线模型。

Context-aware LLM-based AI Agents for Human-centered Energy Management Systems in Smart Buildings

Authors: Tianzhi He, Farrokh Jazizadeh

First: 2025-12-31T18:51:19+00:00 · Latest: 2025-12-31T18:51:19+00:00

Abs · PDF · Code1 · Code2

Abstract

This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents to facilitate context-aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context-aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype's performance was evaluated using 120 user queries across four distinct real-world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost-effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory-related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM-based BEMS AI agents and identifying future research directions, emphasizing the trade-off between response accuracy and computational efficiency.

中文标题/摘要

标题：智能建筑中面向人类中心的能源管理系统中的上下文感知LLM基AI代理

本研究提出了一种概念框架和原型评估，用于通过自然语言交互促进智能建筑中上下文感知能源管理的大型语言模型（LLM）基建筑能源管理系统（BEMS）AI代理。所提出的框架包括三个模块：感知（传感）、中央控制（大脑）和行动（执行和用户交互），形成一个闭环反馈回路，捕捉、分析和解释能源数据，以智能响应用户查询并管理连接的电器。通过利用LLM的自主数据分析能力，BEMS AI代理旨在提供有关能源消耗、成本预测和设备调度的上下文感知见解，从而解决现有能源管理系统中的局限性。原型的性能使用120个用户查询和四个不同的真实住宅能源数据集以及不同的评估指标（包括延迟、功能、能力、准确性和成本效益）进行了评估。通过ANOVA测试展示了该框架的通用性。结果表明，通过设备控制（86%）、记忆相关任务（97%）、调度和自动化（74%）和能源分析（77%）的响应准确性衡量，表现出有希望的性能，而更复杂的成本估算任务则指出了改进的领域，准确率为49%。这项基准研究朝着正式化LLM基BEMS AI代理的评估和确定未来研究方向迈进，强调了响应准确性和计算效率之间的权衡。

Summary / 总结

This study introduces a conceptual framework and prototype for LLM-based BEMS AI agents that facilitate context-aware energy management through natural language interaction. The framework consists of perception, central control, and action modules, forming a closed loop to analyze and interpret energy data. Performance was evaluated using 120 user queries across four residential energy datasets, with results showing response accuracy of 86% in device control, 97% in memory-related tasks, 74% in scheduling and automation, and 77% in energy analysis. Cost estimation tasks showed lower accuracy at 49%. The study highlights the potential of LLMs in energy management while identifying areas for improvement.

该研究提出了一种基于LLM的BEMS AI代理的概念框架和原型，通过自然语言交互增强智能建筑中的上下文感知能源管理。该框架包括感知、中央控制和行动模块，形成一个闭环进行能源数据的分析和响应用户查询。性能评估使用了120个用户查询和四个实际住宅数据集，结果显示在设备控制（86%）、记忆任务（97%）和能源分析（77%）方面表现良好，但成本估算任务的准确性较低（49%）。

Generative Classifiers Avoid Shortcut Solutions

Authors: Alexander C. Li, Ananya Kumar, Deepak Pathak

Venue: ICLR 2025

First: 2025-12-31T18:31:46+00:00 · Latest: 2025-12-31T18:31:46+00:00

Comments: ICLR 2025. Code: https://github.com/alexlioralexli/generative-classifiers

Abs · PDF · Code1 · Code2 · Code3

Abstract

Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail even under minor distribution shift. This failure mode stems from an overreliance on features that are spuriously correlated with the label. We show that generative classifiers, which use class-conditional generative models, can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones. These generative classifiers are simple to train, avoiding the need for specialized augmentations, strong regularization, extra hyperparameters, or knowledge of the specific spurious correlations to avoid. We find that diffusion-based and autoregressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks and reduce the impact of spurious correlations in realistic applications, such as medical or satellite datasets. Finally, we carefully analyze a Gaussian toy setting to understand the inductive biases of generative classifiers, as well as the data properties that determine when generative classifiers outperform discriminative ones.

中文标题/摘要

标题：生成式分类器避免捷径解决方案

分类中的判别方法往往会学习仅在分布内有效的捷径，而在轻微分布变化下会失效。这种失败模式源于对与标签偶然相关的特征的过度依赖。我们表明，使用条件生成模型的生成式分类器可以通过建模所有特征，而不是主要建模偶然特征，从而避免此问题。生成式分类器易于训练，无需特殊增强、强正则化、额外超参数或避免特定偶然相关性的知识。我们发现，基于扩散和自回归的生成式分类器在五个标准图像和文本分布变化基准测试中达到最先进的性能，并在医疗或卫星数据集等现实应用中减少了偶然相关性的影响。最后，我们仔细分析了一个高斯玩具设置，以理解生成式分类器的归纳偏置，以及哪些数据属性决定了生成式分类器何时优于判别式分类器。

Summary / 总结

The paper investigates the issue of shortcut solutions in discriminative classifiers, which can fail under distribution shifts. It proposes using generative classifiers that model all features, both core and spurious, to avoid this problem. Experiments on image and text benchmarks show that generative classifiers, particularly those based on diffusion and autoregressive models, outperform discriminative classifiers and reduce the impact of spurious correlations. Additionally, the study analyzes a Gaussian toy setting to understand the conditions under which generative classifiers excel over discriminative ones.

论文探讨了判别分类器学习在分布变化时会失效的捷径解决方案的问题。它提出使用生成分类器，这种分类器可以建模所有特征，以避免这个问题。实验表明，基于扩散和自回归模型的生成分类器在各种基准测试中优于判别分类器，并且在医疗和卫星图像等实际数据集中减少了伪相关的影响。

Plan Verification for LLM-Based Embodied Task Completion Agents

Authors: Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

First: 2025-09-02T19:06:56+00:00 · Latest: 2025-12-31T18:31:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.

中文标题/摘要

标题：基于LLM的具身任务完成代理计划验证

基于大型语言模型（LLM）的任务计划及其对应的具身AI人类示范可能是嘈杂的，包含不必要的动作、冗余导航和逻辑错误，这些都会降低策略质量。我们提出了一种迭代验证框架，在该框架中，一个法官LLM批评动作序列，一个规划LLM应用修订，从而逐步产生更清洁且更具空间连贯性的轨迹。与基于规则的方法不同，我们的方法依赖于自然语言提示，能够广泛泛化不同类型错误，包括无关动作、矛盾和缺失步骤。在TEACh具身AI数据集中手动标注的动作集上，我们的框架在四个最先进的LLM（GPT o4-mini、DeepSeek-R1、Gemini 2.5、LLaMA 4 Scout）上实现了高达90%的召回率和100%的精确率。精炼循环收敛迅速，96.5%的序列最多需要三轮迭代，同时提高了时间效率和空间动作组织。至关重要的是，该方法保留了人类错误恢复模式，而不是将其消除，支持未来关于稳健纠正行为的工作。通过将计划验证确立为空间规划和动作精炼的可靠LLM能力，我们为具身AI中的模仿学习提供了可扩展的高质量训练数据路径。

Summary / 总结

The research addresses the issue of noisy task plans generated by large language models (LLMs) for embodied AI, which can include unnecessary actions, redundancies, and logical errors. It introduces an iterative verification framework where a Judge LLM critiques action sequences and a Planner LLM refines them, resulting in cleaner and more spatially coherent trajectories. The method, which uses natural language prompting, achieves high recall and precision across four state-of-the-art LLMs and converges quickly, improving both temporal efficiency and spatial action organization without collapsing human error-recovery patterns.

研究旨在解决大型语言模型（LLM）为 embodied AI 生成的任务计划中存在的噪音问题，如不必要的动作和逻辑错误。提出了一种迭代验证框架，涉及一个 Judge LLM 和一个 Planner LLM 来细化动作序列，从而产生更清洁且更具空间连贯性的轨迹。该方法使用自然语言提示，在 TEACh 数据集上展示了高精度和召回率，具有快速收敛性和改进的时间效率和空间动作组织。

Towards Generalisable Foundation Models for Brain MRI

Authors: Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander

First: 2025-10-27T15:19:46+00:00 · Latest: 2025-12-31T18:26:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.

中文标题/摘要

标题：通用基础模型在脑MRI中的应用

人工智能（AI）中的基础模型正在通过从大规模未标记数据集中学习通用特征来改变医学成像。在本研究中，我们介绍了BrainFound，这是一种用于脑MRI的自监督基础模型，通过扩展最初为2D自然图像设计的DINO-v2视觉变换器构建。BrainFound将DINO-v2适应为通过结合连续MRI切片的体素信息来建模完整的3D脑解剖结构，超越了传统的单切片范式。它支持单模态和多模态输入，能够执行一系列下游任务，包括疾病检测和图像分割，同时在不同的成像协议和临床场景中具有泛化能力。我们表明，BrainFound在标签稀缺和多对比度设置中始终优于现有的自监督预训练策略和监督基线。通过整合多种3D MRI模态（如T1、T2、FLAIR）的信息，它提高了诊断准确性并减少了对大量专家注释的依赖。这种灵活性使BrainFound成为3D神经成像管道的可扩展和实用解决方案，具有在临床部署和研究创新方面的巨大潜力。

Summary / 总结

The research aims to develop a generalisable foundation model for brain MRI by extending DINO-v2, a vision transformer, to handle 3D brain anatomy. BrainFound incorporates volumetric information from sequential MRI slices and supports both single- and multimodal inputs, enhancing its applicability to various downstream tasks. Key findings show that BrainFound outperforms existing self-supervised pretraining strategies and supervised baselines, especially in label-scarce and multi-contrast settings, improving diagnostic accuracy and reducing dependency on expert annotations. This makes BrainFound a scalable solution for 3D neuroimaging pipelines.

研究旨在通过将DINO-v2扩展到处理3D脑部解剖结构，开发一种通用的基础模型BrainFound。BrainFound整合了从连续MRI切片中获取的体素信息，并支持单模态和多模态输入，提高了疾病检测和图像分割的准确性。实验结果显示，BrainFound在标签稀缺和多对比度设置中优于现有的自监督预训练策略和监督基线，提高了诊断准确性并减少了对专家注释的依赖。这使BrainFound成为3D神经影像处理管道和临床部署的实用解决方案。

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Authors: Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier

Venue: NeurIPS 2025

First: 2025-12-31T18:21:52+00:00 · Latest: 2025-12-31T18:21:52+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.

中文标题/摘要

标题：ResponseRank：通过偏好强度学习实现高效的数据利用奖励建模

二元选择，如强化学习从人类反馈（RLHF）中常用的方式，只能传达偏好的方向。一个人可能选择苹果而不是橙子，香蕉而不是葡萄，但哪种偏好更强烈？强度对于在不确定性下做决策和偏好模型的泛化至关重要，但很难可靠地测量。元数据如响应时间及注释者间的一致性可以作为强度的代理，但往往噪声较大且混杂。我们提出ResponseRank以应对来自噪声强度信号的学习挑战。我们的方法使用代理信号的相对差异来对成对比较的响应进行排序，以推断其偏好强度。为了控制系统性变化，我们仅在精心构建的层内局部比较信号。这使得我们可以稳健地学习与强度推导出的排名一致的效用差异，同时对强度信号的假设最少。我们的贡献包括三个方面：(1) ResponseRank，一种新颖的方法，通过利用局部有效的相对强度信号稳健地学习偏好强度；(2) 在合成偏好学习（使用模拟响应时间）、语言建模（使用注释者一致性）和RL控制任务（使用模拟回合回报）等多样任务中，证明了改进的样本效率和稳健性；(3) Pearson距离相关性（PDC），一种新颖的度量标准，能够隔离序数准确性与基数效用学习。

Summary / 总结

ResponseRank is a method that addresses the challenge of learning preference strength from noisy signals by ranking responses based on relative differences in proxy signals. It improves sample efficiency and robustness across various tasks, including synthetic preference learning, language modeling, and RL control tasks. The method uses locally valid relative strength signals to learn utility differences and introduces the Pearson Distance Correlation (PDC) as a metric to isolate cardinal utility learning from ordinal accuracy.

ResponseRank 是一种从嘈杂信号中学习偏好强度的方法，旨在解决从人类反馈进行强化学习的挑战。它通过相对差异的代理信号来排名响应，并在局部构造的层内控制系统性变化。关键发现包括在合成偏好学习、语言建模和RL控制任务等各种任务中提高了样本效率和鲁棒性。

FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

Authors: Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai

First: 2025-12-31T17:57:45+00:00 · Latest: 2025-12-31T17:57:45+00:00

Abs · PDF · Code1 · Code2

Abstract

We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

中文标题/摘要

标题：FoundationSLAM：利用基础深度模型实现端到端密集视觉SLAM的力量

我们提出了FoundationSLAM，一种基于学习的单目密集SLAM系统，解决了先前基于流的方法中缺乏几何一致性的问题，以实现准确和鲁棒的跟踪和建图。我们的核心思想是通过利用基础深度模型的指导，将流估计与几何推理相结合。为此，我们首先开发了一种混合流网络，生成几何感知的对应关系，使不同关键帧之间的深度和姿态推断保持一致。为了确保全局一致性，我们提出了一种双向一致束调整层，该层在多视图约束下联合优化关键帧姿态和深度。此外，我们引入了一种可靠性感知精炼机制，通过区分可靠和不确定区域动态调整流更新过程，形成匹配与优化之间的闭环。广泛的实验表明，FoundationSLAM在多个具有挑战性的数据集上实现了卓越的轨迹精度和密集重建质量，同时以每秒18帧的速度实时运行，展示了我们的方法在各种场景下的强大泛化能力和实际应用价值。

Summary / 总结

FoundationSLAM is a learning-based monocular dense SLAM system that improves upon previous flow-based approaches by incorporating geometric consistency. It uses a Hybrid Flow Network to produce geometry-aware correspondences and a Bi-Consistent Bundle Adjustment Layer to enforce global consistency. Additionally, it includes a Reliability-Aware Refinement mechanism to dynamically adapt the flow update process. Experiments show that FoundationSLAM provides superior trajectory accuracy and dense reconstruction quality, running at real-time speeds of 18 FPS across various datasets.

FoundationSLAM 是一种基于学习的单目密集 SLAM 系统，通过结合几何一致性改进了之前的流基方法。它使用 Hybrid Flow Network 生成几何感知的对应关系，并使用 Bi-Consistent Bundle Adjustment Layer 强化全局一致性。此外，它还引入了 Reliability-Aware Refinement 机制，动态调整流更新过程。实验表明，FoundationSLAM 在各种数据集上实现了高轨迹精度和密集重建质量，实时运行速度为 18 FPS。

Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

Authors: Zhenyu Cui, Jiahuan Zhou, Yuxin Peng

First: 2025-12-31T17:50:05+00:00 · Latest: 2025-12-31T17:50:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

中文标题/摘要

标题：Bi-C2R：双向持续兼容表示以实现无需再索引的终身行人重识别

终身行人重识别（L-ReID）利用按顺序收集的数据进行持续训练和更新重识别模型，重点关注所有数据的整体性能。其主要挑战是在训练新数据时避免旧知识的灾难性遗忘。现有L-ReID方法通常在每次更新后重新提取所有历史画廊图像的新特征进行推理，这被称为“再索引”。然而，由于数据隐私问题和大规模画廊图像的高再索引成本，历史画廊数据通常直接保存。结果，不可避免地导致更新模型提取的查询特征与更新前模型提取的画廊特征之间的不兼容检索，严重影响重识别性能。为了解决上述问题，本文关注一个新任务，称为无需再索引的终身行人重识别（RFL-ReID），该任务要求在不重新索引历史画廊图像的情况下进行终身行人重识别。因此，RFL-ReID 比 L-ReID 更具挑战性，需要在多样化的流式数据中持续学习并平衡新旧知识，使新旧模型输出的特征相互兼容。为此，我们提出了一种双向持续兼容表示（Bi-C2R）框架，以兼容的方式持续更新旧模型提取的画廊特征，进行高效的L-ReID。我们通过理论分析和在多个基准上的广泛实验验证了我们提出的Bi-C2R方法，结果表明，该方法在引入的RFL-ReID任务和传统L-ReID任务上均能实现领先性能。

Summary / 总结

This paper addresses the challenge of lifelong person re-identification (L-ReID) by proposing a new task called Re-index Free Lifelong person Re-identification (RFL-ReID), which aims to perform re-identification without re-indexing historical gallery images. To tackle the issue of incompatible retrieval, the authors introduce a Bidirectional Continuous Compatible Representation (Bi-C2R) framework that updates the gallery features extracted by the old model to ensure compatibility with features from the updated model. Experiments on multiple benchmarks show that Bi-C2R achieves leading performance on both RFL-ReID and traditional L-ReID tasks.

本文提出了一个新的任务——无需重新索引历史库图像的终身行人重识别（RFL-ReID），旨在无需重新索引历史库图像的情况下进行终身重识别。为解决灾难性遗忘和检索不兼容的问题，作者提出了双向连续兼容表示（Bi-C2R）框架。该框架通过不断更新旧模型提取的库特征，确保与更新模型提取的特征兼容。实验结果表明，Bi-C2R在RFL-ReID和传统L-ReID任务上均取得了领先性能。

Basic Inequalities for First-Order Optimization with Applications to Statistical Risk Analysis

Authors: Seunghoon Paik, Kangjie Zhou, Matus Telgarsky, Ryan J. Tibshirani

First: 2025-12-31T17:49:37+00:00 · Latest: 2025-12-31T17:49:37+00:00

Comments: 47 pages, 3 figures (7 subfigures)

Abs · PDF · Code1 · Code2

Abstract

We introduce \textit{basic inequalities} for first-order iterative optimization algorithms, forming a simple and versatile framework that connects implicit and explicit regularization. While related inequalities appear in the literature, we isolate and highlight a specific form and develop it as a well-rounded tool for statistical analysis. Let $f$ denote the objective function to be optimized. Given a first-order iterative algorithm initialized at $θ_0$ with current iterate $θ_T$, the basic inequality upper bounds $f(θ_T)-f(z)$ for any reference point $z$ in terms of the accumulated step sizes and the distances between $θ_0$, $θ_T$, and $z$. The bound translates the number of iterations into an effective regularization coefficient in the loss function. We demonstrate this framework through analyses of training dynamics and prediction risk bounds. In addition to revisiting and refining known results on gradient descent, we provide new results for mirror descent with Bregman divergence projection, for generalized linear models trained by gradient descent and exponentiated gradient descent, and for randomized predictors. We illustrate and supplement these theoretical findings with experiments on generalized linear models.

中文标题/摘要

标题：一阶优化的基本不等式及其在统计风险分析中的应用

我们引入了一阶迭代优化算法的\textit{基本不等式}，形成一个简单且多功能的框架，将隐式和显式正则化联系起来。虽然文献中存在相关的不等式，但我们隔离并强调了一种特定形式，并将其发展为统计分析中一个完善的工具。令$f$表示要优化的目标函数。给定一个初始于$θ_0$的一阶迭代算法，当前迭代为$θ_T$，基本不等式以累积步长和$θ_0$、$θ_T$与$z$之间的距离为界，上界$f(θ_T)-f(z)$对于任何参考点$z$。该界将迭代次数转化为损失函数中的有效正则化系数。我们通过训练动力学分析和预测风险界展示了这一框架。除了重新审视和改进已知的梯度下降结果外，我们还提供了镜像下降与Bregman散度投影、广义线性模型通过梯度下降和指数梯度下降训练以及随机预测的新结果。我们通过广义线性模型的实验展示了并补充了这些理论发现。

Summary / 总结

The paper introduces basic inequalities for first-order optimization algorithms, which provide a framework for connecting implicit and explicit regularization. The inequalities upper bound the difference between the objective function value at the current iterate and a reference point in terms of step sizes and distances between iterates. This framework is applied to analyze training dynamics and prediction risks, offering new results for mirror descent, gradient descent, and exponentiated gradient descent, as well as randomized predictors. Experiments on generalized linear models support the theoretical findings.

本文引入了第一阶迭代优化算法的基本不等式，将隐式和显式正则化联系起来。这些不等式以步长和距离为参数，上界当前迭代点与任意参考点之间的目标函数值之差。该框架被应用于分析训练动态和预测风险，提供了梯度下降、镜像下降、指数梯度下降以及随机预测器等不同优化算法的新结果。实验结果表明并补充了这些理论发现，特别是在广义线性模型上的应用。

PhysTalk: Language-driven Real-time Physics in 3D Gaussian Scenes

Authors: Luca Collorone, Mert Kiray, Indro Spinelli, Fabio Galasso, Benjamin Busam

First: 2025-12-31T17:32:31+00:00 · Latest: 2025-12-31T17:32:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Realistic visual simulations are omnipresent, yet their creation requires computing time, rendering, and expert animation knowledge. Open-vocabulary visual effects generation from text inputs emerges as a promising solution that can unlock immense creative potential. However, current pipelines lack both physical realism and effective language interfaces, requiring slow offline optimization. In contrast, PhysTalk takes a 3D Gaussian Splatting (3DGS) scene as input and translates arbitrary user prompts into real time, physics based, interactive 4D animations. A large language model (LLM) generates executable code that directly modifies 3DGS parameters through lightweight proxies and particle dynamics. Notably, PhysTalk is the first framework to couple 3DGS directly with a physics simulator without relying on time consuming mesh extraction. While remaining open vocabulary, this design enables interactive 3D Gaussian animation via collision aware, physics based manipulation of arbitrary, multi material objects. Finally, PhysTalk is train-free and computationally lightweight: this makes 4D animation broadly accessible and shifts these workflows from a "render and wait" paradigm toward an interactive dialogue with a modern, physics-informed pipeline.

中文标题/摘要

标题：PhysTalk: 3D 高斯场景中的语言驱动实时物理

逼真的视觉模拟无处不在，但其创建需要计算时间、渲染和专家动画知识。从文本输入生成开放词汇视觉效果成为一种有前景的解决方案，可以释放巨大的创意潜力。然而，当前的工作流程缺乏物理现实性和有效的语言界面，需要缓慢的离线优化。相比之下，PhysTalk 以3D 高斯点绘（3DGS）场景为输入，并将任意用户提示翻译成实时、基于物理的4D 动画。一个大型语言模型（LLM）生成可执行代码，直接通过轻量级代理和粒子动力学修改3DGS 参数。值得注意的是，PhysTalk 是第一个直接将3DGS 与物理模拟器结合的框架，无需依赖耗时的网格提取。尽管保持开放词汇，此设计使用户能够通过碰撞感知的、基于物理的操纵任意多材料对象进行交互3D 高斯动画。最后，PhysTalk 是无训练的且计算量轻：这使得4D 动画广泛可及，并将这些工作流程从“渲染和等待”的范式转向与现代、基于物理的管道进行互动对话。

Summary / 总结

PhysTalk is a framework that translates user prompts into real-time, physics-based 4D animations using a 3D Gaussian Splatting (3DGS) scene as input. It leverages a large language model to generate executable code that modifies 3DGS parameters through lightweight proxies and particle dynamics, enabling interactive and collision-aware manipulation of multi-material objects. This approach avoids the need for time-consuming mesh extraction and shifts the workflow from a 'render and wait' paradigm to an interactive dialogue, making 4D animation more accessible.

PhysTalk 是一个框架，通过将用户提示转化为实时的物理基础4D动画，使用3D高斯点绘（3DGS）场景作为输入。它利用大型语言模型生成可执行代码，通过轻量级代理和粒子动力学直接修改3DGS参数，实现交互式的、碰撞感知的多材料对象操控。这种方法避免了耗时的网格提取需求，并将工作流从“渲染等待”模式转变为交互对话，使4D动画更具可访问性。

DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

First: 2025-12-31T17:31:29+00:00 · Latest: 2025-12-31T17:31:29+00:00

Comments: Submitted to IEEE Robotics and Automation Letters (RA-L)

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

中文标题/摘要

标题：DarkEQA：在低光室内环境中的视觉语言模型体态问答基准测试

视觉语言模型（VLMs）越来越多地被用作体态代理的核心推理模块。现有的基准测试在理想、光线充足的条件下评估其能力，但全天候24/7运行需要在各种视觉退化条件下表现出色，包括夜间或黑暗环境中的低光条件——这一核心需求已被很大程度上忽视。为应对这一未被充分探索的挑战，我们提出了DarkEQA，这是一个开源基准测试，用于在多级低光条件下评估与体态问答（EQA）相关的感知基本能力。DarkEQA通过在受控退化条件下从第一人称观察进行问答评估，隔离了感知瓶颈，使可归因的鲁棒性分析成为可能。DarkEQA的一个关键设计特点是其物理保真度：视觉退化在线性RAW空间中建模，模拟基于物理的照明下降和传感器噪声，随后通过ISP启发式的渲染管道。我们通过评估一系列最先进的VLMs和低光图像增强（LLIE）模型展示了DarkEQA的实用性。我们的分析系统地揭示了这些视觉条件下的操作限制。我们的代码和基准数据集将在接受后发布。

Summary / 总结

DarkEQA is a benchmark designed to evaluate the performance of Vision-Language Models (VLMs) in low-light indoor environments, addressing the underexplored challenge of robust 24/7 operation. It uses a controlled degradation process to simulate low-light conditions and evaluates the models' perceptual abilities. Key findings show that state-of-the-art VLMs struggle with question answering under these conditions, highlighting their limitations in low-light scenarios.

DarkEQA 是一个基准，旨在评估 Vision-Language 模型在低光室内环境中的性能，解决其在 24/7 运行中的不足。方法通过物理上忠实的方式降级第一人称观察，以隔离感知限制。关键发现表明，最先进的 VLM 在低光条件下表现不佳，突显了它们在实际应用中的局限性。基准包括一个模拟物理光照下降和传感器噪声的渲染管道，并将在接受后发布。

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

Authors: Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

First: 2025-12-19T04:09:24+00:00 · Latest: 2025-12-31T17:30:11+00:00

Abs · PDF · Code1 · Code2

Abstract

While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

中文标题/摘要

标题：DAVE：一种用于文档理解和网络代理的VLM视觉编码器

尽管视觉语言模型（VLMs）在多模态任务中表现出色，但它们选择的视觉编码器存在根本性弱点：低级特征缺乏文档理解和网络代理所需的稳健的结构和空间信息。为弥补这一差距，我们引入了DAVE，一种专为VLMs设计并针对这些任务的视觉编码器。我们的训练管道旨在利用大量未标注数据，以避免为文档和网络图像进行昂贵的大规模注释。我们首先在未标注图像上进行自我监督预训练，然后进行监督自回归预训练，模型从少量高质量数据中学习解析和定位等任务。在监督阶段，我们采用两种策略来提高编码器与通用视觉知识和多样化文档及网络代理任务的对齐：(i) 我们引入了一种新的模型合并方案，将使用不同文本解码器训练的编码器结合在一起，以确保与不同网络代理架构的广泛兼容性。(ii) 我们使用集成训练，将预训练的一般编码器（如SigLIP2）的特征与我们自己的文档和网络特定表示融合。在经典文档任务、VQAs、网络定位和基于代理的基准测试中的广泛实验验证了我们方法的有效性，确立了DAVE作为文档和网络应用的强大视觉编码器的地位。

Summary / 总结

DAVE is a vision encoder specifically designed for VLMs to enhance document understanding and web agent tasks by incorporating self-supervised and supervised pretraining methods. It leverages abundant unlabeled data and combines different text decoders and pretrained encoders to improve structural and spatial feature representation. Experimental results show that DAVE outperforms existing models on document tasks, VQAs, web localization, and agent-based benchmarks, making it a robust vision encoder for these applications.

DAVE 是一种为 VLMs 设计的视觉编码器，旨在增强文档理解和网页代理任务。它通过在未标注数据上进行自我监督预训练，然后在有限高质量数据上进行监督自回归预训练。DAVE 结合了模型合并方案和集成训练，以提高其兼容性和性能。实验表明，DAVE 在文档任务、VQA、网页定位和基于代理的基准测试中均优于现有模型，使其成为这些应用的稳健视觉编码器。

SymSeqBench: a unified framework for the generation and analysis of rule-based symbolic sequences and datasets

Authors: Barna Zajzon, Younes Bouhadjar, Maxime Fabre, Felix Schmidt, Noah Ostendorf, Emre Neftci, Abigail Morrison, Renato Duarte

First: 2025-12-31T17:18:26+00:00 · Latest: 2025-12-31T17:18:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Sequential structure is a key feature of multiple domains of natural cognition and behavior, such as language, movement and decision-making. Likewise, it is also a central property of tasks to which we would like to apply artificial intelligence. It is therefore of great importance to develop frameworks that allow us to evaluate sequence learning and processing in a domain agnostic fashion, whilst simultaneously providing a link to formal theories of computation and computability. To address this need, we introduce two complementary software tools: SymSeq, designed to rigorously generate and analyze structured symbolic sequences, and SeqBench, a comprehensive benchmark suite of rule-based sequence processing tasks to evaluate the performance of artificial learning systems in cognitively relevant domains. In combination, SymSeqBench offers versatility in investigating sequential structure across diverse knowledge domains, including experimental psycholinguistics, cognitive psychology, behavioral analysis, neuromorphic computing and artificial intelligence. Due to its basis in Formal Language Theory (FLT), SymSeqBench provides researchers in multiple domains with a convenient and practical way to apply the concepts of FLT to conceptualize and standardize their experiments, thus advancing our understanding of cognition and behavior through shared computational frameworks and formalisms. The tool is modular, openly available and accessible to the research community.

中文标题/摘要

标题：SymSeqBench：一种基于规则的符号序列及其数据集生成与分析的统一框架

序列结构是自然认知和行为多个领域中的关键特征，如语言、运动和决策。同样，这也是我们希望应用于人工智能任务中的核心属性。因此，开发一种能够在不同领域中评估序列学习和处理的框架，同时与计算和可计算性的形式理论建立联系，是非常重要的。为满足这一需求，我们引入了两个互补的软件工具：SymSeq，用于严格生成和分析结构化的符号序列；SeqBench，一个基于规则的序列处理任务综合基准套件，用于评估人工学习系统在认知相关领域的性能。结合使用，SymSeqBench 提供了在实验心理语言学、认知心理学、行为分析、神经形态计算和人工智能等多个知识领域中研究序列结构的灵活性。由于基于形式语言理论（FLT），SymSeqBench 为多个领域的研究人员提供了一种方便且实用的方法，将形式语言理论的概念应用于实验设计和标准化，从而通过共享的计算框架和形式化方法推进我们对认知和行为的理解。该工具模块化、开放获取，并可供研究界使用。

Summary / 总结

The research introduces SymSeqBench, a unified framework for generating and analyzing structured symbolic sequences and datasets. It consists of SymSeq for rigorous sequence generation and analysis, and SeqBench as a benchmark suite for evaluating artificial learning systems. Key findings include the framework's versatility across various domains such as psycholinguistics, cognitive psychology, and artificial intelligence, and its ability to apply Formal Language Theory concepts to standardize experiments and advance understanding of cognition and behavior through shared computational frameworks.

研究引入了SymSeqBench，这是一个统一的框架，用于生成和分析结构化的符号序列和数据集，包括SymSeq用于严谨的序列生成和分析，以及SeqBench作为评估人工学习系统性能的基准套件。关键发现包括该框架在语言心理学、认知心理学和人工智能等各个领域的 versatility，以及它能够通过应用形式语言理论的概念来标准化实验，从而通过共享的计算框架和形式化方法推进对认知和行为的理解。

Distribution-Dependent Rates for Multi-Distribution Learning

Authors: Rafael Hanashiro, Patrick Jaillet

First: 2023-12-20T15:50:16+00:00 · Latest: 2025-12-31T17:05:43+00:00

Abs · PDF · Code1 · Code2

Abstract

To address the needs of modeling uncertainty in sensitive machine learning applications, the setup of distributionally robust optimization (DRO) seeks good performance uniformly across a variety of tasks. The recent multi-distribution learning (MDL) framework tackles this objective in a dynamic interaction with the environment, where the learner has sampling access to each target distribution. Drawing inspiration from the field of pure-exploration multi-armed bandits, we provide distribution-dependent guarantees in the MDL regime, that scale with suboptimality gaps and result in superior dependence on the sample size when compared to the existing distribution-independent analyses. We investigate two non-adaptive strategies, uniform and non-uniform exploration, and present non-asymptotic regret bounds using novel tools from empirical process theory. Furthermore, we devise an adaptive optimistic algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring the contrast between uniform and optimistic allocation in the multi-armed bandit literature. We also conduct a small synthetic experiment illustrating the comparative strengths of each strategy.

中文标题/摘要

标题：依赖分布的学习速率

为应对敏感机器学习应用中建模不确定性的需求，分布鲁棒优化（DRO）设置旨在实现各种任务上的统一良好性能。最近的多分布学习（MDL）框架通过与环境的动态交互来实现这一目标，其中学习者可以对每个目标分布进行采样访问。受到纯探索多臂bandit领域的启发，我们提供了MDL环境下的依赖分布的保证，这些保证与次优差距成比例，并且在样本量依赖性上优于现有的独立于分布的分析。我们研究了两种非自适应策略，均匀探索和非均匀探索，并使用经验过程理论中的新工具给出了非渐近的遗憾界。此外，我们设计了一个自适应乐观算法LCB-DR，展示了对差距的增强依赖性，类似于多臂bandit文献中均匀分配和乐观分配之间的对比。我们还进行了一个小规模的合成实验，以说明每种策略的比较优势。

Summary / 总结

This paper aims to improve the performance of machine learning models in uncertain environments by addressing the multi-distribution learning (MDL) framework. The authors provide distribution-dependent guarantees that scale with suboptimality gaps, offering better sample size dependence compared to existing methods. They explore two non-adaptive strategies, uniform and non-uniform exploration, and introduce an adaptive optimistic algorithm, LCB-DR, which demonstrates improved performance on synthetic data experiments.

该论文旨在通过解决多分布学习（MDL）框架来提高机器学习模型在不确定环境中的性能。作者提供了基于分布的保证，这些保证与次优差距成比例，相比现有方法具有更好的样本大小依赖性。他们探索了两种非自适应策略，均匀探索和非均匀探索，并引入了一种自适应乐观算法LCB-DR，该算法在合成数据实验中表现出更好的性能。

ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

Authors: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou

First: 2025-12-31T16:51:14+00:00 · Latest: 2025-12-31T16:51:14+00:00

Comments: 17 pages, 15 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

中文标题/摘要

标题：ShowUI-$π$: 基于流的生成模型作为GUI灵巧手

构建能够进行灵巧操作的智能代理对于实现机器人和数字环境中的类人自动化至关重要。然而，现有的GUI代理依赖于离散的点击预测(x,y)，这禁止了自由形式、闭环轨迹（例如拖动进度条）的实现，这些轨迹需要连续、实时的感知和调整。在本工作中，我们开发了ShowUI-$π$，这是第一个基于流的生成模型作为GUI灵巧手，具有以下设计：(i) 统一的离散-连续动作，将离散点击和连续拖动整合到一个共享模型中，使模型能够灵活适应各种交互模式；(ii) 基于流的动作生成用于拖动建模，通过一个轻量级的动作专家从连续的视觉观察中预测增量光标调整，确保平滑和稳定的轨迹；(iii) 拖动训练数据和基准，我们手动收集并合成了跨越五个领域（例如PowerPoint，Adobe Premiere Pro）的20,000条拖动轨迹，并引入了ScreenDrag基准，该基准具有全面的在线和离线评估协议，用于评估GUI代理的拖动能力。我们的实验表明，专有的GUI代理在ScreenDrag上仍然存在困难（例如Operator得分为13.27，最好的Gemini-2.5-CUA达到22.18）。相比之下，ShowUI-$π$仅使用4.5亿参数就达到了26.98的得分，这突显了任务的难度和我们方法的有效性。我们希望这项工作能够推动GUI代理向数字世界中的类人灵巧控制发展。代码可在https://github.com/showlab/showui-pi/获取。

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

Authors: Gérard Ben Arous, Murat A. Erdogdu, Nuri Mert Vural, Denny Wu

Venue: NeurIPS 2025

First: 2025-08-05T17:57:56+00:00 · Latest: 2025-12-31T16:43:30+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $f_*(\boldsymbol{x}) \propto \sum_{j=1}^{r}λ_j σ\left(\langle \boldsymbol{θ_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$, $σ$ is the 2nd Hermite polynomial, and $\lbrace\boldsymbolθ_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^β$ for $β\in [0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $λ_j\asymp j^{-α}$ for $α\geq 0$. We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.

Summary / 总结

This study investigates the optimization and sample complexity of training a two-layer neural network with quadratic activation in high dimensions. The data is generated as a sum of orthonormal signal directions weighted by coefficients that follow a power-law decay. The research presents a detailed analysis of the stochastic gradient descent (SGD) dynamics and derives scaling laws for the prediction risk, emphasizing the dependencies on optimization time, sample size, and model width. The analysis combines a precise characterization of the matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.

研究探讨了在高维空间中使用二次激活函数的两层神经网络的优化和样本复杂性。网络使用随机梯度下降（SGD）训练特定的二次函数生成的数据，其中信号方向正交。研究推导了预测风险的缩放律，强调了优化时间、样本大小和模型宽度的依赖关系。关键发现包括SGD动力学的精确分析和无限维有效动力学的收敛保证。

AMAP Agentic Planning Technical Report

Authors: Yulan Hu, Xiangwen Zhang, Sheng Ouyang, Hao Yi, Lu Xu, Qinglin Lang, Lide Tan, Xiang Cheng, Tianchen Ye, Zhicong Li, Ge Chen, Wenjin Yang, Zheng Pan, Shaopan Xiong, Siran Yang, Ju Huang, Yan Zhang, Jiamang Wang, Yong Liu, Yinfeng Huang, Tucheng Lin, Xin Li, Ning Guo

First: 2025-12-31T16:39:09+00:00 · Latest: 2025-12-31T16:39:09+00:00

Abs · PDF · Code1 · Code2

Abstract

We present STAgent, an agentic large language model tailored for spatio-temporal understanding, designed to solve complex tasks such as constrained point-of-interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio-temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain-specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high-quality data like a needle in a haystack, curating high-quality queries with a filter ratio of 1:10,000, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine-tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3-30B-A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.

中文标题/摘要

标题：AMAP代理规划技术报告

我们介绍了STAgent，这是一种针对时空理解定制的代理大型语言模型，旨在解决诸如受约束的兴趣点发现和行程规划等复杂任务。STAgent是一种专门模型，能够在时空场景中与十个不同的工具进行交互，使其能够在复杂推理过程中探索、验证和细化中间步骤。值得注意的是，STAgent有效地保留了其通用能力。我们通过三个关键贡献赋予STAgent这些能力：(1) 一个稳定的工具环境，支持超过十个领域特定工具，实现异步部署和训练；(2) 一个分层数据整理框架，能够从海量数据中识别高质量数据，筛选出高质量查询的比例为1:10,000，强调多样性和难度；(3) 一个级联训练方案，从种子SFT阶段开始，作为衡量查询难度的守护者，随后是针对高确定性查询的SFT微调阶段，最终利用低确定性数据的RL阶段。通过使用Qwen3-30B-A3B初始化以建立强大的SFT基础并利用样本难度的见解，STAgent在TravelBench上表现出色，同时在广泛的一般基准测试中保持其通用能力，从而证明了我们提出的代理模型的有效性。

Summary / 总结

STAgent is an agentic large language model designed for spatio-temporal understanding and complex task solving. It integrates with ten tools to handle tasks like constrained point-of-interest discovery and itinerary planning. STAgent's capabilities are enhanced through a stable tool environment, a hierarchical data curation framework, and a cascaded training recipe. These contributions enable STAgent to maintain its general capabilities while achieving promising performance on TravelBench and other benchmarks.

STAgent 是一个用于时空理解及复杂任务解决的代理型大型语言模型，能够与十个时空工具交互，实现推理过程中步骤的探索、验证和优化。STAgent 通过稳定工具环境、分层数据收集框架和级联训练方法实现这一目标。该模型在 TravelBench 上表现出色，并在多种通用基准测试中保持了通用能力，展示了所提代理模型的有效性。

MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control

Authors: Yongwei Zhang, Yuanzhe Xing, Quan Quan, Zhikun She

First: 2025-12-31T16:36:44+00:00 · Latest: 2025-12-31T16:36:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Achieving provable stability in model-free reinforcement learning (RL) remains a challenge, particularly in balancing exploration with rigorous safety. This article introduces MSACL, a framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. Unlike methods relying on complex reward engineering, MSACL utilizes off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions. By introducing Exponential Stability Labels (ESL) and a $λ$-weighted aggregation mechanism, the framework effectively balances the bias-variance trade-off in multi-step learning. Policy optimization is guided by a stability-aware advantage function, ensuring the learned policy promotes rapid Lyapunov descent. We evaluate MSACL across six benchmarks, including stabilization and nonlinear tracking tasks, demonstrating its superiority over state-of-the-art Lyapunov-based RL algorithms. MSACL achieves exponential stability and rapid convergence under simple rewards, while exhibiting significant robustness to uncertainties and generalization to unseen trajectories. Sensitivity analysis establishes the multi-step horizon $n=20$ as a robust default across diverse systems. By linking Lyapunov theory with off-policy actor-critic frameworks, MSACL provides a foundation for verifiably safe learning-based control. Source code and benchmark environments will be made publicly available.

中文标题/摘要

标题：MSACL：多步演员-评论家学习与李亚普诺夫证书相结合的指数稳定控制

在无模型强化学习（RL）中实现可证明的稳定性仍然是一个挑战，特别是在探索与严格安全之间的平衡。本文介绍了MSACL框架，该框架通过多步李亚普诺夫证书学习将指数稳定性理论与最大熵RL相结合。与依赖复杂奖励工程的方法不同，MSACL利用离策略多步数据来学习满足理论稳定性条件的李亚普诺夫证书。通过引入指数稳定性标签（ESL）和λ加权聚合机制，该框架有效地在多步学习中平衡了偏差-方差权衡。通过稳定性意识的优势函数指导策略优化，确保学习的策略促进快速的李亚普诺夫下降。我们在六个基准测试中评估了MSACL，包括稳定化和非线性跟踪任务，证明了其在最先进的基于李亚普诺夫的RL算法中的优越性。MSACL在简单奖励下实现了指数稳定性并快速收敛，同时对不确定性具有显著的鲁棒性，并且能够泛化到未见过的轨迹。敏感性分析确定了多步时滞n=20为多种系统中的稳健默认值。通过将李亚普诺夫理论与离策略演员-评论家框架相结合，MSACL为验证性安全的学习控制奠定了基础。源代码和基准环境将公开提供。

Summary / 总结

MSACL is a framework that combines exponential stability theory with maximum entropy reinforcement learning to achieve provable stability in model-free RL. It uses multi-step Lyapunov certificate learning and introduces Exponential Stability Labels to balance bias and variance. MSACL outperforms state-of-the-art Lyapunov-based RL algorithms in six benchmarks, showing exponential stability, rapid convergence, and robustness to uncertainties. The multi-step horizon of 20 is found to be a robust default setting.

MSACL 是一种结合了指数稳定性理论和最大熵强化学习的框架，以在无模型的 RL 中实现可证明的稳定性。它使用多步 Lyapunov 凭证学习，并引入了指数稳定性标签来平衡偏差和方差。MSACL 在六个基准测试中表现出色，展示了指数稳定性以及对不确定性较强的鲁棒性。20 步的多步时间窗被发现是一个稳健的默认设置。

A Geometric Theory of Cognition

Authors: Laha Ale

First: 2025-12-13T07:39:53+00:00 · Latest: 2025-12-31T16:33:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Human cognition spans perception, memory, intuitive judgment, deliberative reasoning, action selection, and social inference, yet these capacities are often explained through distinct computational theories. Here we present a unified mathematical framework in which diverse cognitive processes emerge from a single geometric principle. We represent the cognitive state as a point on a differentiable manifold endowed with a learned Riemannian metric that encodes representational constraints, computational costs, and structural relations among cognitive variables. A scalar cognitive potential combines predictive accuracy, structural parsimony, task utility, and normative or logical requirements. Cognition unfolds as the Riemannian gradient flow of this potential, providing a universal dynamical law from which a broad range of psychological phenomena arise. Classical dual-process effects--rapid intuitive responses and slower deliberative reasoning--emerge naturally from metric-induced anisotropies that generate intrinsic time-scale separations and geometric phase transitions, without invoking modular or hybrid architectures. We derive analytical conditions for these regimes and demonstrate their behavioural signatures through simulations of canonical cognitive tasks. Together, these results establish a geometric foundation for cognition and suggest guiding principles for the development of more general and human-like artificial intelligence systems.

中文标题/摘要

标题：认知的几何理论

人类的认知涵盖了感知、记忆、直觉判断、推理、行动选择和社会推理，但这些能力通常通过不同的计算理论来解释。在这里，我们提出了一种统一的数学框架，在这种框架中，各种认知过程源自单一的几何原理。我们将认知状态表示为一个流形上的点，该流形上装备了由表示约束、计算成本和认知变量之间结构关系的学习黎曼度量。一个标量认知势能结合了预测准确性、结构简约性、任务效用和规范或逻辑要求。认知表现为这种势能的黎曼梯度流，提供了一种普遍的动力学定律，从中可以产生广泛的心理学现象。经典的双重过程效应——快速的直觉反应和较慢的推理——自然地源自由度量引起的各向异性，从而产生内在的时间尺度分离和几何相变，而无需引入模块化或混合架构。我们推导出这些状态的分析条件，并通过模拟经典认知任务的行为特征来证明它们。这些结果共同为认知建立了一个几何基础，并为开发更通用和类人的人工智能系统提供了指导原则。

Summary / 总结

This paper aims to unify various cognitive processes under a single geometric framework. The authors represent cognitive states as points on a differentiable manifold with a learned Riemannian metric, and define a scalar cognitive potential that combines predictive accuracy and other factors. Cognition is described as the Riemannian gradient flow of this potential, leading to the emergence of classical dual-process effects without modular architectures. The study demonstrates these effects through simulations of cognitive tasks, providing a new foundation for understanding and developing AI systems that mimic human cognition more closely.

本文旨在通过单一的几何框架统一各种认知过程。作者将认知状态表示为具有学习到的黎曼度量的流形上的点，并定义了一个结合预测准确性等要素的标量认知势能。认知被描述为这种势能的黎曼梯度流，从而自然地产生了经典的心理学二过程效应，而无需模块化或混合架构。研究通过认知任务的模拟展示了这些效应，为理解和开发更接近人类认知的AI系统提供了新的基础。

VIPER: Process-aware Evaluation for Generative Video Reasoning

Authors: Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu

First: 2025-12-31T16:31:59+00:00 · Latest: 2025-12-31T16:31:59+00:00

Comments: Work in progress

Abs · PDF · Code1 · Code2

Abstract

Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

中文标题/摘要

标题：VIPER：基于过程的生成视频推理评估

近期在视频生成方面的突破展示了新兴的能力，称为连续帧推理（CoF），模型通过生成连续帧来解决复杂任务。尽管这些模型在生成视频推理（GVR）方面显示出潜力，但现有的评估框架通常依赖于单帧评估，这可能导致结果作弊，即模型通过错误的过程得出正确的结论。为了解决这个问题，我们提出了一种基于过程的评估范式。我们引入了VIPER，这是一个涵盖16项任务的综合基准，涉及时间、结构、符号、空间、物理和规划推理。此外，我们提出了过程-结果一致性（POC@r）这一新指标，该指标利用VLM作为评判者并采用分层评分标准，评估中间步骤的有效性和最终结果。我们的实验表明，最先进的视频模型在POC@1.0上的表现仅约为20%，显示出显著的结果作弊。我们进一步探讨了测试时缩放和采样鲁棒性的影响，突显了当前视频生成与真正泛化的视觉推理之间存在的巨大差距。我们的基准将公开发布。

Summary / 总结

The paper introduces VIPER, a process-aware evaluation framework for generative video reasoning, addressing the issue of outcome-hacking in existing evaluation methods. It proposes a new metric, Process-outcome Consistency (POC@r), which evaluates both the validity of intermediate steps and the final result using a hierarchical rubric. Experiments show that state-of-the-art video models achieve only about 20% POC@1.0, indicating significant outcome-hacking and a large gap in true visual reasoning capabilities. The benchmark includes 16 tasks covering various reasoning types and will be publicly released.

研究旨在通过提出过程感知评估范式来解决生成视频推理（GVR）模型中的结果作弊问题。VIPER是一个新的基准，评估模型在涉及多种推理类型的16个任务上的表现，并引入了POC@r指标，该指标评估中间步骤的有效性和最终结果。实验结果显示，最先进的模型在POC@1.0上的得分仅为约20%，表明存在显著的结果作弊。研究还探讨了测试时缩放和采样鲁棒性的影响，揭示了当前视频生成与真正的视觉推理能力之间的差距。

ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

Authors: Xinran Gong, Gorkem Durak, Halil Ertugrul Aktas, Vedat Cicek, Jinkui Hao, Ulas Bagci, Nilay S. Shah, Bo Zhou

First: 2025-12-31T16:29:05+00:00 · Latest: 2025-12-31T16:29:05+00:00

Comments: 21 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

中文标题/摘要

标题：ProDM：基于合成现实的属性感知渐进扩散模型用于非门控胸部CT冠状动脉钙化运动校正

冠状动脉钙化(CAC)评分来自胸部CT是一种成熟的工具，用于分层和细化临床心血管疾病风险评估。CAC量化依赖于钙化病灶的准确勾勒，但常常受到心脏和呼吸运动引入的伪影影响。心电图门控心脏CT显著减少了运动伪影，但由于门控要求和缺乏保险覆盖，其在人群筛查和常规成像中的应用受到限制。尽管从非门控胸部CT中识别偶然的CAC越来越多地被认为是一种可行的替代方案，因为它提供了更易于获取和广泛可用的替代方案，但该模态受限于更严重的运动伪影。我们提出了ProDM（属性感知渐进校正扩散模型），这是一种生成扩散框架，可以从非门控CT中恢复无运动的钙化病灶。ProDM 引入了三个关键组件：(1) 一种CAC运动模拟数据引擎，直接从心脏门控CT中合成具有多种运动轨迹的现实非门控获取，从而实现无需配对数据的监督训练；(2) 一种属性感知学习策略，通过可微分的钙化一致性损失整合钙化特定先验，以保留病灶完整性；(3) 一种渐进校正方案，在扩散步骤中逐步减少伪影，以增强稳定性和钙化准确性。在真实患者数据集上的实验表明，与几个基线相比，ProDM 显著提高了CAC评分准确性、空间病灶保真度和风险分层性能。在真实非门控扫描上的读者研究进一步证实，ProDM 抑制了运动伪影并提高了临床可用性。这些发现突显了渐进、属性感知框架在常规胸部CT成像中可靠CAC量化中的潜力。

Summary / 总结

ProDM is a generative diffusion framework designed to correct motion artifacts in non-gated chest CT scans for coronary artery calcium (CAC) scoring. It includes a motion simulation engine, a property-aware learning strategy, and a progressive correction scheme. Experiments show that ProDM enhances CAC scoring accuracy and lesion fidelity, outperforming baseline methods and improving clinical usability.

ProDM 是一种生成扩散框架，旨在通过纠正非门控胸部 CT 扫描中的运动伪影来实现准确的冠状动脉钙化 (CAC) 评分。它包括一个运动模拟引擎、一种钙化特定的学习策略以及一个逐步纠正方案。实验表明，ProDM 提高了 CAC 评分的准确性和病灶空间保真度，优于基线方法，并改善了临床适用性。

RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

Authors: Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang, Jian Xu, Bo Zheng

First: 2025-12-31T16:09:08+00:00 · Latest: 2025-12-31T16:09:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Search relevance plays a central role in web e-commerce. While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics across the industry. To address this limitation, we propose Rule-Aware benchmark with Image for Relevance assessment(RAIR), a Chinese dataset derived from real-world scenarios. RAIR established a standardized framework for relevance assessment and provides a set of universal rules, which forms the foundation for standardized evaluation. Additionally, RAIR analyzes essential capabilities required for current relevance models and introduces a comprehensive dataset consists of three subset: (1) a general subset with industry-balanced sampling to evaluate fundamental model competencies; (2) a long-tail hard subset focus on challenging cases to assess performance limits; (3) a visual salience subset for evaluating multimodal understanding capabilities. We conducted experiments on RAIR using 14 open and closed-source models. The results demonstrate that RAIR presents sufficient challenges even for GPT-5, which achieved the best performance. RAIR data are now available, serving as an industry benchmark for relevance assessment while providing new insights into general LLM and Visual Language Model(VLM) evaluation.

中文标题/摘要

标题：RAIR：一种规则意识基准，统一长尾和视觉显著性子集，用于电子商务相关性评估

搜索相关性在网页电子商务中起着核心作用。虽然大型语言模型（LLMs）在相关性任务上取得了显著成果，但现有基准缺乏足够的复杂性，无法进行全面模型评估，导致行业内缺乏标准化的相关性评估指标。为解决这一局限，我们提出了规则意识基准与图像相关性评估（RAIR），这是一个源自现实场景的中文数据集。RAIR 建立了一个标准化的相关性评估框架，并提供了一套通用规则，为标准化评估奠定了基础。此外，RAIR 分析了当前相关性模型所需的关键能力，并引入了一个综合数据集，包括三个子集：（1）一个行业平衡采样的通用子集，用于评估基本模型能力；（2）一个长尾难题子集，专注于具有挑战性的案例以评估性能极限；（3）一个视觉显著性子集，用于评估多模态理解能力。我们在 RAIR 上使用了 14 个开源和闭源模型进行了实验。结果表明，即使对于表现最佳的 GPT-5，RAIR 也提供了足够的挑战。RAIR 数据现已可用，作为相关性评估的行业基准，同时为通用大语言模型（LLM）和视觉语言模型（VLM）评估提供了新的见解。

Summary / 总结

The paper introduces RAIR, a benchmark for e-commerce search relevance assessment, addressing the lack of complexity in existing benchmarks. It consists of three subsets: a general subset, a long-tail hard subset, and a visual salience subset. Experiments on 14 models, including GPT-5, show that RAIR presents significant challenges, even for advanced models. The dataset is now available as an industry benchmark.

该论文提出了RAIR基准数据集，用于电子商务搜索相关性评估，解决了现有基准缺乏复杂性的问题。RAIR包括三个子集：一个通用子集用于基本模型评估，一个长尾难题子集用于挑战性案例评估，以及一个视觉显著性子集用于评估多模态理解能力。对14个模型（包括GPT-5）的实验显示，RAIR提出了显著挑战，即使是高级模型也不例外。该数据集现已可用，作为行业基准并提供了对通用LLM和视觉语言模型评估的新见解。

Iterative Deployment Improves Planning Skills in LLMs

Authors: Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, Ilia Shumailov, André G. Pereira, Yarin Gal

First: 2025-12-31T16:03:14+00:00 · Latest: 2025-12-31T16:03:14+00:00

Abs · PDF · Code1 · Code2

Abstract

We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.

中文标题/摘要

标题：迭代部署提升大语言模型的规划技能

我们展示了对大型语言模型（LLM）进行迭代部署，每个模型都在用户精心挑选的数据上进行微调，可以显著改变模型的性质。通过在各种规划领域进行测试，我们观察到规划技能有了显著提高，后期模型通过发现更长的规划方案展示了潜在的泛化能力。我们还提供了理论分析，表明迭代部署实际上在外循环中实现了强化学习（RL）训练（即，不是作为有意图的模型训练的一部分），并隐含了一个奖励函数。与RL的联系有两个重要含义：首先，对于AI安全领域而言，由于反复部署所隐含的奖励函数没有明确定义，可能会对未来的模型部署产生意想不到的影响。其次，这里突出的机制可以被视为一种替代的训练方案，依赖于数据的挑选而非明确的奖励。

Summary / 总结

The research aims to improve the planning skills of large language models (LLMs) through iterative deployment. Each model is fine-tuned on data curated by users from the previous model's deployment. This process leads to significant improvements in planning skills, with later models able to discover much longer plans than the initial models. Theoretical analysis shows that this iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop, with an implicit reward function. This has implications for AI safety and suggests an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.

研究旨在通过迭代部署来提升大型语言模型（LLMs）的规划能力。每个模型都会在用户从上一模型部署中筛选的数据上进行微调。这一过程显著提升了规划技能，后期模型能够发现比初始模型更长的计划。理论分析表明，这种迭代部署实际上在外循环中实现了强化学习（RL）训练，具有隐含的奖励函数。这在AI安全领域具有重要意义，并且表明可以依赖数据筛选而非明确的奖励来实现RL训练的一种替代方案。

PoseStreamer: A Multi-modal Framework for 6DoF Pose Estimation of Unseen Moving Objects

Authors: Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng

First: 2025-12-28T15:52:58+00:00 · Latest: 2025-12-31T15:59:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.

Summary / 总结

PoseStreamer is a multi-modal framework for 6DoF pose estimation of unseen moving objects, addressing the limitations of standard RGB cameras in high-speed and low-light scenarios. It integrates an Adaptive Pose Memory Queue, an Object-centric 2D Tracker, and a Ray Pose Filter to enhance temporal consistency, 2D to 3D translation, and geometric refinement, respectively. Experiments show that PoseStreamer outperforms existing methods in high-speed moving scenarios and demonstrates strong generalizability for unseen objects.

PoseStreamer 是一个用于未见过的移动物体的 6DoF 姿态估计的多模态框架，解决了标准 RGB 摄像头在高速和低光照场景下的局限性。它结合了自适应姿态记忆队列、对象中心的 2D 跟踪器和射线姿态滤波器。实验表明，PoseStreamer 在高速移动场景中表现出色，并且对于未见过的物体具有很强的通用性。

ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Authors: Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim

First: 2025-02-20T18:01:41+00:00 · Latest: 2025-12-31T15:43:05+00:00

Comments: Accepted and to appear in IJCNLP-AACL 2025

Abs · PDF · Code1 · Code2

Abstract

Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

中文标题/摘要

标题：ReVision：一种用于隐私保护任务导向视觉指令重写的数据集和基线VLM

随着AR、VR和配备强大摄像头的现代智能手机成为人机通信的主要接口，高效的隐私保护多模态交互变得至关重要。现有的强大视觉-语言模型（VLMs）支持多模态交互，通常依赖于基于云的处理，这引发了（1）视觉隐私问题，即传输敏感的视觉数据到服务器，以及（2）其有限的实时、设备端可用性问题。本文探讨了视觉指令重写这一新颖的方法，即将多模态指令转换为纯文本命令，允许轻量级设备端指令重写VLM（参数量250M）与现有对话AI系统的无缝集成，增强视觉数据隐私。为此，我们提供了一个涵盖14个领域的超过39,000个示例的数据集，并开发了一个紧凑的VLM，该模型在图像字幕数据集上进行预训练，并针对指令重写进行了微调。实验结果通过NLG指标（如BLEU、METEOR和ROUGE）以及语义解析分析评估，表明即使是最小量化版本的模型（存储占用<500MB）也能实现有效的指令重写，从而实现以隐私为中心的多模态AI应用。

Summary / 总结

This paper addresses the need for efficient and privacy-preserving multimodal interaction by introducing ReVision, a dataset and baseline vision-language model for visual instruction rewriting. The model transforms multimodal instructions into text-only commands, enhancing privacy and on-device usability. The dataset includes over 39,000 examples across 14 domains, and the model, pretrained on image captioning and fine-tuned for instruction rewriting, achieves effective instruction rewriting with a quantized version having less than 500MB storage footprint, as evidenced by NLG metrics and semantic parsing analysis.

本文通过引入ReVision数据集和视觉指令重写基线模型，解决高效且隐私保护的多模态交互需求。该模型将多模态指令转换为纯文本命令，增强隐私性和设备端使用性。数据集包含超过39,000个跨14个领域的示例，模型在图像字幕数据集上预训练，并针对指令重写进行微调，即使在小于500MB存储占用量的量化版本中也能实现有效的指令重写，通过NLG指标和语义解析分析进行评估。

HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition

Authors: Wang Lu, Yao Zhu, Jindong Wang

Venue: KDD 2026

First: 2025-12-11T16:52:50+00:00 · Latest: 2025-12-31T15:41:01+00:00

Comments: Accepted by KDD 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.

中文标题/摘要

标题：HAROOD：基于传感器的人体活动识别离分布泛化基准

基于传感器的人体活动识别（HAR）从时间序列传感数据中挖掘活动模式。在现实场景中，个体、设备、环境和时间的变化对同一活动引入了显著的分布变化。最近的努力试图通过应用或适应现有的离分布（OOD）算法来解决这一挑战，但仅限于某些分布变化场景（例如跨设备或跨位置），缺乏对这些算法有效性的全面了解。例如，HAR是否需要离分布？哪种OOD算法表现最佳？在本文中，我们通过提出HAROOD，一个全面的HAR离分布基准来填补这一空白。我们定义了4种离分布场景：跨个体、跨位置、跨数据集和跨时间，并构建了一个涵盖6个数据集、16种比较方法（使用CNN和Transformer架构实现）和两种模型选择协议的测试平台。然后，我们进行了广泛的实验并提出了几个未来研究的发现，例如没有一种方法始终优于其他方法，突显了显著的进步机会。我们的代码库高度模块化，易于扩展以添加新数据集、算法、比较和分析，以促进基于离分布的HAR研究。我们的实现已发布并可在https://github.com/AIFrontierLab/HAROOD/找到。

Summary / 总结

The research aims to evaluate the effectiveness of out-of-distribution (OOD) algorithms in sensor-based human activity recognition (HAR) across various distribution shifts. The study introduces HAROOD, a comprehensive benchmark that defines four OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time. Through extensive experiments with 16 comparative methods, the study finds that no single method consistently outperforms others, indicating significant room for improvement in OOD-based HAR research.

研究旨在评估出-of-distribution (OOD) 算法在传感器基于的人类活动识别（HAR）中的有效性，涵盖四种分布偏移场景：跨个体、跨位置、跨数据集和跨时间。该研究提出了HAROOD，一个全面的基准，评估了16种使用CNN和Transformer架构的比较方法，并指出没有一种方法在所有场景中都表现最佳，表明在基于OOD的HAR研究中存在巨大的改进空间。

Are First-Order Diffusion Samplers Really Slower? A Fast Forward-Value Approach

Authors: Yuchen Jiao, Na Li, Changxiao Cai, Gen Li

First: 2025-12-31T15:35:53+00:00 · Latest: 2025-12-31T15:35:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Higher-order ODE solvers have become a standard tool for accelerating diffusion probabilistic model (DPM) sampling, motivating the widespread view that first-order methods are inherently slower and that increasing discretization order is the primary path to faster generation. This paper challenges this belief and revisits acceleration from a complementary angle: beyond solver order, the placement of DPM evaluations along the reverse-time dynamics can substantially affect sampling accuracy in the low-neural function evaluation (NFE) regime. We propose a novel training-free, first-order sampler whose leading discretization error has the opposite sign to that of DDIM. Algorithmically, the method approximates the forward-value evaluation via a cheap one-step lookahead predictor. We provide theoretical guarantees showing that the resulting sampler provably approximates the ideal forward-value trajectory while retaining first-order convergence. Empirically, across standard image generation benchmarks (CIFAR-10, ImageNet, FFHQ, and LSUN), the proposed sampler consistently improves sample quality under the same NFE budget and can be competitive with, and sometimes outperform, state-of-the-art higher-order samplers. Overall, the results suggest that the placement of DPM evaluations provides an additional and largely independent design angle for accelerating diffusion sampling.

中文标题/摘要

标题：一阶扩散采样器真的更慢吗？一种快速的前向值方法

高阶ODE求解器已成为加速扩散概率模型(DPM)采样的标准工具，这促使人们普遍认为一阶方法本质上更慢，并且提高离散化阶数是加快生成的主要途径。本文挑战了这一观点，并从互补的角度重新审视加速：除了求解器阶数外，DPM评估在反向时间动力学中的位置会在低神经网络评估次数(NFE)区间显著影响采样精度。我们提出了一种新的无需训练的一阶采样器，其主要离散化误差与DDIM相反。从算法上讲，该方法通过廉价的一步前瞻预测器近似前向值评估。我们提供了理论保证，表明该采样器可以证明地逼近理想的前向值轨迹，同时保持一阶收敛性。实验上，在标准图像生成基准(CIFAR-10、ImageNet、FFHQ和LSUN)上，所提出的采样器在相同的NFE预算下始终能提高样本质量，并且可以与最先进的高阶采样器竞争，有时甚至可以超越它们。总体而言，结果表明，DPM评估的位置提供了另一个独立的设计角度，可以加速扩散采样。

Summary / 总结

This paper challenges the belief that first-order diffusion samplers are inherently slower than higher-order methods. Instead, it focuses on the placement of DPM evaluations along the reverse-time dynamics to improve sampling accuracy. The authors propose a novel first-order sampler that approximates the forward-value evaluation via a cheap one-step lookahead predictor, providing theoretical guarantees for its accuracy and first-order convergence. Empirically, the sampler consistently improves sample quality across various benchmarks and can match or outperform state-of-the-art higher-order samplers under the same neural function evaluation budget.

本文挑战了一级扩散采样器比高阶方法更慢的传统观点。提出了一种新型的一级采样器，通过廉价的一步前瞻预测器近似前向值评估，并提供了理论保证，证明该采样器可以近似理想的前向值轨迹并保持一级收敛性。实验结果显示，该采样器在各种基准测试中以相同的神经函数评估预算提高了样本质量，并且在某些情况下可以超越最先进的高阶采样器。

Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection

Authors: Bartłomiej Olber, Jakub Winter, Paweł Wawrzyński, Andrii Gamalii, Daniel Górniak, Marcin Łojek, Robert Nowak, Krystian Radlak

First: 2025-12-31T15:26:09+00:00 · Latest: 2025-12-31T15:26:09+00:00

Abs · PDF · Code1 · Code2

Abstract

3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.

中文标题/摘要

标题：半监督多样性感知领域适应性调整用于3D物体检测

3D物体检测器是自主车辆感知系统中的基本组件。尽管这些检测器在标准的自动驾驶基准测试中表现出色，但在不同领域中却难以泛化——例如，一个在美国训练的模型可能在亚洲或欧洲等地区表现不佳。本文提出了一种基于神经元激活模式的激光雷达领域适应方法，表明通过正确选择一小部分具有代表性和多样性的目标域样本进行标注，可以实现最先进的性能。所提出的方法需要非常小的标注预算，并且当与受持续学习启发的后训练技术结合使用时，可以防止权重从原始模型中漂移。实证评估表明，所提出的领域适应方法优于线性探针和最先进的领域适应技术。

Summary / 总结

This paper addresses the challenge of 3D object detectors failing to generalize across different domains. It introduces a semi-supervised diversity-aware domain adaptation method that uses neuron activation patterns to select a small, representative, and diverse subset of samples from the target domain for annotation. The method requires minimal annotation effort and incorporates post-training techniques to prevent weight drift. Experimental results show that this approach outperforms both linear probing and existing domain adaptation techniques.

该论文解决了3D物体检测器在不同领域间难以泛化的挑战。它提出了一种基于神经元激活模式的新型半监督多样性感知领域适应方法，通过仅标注目标域中少量的代表性且多样化的样本，实现了最先进的性能，同时减少了标注工作量。该方法还结合了防止模型权重漂移的技术，进一步提升了整体性能。

Frequent subgraph-based persistent homology for graph classification

Authors: Xinyang Chen, Amaël Broustet, Guoting Chen

First: 2025-12-31T15:21:15+00:00 · Latest: 2025-12-31T15:21:15+00:00

Comments: Preprint. 18 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Persistent homology (PH) has recently emerged as a powerful tool for extracting topological features. Integrating PH into machine learning and deep learning models enhances topology awareness and interpretability. However, most PH methods on graphs rely on a limited set of filtrations, such as degree-based or weight-based filtrations, which overlook richer features like recurring information across the dataset and thus restrict expressive power. In this work, we propose a novel graph filtration called Frequent Subgraph Filtration (FSF), which is derived from frequent subgraphs and produces stable and information-rich frequency-based persistent homology (FPH) features. We study the theoretical properties of FSF and provide both proofs and experimental validation. Beyond persistent homology itself, we introduce two approaches for graph classification: an FPH-based machine learning model (FPH-ML) and a hybrid framework that integrates FPH with graph neural networks (FPH-GNNs) to enhance topology-aware graph representation learning. Our frameworks bridge frequent subgraph mining and topological data analysis, offering a new perspective on topology-aware feature extraction. Experimental results show that FPH-ML achieves competitive or superior accuracy compared with kernel-based and degree-based filtration methods. When integrated into graph neural networks, FPH yields relative performance gains ranging from 0.4 to 21 percent, with improvements of up to 8.2 percentage points over GCN and GIN backbones across benchmarks.

中文标题/摘要

标题：基于频繁子图的持久同调图分类

持久同调（PH）最近已成为提取拓扑特征的强大工具。将PH集成到机器学习和深度学习模型中增强了拓扑意识和可解释性。然而，大多数图上的PH方法依赖于有限的滤波器集，如度数基或权重基滤波器，这忽略了数据集中丰富的信息，如重复信息，从而限制了表达能力。在本文中，我们提出了一种新的图滤波器，称为频繁子图滤波器（FSF），它源自频繁子图并产生稳定且信息丰富的频率基持久同调（FPH）特征。我们研究了FSF的理论性质并提供了证明和实验验证。除了持久同调本身，我们还介绍了两种图分类方法：基于FPH的机器学习模型（FPH-ML）和将FPH与图神经网络（FPH-GNNs）结合的混合框架，以增强拓扑感知的图表示学习。我们的框架将频繁子图挖掘与拓扑数据分析相结合，提供了拓扑感知特征提取的新视角。实验结果表明，FPH-ML在与核基和度数基滤波器方法相比时，实现了竞争力或更优的准确性。当集成到图神经网络中时，FPH在基准测试中相对性能提高了0.4%到21%，并在GCN和GIN骨干网络上提高了高达8.2个百分点。

Summary / 总结

This paper introduces Frequent Subgraph Filtration (FSF) for graph classification, which derives persistent homology features from frequent subgraphs, providing richer and more stable features than traditional degree or weight-based filtrations. The authors propose FPH-ML and FPH-GNNs, which integrate these features into machine learning and graph neural networks, respectively. Experimental results show that FPH-ML achieves competitive or superior accuracy compared to kernel-based and degree-based methods, and FPH-GNNs yield up to 8.2 percentage point improvements over GCN and GIN backbones.

该研究提出了用于图分类的频繁子图过滤（FSF）方法，生成稳定且丰富的基于频率的持久同调（FPH）特征，增强拓扑感知和可解释性。实验结果表明，FPH-ML在与核基和度基过滤方法相比时，具有竞争力或更优的准确性。将FPH与图神经网络（FPH-GNNs）结合使用，性能提升高达8.2个百分点，超过GCN和GIN基线模型。

AI-Driven Cloud Resource Optimization for Multi-Cluster Environments

Authors: Vinoth Punniyamoorthy, Akash Kumar Agarwal, Bikesh Kumar, Abhirup Mazumder, Kabilan Kannan, Sumit Saha

First: 2025-12-31T15:15:46+00:00 · Latest: 2025-12-31T15:15:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern cloud-native systems increasingly rely on multi-cluster deployments to support scalability, resilience, and geographic distribution. However, existing resource management approaches remain largely reactive and cluster-centric, limiting their ability to optimize system-wide behavior under dynamic workloads. These limitations result in inefficient resource utilization, delayed adaptation, and increased operational overhead across distributed environments. This paper presents an AI-driven framework for adaptive resource optimization in multi-cluster cloud systems. The proposed approach integrates predictive learning, policy-aware decision-making, and continuous feedback to enable proactive and coordinated resource management across clusters. By analyzing cross-cluster telemetry and historical execution patterns, the framework dynamically adjusts resource allocation to balance performance, cost, and reliability objectives. A prototype implementation demonstrates improved resource efficiency, faster stabilization during workload fluctuations, and reduced performance variability compared to conventional reactive approaches. The results highlight the effectiveness of intelligent, self-adaptive infrastructure management as a key enabler for scalable and resilient cloud platforms.

中文标题/摘要

标题：多集群环境中的AI驱动云资源优化

现代云原生系统越来越多地依赖多集群部署以支持可扩展性、弹性和地理分布。然而，现有的资源管理方法仍然主要具有反应性和集群中心化的特点，限制了它们在动态工作负载下优化系统级行为的能力。这些限制导致资源利用效率低下、适应延迟和分布式环境中的操作开销增加。本文提出了一种用于多集群云系统自适应资源优化的AI驱动框架。所提出的方法结合了预测学习、策略感知决策和持续反馈，以实现跨集群的主动和协调资源管理。通过分析跨集群遥测数据和历史执行模式，该框架动态调整资源分配以平衡性能、成本和可靠性目标。原型实现表明，与传统的反应性方法相比，该方法在资源效率、工作负载波动期间更快的稳定性和降低的性能变异性方面具有优势。结果突显了智能、自适应基础设施管理作为可扩展和弹性云平台的关键使能器的有效性。

Summary / 总结

This paper addresses the inefficiencies in resource management for multi-cluster cloud systems by proposing an AI-driven framework. The framework uses predictive learning and policy-aware decision-making to optimize resource allocation across clusters, balancing performance, cost, and reliability. Experimental results show improved resource efficiency, faster stabilization during workload fluctuations, and reduced performance variability compared to traditional reactive methods.

本文提出了一种基于AI的框架来解决多集群云系统中的资源管理效率问题。该框架利用预测学习、策略感知决策和持续反馈来跨集群优化资源分配。实验结果表明，与传统的反应式方法相比，该框架能够提高资源效率、加快工作负载波动期间的稳定速度，并减少性能波动。

FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

Authors: Zichen Tang, Haihong E, Rongjin Li, Jiacheng Liu, Linwei Jia, Zhuodi Hao, Zhongjun Yang, Yuanze Li, Haolin Tian, Xinyi Hu, Peizhi Zhao, Yuan Liu, Zhengyu Wang, Xianghe Wang, Yiling Huang, Xueyuan Lin, Ruofei Bai, Zijian Xie, Qian Huang, Ruining Cao, Haocheng Gao

Venue: AAAI

First: 2025-12-31T15:00:03+00:00 · Latest: 2025-12-31T15:00:03+00:00

Comments: Accepted by AAAI-26 Main Track

Abs · PDF · Code1 · Code2

Abstract

We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

中文标题/摘要

标题：FinMMDocR：基于情景意识、文档理解和多步计算的金融多模态推理基准

我们介绍了FinMMDocR，这是一个新的双语多模态基准，用于评估大规模语言模型（MLLMs）在现实世界金融数值推理中的多模态能力。与现有基准相比，我们的工作带来了三项重大进步。(1) 情景意识：1200个专家标注的问题中有57.9%包含12种隐含的金融情景（例如，投资组合管理），挑战模型进行基于假设的专家级推理；(2) 文档理解：837份中英文文档涵盖9种类型（例如，公司研究报告），平均每份50.8页，包含丰富的视觉元素，显著超越现有基准在金融文档的广度和深度上；(3) 多步计算：问题平均需要11步推理（5.3步提取+5.7步计算步骤），其中65.0%需要跨页证据（平均2.4页）。表现最好的MLLM仅达到58.0%的准确率，不同的检索增强生成（RAG）方法在该任务上表现出显著的性能差异。我们期望FinMMDocR能够推动MLLMs和增强推理方法在现实世界复杂多模态推理任务中的改进。

Summary / 总结

FinMMDocR is a new bilingual multimodal benchmark for evaluating MLLMs on financial numerical reasoning. It introduces scenario awareness, document understanding, and multi-step computation, challenging models with 12 types of implicit financial scenarios, 837 richly detailed documents, and 11-step reasoning. The best MLLM achieves only 58.0% accuracy, highlighting significant room for improvement in MLLMs for complex multimodal reasoning tasks.

FinMMDocR 是一个新的双语多模态基准，用于评估 MLLMs 在实际金融推理中的表现，包含 1,200 个专家标注的问题，其中 57.9% 涉及 12 种类型的金融场景，837 份文档涵盖 9 种类型且包含丰富的视觉元素，平均需要 11 步推理。最佳 MLLM 的准确率仅为 58.0%，突显了在复杂任务中改进 MLLMs 和推理增强方法的需求。

PRISM: A hierarchical multiscale approach for time series forecasting

Authors: Zihao Chen, Alexandre Andre, Wenrui Ma, Ian Knight, Sergey Shuvaev, Eva Dyer

First: 2025-12-31T14:51:12+00:00 · Latest: 2025-12-31T14:51:12+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Forecasting is critical in areas such as finance, biology, and healthcare. Despite the progress in the field, making accurate forecasts remains challenging because real-world time series contain both global trends, local fine-grained structure, and features on multiple scales in between. Here, we present a new forecasting method, PRISM (Partitioned Representation for Iterative Sequence Modeling), that addresses this challenge through a learnable tree-based partitioning of the signal. At the root of the tree, a global representation captures coarse trends in the signal, while recursive splits reveal increasingly localized views of the signal. At each level of the tree, data are projected onto a time-frequency basis (e.g., wavelets or exponential moving averages) to extract scale-specific features, which are then aggregated across the hierarchy. This design allows the model to jointly capture global structure and local dynamics of the signal, enabling accurate forecasting. Experiments across benchmark datasets show that our method outperforms state-of-the-art methods for forecasting. Overall, these results demonstrate that our hierarchical approach provides a lightweight and flexible framework for forecasting multivariate time series. The code is available at https://github.com/nerdslab/prism.

中文标题/摘要

标题：PRISM：一种分层多尺度方法用于时间序列预测

预测在金融、生物学和医疗保健等领域至关重要。尽管该领域取得了进展，但准确预测仍然具有挑战性，因为实际世界的时间序列包含全局趋势、局部精细结构以及介于两者之间的多种尺度特征。为此，我们提出了一种新的预测方法——PRISM（Partitioned Representation for Iterative Sequence Modeling），该方法通过可学习的树状分割信号来应对这一挑战。树的根节点捕获信号中的粗略趋势，而递归分割则揭示信号的越来越局部化的视图。在树的每一层，数据被投影到时间-频率基底（例如小波或指数移动平均）上，以提取特定尺度的特征，然后在层次结构中进行聚合。这种设计使模型能够同时捕捉信号的全局结构和局部动态，从而实现准确的预测。在基准数据集上的实验表明，我们的方法优于最先进的预测方法。总体而言，这些结果表明，我们的分层方法为预测多元时间序列提供了一个轻量级且灵活的框架。代码可在https://github.com/nerdslab/prism/ 获取。

Summary / 总结

The research addresses the challenge of accurately forecasting time series data, which are crucial in finance, biology, and healthcare. PRISM, a hierarchical multiscale approach, partitions the signal using a learnable tree structure to capture both global trends and local dynamics. Experiments show that PRISM outperforms existing methods on benchmark datasets, demonstrating its effectiveness in forecasting multivariate time series.

PRISM 是一种分层多尺度方法，用于时间序列预测，旨在捕捉全局趋势和局部细粒度结构。它通过可学习的树状分割来揭示信号的越来越局部化的视图，每一层树都提取特定尺度的特征。实验表明，PRISM 在基准数据集上的表现优于最先进的方法，证明了其在预测多变量时间序列方面的有效性。

Large Multimodal Models for Low-Resource Languages: A Survey

Authors: Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu

First: 2025-02-08T13:29:44+00:00 · Latest: 2025-12-31T14:45:06+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.

中文标题/摘要

标题：大规模多模态模型在低资源语言中的应用：综述

在本文综述中，我们系统地分析了用于适应低资源语言（LR）的大规模多模态模型（LMMs）的技术，涵盖了从视觉增强和数据生成到跨模态转移和融合策略的方法。通过对96种LR语言的117项研究进行全面分析，我们识别出研究人员在应对数据和计算资源有限的挑战时的关键模式。我们将工作分为资源导向和方法导向两类，并进一步细分为相关子类别。我们按性能和效率比较了方法导向类别的贡献，讨论了代表性研究的优点和局限性。我们发现，视觉信息通常在LR环境中作为提高模型性能的关键桥梁，尽管在幻觉抑制和计算效率等方面仍存在重大挑战。总之，我们为研究人员提供了对当前方法和剩余挑战的清晰理解，使LMMs更易于被LR（未充分研究）语言的使用者使用。我们还通过以下链接提供了开源仓库：https://github.com/marianlupascu/LMM4LRL-Survey。

Summary / 总结

This survey analyzes techniques for adapting large multimodal models to low-resource languages, examining visual enhancement, data creation, cross-modal transfer, and fusion strategies. Based on a review of 117 studies across 96 low-resource languages, the authors identify key patterns in addressing data and resource limitations. They find that visual information is crucial for improving model performance but highlight ongoing challenges such as hallucination mitigation and computational efficiency. The survey provides a comprehensive understanding of current approaches and remaining challenges in making large multimodal models accessible to speakers of low-resource languages.

该综述分析了将大型多模态模型应用于低资源语言的技术，研究了117项针对96种语言的研究。它识别了在应对数据和资源限制方面的关键模式，将贡献分为资源导向和方法导向两类。综述发现，视觉信息对于提高模型性能至关重要，但在幻觉抑制和计算效率等方面仍存在挑战。它为研究人员提供了当前方法和挑战的见解，以使多模态模型更适用于低资源语言的使用者。

Semi-Automated Data Annotation in Multisensor Datasets for Autonomous Vehicle Testing

Authors: Andrii Gamalii, Daniel Górniak, Robert Nowak, Bartłomiej Olber, Krystian Radlak, Jakub Winter

First: 2025-12-31T14:43:48+00:00 · Latest: 2025-12-31T14:43:48+00:00

Abs · PDF · Code1 · Code2

Abstract

This report presents the design and implementation of a semi-automated data annotation pipeline developed within the DARTS project, whose goal is to create a large-scale, multimodal dataset of driving scenarios recorded in Polish conditions. Manual annotation of such heterogeneous data is both costly and time-consuming. To address this challenge, the proposed solution adopts a human-in-the-loop approach that combines artificial intelligence with human expertise to reduce annotation cost and duration. The system automatically generates initial annotations, enables iterative model retraining, and incorporates data anonymization and domain adaptation techniques. At its core, the tool relies on 3D object detection algorithms to produce preliminary annotations. Overall, the developed tools and methodology result in substantial time savings while ensuring consistent, high-quality annotations across different sensor modalities. The solution directly supports the DARTS project by accelerating the preparation of large annotated dataset in the project's standardized format, strengthening the technological base for autonomous vehicle research in Poland.

中文标题/摘要

标题：多传感器数据集在自动驾驶车辆测试中的半自动化数据标注

本报告介绍了在DARTS项目中设计和实现的一种半自动化数据标注流水线，其目标是在波兰条件下创建一个大规模的多模态驾驶场景数据集。手动标注这种异构数据既昂贵又耗时。为应对这一挑战，所提出的解决方案采用了一种人工在环的方法，结合人工智能与人类专业知识，以降低标注成本和时间。系统自动生成初始标注，支持迭代模型重新训练，并采用数据匿名化和领域适应技术。该工具的核心依赖于3D物体检测算法以生成初步标注。总体而言，开发的工具和方法在确保不同传感器模态之间一致性和高质量标注的同时，节省了大量时间。该解决方案直接支持DARTS项目，通过加速准备符合项目标准化格式的大规模标注数据集，加强了波兰自动驾驶车辆研究的技术基础。

Summary / 总结

The research aims to address the high cost and time-consuming nature of manually annotating large-scale, multimodal driving scenario datasets for autonomous vehicle testing. It introduces a semi-automated data annotation pipeline that leverages 3D object detection algorithms to generate initial annotations, allowing for iterative model retraining and data anonymization. Key findings include substantial time savings and consistent, high-quality annotations across different sensor modalities, which significantly support the DARTS project's goals.

研究旨在解决手动标注异构数据在自动驾驶车辆测试中的高成本和耗时问题。提出了一种半自动数据标注流水线，结合了AI和人类专业知识。该系统使用3D物体检测算法生成初步标注，允许模型迭代训练，并包括数据匿名化和领域适应技术。关键发现表明，这带来了显著的时间节省，并确保了不同传感器模态下的一致性和高质量标注，支持DARTS项目准备大规模标注数据集的目标。