SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Authors: Zhening Huang, Hyeonho Jeong, Xuelin Chen, Yulia Gryaditskaya, Tuanfeng Y. Wang, Joan Lasenby, Chun-Hao Huang
First: 2025-12-31T18:59:57+00:00 · Latest: 2025-12-31T18:59:57+00:00
Comments: Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot
Abstract
We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: https://zheninghuang.github.io/Space-Time-Pilot/ Code: https://github.com/ZheningHuang/spacetimepilot
中文标题/摘要
标题:SpaceTimePilot:时空分离的动态场景生成渲染
我们提出了SpaceTimePilot,一种视频扩散模型,能够分离空间和时间以实现可控的生成渲染。给定单目视频,SpaceTimePilot可以在生成过程中独立改变摄像机视角和运动序列,重新渲染场景,实现空间和时间上的连续和任意探索。为此,我们在扩散过程中引入了有效的动画时间嵌入机制,允许对输出视频的运动序列进行显式控制,相对于源视频。由于现有数据集中没有提供同一动态场景的配对视频以连续的时间变化,我们提出了一种简单而有效的时空扭曲训练方案,利用现有的多视角数据集模拟时间差异。该策略有效地监督模型学习时间控制并实现稳健的时空分离。为了进一步提高双重控制的精度,我们引入了两个额外组件:改进的摄像机条件机制,允许从第一帧开始改变摄像机,以及CamxTime,第一个合成时空全覆盖渲染数据集,提供了场景内的完全自由时空视频轨迹。在时空扭曲方案和CamxTime数据集上的联合训练产生了更精确的时间控制。我们在真实世界和合成数据上评估了SpaceTimePilot,展示了清晰的时空分离和与先前工作相比的强劲结果。
Summary / 总结
SpaceTimePilot is a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, it can independently alter the camera viewpoint and motion sequence, enabling continuous exploration across space and time. The model uses an animation time-embedding mechanism and a temporal-warping training scheme to achieve robust space-time disentanglement. Additional components, such as an improved camera-conditioning mechanism and the CamxTime dataset, further enhance temporal control. Evaluations on real-world and synthetic data show clear space-time disentanglement and strong performance compared to previous methods.
SpaceTimePilot 是一种视频扩散模型,能够解耦空间和时间,实现可控的生成渲染。给定单目视频,它可以独立改变摄像机视角和运动序列,对场景进行连续的空间和时间探索。模型使用了动画时间嵌入机制和时间扭曲训练方案,以实现稳健的空间-时间解耦。此外,还引入了改进的摄像机条件机制和 CamxTime 数据集,进一步增强了控制能力。实验结果显示了清晰的空间-时间解耦和与先前方法相比的强劲性能。
GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction
Authors: Yi-Chuan Huang, Hao-Jen Chien, Chin-Yang Lin, Ying-Huan Chen, Yu-Lun Liu
First: 2025-12-31T18:59:55+00:00 · Latest: 2025-12-31T18:59:55+00:00
Comments: Project page: https://yichuanh.github.io/GaMO/
Abstract
Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: https://yichuanh.github.io/GaMO/
中文标题/摘要
标题:GaMO:基于几何的多视图扩散外延法用于稀疏视图三维重建
近年来,三维重建取得了显著进展,能够在密集多视角图像中实现高质量场景捕获,但在输入视角有限时却面临挑战。各种方法,包括正则化技术、语义先验和几何约束,已被实施以应对这一挑战。最新的基于扩散的方法通过生成新的视角来增强训练数据,从而在生成新相机姿态的新视图方面取得了显著改进,超越了早期的正则化和基于先验的方法。尽管取得了这些进展,我们仍发现这些最先进的方法存在三个关键局限性:超出已知视图边缘的覆盖不足、生成视图之间的几何不一致以及计算成本高昂的管道。我们提出了GaMO(几何感知多视图外延器),一种通过多视图外延重新构建稀疏视图的框架。与生成新视角不同,GaMO 从现有相机姿态扩展视场,这本身就能保持几何一致性并提供更广泛的场景覆盖。我们的方法以零样本方式采用多视图条件和几何感知去噪策略,无需训练。在Replica和ScanNet++上的广泛实验表明,GaMO 在3、6和9个输入视图下的重建质量达到最先进的水平,在PSNR和LPIPS方面优于先前方法,同时比最先进的基于扩散的方法快25倍,处理时间不到10分钟。项目页面:https://yichuanh.github.io/GaMO/
Summary / 总结
The research aims to address the limitations of sparse-view 3D reconstruction, particularly the inadequacy of coverage and geometric inconsistencies in existing methods. GaMO, a geometry-aware multi-view outpainting framework, is introduced to expand the field of view from existing camera poses, ensuring geometric consistency and broader scene coverage. The method achieves state-of-the-art reconstruction quality with a significant speedup over previous diffusion-based approaches, outperforming prior methods in PSNR and LPIPS across 3, 6, and 9 input views, while processing times are under 10 minutes.
研究旨在通过提出GaMO框架解决稀疏视角3D重建的局限性,该框架通过多视角出画方式从现有摄像机姿态扩展视野。这种方法避免生成新视角,从而保持几何一致性并提供更广泛的场景覆盖。GaMO在3、6和9个输入视图下均优于先前方法,在PSNR和LPIPS方面表现出色,处理速度比最先进的扩散基方法快25倍,处理时间少于10分钟。
Edit3r: Instant 3D Scene Editing from Sparse Unposed Images
Authors: Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang
First: 2025-12-31T18:59:53+00:00 · Latest: 2025-12-31T18:59:53+00:00
Comments: Project page: https://edit3r.github.io/edit3r/
Abstract
We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.
中文标题/摘要
标题:Edit3r:从稀疏未对齐图像即时编辑3D场景
我们提出了Edit3r,这是一种单次通过框架,可以从未对齐、视角不一致、指令编辑过的图像中重建和编辑3D场景。与需要逐场景优化的先前方法不同,Edit3r直接预测指令对齐的3D编辑,从而实现快速且逼真的渲染,无需优化或姿态估计。训练此类模型的关键挑战在于缺乏多视角一致的编辑图像作为监督。我们通过(i)基于SAM2的重新着色策略生成可靠的、跨视角一致的监督,以及(ii)不对称输入策略,将重新着色的参考视图与原始辅助视图配对,鼓励网络融合和对齐不同的观察结果来解决这一问题。在推理时,我们的模型能够有效处理由2D方法(如InstructPix2Pix)编辑的图像,尽管在训练过程中并未接触到此类编辑。为了进行大规模的定量评估,我们引入了DL3DV-Edit-Bench基准,该基准基于DL3DV测试分割,包含20个多样化的场景、4种编辑类型和总共100次编辑。全面的定量和定性结果表明,Edit3r在语义对齐和3D一致性方面优于最近的基线方法,同时具有显著更高的推理速度,使其在实时3D编辑应用中具有前景。
Summary / 总结
Edit3r is a feed-forward framework that reconstructs and edits 3D scenes from unposed, view-inconsistent images in a single pass. It directly predicts instruction-aligned 3D edits without requiring per-scene optimization, enabling fast and photorealistic rendering. Key to training this model is addressing the lack of multi-view consistent edited images for supervision, achieved through a SAM2-based recoloring strategy and an asymmetric input strategy. Edit3r effectively handles edits made by 2D methods like InstructPix2Pix and outperforms recent baselines in terms of semantic alignment and 3D consistency, while operating at higher inference speed, making it suitable for real-time 3D editing applications.
Edit3r 是一个无需优化或姿态估计即可从不一致视角的未对齐图像中重建和编辑 3D 场景的前馈框架。它使用 SAM2 基础的重新着色策略生成可靠的监督,并使用不对称输入策略鼓励网络融合和对齐不同的观测。该模型能够处理 2D 编辑且运行速度快,实现了比最近基线更好的语义对齐和 3D 一致性。全面的基准测试表明,它适用于实时 3D 编辑应用。
Scaling Open-Ended Reasoning to Predict the Future
Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
First: 2025-12-31T18:59:51+00:00 · Latest: 2025-12-31T18:59:51+00:00
Comments: 45 pages
Abstract
High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
中文标题/摘要
标题:将开放性推理扩展以预测未来
高风险决策涉及对未来不确定性的推理。在本研究中,我们训练语言模型对开放性预测问题进行预测。为了扩大训练数据,我们从每日新闻中报道的全球事件中自动合成新的预测问题,采用完全自动化的精心编纂配方。我们在OpenForesight数据集上训练Qwen3思考模型。为了防止训练和评估期间出现未来信息泄露,我们在数据生成和检索中使用离线新闻语料库。在一小部分验证集的指导下,我们展示了检索的好处以及强化学习(RL)中改进的奖励函数。一旦我们获得最终的预测系统,我们将在2025年5月至8月之间进行保留测试。我们的专门模型OpenForecaster 8B与更大的专有模型相当,我们的训练提高了预测的准确性、校准性和一致性。我们发现预测训练带来的校准改进在流行基准上具有普遍性。我们开源了所有模型、代码和数据,以使语言模型预测研究广泛可及。
Summary / 总结
This work aims to enhance language models for open-ended forecasting by synthesizing questions from daily news. The Qwen3 models are trained on the OpenForesight dataset, using an offline news corpus to avoid future information leakage. The study shows that retrieval and an improved reward function enhance the model's performance, leading to better accuracy, calibration, and consistency in predictions. The OpenForecaster 8B model matches larger proprietary models in these aspects, and the research materials are open-sourced for broader access to forecasting research.
该研究旨在增强语言模型在开放性预测问题上的能力,特别是在高风险决策场景中的应用。方法是通过从每日新闻报告中合成新的预测问题,并在名为OpenForesight的数据集上训练Qwen3模型。模型OpenForecaster 8B在预测准确性、校准性和一致性方面优于更大规模的专有模型。校准改进在流行基准测试中也得到了验证,并且研究已开源以促进更广泛的语言模型预测研究的可访问性。
FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion
Authors: Dian Shao, Mingfei Shi, Like Liu
Venue: AAAI 2026
First: 2025-12-31T18:59:12+00:00 · Latest: 2025-12-31T18:59:12+00:00
Comments: Accepted by AAAI 2026
Abstract
Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at https://smartdianlab.github.io/projects-FineTec/.
中文标题/摘要
标题:FineTec: 通过骨架分解和序列完成在时间腐蚀下的细粒度动作识别
从时间上被破坏的骨架序列中识别细粒度动作仍然是一个重大挑战,尤其是在现实场景中,实时姿态估计经常会产生大量缺失数据。现有方法往往难以准确恢复时间动态和细粒度的空间结构,导致丢失了区分相似动作的关键微动线索。为了解决这个问题,我们提出了一种名为FineTec的统一框架,用于在时间腐蚀下的细粒度动作识别。FineTec首先使用具有多样时间掩码的上下文感知完成从受损输入中恢复基础骨架序列。接着,一个基于骨架的空间分解模块将骨架划分为五个语义区域,并根据运动方差进一步划分为动态和静态子组,通过目标扰动生成两个增强的骨架序列。这些序列与基础序列一起通过一个基于拉格朗日动力学的物理驱动估计模块进行处理,该模块利用拉格朗日动力学估计关节加速度。最后,融合后的骨架位置序列和融合后的加速度序列一起输入到基于GCN的动作识别头部。在粗粒度(NTU-60, NTU-120)和细粒度(Gym99, Gym288)基准上的广泛实验表明,FineTec在各种时间腐蚀水平下显著优于先前的方法。具体而言,FineTec在具有挑战性的Gym99-严重和Gym288-严重设置中分别实现了89.1%和78.1%的顶级准确率,展示了其鲁棒性和泛化能力。代码和数据集可在https://smartdianlab.github.io/projects-FineTec/找到。
Summary / 总结
FineTec is a unified framework for fine-grained action recognition under temporal corruption. It first restores a base skeleton sequence using context-aware completion, then decomposes the skeleton into dynamic and static subgroups and generates augmented sequences. A physics-driven estimation module estimates joint accelerations, and these sequences are fed into a GCN-based action recognition head. Experiments show that FineTec outperforms previous methods, achieving top-1 accuracies of 89.1% and 78.1% on Gym99-severe and Gym288-severe settings, respectively.
FineTec 是一种统一框架,用于处理时间腐蚀下的细粒度动作识别,解决了现实场景中缺失数据的挑战。它首先使用上下文感知的完成方法恢复基础骨架序列,然后将骨架分解为动态和静态区域并生成增强序列。一个基于物理驱动的估计模块估计关节加速度,序列被输入到基于GCN的动作识别头部。实验表明,FineTec 在 Gym99-severe 和 Gym288-severe 设置中分别实现了 89.1% 和 78.1% 的 top-1 准确率,优于之前的方法。
From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing
Authors: Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu
First: 2025-12-31T18:58:30+00:00 · Latest: 2025-12-31T18:58:30+00:00
Comments: Project Page https://hjrphoebus.github.io/X-Dub
Abstract
Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.
中文标题/摘要
标题:从修复到编辑:一种基于上下文的视觉配音自强化框架
基于音频的视觉配音旨在使视频的唇部动作与新语音同步,但根本上受到理想训练数据的挑战:仅唇部动作不同而其他所有视觉条件都相同的配对视频。现有方法通过基于掩码的修复范式绕过了这一问题,不完整的视觉条件迫使模型同时生成缺失的内容并同步唇部,导致视觉伪影、身份漂移和同步不良。在本文中,我们提出了一种新颖的自强化框架,将视觉配音重新定义为一个从病态的修复任务到一个良好的视频到视频编辑问题。我们的方法首先使用扩散变换器作为数据生成器,合成理想的训练数据:每个真实样本的唇部修改的伴生视频,形成视觉对齐的视频对。然后,基于扩散变换器的音频驱动编辑器在这些对上端到端训练,利用完整的对齐输入视频帧专注于精确的音频驱动唇部修改。这种完整的、帧对齐的输入条件为编辑器提供了丰富的视觉上下文,提供了完整的身份线索、场景交互和连续的空间-时间动态。利用这种丰富的上下文,我们的方法能够实现高度准确的唇部同步、忠实的身份保留和对复杂野外场景的出色鲁棒性。我们还引入了一种时间步长自适应多阶段学习策略,作为必要组件以在扩散时间步长中分离相互冲突的编辑目标,从而促进稳定训练并提高唇部同步和视觉保真度。此外,我们提出了ContextDubBench,这是一个全面的基准数据集,用于在多样且具有挑战性的实际应用场景中进行稳健评估。
Summary / 总结
This paper addresses the challenge of audio-driven visual dubbing by proposing a self-bootstrapping framework that transforms the task from an inpainting problem to a video-to-video editing problem. The framework uses a Diffusion Transformer to generate ideal training data, which are then used to train an audio-driven editor. This approach provides a rich visual context, leading to accurate lip synchronization, faithful identity preservation, and robust performance in challenging scenarios. The method also introduces a timestep-adaptive multi-phase learning strategy to improve training stability and visual fidelity.
本文提出了一种自强化框架,将音频驱动的视觉配音问题从一个病态的图像填补任务转变为一个条件良好的视频到视频编辑问题。该框架使用扩散变换器生成理想的训练数据,然后在这些配对上训练一个音频驱动的编辑器。该方法实现了高度准确的唇部同步、忠实的身份保留以及对具有挑战性的场景的鲁棒性。此外,还引入了时间步长自适应多阶段学习策略和ContextDubBench,这是一个用于在各种挑战性实际应用场景中进行稳健评估的基准数据集。
Many Minds from One Model: Bayesian Transformers for Population Intelligence
Authors: Diji Yang, Yi Zhang
First: 2025-12-31T18:56:02+00:00 · Latest: 2025-12-31T18:56:02+00:00
Abstract
Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights.
B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.
中文标题/摘要
标题:一个模型多个心智:基于贝叶斯的变换器实现群体智能
尽管现代变换器规模庞大且取得巨大成功,但它们几乎无一例外地被训练为单一目标系统:优化产生一组确定性的参数,代表对数据的单一功能假设。受智能源自多个心智的想法启发,我们提出了群体贝叶斯变换器(B-Trans),将其转换为贝叶斯变换器模型,以支持从一组预训练权重中采样多样且连贯的模型实例。
B-Trans 通过将归一化层中的偏置类似偏移视为具有高斯变分近似的随机变量,引入了一个基于贝叶斯的后验代理,从而在不训练完整的贝叶斯神经网络的情况下诱导模型行为的分布。从这个代理中采样产生一组具有不同行为但保持一般能力的模型实例。为了在每次生成中保持连贯性,我们在序列级别冻结采样的噪声,确保时间上的一致性。B-Trans 允许群体级别的决策,其中跨采样个体汇总预测显著增强了探索性。在零样本生成、具有可验证奖励的强化学习(RLVR)以及无需显式标签的强化学习实验中,B-Trans 有效地利用了群体的智慧,提供了更好的语义多样性,同时在任务性能上优于确定性基线。
Summary / 总结
The paper proposes Population Bayesian Transformers (B-Trans) to address the limitation of single-minded transformers by introducing a Bayesian approach to generate diverse yet coherent model instances from a single set of pre-trained weights. B-Trans treats bias-like offsets in normalization layers as stochastic variables and uses a Gaussian variational approximation to induce a distribution over model behavior. Experiments show that B-Trans enhances semantic diversity and task performance in zero-shot generation, Reinforcement Learning with Verifiable Rewards, and reinforcement learning without explicit labels, outperforming deterministic baselines.
论文提出了Population Bayesian Transformers(B-Trans),通过引入贝叶斯方法从单个预训练权重集中生成多样且一致的模型实例来解决单向思维变压器的局限性。B-Trans 将归一化层中的偏置项视为随机变量,并使用高斯变分近似来诱导模型行为的分布。实验表明,B-Trans 通过群体级决策增强语义多样性和任务性能,优于确定性基线,在零样本生成、具有验证奖励的强化学习以及无显式标签的强化学习中表现更优。
Context-aware LLM-based AI Agents for Human-centered Energy Management Systems in Smart Buildings
Authors: Tianzhi He, Farrokh Jazizadeh
First: 2025-12-31T18:51:19+00:00 · Latest: 2025-12-31T18:51:19+00:00
Abstract
This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents to facilitate context-aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context-aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype's performance was evaluated using 120 user queries across four distinct real-world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost-effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory-related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM-based BEMS AI agents and identifying future research directions, emphasizing the trade-off between response accuracy and computational efficiency.
中文标题/摘要
标题:面向人类中心的智能建筑能源管理系统的大语言模型(LLM)基智能代理
本研究提出了一种概念框架和原型评估,用于通过自然语言交互促进智能建筑中上下文感知能源管理的大型语言模型(LLM)基建筑能源管理系统(BEMS)智能代理。所提出的框架包括三个模块:感知(传感)、中央控制(大脑)和行动(执行和用户交互),形成一个闭环反馈回路,捕捉、分析和解释能源数据,以智能响应用户查询并管理连接的电器。通过利用LLM的自主数据分析能力,BEMS智能代理旨在提供有关能源消耗、成本预测和设备调度的上下文感知见解,从而解决现有能源管理系统中的局限性。原型的性能使用120个用户查询和四个不同的真实住宅能源数据集以及不同的评估指标(包括延迟、功能、能力、准确性和成本效益)进行了评估。通过ANOVA测试展示了该框架的普适性。结果表明,通过设备控制(86%)、记忆相关任务(97%)、调度和自动化(74%)和能源分析(77%)的响应准确性衡量,表现出有希望的性能,而更复杂的成本估算任务则指出了改进的领域,准确率为49%。这项基准研究朝着正式评估LLM基BEMS智能代理并确定未来研究方向的方向迈进,强调了响应准确性和计算效率之间的权衡。
Summary / 总结
This study introduces a conceptual framework and prototype for LLM-based BEMS AI agents to enhance context-aware energy management in smart buildings via natural language interaction. The framework includes perception, central control, and action modules, forming a closed loop for energy data analysis and response to user queries. Performance was evaluated using 120 user queries and various metrics, showing promising results in device control (86%), memory tasks (97%), scheduling (74%), and energy analysis (77%), though cost estimation accuracy was lower at 49%. ANOVA tests demonstrated the framework's generalizability.
该研究提出了一种基于LLM的BEMS AI代理的概念框架和原型,通过自然语言交互实现智能建筑中的上下文感知能源管理。框架包括感知、中央控制和行动模块,形成一个闭环的数据分析和用户查询响应流程。性能评估使用了四个住宅能源数据集中的120个用户查询,结果显示在设备控制上的准确率为86%,在记忆相关任务上的准确率为97%,在调度和自动化上的准确率为74%,在能源分析上的准确率为77%,但在成本估算上的准确率仅为49%。
Generative Classifiers Avoid Shortcut Solutions
Authors: Alexander C. Li, Ananya Kumar, Deepak Pathak
Venue: ICLR 2025
First: 2025-12-31T18:31:46+00:00 · Latest: 2025-12-31T18:31:46+00:00
Comments: ICLR 2025. Code: https://github.com/alexlioralexli/generative-classifiers
Abstract
Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail even under minor distribution shift. This failure mode stems from an overreliance on features that are spuriously correlated with the label. We show that generative classifiers, which use class-conditional generative models, can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones. These generative classifiers are simple to train, avoiding the need for specialized augmentations, strong regularization, extra hyperparameters, or knowledge of the specific spurious correlations to avoid. We find that diffusion-based and autoregressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks and reduce the impact of spurious correlations in realistic applications, such as medical or satellite datasets. Finally, we carefully analyze a Gaussian toy setting to understand the inductive biases of generative classifiers, as well as the data properties that determine when generative classifiers outperform discriminative ones.
中文标题/摘要
标题:生成式分类器避免捷径解决方案
分类中的判别方法往往学习在分布内有效的捷径,但在轻微分布变化下会失效。这种失败模式源于对与标签偶然相关的特征的过度依赖。我们展示了生成式分类器,通过使用条件生成模型来建模所有特征,而非主要建模偶然特征,可以避免这一问题。生成式分类器易于训练,无需特殊增强、强正则化、额外超参数或避免特定偶然相关性的知识。我们发现基于扩散和自回归的生成式分类器在五个标准图像和文本分布变化基准测试中达到最佳性能,并在医疗或卫星数据集等实际应用中减少了偶然相关性的影响。最后,我们仔细分析了一个高斯玩具设置,以理解生成式分类器的归纳偏置,以及哪些数据属性决定了生成式分类器何时优于判别式分类器。
Summary / 总结
The paper addresses the issue of discriminative classifiers learning spurious correlations that fail under distribution shifts. It proposes using generative classifiers that model all features, avoiding shortcut solutions. Experiments show that generative classifiers, particularly those based on diffusion and autoregressive models, outperform discriminative classifiers on various benchmarks and in real-world applications like medical and satellite data. The study also provides insights into the conditions under which generative classifiers are more effective than discriminative ones.
论文探讨了判别分类器学习在分布变化时会失效的虚假相关性的问题。它提出使用能够建模所有特征的生成分类器,避免学习捷径。实验表明,基于扩散和自回归模型的生成分类器在各种基准测试和医疗、卫星数据等实际应用中优于判别分类器。研究还分析了生成分类器在什么条件下比判别分类器更有效。
Plan Verification for LLM-Based Embodied Task Completion Agents
Authors: Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur
First: 2025-09-02T19:06:56+00:00 · Latest: 2025-12-31T18:31:30+00:00
Abstract
Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.
中文标题/摘要
标题:基于LLM的具身任务完成代理计划验证
基于大型语言模型(LLM)的任务计划及其对应的具身AI人类示范可能包含噪声,存在不必要的动作、冗余导航和逻辑错误,这些都会降低策略质量。我们提出了一种迭代验证框架,在该框架中,一个法官LLM批评动作序列,一个规划LLM应用修订,从而逐步产生更清洁且更具空间连贯性的轨迹。与基于规则的方法不同,我们的方法依赖于自然语言提示,能够广泛泛化到各种错误类型,包括无关动作、矛盾和缺失步骤。在TEACh具身AI数据集中手动标注的动作集上,我们的框架在四个最先进的LLM(GPT o4-mini、DeepSeek-R1、Gemini 2.5、LLaMA 4 Scout)上实现了高达90%的召回率和100%的精确率。精炼循环收敛迅速,96.5%的序列最多需要三轮迭代,同时提高了时间效率和空间动作组织。至关重要的是,该方法保留了人类错误恢复模式,而不是将其消除,支持未来关于稳健纠正行为的工作。通过将计划验证确立为空间规划和动作精炼的可靠LLM能力,我们为具身AI中的模仿学习提供了可扩展的高质量训练数据路径。
Summary / 总结
This paper addresses the issue of noisy task plans generated by large language models (LLMs) for embodied AI, proposing an iterative verification framework involving a Judge LLM and a Planner LLM. The framework improves the quality of action sequences by critiquing and refining them, leading to cleaner and more spatially coherent trajectories. Experiments on the TEACh dataset show that the framework achieves high precision and recall across various LLMs, with most sequences converging within three iterations, enhancing both temporal efficiency and spatial action organization without collapsing human error-recovery patterns.
本文针对大型语言模型(LLM)生成的体态AI任务计划中存在的噪声问题,提出了一种迭代验证框架,该框架包括一个裁判LLM和一个规划LLM。该框架通过批判和改进行动序列,提高了行动序列的质量,使其更加清洁且具有更好的空间一致性。实验结果表明,该框架在TEACh数据集上实现了高精度和高召回率,大多数序列在三轮迭代内收敛,同时提高了时间效率和空间行动组织,而不会破坏人类的错误恢复模式。
Towards Generalisable Foundation Models for Brain MRI
Authors: Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander
First: 2025-10-27T15:19:46+00:00 · Latest: 2025-12-31T18:26:04+00:00
Abstract
Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.
中文标题/摘要
标题:通用脑MRI基础模型的研究
人工智能(AI)中的基础模型正在通过从大规模未标记数据集学习通用特征来改变医学成像。在本研究中,我们介绍了BrainFound,这是一种用于脑MRI的自监督基础模型,通过扩展最初设计用于2D自然图像的DINO-v2视觉变换器构建而成。BrainFound将DINO-v2适应于建模完整的3D脑解剖结构,通过纳入序列MRI切片的体素信息,超越了传统的单切片范式。它支持单模态和多模态输入,能够支持一系列下游任务,包括疾病检测和图像分割,同时在不同的成像协议和临床场景中具有泛化能力。我们展示了BrainFound在标签稀缺和多对比度设置中始终优于现有的自监督预训练策略和监督基线。通过整合多种3D MRI模态(如T1、T2、FLAIR)的信息,它提高了诊断准确性并减少了对大量专家注释的依赖。这种灵活性使BrainFound成为3D神经成像管道的可扩展和实用解决方案,具有在临床部署和研究创新方面的巨大潜力。
Summary / 总结
The research aims to develop a generalisable foundation model for brain MRI by extending DINO-v2, a vision transformer, to handle 3D brain anatomy. BrainFound incorporates volumetric information from sequential MRI slices and supports both single- and multimodal inputs, enabling various downstream tasks. Experimental results show that BrainFound outperforms existing self-supervised pretraining strategies and supervised baselines, especially in label-scarce and multi-contrast settings, enhancing diagnostic accuracy and reducing dependency on expert annotations. This makes BrainFound a practical solution for 3D neuroimaging pipelines and clinical deployment.
研究旨在通过将DINO-v2扩展到处理3D脑部解剖结构,开发一种通用的基础模型BrainFound。BrainFound整合了来自连续MRI切片的体素信息,并支持单模态和多模态输入,能够执行广泛的下游任务。实验结果表明,BrainFound在标签稀缺和多对比度设置中优于现有的自监督预训练策略和监督基线,提高了诊断准确性并减少了对大量专家注释的依赖。
ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning
Authors: Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier
Venue: NeurIPS 2025
First: 2025-12-31T18:21:52+00:00 · Latest: 2025-12-31T18:21:52+00:00
Comments: NeurIPS 2025
Abstract
Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.
中文标题/摘要
标题:ResponseRank:通过偏好强度学习实现数据高效奖励建模
二元选择,如强化学习从人类反馈(RLHF)中常用的方式,只能传达偏好的方向。一个人可能选择苹果而不是橙子,香蕉而不是葡萄,但哪种偏好更强烈?强度对于在不确定性下做决策和偏好模型的泛化至关重要,但很难可靠地测量。元数据如响应时间及注释者间的一致性可以作为强度的代理,但往往噪声较大且混杂。我们提出ResponseRank以应对来自噪声强度信号的学习挑战。我们的方法使用代理信号的相对差异来对成对比较的响应进行排序,以推断其偏好强度。为了控制系统性变化,我们仅在精心构建的层内局部比较信号。这使得我们可以稳健地学习与强度推导出的排名一致的效用差异,同时对强度信号的假设最少。我们的贡献包括三个方面:(1) ResponseRank,一种新颖的方法,通过利用局部有效的相对强度信号稳健地学习偏好强度;(2) 在合成偏好学习(使用模拟响应时间)、语言建模(使用注释者一致性)和RL控制任务(使用模拟回合回报)等多样任务中,证明了改进的样本效率和稳健性;(3) Pearson距离相关性(PDC),一种新颖的度量标准,能够隔离序数准确性与基数效用学习。
Summary / 总结
ResponseRank is a method for learning preference strength from noisy signals in reinforcement learning from human feedback. It ranks responses based on inferred preference strength using relative differences in proxy signals and controls for systemic variation within strata. The method demonstrates improved sample efficiency and robustness across various tasks, including synthetic preference learning, language modeling, and RL control tasks.
ResponseRank 是一种从人类反馈中的嘈杂信号中学习偏好强度的方法。它通过在精心构建的层内使用代理信号的相对差异来对响应进行排序,以推断偏好强度。该方法在合成偏好学习、语言建模和RL控制任务等多种任务中展示了改进的样本效率和鲁棒性。
FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM
Authors: Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai
First: 2025-12-31T17:57:45+00:00 · Latest: 2025-12-31T17:57:45+00:00
Abstract
We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.
中文标题/摘要
标题:FoundationSLAM:利用基础深度模型释放流动基础模型的潜力以实现端到端密集视觉SLAM
我们提出了FoundationSLAM,这是一种基于学习的单目密集SLAM系统,旨在解决先前基于流动的方法中缺乏几何一致性的问题,以实现准确和稳健的跟踪和建图。我们的核心思想是通过利用基础深度模型的指导,将流动估计与几何推理相结合。为此,我们首先开发了一种混合流动网络,该网络生成几何感知的对应关系,使不同关键帧之间的深度和姿态推断保持一致。为了确保全局一致性,我们提出了一种双向一致束调整层,该层在多视图约束下联合优化关键帧姿态和深度。此外,我们引入了一种可靠性感知精炼机制,该机制通过区分可靠和不确定区域动态调整流动更新过程,形成匹配与优化之间的闭环。广泛的实验表明,FoundationSLAM在多个具有挑战性的数据集上实现了卓越的轨迹精度和密集重建质量,同时以每秒18帧的速度实时运行,展示了我们方法在各种场景下的强大泛化能力和实际应用潜力。
Summary / 总结
FoundationSLAM is a learning-based monocular dense SLAM system that improves the accuracy and robustness of tracking and mapping by integrating depth information from foundation models. It uses a Hybrid Flow Network to generate geometry-aware correspondences and a Bi-Consistent Bundle Adjustment Layer to enforce global consistency. Additionally, it includes a Reliability-Aware Refinement mechanism to dynamically adjust the flow update process. Experiments show that FoundationSLAM outperforms previous methods in trajectory accuracy and dense reconstruction quality across various datasets, while maintaining real-time performance at 18 FPS.
FoundationSLAM 是一种基于学习的单目密集 SLAM 系统,通过整合来自基础深度模型的指导来改进之前的流基方法。它使用 Hybrid Flow Network 生成几何感知的对应关系,并使用 Bi-Consistent Bundle Adjustment Layer 来确保全局一致性。此外,它还引入了 Reliability-Aware Refinement 机制,以动态适应流更新过程。实验表明,FoundationSLAM 在多个数据集上实现了高轨迹准确性和密集重建质量,并以每秒 18 帧的速度实时运行。
Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification
Authors: Zhenyu Cui, Jiahuan Zhou, Yuxin Peng
First: 2025-12-31T17:50:05+00:00 · Latest: 2025-12-31T17:50:05+00:00
Abstract
Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.
中文标题/摘要
标题:Bi-C2R:双向持续兼容表示以实现无需再索引的终身行人重识别
终身行人重识别(L-ReID)利用按顺序收集的数据进行持续训练和更新重识别模型,重点关注所有数据的整体性能。其主要挑战是在训练新数据时避免旧知识的灾难性遗忘问题。现有L-ReID方法通常在每次更新后重新提取所有历史画廊图像的新特征进行推理,这被称为“再索引”。然而,由于数据隐私问题和大规模画廊图像的高再索引成本,历史画廊数据通常直接保存。因此,不可避免地导致更新模型提取的查询特征与更新前模型提取的画廊特征之间的不兼容检索,严重影响重识别性能。为了解决上述问题,本文关注一个新任务,称为无需再索引的终身行人重识别(RFL-ReID),该任务要求在不重新索引历史画廊图像的情况下进行终身行人重识别。因此,RFL-ReID 比 L-ReID 更具挑战性,需要在多样化的流式数据中持续学习并平衡新旧知识,使新旧模型输出的特征相互兼容。为此,我们提出了一种双向持续兼容表示(Bi-C2R)框架,以兼容的方式持续更新旧模型提取的画廊特征,实现高效的L-ReID。我们通过理论分析和在多个基准上的广泛实验验证了我们提出的Bi-C2R方法,结果表明,该方法在引入的RFL-ReID任务和传统L-ReID任务上均能实现领先性能。
Summary / 总结
The paper addresses the challenge of lifelong person re-identification (L-ReID) by proposing a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which aims to perform re-identification without re-indexing historical gallery images. To solve this, the authors introduce a Bidirectional Continuous Compatible Representation (Bi-C2R) framework that updates the gallery features extracted by the old model to ensure compatibility with features from the updated model. The method is evaluated on multiple benchmarks and shows superior performance in both RFL-ReID and traditional L-ReID tasks.
该论文通过提出一个新的任务——无需重新索引历史图库图像的终身行人重识别(RFL-ReID),来解决终身行人重识别(L-ReID)中的挑战。为了解决检索不兼容的问题,作者引入了一个双向连续兼容表示(Bi-C2R)框架,该框架更新由旧模型提取的图库特征,以确保与更新模型提取的特征兼容。在多个基准上的实验表明,Bi-C2R在RFL-ReID和传统L-ReID任务上均取得了领先性能。
Basic Inequalities for First-Order Optimization with Applications to Statistical Risk Analysis
Authors: Seunghoon Paik, Kangjie Zhou, Matus Telgarsky, Ryan J. Tibshirani
First: 2025-12-31T17:49:37+00:00 · Latest: 2025-12-31T17:49:37+00:00
Comments: 47 pages, 3 figures (7 subfigures)
Abstract
We introduce \textit{basic inequalities} for first-order iterative optimization algorithms, forming a simple and versatile framework that connects implicit and explicit regularization. While related inequalities appear in the literature, we isolate and highlight a specific form and develop it as a well-rounded tool for statistical analysis. Let $f$ denote the objective function to be optimized. Given a first-order iterative algorithm initialized at $θ_0$ with current iterate $θ_T$, the basic inequality upper bounds $f(θ_T)-f(z)$ for any reference point $z$ in terms of the accumulated step sizes and the distances between $θ_0$, $θ_T$, and $z$. The bound translates the number of iterations into an effective regularization coefficient in the loss function. We demonstrate this framework through analyses of training dynamics and prediction risk bounds. In addition to revisiting and refining known results on gradient descent, we provide new results for mirror descent with Bregman divergence projection, for generalized linear models trained by gradient descent and exponentiated gradient descent, and for randomized predictors. We illustrate and supplement these theoretical findings with experiments on generalized linear models.
中文标题/摘要
标题:一阶优化的基本不等式及其在统计风险分析中的应用
我们引入了一阶迭代优化算法的\textit{基本不等式},形成一个简单且多功能的框架,将隐式和显式正则化联系起来。虽然文献中存在相关的不等式,但我们隔离并强调了一种特定形式,并将其发展为统计分析中一个完善的工具。令$f$表示要优化的目标函数。给定一个初始于$θ_0$的一阶迭代算法,当前迭代为$θ_T$,基本不等式以累积步长和$θ_0$、$θ_T$与$z$之间的距离为界,上界$f(θ_T)-f(z)$对于任何参考点$z$。该界将迭代次数转化为损失函数中的有效正则化系数。我们通过训练动力学分析和预测风险界展示了这一框架。除了重新审视和改进已知的梯度下降结果外,我们还提供了镜像下降与Bregman散度投影、广义线性模型通过梯度下降和指数梯度下降训练、以及随机预测的新结果。我们通过广义线性模型的实验展示了并补充了这些理论发现。
Summary / 总结
This paper introduces basic inequalities for first-order iterative optimization algorithms, which connect implicit and explicit regularization. The inequalities upper bound the difference between the objective function value at the current iterate and a reference point in terms of step sizes and distances between iterates. This framework is applied to analyze training dynamics and prediction risk bounds, providing new results for mirror descent, generalized linear models, and randomized predictors, and refining known results on gradient descent.
本文引入了用于一阶迭代优化算法的基本不等式,将隐式和显式正则化联系起来。这些不等式以步长和迭代点之间的距离为依据,上界当前迭代点的目标函数值与参考点之间的差异。该框架被应用于分析训练动态和预测风险界,提供了镜像下降、广义线性模型和随机化预测的新结果,并改进了梯度下降已知结果。
PhysTalk: Language-driven Real-time Physics in 3D Gaussian Scenes
Authors: Luca Collorone, Mert Kiray, Indro Spinelli, Fabio Galasso, Benjamin Busam
First: 2025-12-31T17:32:31+00:00 · Latest: 2025-12-31T17:32:31+00:00
Abstract
Realistic visual simulations are omnipresent, yet their creation requires computing time, rendering, and expert animation knowledge. Open-vocabulary visual effects generation from text inputs emerges as a promising solution that can unlock immense creative potential. However, current pipelines lack both physical realism and effective language interfaces, requiring slow offline optimization. In contrast, PhysTalk takes a 3D Gaussian Splatting (3DGS) scene as input and translates arbitrary user prompts into real time, physics based, interactive 4D animations. A large language model (LLM) generates executable code that directly modifies 3DGS parameters through lightweight proxies and particle dynamics. Notably, PhysTalk is the first framework to couple 3DGS directly with a physics simulator without relying on time consuming mesh extraction. While remaining open vocabulary, this design enables interactive 3D Gaussian animation via collision aware, physics based manipulation of arbitrary, multi material objects. Finally, PhysTalk is train-free and computationally lightweight: this makes 4D animation broadly accessible and shifts these workflows from a "render and wait" paradigm toward an interactive dialogue with a modern, physics-informed pipeline.
中文标题/摘要
标题:PhysTalk: 3D 高斯场景中的语言驱动实时物理
逼真的视觉模拟无处不在,但其创建需要计算时间、渲染和专家动画知识。基于开放词汇的视觉效果生成从文本输入中脱颖而出,成为一种有望释放巨大创意潜力的解决方案。然而,当前的工作流程缺乏物理真实性和有效的语言界面,需要缓慢的离线优化。相比之下,PhysTalk 将 3D 高斯点绘(3DGS)场景作为输入,并将任意用户提示翻译成基于实时物理的 4D 交互式动画。一个大型语言模型(LLM)生成可执行代码,直接通过轻量级代理和粒子动力学修改 3DGS 参数。值得注意的是,PhysTalk 是第一个直接将 3DGS 与物理模拟器结合而无需依赖耗时的网格提取的框架。尽管保持开放词汇,这种设计使得通过碰撞感知的基于物理的任意多材料对象操作实现交互式 3D 高斯动画成为可能。最后,PhysTalk 是无训练的且计算量轻:这使得 4D 动画广泛可及,并将这些工作流程从“渲染等待”范式转向与现代、基于物理的管道进行互动对话。
DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh
First: 2025-12-31T17:31:29+00:00 · Latest: 2025-12-31T17:31:29+00:00
Comments: Submitted to IEEE Robotics and Automation Letters (RA-L)
Abstract
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.
中文标题/摘要
标题:DarkEQA:在低光室内环境中评估视觉语言模型的实体化问答能力
视觉语言模型(VLMs)越来越多地被用作实体代理的核心推理模块。现有的基准测试在理想的、光线充足的条件下评估它们的能力,但全天候运行的需求则要求在广泛的视觉退化条件下表现出色,包括夜间或黑暗环境中的低光条件——这一核心需求已被很大程度上忽视。为应对这一未被充分探索的挑战,我们提出了DarkEQA,这是一个开源基准,用于在多级低光条件下评估与实体化问答(EQA)相关的感知基本能力。DarkEQA通过在受控退化条件下从第一人称观察中进行问答评估,隔离了感知瓶颈,使可归因的鲁棒性分析成为可能。DarkEQA的一个关键设计特点是其物理保真度:视觉退化在线性RAW空间中建模,模拟基于物理的照明下降和传感器噪声,随后通过ISP启发式的渲染管道。我们通过评估一系列最先进的VLMs和低光图像增强(LLIE)模型展示了DarkEQA的实用性。我们的分析系统地揭示了这些视觉条件下的操作限制。我们的代码和基准数据集将在接受后发布。
Summary / 总结
DarkEQA is a benchmark designed to evaluate the performance of Vision-Language Models (VLMs) in low-light indoor environments, addressing the underexplored challenge of robust 24/7 operation. It isolates the perception bottleneck by degrading egocentric observations and evaluates VLMs' perceptual primitives under controlled low-light conditions. Key findings include the limitations of VLMs in handling challenging visual conditions, which are systematically analyzed through the benchmark. The benchmark includes a physical fidelity model that simulates real-world low-light scenarios and will be released upon acceptance.
DarkEQA 是一个基准,旨在评估 Vision-Language 模型在低光室内环境中的性能,解决其在 24/7 运行中的鲁棒性不足问题。它通过线性 RAW 空间中的受控降级过程模拟低光条件,使感知限制得以分析。主要发现表明,当前的 VLMs 在这些具有挑战性的视觉条件下进行问答时表现不佳。
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
Authors: Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig
First: 2025-12-19T04:09:24+00:00 · Latest: 2025-12-31T17:30:11+00:00
Abstract
While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
中文标题/摘要
标题:DAVE:一种面向文档理解和网络代理的VLM视觉编码器
尽管视觉语言模型(VLMs)在多模态任务中表现出色,但它们所选择的视觉编码器存在根本性弱点:其低级特征缺乏文档理解和网络代理所需的关键结构和空间信息。为弥补这一差距,我们提出了DAVE,一种专为VLMs设计并针对这些任务定制的视觉编码器。我们的训练管道旨在利用大量未标注数据,以避免对文档和网络图像进行昂贵的大规模注释的需求。我们首先在未标注图像上进行自我监督预训练,然后在监督自回归预训练阶段,模型从有限的高质量数据中学习解析和定位等任务。在监督阶段内,我们采用了两种策略来提高编码器与通用视觉知识和多样化文档及网络代理任务的对齐:(i) 我们引入了一种新颖的模型合并方案,将使用不同文本解码器训练的编码器结合在一起,以确保与不同网络代理架构的广泛兼容性。(ii) 我们使用集成训练,将预训练的一般编码器(如SigLIP2)的特征与我们自己的文档和网络特定表示融合在一起。在经典文档任务、VQAs、网络定位和基于代理的基准测试中的广泛实验验证了我们方法的有效性,确立了DAVE作为文档和网络应用的强大视觉编码器的地位。
Summary / 总结
DAVE is a vision encoder designed for VLMs to enhance document understanding and web agent tasks by incorporating self-supervised and supervised pretraining methods. It leverages abundant unlabeled data and combines encoders trained with different text decoders and ensemble training to improve compatibility and performance. Experiments show DAVE outperforms existing models on document tasks, VQAs, web localization, and agent-based benchmarks, making it a robust vision encoder for these applications.
DAVE 是一种为 VLMs 设计的视觉编码器,旨在通过引入稳健的结构和空间信息来提升文档理解和网页代理任务。它采用无监督预训练在未标记数据上,随后进行监督自回归预训练,使用有限的高质量数据。DAVE 包含模型合并方案和集成训练,以增强兼容性和性能。实验表明,DAVE 在文档任务、VQA、网页定位和基于代理的基准测试中优于现有方法。
SymSeqBench: a unified framework for the generation and analysis of rule-based symbolic sequences and datasets
Authors: Barna Zajzon, Younes Bouhadjar, Maxime Fabre, Felix Schmidt, Noah Ostendorf, Emre Neftci, Abigail Morrison, Renato Duarte
First: 2025-12-31T17:18:26+00:00 · Latest: 2025-12-31T17:18:26+00:00
Abstract
Sequential structure is a key feature of multiple domains of natural cognition and behavior, such as language, movement and decision-making. Likewise, it is also a central property of tasks to which we would like to apply artificial intelligence. It is therefore of great importance to develop frameworks that allow us to evaluate sequence learning and processing in a domain agnostic fashion, whilst simultaneously providing a link to formal theories of computation and computability. To address this need, we introduce two complementary software tools: SymSeq, designed to rigorously generate and analyze structured symbolic sequences, and SeqBench, a comprehensive benchmark suite of rule-based sequence processing tasks to evaluate the performance of artificial learning systems in cognitively relevant domains. In combination, SymSeqBench offers versatility in investigating sequential structure across diverse knowledge domains, including experimental psycholinguistics, cognitive psychology, behavioral analysis, neuromorphic computing and artificial intelligence. Due to its basis in Formal Language Theory (FLT), SymSeqBench provides researchers in multiple domains with a convenient and practical way to apply the concepts of FLT to conceptualize and standardize their experiments, thus advancing our understanding of cognition and behavior through shared computational frameworks and formalisms. The tool is modular, openly available and accessible to the research community.
中文标题/摘要
标题:SymSeqBench:基于规则的符号序列及其数据集生成与分析的统一框架
序列结构是自然认知和行为多个领域的关键特征,如语言、运动和决策。同样,这也是我们希望应用于人工智能任务的核心属性。因此,开发一种在不同领域中评估序列学习和处理的框架,同时与计算和可计算性的形式理论建立联系,是非常重要的。为满足这一需求,我们引入了两个互补的软件工具:SymSeq,用于严格生成和分析结构化的符号序列;SeqBench,一个基于规则的序列处理任务综合基准套件,用于评估人工学习系统在认知相关领域的性能。结合使用,SymSeqBench 提供了跨不同知识领域的序列结构研究的灵活性,包括实验心理语言学、认知心理学、行为分析、神经形态计算和人工智能。由于其基于形式语言理论(FLT),SymSeqBench 为多个领域的研究人员提供了一种方便且实用的方法,将形式语言理论的概念应用于实验设计和标准化,从而通过共享的计算框架和形式化方法推进我们对认知和行为的理解。该工具是模块化的,公开可用,并且对研究界开放。
Summary / 总结
SymSeqBench is a unified framework for generating and analyzing structured symbolic sequences and datasets, addressing the need for evaluating sequence learning and processing across various domains. It consists of SymSeq for rigorous sequence generation and analysis, and SeqBench as a benchmark suite for artificial learning systems. Key findings include the versatility of SymSeqBench in diverse fields such as psycholinguistics, cognitive psychology, and artificial intelligence, and its ability to apply Formal Language Theory concepts to standardize experiments and advance cognitive understanding through shared computational frameworks.
SymSeqBench 是一个统一框架,用于生成和分析结构化的符号序列和数据集,旨在跨多个领域评估序列学习和处理的需求。它包括 SymSeq 用于严谨的序列生成和分析,以及 SeqBench 作为人工学习系统的基准测试套件。关键发现包括 SymSeqBench 在语言学、认知心理学和人工智能等领域的灵活性,以及它能够通过共享的计算框架和形式化应用形式语言理论概念来标准化实验,从而推动对认知和行为的理解。
Distribution-Dependent Rates for Multi-Distribution Learning
Authors: Rafael Hanashiro, Patrick Jaillet
First: 2023-12-20T15:50:16+00:00 · Latest: 2025-12-31T17:05:43+00:00
Abstract
To address the needs of modeling uncertainty in sensitive machine learning applications, the setup of distributionally robust optimization (DRO) seeks good performance uniformly across a variety of tasks. The recent multi-distribution learning (MDL) framework tackles this objective in a dynamic interaction with the environment, where the learner has sampling access to each target distribution. Drawing inspiration from the field of pure-exploration multi-armed bandits, we provide distribution-dependent guarantees in the MDL regime, that scale with suboptimality gaps and result in superior dependence on the sample size when compared to the existing distribution-independent analyses. We investigate two non-adaptive strategies, uniform and non-uniform exploration, and present non-asymptotic regret bounds using novel tools from empirical process theory. Furthermore, we devise an adaptive optimistic algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring the contrast between uniform and optimistic allocation in the multi-armed bandit literature. We also conduct a small synthetic experiment illustrating the comparative strengths of each strategy.
中文标题/摘要
标题:依赖分布的学习速率
为应对敏感机器学习应用中建模不确定性的需求,分布鲁棒优化(DRO)设置旨在实现各种任务上的统一良好性能。最近的多分布学习(MDL)框架通过与环境的动态交互来实现这一目标,其中学习者可以对每个目标分布进行采样访问。受到纯探索多臂bandit领域的启发,我们在MDL框架下提供了依赖分布的保证,这些保证与次优差距成比例,并且在样本量依赖性上优于现有的独立于分布的分析。我们研究了两种非自适应策略,均匀探索和非均匀探索,并使用经验过程理论中的新工具给出了非渐近的遗憾界。此外,我们设计了一种自适应乐观算法LCB-DR,展示了对差距的增强依赖性,类似于多臂bandit文献中均匀分配和乐观分配之间的对比。我们还进行了一个小规模的合成实验,以说明每种策略的比较优势。
Summary / 总结
This paper addresses the need for modeling uncertainty in machine learning applications by proposing distribution-dependent guarantees in the multi-distribution learning (MDL) framework. The authors provide non-asymptotic regret bounds for two non-adaptive strategies, uniform and non-uniform exploration, and introduce an adaptive optimistic algorithm, LCB-DR, which shows improved performance. The key findings include superior dependence on sample size and suboptimality gaps compared to existing distribution-independent analyses.
论文旨在通过在多分布学习(MDL)框架中使用分布鲁棒优化(DRO),提高机器学习模型在不确定环境中的性能。它提供了与次优差距成比例的分布依赖性保证,并为两种非自适应策略——均匀探索和非均匀探索——提供了非渐近后悔界。此外,还提出了一种自适应乐观算法LCB-DR,该算法在大差距场景中表现出更好的性能。实验表明,在合成设置中,LCB-DR优于非自适应策略。
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
Authors: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou
First: 2025-12-31T16:51:14+00:00 · Latest: 2025-12-31T16:51:14+00:00
Comments: 17 pages, 15 figures
Abstract
Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.
中文标题/摘要
标题:ShowUI-$π$: 基于流的生成模型作为GUI灵巧之手
构建能够进行灵巧操作的智能代理对于实现机器人和数字环境中的类人自动化至关重要。然而,现有的GUI代理依赖于离散的点击预测(x,y),这禁止了自由形式、闭环轨迹(例如拖动进度条)的实现,这些轨迹需要连续的、实时的感知和调整。在本工作中,我们开发了ShowUI-$π$,这是第一个基于流的生成模型作为GUI灵巧之手,其设计包括:(i) 统一的离散-连续动作,将离散点击和连续拖动整合到一个共享模型中,以适应多种交互模式;(ii) 基于流的动作生成,用于拖动建模,通过轻量级的动作专家从连续的视觉观察中预测增量光标调整,确保平滑和稳定的轨迹;(iii) 拖动训练数据和基准,我们手动收集并合成了跨越五个领域(例如PowerPoint,Adobe Premiere Pro)的20,000条拖动轨迹,并引入了ScreenDrag基准,该基准具有全面的在线和离线评估协议,用于评估GUI代理的拖动能力。我们的实验表明,专有的GUI代理在ScreenDrag上仍然存在困难(例如Operator得分为13.27,而最好的Gemini-2.5-CUA达到22.18)。相比之下,ShowUI-$π$仅使用4.5亿参数就达到了26.98,这突显了任务的难度和我们方法的有效性。我们希望这项工作能够推动GUI代理向数字世界中的类人灵巧控制发展。代码可在https://github.com/showlab/showui-pi/获取。
Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws
Authors: Gérard Ben Arous, Murat A. Erdogdu, Nuri Mert Vural, Denny Wu
Venue: NeurIPS 2025
First: 2025-08-05T17:57:56+00:00 · Latest: 2025-12-31T16:43:30+00:00
Comments: NeurIPS 2025
Abstract
We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as $f_*(\boldsymbol{x}) \propto \sum_{j=1}^{r}λ_j σ\left(\langle \boldsymbol{θ_j}, \boldsymbol{x}\rangle\right), \boldsymbol{x} \sim N(0,\boldsymbol{I}_d)$, $σ$ is the 2nd Hermite polynomial, and $\lbrace\boldsymbolθ_j \rbrace_{j=1}^{r} \subset \mathbb{R}^d$ are orthonormal signal directions. We consider the extensive-width regime $r \asymp d^β$ for $β\in [0, 1)$, and assume a power-law decay on the (non-negative) second-layer coefficients $λ_j\asymp j^{-α}$ for $α\geq 0$. We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.
Summary / 总结
This study investigates the optimization and sample complexity of training a two-layer neural network with quadratic activation in high dimensions, where the data is generated as a sum of orthonormal signal directions weighted by coefficients that follow a power-law decay. The research presents a detailed analysis of the stochastic gradient descent (SGD) dynamics and derives scaling laws for the prediction risk, emphasizing the dependencies on optimization time, sample size, and model width. The analysis combines a precise characterization of the matrix Riccati differential equation with novel matrix monotonicity arguments to ensure convergence guarantees for the infinite-dimensional effective dynamics.
该研究探讨了高维下具有二次激活函数的两层神经网络的优化和样本复杂性。研究集中在宽模型区域,并考虑了第二层系数的幂律衰减。主要发现包括对SGD动态的精确分析以及预测风险的缩放定律,强调了优化时间、样本量和模型宽度的影响。
AMAP Agentic Planning Technical Report
Authors: Yulan Hu, Xiangwen Zhang, Sheng Ouyang, Hao Yi, Lu Xu, Qinglin Lang, Lide Tan, Xiang Cheng, Tianchen Ye, Zhicong Li, Ge Chen, Wenjin Yang, Zheng Pan, Shaopan Xiong, Siran Yang, Ju Huang, Yan Zhang, Jiamang Wang, Yong Liu, Yinfeng Huang, Tucheng Lin, Xin Li, Ning Guo
First: 2025-12-31T16:39:09+00:00 · Latest: 2025-12-31T16:39:09+00:00
Abstract
We present STAgent, an agentic large language model tailored for spatio-temporal understanding, designed to solve complex tasks such as constrained point-of-interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio-temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain-specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high-quality data like a needle in a haystack, curating high-quality queries with a filter ratio of 1:10,000, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine-tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3-30B-A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.
中文标题/摘要
标题:AMAP自主规划技术报告
我们介绍了STAgent,这是一种针对时空理解定制的自主大型语言模型,旨在解决诸如受限兴趣点发现和行程规划等复杂任务。STAgent是一种专门模型,能够在时空场景中与十个不同的工具进行交互,使其能够在复杂推理过程中探索、验证和细化中间步骤。值得注意的是,STAgent有效地保留了其通用能力。我们通过三个关键贡献赋予STAgent这些能力:(1) 一个稳定的工具环境,支持超过十个领域特定工具,实现异步部署和训练;(2) 一个分层数据整理框架,能够从海量数据中识别高质量数据,筛选出高质量查询的比例为1:10,000,强调多样性和难度;以及(3) 一个级联训练方案,从种子SFT阶段开始,作为衡量查询难度的守护者,随后是针对高确定性查询进行微调的第二个SFT阶段,最终利用低确定性数据的强化学习阶段。通过使用Qwen3-30B-A3B初始化以建立强大的SFT基础并利用样本难度的见解,STAgent在TravelBench上表现出色,同时在广泛的一般基准测试中保持其通用能力,从而证明了我们提出的自主模型的有效性。
Summary / 总结
STAgent is an agentic large language model designed for spatio-temporal understanding and complex task solving, such as constrained point-of-interest discovery and itinerary planning. It interacts with ten tools to explore, verify, and refine steps during complex reasoning. STAgent achieves this through a stable tool environment, a hierarchical data curation framework, and a cascaded training recipe. The model maintains its general capabilities while showing promising performance on TravelBench and other benchmarks.
STAgent 是一个针对时空理解及复杂任务解决(如受限兴趣点发现和行程规划)的代理型大型语言模型。它通过与十个工具交互来探索、验证和细化推理过程中的步骤。STAgent 通过稳定工具环境、分层数据收集框架和级联训练配方实现这一目标。STAgent 在 TravelBench 上表现出色,并且在各种基准测试中保持了通用能力,这表明所提出模型的有效性。
MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control
Authors: Yongwei Zhang, Yuanzhe Xing, Quan Quan, Zhikun She
First: 2025-12-31T16:36:44+00:00 · Latest: 2025-12-31T16:36:44+00:00
Abstract
Achieving provable stability in model-free reinforcement learning (RL) remains a challenge, particularly in balancing exploration with rigorous safety. This article introduces MSACL, a framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. Unlike methods relying on complex reward engineering, MSACL utilizes off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions. By introducing Exponential Stability Labels (ESL) and a $λ$-weighted aggregation mechanism, the framework effectively balances the bias-variance trade-off in multi-step learning. Policy optimization is guided by a stability-aware advantage function, ensuring the learned policy promotes rapid Lyapunov descent. We evaluate MSACL across six benchmarks, including stabilization and nonlinear tracking tasks, demonstrating its superiority over state-of-the-art Lyapunov-based RL algorithms. MSACL achieves exponential stability and rapid convergence under simple rewards, while exhibiting significant robustness to uncertainties and generalization to unseen trajectories. Sensitivity analysis establishes the multi-step horizon $n=20$ as a robust default across diverse systems. By linking Lyapunov theory with off-policy actor-critic frameworks, MSACL provides a foundation for verifiably safe learning-based control. Source code and benchmark environments will be made publicly available.
中文标题/摘要
标题:MSACL:多步演员-评论家学习与李雅普诺夫证书相结合的指数稳定控制
在无模型的强化学习(RL)中实现可证明的稳定性仍然是一个挑战,特别是在探索与严格安全之间的平衡。本文介绍了MSACL框架,该框架通过多步李雅普诺夫证书学习将指数稳定性理论与最大熵RL相结合。与依赖于复杂奖励工程的方法不同,MSACL利用离策略多步数据来学习满足理论稳定性条件的李雅普诺夫证书。通过引入指数稳定性标签(ESL)和$λ$加权聚合机制,该框架有效地平衡了多步学习中的偏差-方差权衡。通过稳定性意识的优势函数指导策略优化,确保学习到的策略促进快速的李雅普诺夫下降。我们在六个基准测试中评估了MSACL,包括稳定化和非线性跟踪任务,展示了其在最先进的基于李雅普诺夫的RL算法中的优越性。MSACL在简单奖励下实现了指数稳定性并快速收敛,同时对不确定性具有显著的鲁棒性,并且能够泛化到未见过的轨迹。敏感性分析确定多步时间 horizon $n=20$ 为多种系统中的稳健默认值。通过将李雅普诺夫理论与离策略演员-评论家框架相结合,MSACL为验证安全的学习控制奠定了基础。源代码和基准环境将公开提供。
Summary / 总结
MSACL is a framework that combines exponential stability theory with maximum entropy reinforcement learning to achieve provable stability in model-free RL. It uses multi-step Lyapunov certificate learning and introduces Exponential Stability Labels to balance bias and variance. MSACL outperforms state-of-the-art Lyapunov-based RL algorithms in six benchmarks, showing exponential stability, rapid convergence, and robustness to uncertainties. The multi-step horizon of 20 is found to be a robust default across different systems.
MSACL 是一种结合了指数稳定性理论和最大熵强化学习的框架,旨在实现无模型自由RL中的可证明稳定性。它利用离策略多步数据学习李亚普诺夫证书,并引入了指数稳定性标签和 $λ$ 加权聚合机制来平衡偏差和方差。MSACL 在六个基准测试中表现出色,展示了指数稳定性、快速收敛以及对不确定性较强的鲁棒性。多步时滞 $n=20$ 被发现适用于不同系统。
A Geometric Theory of Cognition
Authors: Laha Ale
First: 2025-12-13T07:39:53+00:00 · Latest: 2025-12-31T16:33:03+00:00
Abstract
Human cognition spans perception, memory, intuitive judgment, deliberative reasoning, action selection, and social inference, yet these capacities are often explained through distinct computational theories. Here we present a unified mathematical framework in which diverse cognitive processes emerge from a single geometric principle. We represent the cognitive state as a point on a differentiable manifold endowed with a learned Riemannian metric that encodes representational constraints, computational costs, and structural relations among cognitive variables. A scalar cognitive potential combines predictive accuracy, structural parsimony, task utility, and normative or logical requirements. Cognition unfolds as the Riemannian gradient flow of this potential, providing a universal dynamical law from which a broad range of psychological phenomena arise. Classical dual-process effects--rapid intuitive responses and slower deliberative reasoning--emerge naturally from metric-induced anisotropies that generate intrinsic time-scale separations and geometric phase transitions, without invoking modular or hybrid architectures. We derive analytical conditions for these regimes and demonstrate their behavioural signatures through simulations of canonical cognitive tasks. Together, these results establish a geometric foundation for cognition and suggest guiding principles for the development of more general and human-like artificial intelligence systems.
中文标题/摘要
标题:认知的几何理论
人类认知涵盖了感知、记忆、直觉判断、推理、行动选择和社会推理,但这些能力通常通过不同的计算理论来解释。在这里,我们提出了一种统一的数学框架,其中多种认知过程源自单一的几何原理。我们将认知状态表示为一个流形上的点,该流形上装备了一个学习到的黎曼度量,编码了表征约束、计算成本和认知变量之间的结构关系。一个标量认知势能结合了预测准确性、结构简约性、任务效用和规范或逻辑要求。认知表现为这种势能的黎曼梯度流,提供了一个普遍的动力学定律,从中可以产生广泛的心理现象。经典的双重过程效应——快速的直觉反应和较慢的推理——自然地源自度量诱导的各向异性,产生内在的时间尺度分离和几何相变,而无需引入模块化或混合架构。我们推导了这些状态的分析条件,并通过模拟经典认知任务的行为特征来证明它们。这些结果为认知建立了一个几何基础,并为开发更通用和类人的人工智能系统提供了指导原则。
VIPER: Process-aware Evaluation for Generative Video Reasoning
Authors: Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu
First: 2025-12-31T16:31:59+00:00 · Latest: 2025-12-31T16:31:59+00:00
Comments: Work in progress
Abstract
Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.
中文标题/摘要
标题:VIPER:基于过程的生成视频推理评估
近期在视频生成方面的突破展示了新兴的能力,称为连续帧推理(CoF),模型通过生成连续帧来解决复杂任务。尽管这些模型在生成视频推理(GVR)方面显示出潜力,但现有的评估框架通常依赖于单帧评估,这可能导致结果作弊,即模型通过错误的过程得出正确的结论。为了解决这一问题,我们提出了一种基于过程的评估范式。我们引入了VIPER,这是一个涵盖16项任务的综合基准,涉及时间、结构、符号、空间、物理和规划推理。此外,我们提出了过程-结果一致性(POC@r)这一新指标,该指标利用VLM作为评判者并采用分层评分标准,评估中间步骤的有效性和最终结果。我们的实验表明,最先进的视频模型在POC@1.0上的表现仅约为20%,显示出显著的结果作弊。我们进一步探讨了测试时缩放和采样鲁棒性的影响,突显了当前视频生成与真正泛化的视觉推理之间存在的巨大差距。我们的基准将公开发布。
Summary / 总结
The paper addresses the issue of outcome-hacking in Generative Video Reasoning (GVR) models by proposing a process-aware evaluation paradigm. VIPER, a comprehensive benchmark, evaluates 16 tasks across various reasoning types, and introduces POC@r, a new metric that assesses both intermediate steps and final results using a hierarchical rubric. Experiments show that state-of-the-art models achieve only about 20% POC@1.0, indicating significant outcome-hacking. The study also highlights the gap between current video generation and true generalized visual reasoning.
研究旨在通过提出过程感知评估范式来解决生成视频推理(GVR)模型中的结果作弊问题。VIPER是一个新的基准,评估了涵盖多种推理类型的16个任务,并引入了POC@r指标,该指标评估中间步骤的有效性和最终结果。实验表明,最先进的模型在POC@1.0上的得分仅为约20%,表明存在显著的结果作弊。研究还指出了当前视频生成与真正的泛化视觉推理之间的差距。
ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT
Authors: Xinran Gong, Gorkem Durak, Halil Ertugrul Aktas, Vedat Cicek, Jinkui Hao, Ulas Bagci, Nilay S. Shah, Bo Zhou
First: 2025-12-31T16:29:05+00:00 · Latest: 2025-12-31T16:29:05+00:00
Comments: 21 pages, 8 figures
Abstract
Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.
中文标题/摘要
标题:ProDM:基于合成现实的属性感知渐进扩散模型在非门控胸部CT冠状动脉钙化运动校正中的应用
从胸部CT中获取冠状动脉钙化(CAC)评分是用于分层和细化临床心血管疾病风险估计的成熟工具。CAC量化依赖于钙化病灶的准确勾勒,但常常受到心脏和呼吸运动引入的伪影的影响。心电图门控心脏CT显著减少了运动伪影,但由于门控要求和缺乏保险覆盖,其在人群筛查和常规成像中的应用受到限制。尽管从非门控胸部CT中识别偶然的CAC越来越被认为是可获得且广泛可用的替代方案,但该方法受限于更严重的运动伪影。我们提出了ProDM(属性感知渐进校正扩散模型),这是一种生成扩散框架,可以从非门控CT中恢复无运动的钙化病灶。ProDM 引入了三个关键组件:(1)一种CAC运动模拟数据引擎,直接从心脏门控CT中合成具有多种运动轨迹的现实非门控获取,从而实现无需配对数据的监督训练;(2)一种属性感知的学习策略,通过可微分的钙化一致性损失整合钙化特定的先验知识,以保持病灶完整性;(3)一种渐进校正方案,在扩散步骤中逐步减少伪影,以增强稳定性和钙化准确性。在真实患者数据集上的实验表明,与几个基线相比,ProDM 显著提高了CAC评分准确性、空间病灶保真度和风险分层性能。在真实非门控扫描上的读者研究进一步证实,ProDM 抑制了运动伪影并提高了临床可用性。这些发现突显了渐进、属性感知框架在常规胸部CT成像中可靠CAC量化中的潜力。
Summary / 总结
ProDM is a generative diffusion framework designed to correct motion artifacts in non-gated chest CT scans for accurate coronary artery calcium (CAC) scoring. It includes a motion simulation engine, a calcium-specific learning strategy, and a progressive correction scheme. ProDM significantly improves CAC scoring accuracy and lesion fidelity compared to baselines, and reader studies confirm its effectiveness in reducing motion artifacts and enhancing clinical usability.
ProDM 是一种生成扩散框架,旨在通过纠正非门控胸部 CT 扫描中的运动伪影来实现准确的冠状动脉钙化 (CAC) 评分。它包括一个运动模拟引擎、一种钙化特定的学习策略和一种渐进修正方案。实验表明,ProDM 提高了 CAC 评分的准确性和病灶保真度,优于基线方法,并改善了临床适用性。
RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Authors: Chenji Lu, Zhuo Chen, Hui Zhao, Zhenyi Wang, Pengjie Wang, Jian Xu, Bo Zheng
First: 2025-12-31T16:09:08+00:00 · Latest: 2025-12-31T16:09:08+00:00
Abstract
Search relevance plays a central role in web e-commerce. While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics across the industry. To address this limitation, we propose Rule-Aware benchmark with Image for Relevance assessment(RAIR), a Chinese dataset derived from real-world scenarios. RAIR established a standardized framework for relevance assessment and provides a set of universal rules, which forms the foundation for standardized evaluation. Additionally, RAIR analyzes essential capabilities required for current relevance models and introduces a comprehensive dataset consists of three subset: (1) a general subset with industry-balanced sampling to evaluate fundamental model competencies; (2) a long-tail hard subset focus on challenging cases to assess performance limits; (3) a visual salience subset for evaluating multimodal understanding capabilities. We conducted experiments on RAIR using 14 open and closed-source models. The results demonstrate that RAIR presents sufficient challenges even for GPT-5, which achieved the best performance. RAIR data are now available, serving as an industry benchmark for relevance assessment while providing new insights into general LLM and Visual Language Model(VLM) evaluation.
中文标题/摘要
标题:RAIR:一种结合长尾和视觉显著性子集的规则意识基准,用于电子商务相关性评估
搜索相关性在网页电子商务中起着核心作用。虽然大型语言模型(LLMs)在相关性任务上取得了显著成果,但现有基准缺乏足够的复杂性,无法进行全面的模型评估,导致行业内缺乏标准化的相关性评估指标。为解决这一局限,我们提出了规则意识基准与图像相关性评估(RAIR),这是一个源自现实场景的中文数据集。RAIR 建立了一个标准化的相关性评估框架,并提供了一套通用规则,为标准化评估奠定了基础。此外,RAIR 分析了当前相关性模型所需的关键能力,并引入了一个综合数据集,包括三个子集:(1)一个行业平衡采样的通用子集,用于评估基本模型能力;(2)一个长尾难题子集,专注于具有挑战性的案例以评估性能极限;(3)一个视觉显著性子集,用于评估多模态理解能力。我们在 RAIR 上使用了 14 个开源和闭源模型进行了实验。结果表明,即使对于表现最佳的 GPT-5,RAIR 也提出了足够的挑战。RAIR 数据现已可用,作为相关性评估的行业基准,同时为通用大语言模型(LLM)和视觉语言模型(VLM)的评估提供了新的见解。
Summary / 总结
The paper introduces RAIR, a benchmark for e-commerce search relevance assessment, addressing the lack of complexity in existing benchmarks. It includes three subsets: a general subset for fundamental model competencies, a long-tail hard subset for challenging cases, and a visual salience subset for multimodal understanding. Experiments on 14 models, including GPT-5, show that RAIR provides significant challenges, even for the most advanced models.
论文提出了RAIR基准数据集,用于电商搜索相关性评估,解决了现有基准缺乏复杂性的问题。RAIR包含三个子集:通用子集、长尾难题子集和视觉显著性子集。对14个模型(包括GPT-5)的实验显示,RAIR提供了显著的挑战,即使是最先进的模型也不例外。该数据集现已可用,作为行业基准并提供了对通用大模型和视觉语言模型评估的新见解。
Iterative Deployment Improves Planning Skills in LLMs
Authors: Augusto B. Corrêa, Yoav Gelberg, Luckeciano C. Melo, Ilia Shumailov, André G. Pereira, Yarin Gal
First: 2025-12-31T16:03:14+00:00 · Latest: 2025-12-31T16:03:14+00:00
Abstract
We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.
中文标题/摘要
标题:迭代部署提升大语言模型的规划技能
我们展示了对大型语言模型(LLM)进行迭代部署,每个模型都在用户精心挑选的数据上进行微调(这些数据来自前一个模型的部署),可以显著改变模型的性质。通过在各种规划领域进行测试,我们观察到规划技能有了显著提高,后期模型通过发现比初始模型更长的规划方案展示了涌现性泛化。我们还提供了理论分析,表明迭代部署实际上在外循环中实现了强化学习(RL)训练(即,不是作为有意图的模型训练的一部分),并隐含了一个奖励函数。与RL的联系有两个重要含义:首先,对于AI安全领域而言,由于反复部署所隐含的奖励函数没有明确定义,可能会对未来的模型部署产生意想不到的影响。其次,这里突出的机制可以被视为一种替代的训练方案,依赖于数据的挑选而非明确的奖励。
Summary / 总结
The research aims to enhance the planning skills of large language models (LLMs) through iterative deployment. Each model is fine-tuned on data curated by users from the previous model's deployment, leading to significant improvements in planning abilities. Later models demonstrate emergent generalization by generating much longer plans than the initial models. Theoretical analysis suggests that this iterative deployment mechanism effectively implements reinforcement learning (RL) training in an outer-loop manner, with an implicit reward function. This has implications for AI safety and offers an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.
研究展示了通过迭代部署大型语言模型,每个模型在用户根据前一个模型部署的数据进行微调后,显著提升了模型的规划能力。后期模型展示了更好的泛化能力,能够发现比初始模型更长的计划。理论分析表明,这种迭代部署机制实际上在外循环中实现了强化学习,带有隐含的奖励函数,这对人工智能安全有重要影响,并提供了一种不同于显式强化学习的数据策展替代训练机制。
PoseStreamer: A Multi-modal Framework for 6DoF Pose Estimation of Unseen Moving Objects
Authors: Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng
First: 2025-12-28T15:52:58+00:00 · Latest: 2025-12-31T15:59:13+00:00
Abstract
Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.
中文标题/摘要
标题:PoseStreamer:一种针对未见移动物体的6DoF姿态估计多模态框架
六自由度(6DoF)姿态估计对于新型物体是一个关键任务,在计算机视觉中,但在高速和低光照场景中,标准RGB相机由于运动模糊而表现不佳。尽管事件相机由于其高时间分辨率提供了有希望的解决方案,但当前的6DoF姿态估计方法在高速移动物体场景中通常表现不佳。为了解决这一差距,我们提出PoseStreamer,一种专门针对高速移动场景的鲁棒多模态6DoF姿态估计框架。我们的方法整合了三个核心组件:自适应姿态记忆队列,利用历史方向线索以实现时间一致性;对象为中心的二维跟踪器,提供强大的二维先验以增强三维中心召回率;以及沿相机射线的几何细化的射线姿态滤波器。此外,我们引入了MoCapCube6D,这是一种新型多模态数据集,用于在快速运动下评估性能。广泛的实验表明,PoseStreamer不仅在高速移动场景中实现了更高的准确性,而且作为无模板框架,具有很强的泛化能力,适用于未见的移动物体。
Summary / 总结
PoseStreamer is a multi-modal framework for 6DoF pose estimation of unseen moving objects, addressing the limitations of standard RGB cameras in high-speed and low-light scenarios. It integrates an Adaptive Pose Memory Queue, an Object-centric 2D Tracker, and a Ray Pose Filter to enhance temporal consistency, 2D to 3D transformation, and geometric refinement, respectively. Experiments show that PoseStreamer outperforms existing methods in high-speed moving scenarios and demonstrates strong generalizability for unseen objects.
PoseStreamer 是一个用于未见过的移动物体 6DoF 姿态估计的多模态框架,旨在解决标准 RGB 相机在高速和低光场景下的局限性。该框架结合了自适应姿态记忆队列、对象中心的 2D 跟踪器和沿相机射线的几何细化滤波器,以增强时间一致性、2D 到 3D 的转换和几何细化。实验表明,PoseStreamer 在高速移动场景中表现出色,并且对于未见过的物体具有很强的通用性。
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Authors: Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim
First: 2025-02-20T18:01:41+00:00 · Latest: 2025-12-31T15:43:05+00:00
Comments: Accepted and to appear in IJCNLP-AACL 2025
Abstract
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
中文标题/摘要
标题:ReVision:一种用于隐私保护任务导向视觉指令重写的数据集和基线VLM
随着AR、VR和配备强大摄像头的现代智能手机成为人机通信的主要接口,高效的隐私保护多模态交互变得至关重要。现有的强大视觉-语言模型(VLMs)支持多模态交互,通常依赖于基于云的处理,这引发了关于(1)视觉隐私问题,即传输敏感的视觉数据到服务器,以及(2)其有限的实时、设备端可用性的问题。本文探讨了视觉指令重写这一新颖的方法,即将多模态指令转换为纯文本命令,允许轻量级设备端指令重写VLM(参数量250M)与现有对话AI系统的无缝集成,增强视觉数据隐私。为此,我们提供了一个涵盖14个领域的超过39,000个示例的数据集,并开发了一个紧凑的VLM,该模型在图像字幕数据集上进行预训练,并针对指令重写进行了微调。实验结果通过NLG指标(如BLEU、METEOR和ROUGE)以及语义解析分析评估,表明即使是最小量化版本的模型(存储占用量<500MB)也能实现有效的指令重写,从而实现以隐私为中心的多模态AI应用。
Summary / 总结
This paper addresses the need for efficient and privacy-preserving multimodal interaction by proposing Visual Instruction Rewriting, which transforms visual instructions into text-only commands. The authors present a dataset of over 39,000 examples and develop a compact vision-language model (250M parameters) pretrained on image captioning and fine-tuned for instruction rewriting. Experimental results show that even a quantized version of the model can achieve effective instruction rewriting, enhancing privacy in multimodal AI applications.
本文针对现代AR和VR技术中高效且保护隐私的多模态交互需求,提出了ReVision数据集和视觉语言模型,用于视觉指令重写,将多模态指令转换为纯文本命令。该模型基于图像描述预训练,并针对指令重写进行了微调,可以在设备上运行,增强隐私性和实时可用性。实验结果表明,即使量化后的模型也能有效重写指令,展示了其在隐私导向的多模态AI应用中的潜力。
HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition
Authors: Wang Lu, Yao Zhu, Jindong Wang
Venue: KDD 2026
First: 2025-12-11T16:52:50+00:00 · Latest: 2025-12-31T15:41:01+00:00
Comments: Accepted by KDD 2026
Abstract
Sensor-based human activity recognition (HAR) mines activity patterns from the time-series sensory data. In realistic scenarios, variations across individuals, devices, environments, and time introduce significant distributional shifts for the same activities. Recent efforts attempt to solve this challenge by applying or adapting existing out-of-distribution (OOD) algorithms, but only in certain distribution shift scenarios (e.g., cross-device or cross-position), lacking comprehensive insights on the effectiveness of these algorithms. For instance, is OOD necessary to HAR? Which OOD algorithm performs the best? In this paper, we fill this gap by proposing HAROOD, a comprehensive benchmark for HAR in OOD settings. We define 4 OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time, and build a testbed covering 6 datasets, 16 comparative methods (implemented with CNN-based and Transformer-based architectures), and two model selection protocols. Then, we conduct extensive experiments and present several findings for future research, e.g., no single method consistently outperforms others, highlighting substantial opportunity for advancement. Our codebase is highly modular and easy to extend for new datasets, algorithms, comparisons, and analysis, with the hope to facilitate the research in OOD-based HAR. Our implementation is released and can be found at https://github.com/AIFrontierLab/HAROOD.
中文标题/摘要
标题:HAROOD:基于传感器的人体活动识别离分布泛化基准
基于传感器的人体活动识别(HAR)从时间序列传感数据中挖掘活动模式。在现实场景中,个体、设备、环境和时间的变化会对同一活动引入显著的分布变化。最近的努力试图通过应用或适应现有的离分布(OOD)算法来解决这一挑战,但仅限于某些分布变化场景(例如,跨设备或跨位置),缺乏对这些算法有效性的全面了解。例如,HAR是否需要离分布?哪种离分布算法表现最佳?在本文中,我们通过提出HAROOD,一个全面的离分布HAR基准来填补这一空白。我们定义了4种离分布场景:跨个体、跨位置、跨数据集和跨时间,并构建了一个涵盖6个数据集、16种比较方法(使用CNN和Transformer架构实现)和两种模型选择协议的测试平台。然后,我们进行了广泛的实验,并提出了若干未来研究的发现,例如,没有单一方法始终优于其他方法,突显了巨大的改进机会。我们的代码库高度模块化,易于扩展以添加新数据集、算法、比较和分析,以促进基于离分布的HAR研究。我们的实现已发布,可从https://github.com/AIFrontierLab/HAROOD获取。
Summary / 总结
HAROOD is a benchmark for evaluating out-of-distribution generalization in sensor-based human activity recognition (HAR). It defines four OOD scenarios: cross-person, cross-position, cross-dataset, and cross-time. The study tests 16 comparative methods using CNN and Transformer architectures across six datasets and two model selection protocols. Key findings include no single method consistently outperforming others, indicating significant room for improvement in OOD generalization for HAR.
该论文提出了HAROOD基准,用于评估传感器基于的人类活动识别中的离分布泛化能力。定义了四种离分布场景:跨个体、跨位置、跨数据集和跨时间,并使用CNN和Transformer架构在六个数据集上评估了16种比较方法。研究发现,没有一种方法能够始终优于其他方法,表明在HAR的离分布泛化方面存在巨大的改进空间。
Are First-Order Diffusion Samplers Really Slower? A Fast Forward-Value Approach
Authors: Yuchen Jiao, Na Li, Changxiao Cai, Gen Li
First: 2025-12-31T15:35:53+00:00 · Latest: 2025-12-31T15:35:53+00:00
Abstract
Higher-order ODE solvers have become a standard tool for accelerating diffusion probabilistic model (DPM) sampling, motivating the widespread view that first-order methods are inherently slower and that increasing discretization order is the primary path to faster generation. This paper challenges this belief and revisits acceleration from a complementary angle: beyond solver order, the placement of DPM evaluations along the reverse-time dynamics can substantially affect sampling accuracy in the low-neural function evaluation (NFE) regime.
We propose a novel training-free, first-order sampler whose leading discretization error has the opposite sign to that of DDIM. Algorithmically, the method approximates the forward-value evaluation via a cheap one-step lookahead predictor. We provide theoretical guarantees showing that the resulting sampler provably approximates the ideal forward-value trajectory while retaining first-order convergence. Empirically, across standard image generation benchmarks (CIFAR-10, ImageNet, FFHQ, and LSUN), the proposed sampler consistently improves sample quality under the same NFE budget and can be competitive with, and sometimes outperform, state-of-the-art higher-order samplers. Overall, the results suggest that the placement of DPM evaluations provides an additional and largely independent design angle for accelerating diffusion sampling.
中文标题/摘要
标题:一阶扩散采样器真的更慢吗?一种快速的前向值方法
高阶ODE求解器已成为加速扩散概率模型(DPM)采样的标准工具,这促使人们普遍认为一阶方法本质上更慢,并且提高离散化阶数是实现更快生成的主要途径。本文挑战了这一观点,并从互补的角度重新审视加速:除了求解器阶数之外,DPM评估在反向时间动力学中的位置会在低神经网络评估次数(NFE)区间内显著影响采样精度。
我们提出了一种新的无需训练的一阶采样器,其主要离散化误差与DDIM相反。从算法上讲,该方法通过廉价的一步前瞻预测器近似前向值评估。我们提供了理论保证,表明该采样器能够证明逼近理想的前向值轨迹,同时保持一阶收敛性。实验上,在标准图像生成基准(CIFAR-10、ImageNet、FFHQ和LSUN)上,所提出的采样器在相同的NFE预算下始终能提高样本质量,并且有时可以与最先进的高阶采样器竞争。总体而言,结果表明,DPM评估的位置提供了加速扩散采样的另一个独立设计角度。
Summary / 总结
This paper challenges the belief that first-order diffusion samplers are inherently slower than higher-order methods. It proposes a novel first-order sampler that approximates forward-value evaluation via a cheap one-step lookahead predictor, showing theoretical guarantees of first-order convergence and empirical improvements in sample quality across various benchmarks. The results suggest that the placement of DPM evaluations can provide an additional design angle for accelerating diffusion sampling.
本文挑战了第一阶扩散采样器比高阶方法更慢的传统观点。提出了一种新型的第一阶采样器,通过廉价的一步前瞻预测来近似前向值评估,实现了第一阶收敛,并提供了理论保证。在标准图像生成基准上的实验表明,该采样器在相同的神经函数评估预算下始终能提高样本质量,并且有时能超越最先进的高阶采样器。
Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection
Authors: Bartłomiej Olber, Jakub Winter, Paweł Wawrzyński, Andrii Gamalii, Daniel Górniak, Marcin Łojek, Robert Nowak, Krystian Radlak
First: 2025-12-31T15:26:09+00:00 · Latest: 2025-12-31T15:26:09+00:00
Abstract
3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.
中文标题/摘要
标题:半监督多样性感知领域适应性调整方法在3D物体检测中的应用
3D物体检测器是自主驾驶感知系统中的基本组件。尽管这些检测器在标准的自主驾驶基准测试中表现出色,但在不同领域中的泛化能力却常常不足——例如,一个在美国训练的模型可能在亚洲或欧洲地区表现不佳。本文提出了一种基于神经元激活模式的激光雷达领域适应方法,表明通过正确选择一小部分具有代表性和多样性的目标域样本进行标注,可以实现最先进的性能。所提出的方法需要非常小的标注预算,并且当与受持续学习启发的后训练技术结合使用时,可以防止权重从原始模型中漂移。实证评估表明,所提出的领域适应方法优于线性探针和最先进的领域适应技术。
Summary / 总结
This paper addresses the challenge of domain adaptation in 3D object detection for autonomous vehicles. It introduces a semi-supervised method that uses neuron activation patterns to adapt models trained in one domain to another, requiring minimal annotations from the target domain. The method combines these annotations with post-training techniques to prevent model drift, achieving superior performance compared to linear probing and other state-of-the-art techniques without extensive labeling efforts.
该论文针对自动驾驶中3D物体检测的领域适应问题,提出了一种半监督方法,利用神经元激活模式来适应从一个领域到另一个领域的模型,仅需少量目标领域标注样本。该方法结合了后续训练技巧以防止模型漂移,实验结果表明其性能优于线性探针和其他最先进的领域适应技术,且无需大量标注工作。
Frequent subgraph-based persistent homology for graph classification
Authors: Xinyang Chen, Amaël Broustet, Guoting Chen
First: 2025-12-31T15:21:15+00:00 · Latest: 2025-12-31T15:21:15+00:00
Comments: Preprint. 18 pages, 10 figures
Abstract
Persistent homology (PH) has recently emerged as a powerful tool for extracting topological features. Integrating PH into machine learning and deep learning models enhances topology awareness and interpretability. However, most PH methods on graphs rely on a limited set of filtrations, such as degree-based or weight-based filtrations, which overlook richer features like recurring information across the dataset and thus restrict expressive power. In this work, we propose a novel graph filtration called Frequent Subgraph Filtration (FSF), which is derived from frequent subgraphs and produces stable and information-rich frequency-based persistent homology (FPH) features. We study the theoretical properties of FSF and provide both proofs and experimental validation. Beyond persistent homology itself, we introduce two approaches for graph classification: an FPH-based machine learning model (FPH-ML) and a hybrid framework that integrates FPH with graph neural networks (FPH-GNNs) to enhance topology-aware graph representation learning. Our frameworks bridge frequent subgraph mining and topological data analysis, offering a new perspective on topology-aware feature extraction. Experimental results show that FPH-ML achieves competitive or superior accuracy compared with kernel-based and degree-based filtration methods. When integrated into graph neural networks, FPH yields relative performance gains ranging from 0.4 to 21 percent, with improvements of up to 8.2 percentage points over GCN and GIN backbones across benchmarks.
中文标题/摘要
标题:基于频繁子图的持久同调法用于图分类
持久同调(PH)最近已成为提取拓扑特征的强大工具。将PH集成到机器学习和深度学习模型中增强了拓扑意识和可解释性。然而,大多数图上的PH方法依赖于有限的滤波器集,如度数基或权重基滤波器,这忽略了数据集中反复出现的信息,从而限制了表达能力。在本文中,我们提出了一种新的图滤波器,称为频繁子图滤波器(FSF),该滤波器源自频繁子图并产生稳定且信息丰富的基于频率的持久同调(FPH)特征。我们研究了FSF的理论性质并提供了证明和实验验证。除了持久同调本身,我们还介绍了两种图分类方法:基于FPH的机器学习模型(FPH-ML)和将FPH与图神经网络(FPH-GNNs)结合的混合框架,以增强拓扑感知的图表示学习。我们的框架将频繁子图挖掘与拓扑数据分析相结合,提供了拓扑感知特征提取的新视角。实验结果表明,FPH-ML在与核基和度数基滤波器方法相比时,实现了竞争力或更优的准确性。当集成到图神经网络中时,FPH在基准测试中相对性能提高了0.4%到21%,并在GCN和GIN骨干网络上提高了最高8.2个百分点。
Summary / 总结
This work introduces Frequent Subgraph Filtration (FSF) for graph classification, which derives persistent homology features from frequent subgraphs, offering a richer set of topological features compared to traditional degree or weight-based filtrations. The method is integrated into both machine learning models (FPH-ML) and graph neural networks (FPH-GNNs) to enhance topology-aware graph representation. Experiments show FPH-ML achieves competitive accuracy with kernel-based and degree-based methods, and FPH-GNNs provide relative performance gains of up to 21 percent over GCN and GIN backbones.
该研究提出了用于图的持久同调(PH)的频繁子图过滤(FSF)方法,以捕捉重复的子图信息,增强拓扑感知特征提取。该方法被集成到机器学习(FPH-ML)和图神经网络(FPH-GNNs)中进行图分类。实验表明,FPH-ML在与核基和度基方法的比较中表现出更优或相当的准确性,而FPH-GNNs在GCN和GIN骨干网络上的性能提升高达8.2个百分点。
AI-Driven Cloud Resource Optimization for Multi-Cluster Environments
Authors: Vinoth Punniyamoorthy, Akash Kumar Agarwal, Bikesh Kumar, Abhirup Mazumder, Kabilan Kannan, Sumit Saha
First: 2025-12-31T15:15:46+00:00 · Latest: 2025-12-31T15:15:46+00:00
Abstract
Modern cloud-native systems increasingly rely on multi-cluster deployments to support scalability, resilience, and geographic distribution. However, existing resource management approaches remain largely reactive and cluster-centric, limiting their ability to optimize system-wide behavior under dynamic workloads. These limitations result in inefficient resource utilization, delayed adaptation, and increased operational overhead across distributed environments. This paper presents an AI-driven framework for adaptive resource optimization in multi-cluster cloud systems. The proposed approach integrates predictive learning, policy-aware decision-making, and continuous feedback to enable proactive and coordinated resource management across clusters. By analyzing cross-cluster telemetry and historical execution patterns, the framework dynamically adjusts resource allocation to balance performance, cost, and reliability objectives. A prototype implementation demonstrates improved resource efficiency, faster stabilization during workload fluctuations, and reduced performance variability compared to conventional reactive approaches. The results highlight the effectiveness of intelligent, self-adaptive infrastructure management as a key enabler for scalable and resilient cloud platforms.
中文标题/摘要
标题:多集群环境中的AI驱动云资源优化
现代云原生系统越来越多地依赖多集群部署以支持可扩展性、弹性和地理分布。然而,现有的资源管理方法仍然主要具有反应性和集群中心化的特点,限制了它们在动态工作负载下优化系统级行为的能力。这些限制导致资源利用效率低下、适应延迟和分布式环境中的操作开销增加。本文提出了一种用于多集群云系统中自适应资源优化的AI驱动框架。所提出的方法将预测学习、策略感知决策和持续反馈相结合,以实现跨集群的主动和协调资源管理。通过分析跨集群遥测数据和历史执行模式,该框架动态调整资源分配以平衡性能、成本和可靠性目标。原型实现表明,与传统的反应性方法相比,该方法在资源效率、工作负载波动期间的快速稳定性和性能变异性方面有所改进。结果突显了智能、自适应基础设施管理作为可扩展和弹性云平台的关键使能器的有效性。
Summary / 总结
This paper addresses the inefficiencies in resource management for multi-cluster cloud systems by proposing an AI-driven framework. The method integrates predictive learning, policy-aware decision-making, and continuous feedback to enable proactive and coordinated resource management. Key experimental findings show improved resource efficiency, faster stabilization during workload fluctuations, and reduced performance variability compared to traditional reactive approaches.
研究旨在通过提出一种基于AI的框架来解决多集群云系统中的资源管理效率问题。该框架利用预测学习、策略感知决策和持续反馈来跨集群优化资源分配。实验结果表明,与传统反应式方法相比,该框架能够提高资源效率、更快地在工作负载变化时稳定系统,并减少性能波动。
FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation
Authors: Zichen Tang, Haihong E, Rongjin Li, Jiacheng Liu, Linwei Jia, Zhuodi Hao, Zhongjun Yang, Yuanze Li, Haolin Tian, Xinyi Hu, Peizhi Zhao, Yuan Liu, Zhengyu Wang, Xianghe Wang, Yiling Huang, Xueyuan Lin, Ruofei Bai, Zijian Xie, Qian Huang, Ruining Cao, Haocheng Gao
Venue: AAAI
First: 2025-12-31T15:00:03+00:00 · Latest: 2025-12-31T15:00:03+00:00
Comments: Accepted by AAAI-26 Main Track
Abstract
We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.
中文标题/摘要
标题:FinMMDocR:基于情景意识、文档理解和多步计算的金融多模态推理基准
我们介绍了FinMMDocR,这是一种新的双语多模态基准,用于评估多模态大型语言模型(MLLMs)在现实世界金融数值推理中的表现。与现有基准相比,我们的工作带来了三项重大进步。(1) 情景意识:1200个专家标注的问题中有57.9%包含12种隐含的金融情景(例如,投资组合管理),挑战模型进行基于假设的专家级推理;(2) 文档理解:837份中英文文档涵盖9种类型(例如,公司研究报告),平均每份50.8页,包含丰富的视觉元素,显著超越现有基准在金融文档的广度和深度上;(3) 多步计算:问题平均需要11步推理(5.3步提取+5.7步计算步骤),其中65.0%需要跨页证据(平均2.4页)。表现最好的MLLM仅达到58.0%的准确率,不同的检索增强生成(RAG)方法在该任务上表现出显著的性能差异。我们期望FinMMDocR能够推动MLLMs和增强推理方法在复杂多模态推理任务中的改进。
Summary / 总结
FinMMDocR is a new bilingual multimodal benchmark for evaluating MLLMs on real-world financial reasoning, featuring scenario awareness, document understanding, and multi-step computation. It includes 1,200 expert-annotated problems with 57.9% involving financial scenarios, 837 documents with rich visual elements, and 11-step reasoning on average. The best MLLM achieves only 58.0% accuracy, highlighting the challenges and potential for improvement in MLLMs for complex financial reasoning tasks.
FinMMDocR 是一个新颖的双语多模态基准,用于评估 MLLMs 在现实世界金融推理中的表现,包含情景意识、文档理解和多步计算。它包括 1,200 个专家标注的问题,其中 57.9% 涉及隐含的金融情景,837 份文档涵盖 9 种类型,平均 50.8 页,并且多步推理任务平均需要 11 步。最佳 MLLM 的准确率仅为 58.0%,突显了在复杂多模态推理任务中改进 MLLMs 的必要性。
PRISM: A hierarchical multiscale approach for time series forecasting
Authors: Zihao Chen, Alexandre Andre, Wenrui Ma, Ian Knight, Sergey Shuvaev, Eva Dyer
First: 2025-12-31T14:51:12+00:00 · Latest: 2025-12-31T14:51:12+00:00
Abstract
Forecasting is critical in areas such as finance, biology, and healthcare. Despite the progress in the field, making accurate forecasts remains challenging because real-world time series contain both global trends, local fine-grained structure, and features on multiple scales in between. Here, we present a new forecasting method, PRISM (Partitioned Representation for Iterative Sequence Modeling), that addresses this challenge through a learnable tree-based partitioning of the signal. At the root of the tree, a global representation captures coarse trends in the signal, while recursive splits reveal increasingly localized views of the signal. At each level of the tree, data are projected onto a time-frequency basis (e.g., wavelets or exponential moving averages) to extract scale-specific features, which are then aggregated across the hierarchy. This design allows the model to jointly capture global structure and local dynamics of the signal, enabling accurate forecasting. Experiments across benchmark datasets show that our method outperforms state-of-the-art methods for forecasting. Overall, these results demonstrate that our hierarchical approach provides a lightweight and flexible framework for forecasting multivariate time series. The code is available at https://github.com/nerdslab/prism.
中文标题/摘要
标题:PRISM:一种分层多尺度方法用于时间序列预测
预测在金融、生物学和医疗保健等领域至关重要。尽管该领域取得了进展,但准确预测仍然具有挑战性,因为真实世界的时间序列包含全局趋势、局部精细结构以及介于两者之间的多种尺度特征。为此,我们提出了一种新的预测方法——PRISM(Partitioned Representation for Iterative Sequence Modeling),该方法通过可学习的树状分割信号来应对这一挑战。树的根节点捕获信号中的粗略趋势,而递归分割则揭示信号的越来越局部化的视图。在树的每一层,数据被投影到时间-频率基底(例如小波或指数移动平均)上,以提取特定尺度的特征,然后在层次结构中进行聚合。这种设计使模型能够同时捕捉信号的全局结构和局部动态,从而实现准确的预测。在基准数据集上的实验表明,我们的方法在预测方面优于最先进的方法。总体而言,这些结果表明,我们的分层方法为多变量时间序列预测提供了一个轻量级且灵活的框架。代码可在https://github.com/nerdslab/prism/ 获取。
Summary / 总结
The paper introduces PRISM, a hierarchical multiscale approach for time series forecasting. It addresses the challenge of capturing both global trends and local fine-grained structures by using a learnable tree-based partitioning of the signal. Experiments show that PRISM outperforms state-of-the-art methods on benchmark datasets, demonstrating its effectiveness in forecasting multivariate time series.
PRISM 是一种用于时间序列预测的分层多尺度方法,旨在捕捉真实世界时间序列中的全局趋势和局部精细结构。它通过可学习的树状分割来揭示信号的越来越局部化的视图,并将数据投影到时频基础上以提取特定尺度的特征。实验结果显示,PRISM 在各种基准数据集上的预测准确性优于现有最先进的方法。
Large Multimodal Models for Low-Resource Languages: A Survey
Authors: Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu
First: 2025-02-08T13:29:44+00:00 · Latest: 2025-12-31T14:45:06+00:00
Abstract
In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.
中文标题/摘要
标题:低资源语言的大规模多模态模型:综述
在本综述中,我们系统地分析了用于适应低资源语言(LR)的大规模多模态模型(LMMs)的技术,考察了从视觉增强和数据生成到跨模态转移和融合策略的各种方法。通过对96种LR语言的117项研究进行全面分析,我们识别出研究人员在应对数据和计算资源有限的挑战时的关键模式。我们将工作分为资源导向和方法导向两类,并进一步细分为相关子类别。我们按性能和效率比较了方法导向类别的贡献,讨论了代表性研究的优势和局限性。我们发现,视觉信息通常在LR环境中提高模型性能方面起到了关键桥梁作用,但在幻觉抑制和计算效率等方面仍面临重大挑战。总之,我们为研究人员提供了对当前使LMMs更适用于LR(未充分研究)语言的方法和剩余挑战的清晰理解。我们还通过以下链接提供了开源仓库:https://github.com/marianlupascu/LMM4LRL-Survey。
Summary / 总结
This survey analyzes techniques for adapting large multimodal models (LMMs) to low-resource languages, examining visual enhancement, data creation, cross-modal transfer, and fusion strategies. Through a comprehensive analysis of 117 studies across 96 low-resource languages, the authors identify key patterns and categorize works into resource-oriented and method-oriented contributions. They find that visual information significantly improves model performance but highlight challenges in areas like hallucination mitigation and computational efficiency. The survey provides a clear understanding of current approaches and remaining challenges in making LMMs accessible to speakers of low-resource languages.
本文综述了将大型多模态模型应用于低资源语言的技术,研究了视觉增强、数据创建、跨模态转移和融合策略。通过对96种低资源语言的117项研究进行全面分析,作者识别了在应对数据和资源限制方面的关键模式。他们发现视觉信息显著提高了模型性能,但指出在幻觉抑制和计算效率等方面仍存在挑战。综述为研究人员提供了当前方法和剩余挑战的清晰理解,使大型多模态模型更适用于低资源语言的使用者。
Semi-Automated Data Annotation in Multisensor Datasets for Autonomous Vehicle Testing
Authors: Andrii Gamalii, Daniel Górniak, Robert Nowak, Bartłomiej Olber, Krystian Radlak, Jakub Winter
First: 2025-12-31T14:43:48+00:00 · Latest: 2025-12-31T14:43:48+00:00
Abstract
This report presents the design and implementation of a semi-automated data annotation pipeline developed within the DARTS project, whose goal is to create a large-scale, multimodal dataset of driving scenarios recorded in Polish conditions. Manual annotation of such heterogeneous data is both costly and time-consuming. To address this challenge, the proposed solution adopts a human-in-the-loop approach that combines artificial intelligence with human expertise to reduce annotation cost and duration. The system automatically generates initial annotations, enables iterative model retraining, and incorporates data anonymization and domain adaptation techniques. At its core, the tool relies on 3D object detection algorithms to produce preliminary annotations. Overall, the developed tools and methodology result in substantial time savings while ensuring consistent, high-quality annotations across different sensor modalities. The solution directly supports the DARTS project by accelerating the preparation of large annotated dataset in the project's standardized format, strengthening the technological base for autonomous vehicle research in Poland.
中文标题/摘要
标题:多传感器数据集在自动驾驶车辆测试中的半自动化数据标注
本报告介绍了DARTS项目中开发的半自动化数据标注流水线的设计与实现,其目标是在波兰条件下创建一个大规模的多模态驾驶场景数据集。手动标注此类异构数据既昂贵又耗时。为应对这一挑战,所提出的解决方案采用了一种人工在环的方法,结合人工智能与人类专业知识,以降低标注成本和时间。系统自动生成初始标注,支持迭代模型重新训练,并采用数据匿名化和领域适应技术。该工具的核心依赖于3D物体检测算法以生成初步标注。总体而言,开发的工具和方法在确保不同传感器模态之间一致性和高质量标注的同时,节省了大量时间。该解决方案直接支持DARTS项目,通过加速准备符合项目标准化格式的大规模标注数据集,加强了波兰自动驾驶车辆研究的技术基础。
Summary / 总结
This report describes a semi-automated data annotation pipeline designed to reduce the cost and time of annotating a large-scale, multimodal dataset of driving scenarios in Polish conditions. The pipeline combines AI with human oversight to generate initial annotations, iteratively refine them, and incorporate anonymization and domain adaptation techniques. The system uses 3D object detection algorithms to produce preliminary annotations, resulting in substantial time savings while maintaining consistent, high-quality annotations across different sensor types. This tool supports the DARTS project by accelerating the creation of a standardized annotated dataset for autonomous vehicle research.
研究旨在解决手动标注异构数据在自动驾驶车辆测试中的高成本和时间消耗问题。开发了一种半自动数据标注流水线,结合AI和人类专业知识生成初始标注,实现模型迭代训练,并采用数据匿名化和领域适应技术。该系统使用3D物体检测算法生成初步标注,从而实现显著的时间节省和不同传感器模态的一致、高质量标注,支持DARTS项目通过加速准备大规模标注数据集来加速自动驾驶车辆研究。