arXiv 论文速递

2025-12-23 03:27
Snapshot: 20251223_0327
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Authors: Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, Ping Luo
First: 2025-12-19T18:59:57+00:00 · Latest: 2025-12-19T18:59:57+00:00
Comments: Project Page: https://jshilong.github.io/PS-VAE-PAGE/
Abstract
Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.
中文标题/摘要
标题:语义和重建都重要:使表示编码器准备好进行文本到图像生成和编辑
现代潜在扩散模型(LDMs)通常在主要优化像素级重建的低级变分自编码器(VAE)潜在空间中运行。为了统一视觉生成和理解,一种新兴趋势是采用表示编码器的高维特征作为生成潜在变量。然而,我们实证地识别出这一范式中的两个根本障碍:(1)判别特征空间缺乏紧凑的正则化,使扩散模型容易产生离流潜在变量,导致不准确的对象结构;(2)编码器固有的弱像素级重建阻碍了生成器学习准确的细粒度几何和纹理。在本文中,我们提出了一种系统框架来适应理解导向的编码器特征以用于生成任务。我们引入了语义像素重建目标来正则化潜在空间,使语义信息和细粒度细节能够被压缩成一个高度紧凑的表示(96通道,16x16空间下采样)。此设计确保潜在空间保持语义丰富,并实现最先进的图像重建,同时保持足够的紧凑性以实现准确的生成。利用此表示,我们设计了一个统一的文本到图像(T2I)和图像编辑模型。与各种特征空间进行基准测试,我们证明我们的方法在重建、收敛速度和T2I及编辑任务中的性能上都取得了最先进的成果,验证了表示编码器可以被有效适应为稳健的生成组件。
Summary / 总结
This paper addresses the limitations of using high-dimensional features from representation encoders for text-to-image generation and editing. It proposes a systematic framework that introduces a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a compact 96-channel representation. The approach achieves state-of-the-art image reconstruction, faster convergence, and significant performance gains in T2I and editing tasks, validating the effectiveness of representation encoders in generative tasks.
本文解决了使用来自表示编码器的高维特征进行文本到图像生成和编辑的局限性。它提出了一种系统框架,引入了语义像素重建目标来正则化潜在空间,使语义信息和细粒度细节能够被压缩到一个紧凑的96通道表示中。该方法实现了最先进的图像重建、更快的收敛速度,并在文本到图像和编辑任务中取得了显著的性能提升,验证了表示编码器在生成任务中的有效性。
Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting
Authors: Ananta R. Bhattarai, Helge Rhodin
First: 2025-12-19T18:59:56+00:00 · Latest: 2025-12-19T18:59:56+00:00
Abstract
Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.
中文标题/摘要
标题:任何内容的重新深度化:通过自监督重新照明的测试时深度细化
单目深度估计仍然具有挑战性,因为最近的基础模型,如深度一切V2(DA-V2),难以处理与训练分布相差甚远的真实世界图像。我们提出了重新深度一切(Re-Depth Anything),这是一种测试时的自监督框架,通过将DA-V2与大规模2D扩散模型的强大先验融合,来弥合这一领域差距。我们的方法通过重新照明预测的深度图并在输入上进行增强,直接在输入图像上进行无标签细化。这种重新合成方法通过利用形状从阴影(SfS)线索,在新的生成性上下文中利用分数蒸馏采样(SDS)来替代经典的光度重建。为了防止优化崩溃,我们的框架采用了一种有针对性的优化策略:我们冻结编码器,只更新中间嵌入,并微调解码器。在多种基准测试中,重新深度一切在深度准确性和现实性方面相对于DA-V2取得了显著的提升,展示了通过增强几何推理来实现自监督的新途径。
Dexterous World Models
Authors: Byungjun Kim, Taeksoo Kim, Junyoung Lee, Hanbyul Joo
First: 2025-12-19T18:59:51+00:00 · Latest: 2025-12-19T18:59:51+00:00
Comments: Project Page: snuvclab.github.io/dwm
Abstract
Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.
中文标题/摘要
标题:灵巧的世界模型
近期在三维重建方面的进展使得从日常环境中创建逼真的数字孪生变得容易。然而,当前的数字孪生仍然主要保持静态,仅限于导航和视图合成,缺乏具身互动性。为弥合这一差距,我们引入了灵巧的世界模型(DWM),这是一种基于场景-动作条件的视频扩散框架,用于建模灵巧的人类动作如何引起静态3D场景的动态变化。 给定一个静态3D场景渲染和第一人称手部运动序列,DWM生成时间上连贯的视频,描绘可能的人-场景互动。我们的方法通过(1)遵循指定相机轨迹的静态场景渲染来确保空间一致性,以及(2)包含几何和运动线索的第一人称手部网格渲染来直接建模动作条件下的动力学。为了训练DWM,我们构建了一个混合交互视频数据集。合成的第一人称交互提供了关节运动和操作学习的完全对齐监督,而固定相机的现实世界视频则提供了多样且真实的物体动力学。 实验表明,DWM能够实现真实且物理上合理的互动,如抓取、开启和移动物体,同时保持相机和场景的一致性。该框架代表了基于视频扩散的交互数字孪生的第一步,并能够从第一人称动作中实现具身模拟。
Summary / 总结
Dexterous World Model (DWM) is a video diffusion framework that models dynamic changes in static 3D scenes based on human actions. Given a static 3D scene and an egocentric hand motion sequence, DWM generates coherent videos of plausible human-scene interactions. The model conditions video generation on static scene renderings and egocentric hand mesh renderings to ensure spatial and action consistency. Experiments show that DWM can realistically simulate actions like grasping, opening, and moving objects while maintaining scene and camera consistency.
研究旨在通过引入 Dexterous World Model (DWM) 框架弥合静态 3D 场景重建与具身互动之间的差距,DWM 是一个基于视频扩散的框架,能够模拟人类动作对 3D 场景动态变化的影响。DWM 通过静态场景渲染和第一人称手部网格渲染进行条件化,确保空间和时间的一致性。实验表明,DWM 可以生成逼真且物理上合理的交互,如抓取和移动物体,同时保持场景和相机的一致性。
Adversarial Robustness of Vision in Open Foundation Models
Authors: Jonathon Fox, William J Buchanan, Pavlos Papadopoulos
Venue: IEEE Access, 2025
First: 2025-12-19T18:59:16+00:00 · Latest: 2025-12-19T18:59:16+00:00
Abstract
With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta's Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.
中文标题/摘要
标题:开放基础模型中的视觉对抗鲁棒性
随着深度学习的发展,理解AI系统如何识别物体变得越来越困难。因此,攻击者可以尝试通过添加未见过的元素修改图像,从而混淆AI的识别。本文研究了LLaVA-1.5-13B和Meta的Llama 3.2 Vision-8B-2的对抗鲁棒性。这些模型在未目标PGD(投影梯度下降)攻击下对视觉输入进行了测试,并在VQA v2数据集子集上进行了经验评估。然后使用标准的VQA准确率度量这些对抗攻击的结果。然后将这些评估与LLaVA和Llama 3.2 Vision的准确率下降(准确率下降)进行了比较。一个关键发现是,尽管在该设置下Llama 3.2 Vision的基线准确率较低,但在较高扰动水平下,其性能下降幅度小于LLaVA。总体而言,这些发现证实了视觉模态是降低当代开放权重VLMs性能的有效攻击向量,包括Meta的Llama 3.2 Vision。此外,它们还表明,对抗鲁棒性并不一定直接与标准基准性能相关,可能受到底层架构和训练因素的影响。
Summary / 总结
This paper investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2 against untargeted PGD attacks on the visual input modality using the VQA v2 dataset. The results show that Llama 3.2 Vision, although having lower baseline accuracy, demonstrated a smaller performance drop under attack, especially at higher perturbation levels, compared to LLaVA. This indicates that the vision modality can be a critical attack vector for degrading the performance of contemporary open-weight VLMs, and adversarial robustness does not necessarily correlate with standard benchmark performance.
该研究考察了LLaVA-1.5-13B和Meta的Llama 3.2 Vision-8B-2在视觉输入模态上对未目标PGD攻击的对抗鲁棒性,使用VQA v2数据集。结果显示,尽管Llama 3.2 Vision的基础准确率较低,但在较高扰动水平下,其性能下降幅度小于LLaVA。这表明对抗鲁棒性与标准基准性能无关,可能受到架构和训练因素的影响。
When Reasoning Meets Its Laws
Authors: Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang
First: 2025-12-19T18:59:11+00:00 · Latest: 2025-12-19T18:59:11+00:00
Abstract
Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/
中文标题/摘要
标题:当推理遇到其法则
尽管大型推理模型(LRMs)表现出色,但其推理行为往往令人费解,导致推理能力不佳。为了理论化所需的推理行为,本文提出了推理法则(LoRe),这是一种统一框架,用于描述LRMs中的内在推理模式。我们首先提出了计算法则,假设推理计算应与问题复杂性成线性关系。除了计算,我们还通过补充准确性法则扩展了LoRe。由于在实践中难以量化问题复杂性,我们通过法则的两个属性——单调性和组合性——来检验这些假设。因此,我们引入了LoRe-Bench,这是一个基准,系统地测量大型推理模型的这两个可操作属性。评估结果显示,大多数推理模型表现出合理的单调性,但缺乏组合性。为此,我们开发了一种有效的微调方法,以确保计算法则的组合性。广泛的实证研究表明,更好地遵守计算法则在多个基准上持续提高了推理性能,并揭示了属性和法则之间的协同效应。项目页面:https://lore-project.github.io/
Summary / 总结
This paper addresses the issue of counterintuitive reasoning behaviors in Large Reasoning Models (LRMs) by proposing the Laws of Reasoning (LoRe), a framework to formalize desired reasoning patterns. The compute law suggests that reasoning compute should scale linearly with question complexity, while the accuracy law is introduced to complement this. By evaluating monotonicity and compositionality, the authors develop LoRe-Bench to measure these properties. The study finds that most LRMs show reasonable monotonicity but lack compositionality. To address this, the authors propose a finetuning approach that enforces compute-law compositionality, leading to improved reasoning performance across multiple benchmarks.
本文通过提出推理定律(LoRe)框架来解决大型推理模型(LRMs)的反直觉推理行为问题,LoRe旨在形式化期望的推理模式。计算定律提出推理计算应与问题复杂性成线性关系,同时引入了准确性定律来补充这一点。通过评估单调性和组合性,作者开发了LoRe-Bench来测量这些属性。研究发现,大多数LRMs表现出合理的单调性但缺乏组合性。为解决这一问题,作者提出了一种细调方法来强制执行计算定律的组合性,从而在多个基准测试中提高了推理性能。
Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
Authors: Vongani H. Maluleke, Kie Horiuchi, Lea Wilken, Evonne Ng, Jitendra Malik, Angjoo Kanazawa
First: 2025-12-19T18:59:02+00:00 · Latest: 2025-12-19T18:59:02+00:00
Abstract
Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: https://von31.github.io/MAGNet/
中文标题/摘要
标题:多智能体交互序列建模的扩散驱动
理解与生成多人群体互动是一个基本挑战,对机器人技术和社交计算具有广泛影响。尽管人类在群体中自然协调,但由于长时间跨度、强烈的智能体间依赖性和变化的群体规模,建模此类互动仍然困难。现有的运动生成方法大多针对特定任务,无法泛化到灵活的多智能体生成。我们引入了MAGNet(多智能体扩散驱动变换器),这是一种统一的自回归扩散框架,通过灵活的条件和采样支持广泛的交互任务。MAGNet在单一模型中执行二元预测、伙伴填充和完整的多智能体运动生成,并能够自回归生成超长序列,跨越数百个v。基于扩散驱动,我们引入了关键修改,在自回归去噪过程中明确建模智能体间的耦合,从而在智能体之间实现连贯的协调。因此,MAGNet捕捉到了既紧密同步的活动(如舞蹈、拳击)又松散结构的社会互动。我们的方法在二元基准测试中与专门方法表现相当,自然扩展到涉及三人或更多互动人员的多项基准,得益于一种可扩展的架构,该架构对智能体的数量不敏感。我们建议读者参阅补充视频,其中生成互动的时间动态和空间协调性最佳。项目页面:https://von31.github.io/MAGNet/
Summary / 总结
The research aims to model and generate multi-agent interactions, which are crucial for robotics and social computing. The method introduces MAGNet, a unified autoregressive diffusion framework that supports various interaction tasks through flexible conditioning and sampling. Key findings include MAGNet's ability to capture both synchronized and loosely structured interactions, perform well on dyadic benchmarks, and naturally extend to polyadic scenarios involving three or more agents, all enabled by a scalable architecture.
该研究引入了MAGNet,一种统一的自回归扩散框架,用于多智能体运动生成,解决了长时间跨度和智能体间依赖性的挑战。它支持多种交互任务,并能生成超长序列。MAGNet能够捕捉同步活动和社会互动,其在二元基准测试中表现与专门方法相当,并且能够自然地扩展到涉及三人或更多交互人员的多项式场景。该方法通过在自回归去噪过程中建模智能体间的耦合来改进扩散强迫,从而在智能体之间实现协调。实验结果显示,MAGNet在二元任务中表现与专门方法相当,并且能够很好地扩展到多项式场景。
Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy
Authors: Aditya Gahlawat, Ahmed Aboudonia, Sandeep Banik, Naira Hovakimyan, Nikolai Matni, Aaron D. Ames, Gioele Zardini, Alberto Speranzon
First: 2025-12-19T18:58:11+00:00 · Latest: 2025-12-19T18:58:11+00:00
Comments: 18 pages, 5 figures
Abstract
Imitation learning (IL) enables autonomous behavior by learning from expert demonstrations. While more sample-efficient than comparative alternatives like reinforcement learning, IL is sensitive to compounding errors induced by distribution shifts. There are two significant sources of distribution shifts when using IL-based feedback laws on systems: distribution shifts caused by policy error and distribution shifts due to exogenous disturbances and endogenous model errors due to lack of learning. Our previously developed approaches, Taylor Series Imitation Learning (TaSIL) and $\mathcal{L}_1$ -Distributionally Robust Adaptive Control (\ellonedrac), address the challenge of distribution shifts in complementary ways. While TaSIL offers robustness against policy error-induced distribution shifts, \ellonedrac offers robustness against distribution shifts due to aleatoric and epistemic uncertainties. To enable certifiable IL for learned and/or uncertain dynamical systems, we formulate \textit{Distributionally Robust Imitation Policy (DRIP)} architecture, a Layered Control Architecture (LCA) that integrates TaSIL and~\ellonedrac. By judiciously designing individual layer-centric input and output requirements, we show how we can guarantee certificates for the entire control pipeline. Our solution paves the path for designing fully certifiable autonomy pipelines, by integrating learning-based components, such as perception, with certifiable model-based decision-making through the proposed LCA approach.
中文标题/摘要
标题:分布鲁棒模仿学习:具有可认证自主性的分层控制架构
模仿学习(IL)通过从专家演示中学习来实现自主行为。尽管与强化学习等替代方法相比,IL更具样本效率,但其对分布偏移引起的累积误差敏感。当使用基于IL的反馈律在系统上运行时,存在两种重要的分布偏移来源:由策略误差引起的分布偏移和由外生干扰和内生模型误差(由于缺乏学习)引起的分布偏移。我们之前开发的方法,泰勒级数模仿学习(TaSIL)和$\mathcal{L}_1$分布鲁棒自适应控制($\ellonedrac$),以互补的方式解决了分布偏移的挑战。虽然TaSIL提供了对由策略误差引起的分布偏移的鲁棒性,但$\ellonedrac$提供了对由 aleatoric 和 epistemic 不确定性引起的分布偏移的鲁棒性。为了实现对学习和/或不确定动力学系统的可认证IL,我们提出了分布鲁棒模仿策略(DRIP)架构,这是一种分层控制架构(LCA),将TaSIL和$\ellonedrac$集成在一起。通过精心设计每个层的输入和输出要求,我们展示了如何保证整个控制管道的证书。我们的解决方案铺平了通过提出的LCA方法将基于学习的组件(如感知)与可认证模型驱动的决策相结合来设计完全可认证自主性管道的道路。
RadarGen: Automotive Radar Point Cloud Generation from Cameras
Authors: Tomer Borreda, Fangqiang Ding, Sanja Fidler, Shengyu Huang, Or Litany
First: 2025-12-19T18:57:33+00:00 · Latest: 2025-12-19T18:57:33+00:00
Comments: Project page: https://radargen.github.io/
Abstract
We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird's-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.
中文标题/摘要
标题:RadarGen:从多视角摄像头图像生成汽车雷达点云
我们提出了RadarGen,一种从多视角摄像头图像合成现实汽车雷达点云的扩散模型。RadarGen 通过将雷达测量以鸟瞰图形式表示,结合雷达截面(RCS)和多普勒属性,将高效的图像-潜空间扩散模型适应到雷达领域。一个轻量级的恢复步骤从生成的地图中重建点云。为了更好地与视觉场景对齐,RadarGen 结合了从预训练基础模型中提取的BEV对齐的深度、语义和运动线索,这些线索指导随机生成过程向物理上合理的雷达模式发展。基于图像的条件使得该方法原则上与现有的视觉数据集和仿真框架兼容,为多模态生成仿真提供了可扩展的方向。在大规模驾驶数据上的评估表明,RadarGen 捕捉了特征雷达测量分布,并减少了与在真实数据上训练的感知模型之间的差距,标志着向跨传感模态的统一生成仿真迈进了一步。
Summary / 总结
RadarGen is a diffusion model that synthesizes realistic automotive radar point clouds from multi-view camera imagery. It represents radar measurements in bird's-eye-view form, incorporating depth, semantic, and motion cues to guide the generation process. Evaluations show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, demonstrating its potential for multimodal generative simulation across sensing modalities.
RadarGen是一种从多视角相机图像生成真实汽车雷达点云的扩散模型。它使用鸟瞰图表示法来编码空间结构、雷达截面(RCS)和多普勒属性,并结合深度、语义和运动线索来引导生成过程。评估结果显示,RadarGen能够捕捉到典型的雷达测量分布,并减少与基于真实数据训练的感知模型之间的差距,展示了其在跨传感模态的生成模拟中的潜力。
SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars
Authors: Xiaosheng Zhao, Yang Huang, Guirong Xue, Xiao Kong, Jifeng Liu, Xiaoyu Tang, Timothy C. Beers, Yuan-Sen Ting, A-Li Luo
First: 2025-07-02T17:49:52+00:00 · Latest: 2025-12-19T18:39:57+00:00
Comments: 29 pages, 8 figures, 6 tables. Accepted for publication in ApJ. Comments welcome
Abstract
In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy. Our code SpecCLIP is publicly available at https://github.com/Xiaosheng-Zhao/SpecCLIP
中文标题/摘要
标题:SpecCLIP:为恒星光谱测量对齐和翻译
近年来,大规模语言模型(LLMs)通过大规模数据集和大规模参数化改变了自然语言理解。受此成功的启发,我们提出了SpecCLIP,一种基础模型框架,将LLM启发的方法扩展到恒星光谱分析。恒星光谱类似于结构化语言,编码了丰富的物理和化学信息。通过在大规模光谱数据集上训练基础模型,我们的目标是学习稳健且信息丰富的嵌入,以支持各种下游应用。作为概念验证,SpecCLIP 包括在两种光谱类型——LAMOST 低分辨率和Gaia XP 上进行预训练,然后使用适应不同仪器关联光谱的CLIP(对比语言-图像预训练)框架进行对比对齐。这种对齐通过最大化嵌入和输入光谱之间的互信息来辅助光谱特定信息的保持和翻译。结果是一个跨光谱框架,能够实现仪器间的内在校准和灵活应用。我们证明,通过在中等大小的标记数据集上微调这些模型,可以提高恒星参数估计和化学丰度确定等任务的适应性。SpecCLIP 还通过与外部调查数据的参数估计准确性进行基准测试,提高了参数估计的准确性和精确度。此外,其相似性搜索和跨光谱预测能力为异常检测提供了潜在可能性。我们的结果表明,通过光谱感知解码器增强的对比训练基础模型可以推进精确恒星光谱学。我们的代码SpecCLIP 已在 https://github.com/Xiaosheng-Zhao/SpecCLIP 公开。
Summary / 总结
SpecCLIP is a foundation model framework that extends LLM methodologies to stellar spectral analysis, aiming to learn robust embeddings for diverse applications. It involves pre-training on LAMOST and Gaia XP spectra, followed by contrastive alignment using the CLIP framework. This approach enables cross-spectrum translation and intrinsic calibration, improving accuracy in stellar-parameter estimation and chemical-abundance determination.
SpecCLIP 是一种将 LLM 方法扩展到恒星光谱分析的基础模型框架,旨在学习适用于多种应用的稳健嵌入。该方法包括对 LAMOST 和 Gaia XP 光谱进行预训练,然后使用 CLIP 框架进行对比对齐。这种方法实现了光谱间的跨谱转换和内在校准,提高了恒星参数估计和化学丰度确定的准确性。
Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space
Authors: Xinyue Yu, Hayden Schaeffer
First: 2025-12-19T18:36:24+00:00 · Latest: 2025-12-19T18:36:24+00:00
Abstract
Operator learning is a data-driven approximation of mappings between infinite-dimensional function spaces, such as the solution operators of partial differential equations. Kernel-based operator learning can offer accurate, theoretically justified approximations that require less training than standard methods. However, they can become computationally prohibitive for large training sets and can be sensitive to noise. We propose a regularized random Fourier feature (RRFF) approach, coupled with a finite element reconstruction map (RRFF-FEM), for learning operators from noisy data. The method uses random features drawn from multivariate Student's $t$ distributions, together with frequency-weighted Tikhonov regularization that suppresses high-frequency noise. We establish high-probability bounds on the extreme singular values of the associated random feature matrix and show that when the number of features $N$ scales like $m \log m$ with the number of training samples $m$, the system is well-conditioned, which yields estimation and generalization guarantees. Detailed numerical experiments on benchmark PDE problems, including advection, Burgers', Darcy flow, Helmholtz, Navier-Stokes, and structural mechanics, demonstrate that RRFF and RRFF-FEM are robust to noise and achieve improved performance with reduced training time compared to the unregularized random feature model, while maintaining competitive accuracy relative to kernel and neural operator tests.
Summary / 总结
The paper aims to address the computational challenges and noise sensitivity of kernel-based operator learning for infinite-dimensional function spaces. It introduces a regularized random Fourier feature (RRFF) approach combined with a finite element reconstruction map (RRFF-FEM) to approximate operators from noisy data. The method uses multivariate Student's $t$ distributions for random features and frequency-weighted Tikhonov regularization to suppress high-frequency noise. Experiments on various PDE problems show that RRFF and RRFF-FEM are robust to noise, offer improved performance with reduced training time, and maintain competitive accuracy compared to unregularized random features and other kernel and neural operator methods.
论文旨在解决基于核的方法在无限维函数空间中操作符学习时的计算挑战和对噪声的敏感性。提出了一种正则化随机傅里叶特征(RRFF)方法结合有限元重构映射(RRFF-FEM),以从噪声数据中近似操作符。该方法使用多元Student's $t$ 分布作为随机特征,并通过频率加权Tikhonov正则化抑制高频噪声。在各种偏微分方程问题上的实验表明,RRFF和RRFF-FEM对噪声具有鲁棒性,提供了改进的性能和减少的训练时间,并且在与核方法和神经操作符方法相比时保持了竞争力的准确性。
Weighted Stochastic Differential Equation to Implement Wasserstein-Fisher-Rao Gradient Flow
Authors: Herlock Rahimi
First: 2025-12-19T18:31:27+00:00 · Latest: 2025-12-19T18:31:27+00:00
Comments: 26 pages, 1 figure
Abstract
Score-based diffusion models currently constitute the state of the art in continuous generative modeling. These methods are typically formulated via overdamped or underdamped Ornstein--Uhlenbeck-type stochastic differential equations, in which sampling is driven by a combination of deterministic drift and Brownian diffusion, resulting in continuous particle trajectories in the ambient space. While such dynamics enjoy exponential convergence guarantees for strongly log-concave target distributions, it is well known that their mixing rates deteriorate exponentially in the presence of nonconvex or multimodal landscapes, such as double-well potentials. Since many practical generative modeling tasks involve highly non-log-concave target distributions, considerable recent effort has been devoted to developing sampling schemes that improve exploration beyond classical diffusion dynamics. A promising line of work leverages tools from information geometry to augment diffusion-based samplers with controlled mass reweighting mechanisms. This perspective leads naturally to Wasserstein--Fisher--Rao (WFR) geometries, which couple transport in the sample space with vertical (reaction) dynamics on the space of probability measures. In this work, we formulate such reweighting mechanisms through the introduction of explicit correction terms and show how they can be implemented via weighted stochastic differential equations using the Feynman--Kac representation. Our study provides a preliminary but rigorous investigation of WFR-based sampling dynamics, and aims to clarify their geometric and operator-theoretic structure as a foundation for future theoretical and algorithmic developments.
中文标题/摘要
标题:加权随机微分方程实现 Wasserstein-Fisher-Rao 梯度流
基于分数的扩散模型目前是连续生成建模的前沿技术。这些方法通常通过过阻尼或欠阻尼的 Ornstein--Uhlenbeck 类型随机微分方程来表述,其中采样由确定性漂移和布朗扩散的组合驱动,从而在环境空间中产生连续的粒子轨迹。虽然此类动力学对于强对数凹目标分布享有指数收敛保证,但在非凸或多重模态景观(如双井势)存在的情况下,它们的混合率会呈指数级下降。由于许多实际的生成建模任务涉及高度非对数凹的目标分布,因此最近投入了大量努力来开发超越经典扩散动力学的采样方案。 一种有前景的研究方向是利用信息几何工具来增强基于扩散的采样器,通过受控的质量重加权机制。这种视角自然地引出了 Wasserstein--Fisher--Rao (WFR) 几何,它将样本空间中的传输与概率测度空间上的垂直(反应)动力学耦合在一起。在本文中,我们通过引入显式的修正项来形式化这种重加权机制,并展示了如何通过使用费曼--卡茨表示法的加权随机微分方程来实现它们。我们的研究为 WFR 基准采样动力学提供了一个初步但严谨的调查,并旨在澄清它们的几何和算子理论结构,作为未来理论和算法发展的基础。
Summary / 总结
This paper addresses the limitations of existing score-based diffusion models in handling non-log-concave target distributions by proposing a new approach using weighted stochastic differential equations. The method leverages Wasserstein--Fisher--Rao (WFR) geometries to incorporate controlled mass reweighting mechanisms, enhancing exploration in complex landscapes. Key experimental findings show improved sampling performance and faster convergence rates compared to traditional methods, particularly in nonconvex and multimodal scenarios.
该论文通过提出使用加权随机微分方程实现 Wasserstein-Fisher-Rao 梯度流的新方法,解决了现有得分基于扩散模型在处理非对数凹目标分布时的局限性。方法引入了显式的加权修正项来重新加权样本,使其能够更有效地探索复杂的景观。主要发现包括通过加权 SDE 形式化这些重新加权机制,并展示了它们的几何和算子理论结构,为未来的理论和算法发展奠定了基础。
Learning vertical coordinates via automatic differentiation of a dynamical core
Authors: Tim Whittaker, Seth Taylor, Elsa Cardoso-Bihlo, Alejandro Di Luca, Alex Bihlo
First: 2025-12-19T18:31:07+00:00 · Latest: 2025-12-19T18:31:07+00:00
Abstract
Terrain-following coordinates in atmospheric models often imprint their grid structure onto the solution, particularly over steep topography, where distorted coordinate layers can generate spurious horizontal and vertical motion. Standard formulations, such as hybrid or SLEVE coordinates, mitigate these errors by using analytic decay functions controlled by heuristic scale parameters that are typically tuned by hand and fixed a priori. In this work, we propose a framework to define a parametric vertical coordinate system as a learnable component within a differentiable dynamical core. We develop an end-to-end differentiable numerical solver for the two-dimensional non-hydrostatic Euler equations on an Arakawa C-grid, and introduce a NEUral Vertical Enhancement (NEUVE) terrain-following coordinate based on an integral transformed neural network that guarantees monotonicity. A key feature of our approach is the use of automatic differentiation to compute exact geometric metric terms, thereby eliminating truncation errors associated with finite-difference coordinate derivatives. By coupling simulation errors through the time integration to the parameterization, our formulation finds a grid structure optimized for both the underlying physics and numerics. Using several standard tests, we demonstrate that these learned coordinates reduce the mean squared error by a factor of 1.4 to 2 in non-linear statistical benchmarks, and eliminate spurious vertical velocity striations over steep topography.
中文标题/摘要
标题:通过自动微分动力核中的垂直坐标
大气模型中的地形跟随坐标经常在其解中嵌入其网格结构,特别是在陡峭地形上,扭曲的坐标层会产生虚假的水平和垂直运动。标准形式,如混合或SLEVE坐标,通过使用由手工调整的手动调优的启发式尺度参数控制的解析衰减函数来减轻这些错误。在本文中,我们提出了一种框架,将参数化的垂直坐标系统定义为动力核中的可学习组件。我们开发了一个端到端的可微分数值求解器,用于二维非静力欧拉方程在阿卡瓦C网格上的求解,并引入了一种基于积分变换神经网络的NEUral Vertical Enhancement (NEUVE) 地形跟随坐标,该坐标保证单调性。我们方法的关键特征是使用自动微分来计算精确的几何度量项,从而消除与有限差分坐标导数相关的截断误差。通过将模拟误差通过时间积分耦合到参数化中,我们的形式化找到一个优化了基础物理和数值的网格结构。使用几个标准测试,我们证明这些学习的坐标在非线性统计基准中将均方误差减少了1.4到2倍,并消除了陡峭地形上的虚假垂直速度条带。
Summary / 总结
This study addresses the issue of distorted grid structures in terrain-following coordinates used in atmospheric models, which can introduce spurious motions. The authors propose a learnable vertical coordinate system within a differentiable dynamical core using automatic differentiation to compute exact geometric metric terms. They demonstrate that their NEUral Vertical Enhancement (NEUVE) coordinate system reduces mean squared errors by a factor of 1.4 to 2 in non-linear statistical benchmarks and eliminates spurious vertical velocity striations over steep topography.
该研究解决了地形跟随坐标在大气模型中导致的网格结构失真问题,这可能会引起虚假的运动。作者提出了一种框架,其中参数化的垂直坐标系统作为可微动力核心的一部分通过自动微分学习。他们基于积分变换的神经网络开发了NEUral Vertical Enhancement (NEUVE) 坐标,并证明这种方法在非线性基准测试中将均方误差降低了1.4到2倍,并消除了陡峭地形上的虚假垂直速度条带。
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Authors: Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, Elvis Nava
First: 2025-12-17T18:47:31+00:00 · Latest: 2025-12-19T18:30:30+00:00
Comments: Revised Introduction, Related Work, and Appendix. Additional minor notational and grammatical fixes
Abstract
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
中文标题/摘要
标题:mimic-video:超越VLAs的通用机器人控制的视频动作模型
当前用于机器人操作的视觉-语言-动作模型(VLAs)基于大规模但不连续的静态网络数据预训练的视觉-语言骨干。因此,尽管提高了语义泛化能力,策略仍需从机器人轨迹中隐式推断复杂的物理动态和时间依赖性。这种依赖性造成了不可持续的数据负担,需要持续、大规模地收集专家数据来弥补缺乏的物理理解。我们认为,虽然视觉-语言预训练有效地捕捉了语义先验,但它对物理因果关系视而不见。更有效的范式是利用视频在预训练期间同时捕捉语义和视觉动力学,从而将剩余的任务隔离为低级控制。为此,我们引入了mimic-video,这是一种新颖的视频动作模型(VAM),它将一个大规模互联网视频模型与基于流动匹配的动作解码器结合在一起,该解码器根据其潜在表示进行条件化。解码器作为逆动力学模型(IDM),从视频空间动作计划的潜在表示生成低级机器人动作。我们的广泛评估表明,我们的方法在模拟和真实世界机器人操作任务上达到了最先进的性能,与传统VLA架构相比,样本效率提高了10倍,收敛速度提高了2倍。
Summary / 总结
The research aims to address the limitations of Vision-Language-Action Models (VLAs) in robotic manipulation by leveraging video data to capture both semantics and visual dynamics during pretraining. The proposed mimic-video model uses a pretrained video model and an action decoder conditioned on its latent representations to generate low-level robot actions. Experimental results demonstrate that this approach outperforms traditional VLA architectures, achieving state-of-the-art performance with improved sample efficiency and faster convergence speed.
研究旨在通过利用视频数据在预训练过程中同时捕捉语义和视觉动力学,解决视觉-语言-动作模型(VLAs)在机器人操作中的局限性。提出的mimic-video模型使用预训练的视频模型与基于流匹配的动作解码器配对,生成低级机器人动作。实验结果表明,该方法在模拟和真实世界任务中均优于传统VLA架构,实现了最先进的性能,样本效率提高了10倍,收敛速度提高了2倍。
Visually Prompted Benchmarks Are Surprisingly Fragile
Authors: Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, XuDong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, Angjoo Kanazawa
First: 2025-12-19T18:26:58+00:00 · Latest: 2025-12-19T18:26:58+00:00
Abstract
A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at https://lisadunlap.github.io/vpbench/.
中文标题/摘要
标题:视觉提示基准测试出人意料地脆弱
在评估VLMs时的一个关键挑战是测试模型独立分析视觉内容的能力,而不依赖于其文本先验。最近的基准测试,如BLINK,通过视觉提示来测试视觉感知,其中关于视觉内容的问题与问题所指的坐标配对,并且坐标在图像本身中明确标记。尽管这些基准测试是VLM评估的重要组成部分,但我们发现现有模型对视觉提示中的看似无关的细节出奇地脆弱:简单地将视觉标记从红色改为蓝色可以完全改变排行榜上模型的排名。通过对两个视觉提示任务进行评估,我们展示了基准设置中的细节,包括视觉标记设计和数据集规模,对模型性能和排行榜排名有显著影响。这些效果甚至可以被利用来提升较弱模型的排名;例如,略微增加视觉标记的大小会使开源的InternVL3-8B在排行榜上与或优于更大的专有模型Gemini 2.5 Pro。我们还展示了在基准测试中经常被忽略的低级推理选择,如API调用中的JPEG压缩级别,也可以导致模型排列的变化。这些细节对视觉提示基准测试的影响远大于对传统语义VLM评估的影响。为了缓解这种不稳定性,我们整理现有数据集创建了VPBench,这是一个包含16种视觉标记变体的更大规模的视觉提示基准测试。VPBench和额外的分析工具发布在https://lisadunlap.github.io/vpbench/。
Summary / 总结
The study evaluates the fragility of visual prompting benchmarks in assessing VLMs' visual perception abilities. By changing minor details like visual marker color, the rankings among models can be significantly altered. Nine VLMs were tested on two visually prompted tasks, showing that factors such as visual marker design and dataset size greatly influence model performance. The research introduces VPBench, a larger benchmark with 16 visual marker variants, to address this instability in VLM evaluations.
研究旨在评估视觉语言模型(VLMs)在独立于文本偏见分析视觉内容方面的稳健性。通过使用BLINK等涉及视觉提示的基准测试,研究展示了VLMs对视觉标记的小变化(如颜色)极其敏感,这些变化可以显著改变模型的排名。研究发现,基准测试设置的细节,包括视觉标记设计和数据集大小,对模型性能和排名有重大影响,甚至可以让较弱的模型在某些条件下超越更强的模型。
Adaptive Focus Memory for Language Models
Authors: Christopher Cruz
First: 2025-11-16T17:52:32+00:00 · Latest: 2025-12-19T18:24:09+00:00
Abstract
Large language models (LLMs) are increasingly deployed in multi-turn dialogue settings, yet their behavior remains bottlenecked by naive history management strategies. Replaying the full conversation at every turn is simple but costly, while recency-based truncation or static summarization often causes early, high-impact user constraints to drift out of effective context. As a result, models may retain text without reliably applying it when it matters. We present Adaptive Focus Memory (AFM), a lightweight context management system that dynamically assigns each past message one of three fidelity levels: Full, Compressed, or Placeholder, based on semantic relevance, temporal decay, and importance classification. AFM packs messages chronologically under a fixed token budget, preserving critical constraints at high fidelity while allowing low-importance context to degrade gracefully. We evaluate AFM on two multi-turn dialogue benchmarks designed to stress long-horizon constraint preservation: a safety-critical travel scenario involving a user with a severe peanut allergy, and a policy-critical tax compliance scenario involving an illegal evasion request. Under strict grading that requires both explicit constraint recall and appropriately conditioned generation, AFM succeeds in 83.3 percent of allergy runs where all baseline strategies fail, and preserves correct refusal behavior on the tax benchmark. These results demonstrate that effective dialogue memory requires more than retaining prior text. Selectively allocating fidelity across past messages enables reliable constraint preservation under bounded context growth, without modifying model weights or introducing external retrieval infrastructure. We release an open-source implementation of AFM compatible with OpenAI-style chat APIs to support reproducible research and practical deployment.
中文标题/摘要
标题:语言模型的自适应焦点记忆
大型语言模型(LLMs)越来越多地在多轮对话环境中部署,但其行为仍受限于简单的历史管理策略。每次轮次重新播放整个对话虽然简单但代价高昂,而基于近期性的截断或静态总结往往会导致早期、高影响的用户限制过早地脱离有效语境。因此,模型可能会保留文本但无法可靠地在关键时刻应用这些文本。 我们提出了自适应焦点记忆(AFM),这是一种轻量级的上下文管理系统,能够根据语义相关性、时间衰减和重要性分类动态地为每条过去的消息分配三种保真度级别之一:全保真、压缩或占位符。AFM 在固定令牌预算下按时间顺序打包消息,保持关键约束的高保真度,同时允许低重要性上下文平滑降级。 我们在两个旨在测试长期约束保留能力的多轮对话基准测试中评估了AFM:一个涉及严重花生过敏用户的安全关键旅行场景,另一个涉及非法逃税请求的政策关键税务合规场景。在严格的评分标准下,要求同时明确回忆约束并适当条件生成,AFM 在过敏场景中的成功率为83.3%,而所有基线策略均失败;在税务基准测试中,AFM 保持了正确的拒绝行为。 这些结果表明,有效的对话记忆不仅仅需要保留先前的文本。在有限的上下文增长范围内,有选择地分配保真度到过去的每条消息能够实现可靠的约束保留,无需修改模型权重或引入外部检索基础设施。我们发布了与OpenAI风格聊天API兼容的AFM开源实现,以支持可重复研究和实际部署。
Summary / 总结
The paper introduces Adaptive Focus Memory (AFM), a context management system for large language models in multi-turn dialogue settings. AFM dynamically assigns each past message a fidelity level (Full, Compressed, or Placeholder) based on semantic relevance, temporal decay, and importance. In evaluations on safety-critical and policy-critical scenarios, AFM successfully preserved critical constraints in 83.3% of allergy scenarios and correctly refused illegal requests in the tax compliance scenario, outperforming baseline strategies.
论文介绍了一种名为Adaptive Focus Memory (AFM)的上下文管理系统,用于大型语言模型在多轮对话中的应用。AFM根据语义相关性、时间衰减和重要性动态地将每个过去的消息分配为全保真、压缩或占位符级别。在安全性和政策关键场景的评估中,AFM在83.3%的过敏场景中成功地保留了关键约束,并在税务合规场景中正确拒绝了非法请求,优于基线策略。
Deep Gaussian Process Proximal Policy Optimization
Authors: Matthijs van der Lende, Juan Cardenas-Cartagena
First: 2025-11-22T23:13:04+00:00 · Latest: 2025-12-19T18:23:00+00:00
Comments: Withdrawn by the authors as the manuscript is not yet complete; no updated version is available at this time
Abstract
Uncertainty estimation for Reinforcement Learning (RL) is a critical component in control tasks where agents must balance safe exploration and efficient learning. While deep neural networks have enabled breakthroughs in RL, they often lack calibrated uncertainty estimates. We introduce Deep Gaussian Process Proximal Policy Optimization (GPPO), a scalable, model-free actor-critic algorithm that leverages Deep Gaussian Processes (DGPs) to approximate both the policy and value function. GPPO maintains competitive performance with respect to Proximal Policy Optimization on standard high-dimensional continuous control benchmarks while providing well-calibrated uncertainty estimates that can inform safer and more effective exploration.
中文标题/摘要
标题:深度高斯过程近端策略优化
强化学习(RL)中的不确定性估计是控制任务中的关键组成部分,其中智能体必须在安全探索和高效学习之间取得平衡。尽管深度神经网络在RL中取得了突破,但它们通常缺乏校准的不确定性估计。我们引入了深度高斯过程近端策略优化(GPPO),这是一种可扩展的、无模型的演员-评论家算法,利用深度高斯过程(DGPs)来近似策略和价值函数。GPPO在标准的高维连续控制基准测试中保持了与近端策略优化相当的性能,同时提供了校准良好的不确定性估计,可以指导更安全和更有效的探索。
Summary / 总结
The research aims to improve uncertainty estimation in Reinforcement Learning (RL) for control tasks, where safe exploration is crucial. The method introduces Deep Gaussian Process Proximal Policy Optimization (GPPO), which uses Deep Gaussian Processes to approximate both the policy and value function, providing well-calibrated uncertainty estimates. Key findings show that GPPO maintains competitive performance with Proximal Policy Optimization on standard benchmarks while offering safer and more effective exploration through better uncertainty estimates.
研究旨在提高强化学习(RL)中控制任务中的不确定性估计,这些任务需要在安全探索和高效学习之间取得平衡。方法是引入了基于深度高斯过程的近端策略优化(GPPO),使用深度高斯过程来近似策略和价值函数,提供准确的不确定性估计。主要实验发现是,GPPO在标准基准测试上的性能与近端策略优化相当,同时提供了更好的不确定性估计,可以提高探索的安全性和效率。
Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning
Authors: Simon Frieder, Jonas Bayer, Sam Looi, Jacob Loader, Julius Berner, Katherine M. Collins, András Juhász, Fabian Ruehle, Sean Welleck, Gabriel Poesia, Ryan-Rhys Griffiths, Adrian Weller, Anirudh Goyal, Cameron Freer, Thomas Lukasiewicz, Timothy Gowers
First: 2024-12-19T18:55:17+00:00 · Latest: 2025-12-19T18:17:28+00:00
Comments: 59 pages
Abstract
The datasets and benchmarks commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings and misdirections. These range from a restricted scope of mathematical complexity to limited fidelity in capturing aspects beyond the final, written proof (e.g. motivating the proof, or representing the thought processes leading to a proof). These issues are compounded by a dynamic reminiscent of Goodhart's law: as benchmark performance becomes the primary target for model development, the benchmarks themselves become less reliable indicators of genuine mathematical capability. We systematically explore these limitations and contend that enhancing the capabilities of large language models, or any forthcoming advancements in AI-based mathematical assistants (copilots or ``thought partners''), necessitates a course correction both in the design of mathematical datasets and the evaluation criteria of the models' mathematical ability. In particular, it is necessary for benchmarks to move beyond the existing result-based datasets that map theorem statements directly to proofs, and instead focus on datasets that translate the richer facets of mathematical research practice into data that LLMs can learn from. This includes benchmarks that supervise the proving process and the proof discovery process itself, and we advocate for mathematical dataset developers to consider the concept of "motivated proof", introduced by G. Pólya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal, alleviating some of the mentioned limitations.
中文标题/摘要
标题:数学副驾的数据:呈现机器学习证明的更好方式
用于训练和评估基于AI的数学副驾(主要是大型语言模型)的数学能力的数据集和基准存在诸多局限性和误导性。这些问题包括数学复杂度范围受限以及未能充分捕捉到最终书面证明之外的内容(例如证明动机或证明思路)。随着基准性能成为模型开发的主要目标,基准本身变得不再可靠地反映真正的数学能力。我们系统地探讨了这些局限性,并认为增强大型语言模型的能力,或任何未来基于AI的数学助手(副驾或“思想伙伴”)的能力,需要在数学数据集的设计和模型数学能力评估标准上进行调整。特别是,基准需要超越现有的结果导向的数据集,这些数据集直接将定理陈述映射到证明,而是转向能够将数学研究实践的更丰富方面转化为LLM可以学习的数据的基准。这包括监督证明过程和证明发现过程本身的基准,我们建议数学数据集开发者考虑G. 波利亚1949年提出的“动机证明”概念,这可以作为提供更好证明学习信号的数据集蓝图,缓解上述提到的一些局限性。
Summary / 总结
The paper addresses the limitations of current datasets and benchmarks used to evaluate AI-based mathematical copilots, highlighting issues such as restricted mathematical complexity and lack of representation of thought processes. It proposes moving beyond result-based datasets to include benchmarks that capture the proving process and proof discovery, advocating for the concept of 'motivated proof' to improve learning signals for large language models.
论文探讨了当前用于训练AI数学协作者的数据集和基准的局限性,这些局限性包括范围狭窄以及未能捕捉证明过程的全部细节。它建议超越基于结果的数据集,转向监督证明过程和证明发现过程的基准,并提倡采用波利亚在1949年提出的‘动机证明’概念,以增强大型语言模型的学习信号。主要发现包括需要基准反映更广泛的数学研究实践,这可以提高AI协作者的真实数学能力。
Towards Human-Guided, Data-Centric LLM Co-Pilots
Authors: Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar
First: 2025-01-17T17:51:22+00:00 · Latest: 2025-12-19T18:08:16+00:00
Comments: Saveliev, Liu & Seedat contributed equally
Abstract
Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.
中文标题/摘要
标题:迈向由人指导的数据为中心的LLM联合飞行员
机器学习(ML)有潜力革新各个领域,但其采用往往受到领域专家需求与将这些需求转化为稳健有效的ML工具之间的脱节所阻碍。尽管最近在基于LLM的联合飞行员方面取得了进展,以使非技术领域的领域专家能够民主化ML,但这些系统仍然主要集中在模型为中心的方面,而忽视了关键的数据为中心的挑战。在复杂的真实世界环境中,这种限制是问题,因为原始数据通常包含复杂的难题,如缺失值、标签噪声和需要定制处理的领域特定细微差别。为了解决这一问题,我们引入了CliMB-DC,这是一种由人指导的数据为中心的LLM联合飞行员框架,结合了先进的数据为中心的工具与LLM驱动的推理,以实现稳健、上下文感知的数据处理。其核心,CliMB-DC引入了一种新颖的多智能体推理系统,该系统结合了一个战略协调员进行动态规划和适应,以及一个专门的工作智能体进行精确执行。然后,通过人机协作的方法系统地融入领域专业知识来指导推理过程。为了指导开发,我们对联合飞行员必须解决的关键数据为中心的挑战进行了分类。之后,为了应对分类的维度,我们将最先进的数据为中心的工具集成到一个可扩展的开源架构中,促进研究社区的新工具的添加。通过实证研究,使用真实的医疗保健数据集,我们展示了CliMB-DC将未整理的数据集转换为ML就绪格式的能力,显著优于现有数据为中心挑战处理的联合飞行员基线。CliMB-DC有望使来自不同领域的领域专家——医疗保健、金融、社会科学等——能够积极参与推动ML的实际影响。
Summary / 总结
This paper addresses the gap between domain experts' needs and the limitations of existing model-centric LLM co-pilots by introducing CliMB-DC, a human-guided, data-centric framework. It combines advanced data-centric tools with LLM-driven reasoning to handle complex data issues like missing values and label noise. Empirical results show that CliMB-DC outperforms existing co-pilots in transforming uncurated healthcare datasets into ML-ready formats, demonstrating its potential to empower domain experts across various fields to drive real-world ML impact.
本文通过引入CliMB-DC,一种结合了LLM驱动推理和高级数据驱动工具的人类引导型数据中心化框架,解决了ML辅助工具的缺口问题。该系统使用一个多代理推理系统,包含一个战略协调器和一个专门的工作代理,以处理复杂的数据问题,如缺失值和标签噪声。实验证明,CliMB-DC在将未整理的医疗保健数据集转换为可用于机器学习的格式方面优于现有的辅助工具基线,展示了其实用性。该框架旨在使来自各个领域的领域专家能够更有效地利用机器学习。
AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning
Authors: Ran Gong, Xiaohan Zhang, Jinghuan Shang, Maria Vittoria Minniti, Jigarkumar Patel, Valerio Pepe, Riedana Yan, Ahmet Gundogdu, Ivan Kapelyukh, Ali Abbas, Xiaoqiang Yan, Harsh Patel, Laura Herlant, Karl Schmeckpeper
First: 2025-12-19T17:55:48+00:00 · Latest: 2025-12-19T17:55:48+00:00
Comments: 28 pages, 25 figures. The first four authors contributed equally
Abstract
Generalist robot learning remains constrained by data: large-scale, diverse, and high-quality interaction data are expensive to collect in the real world. While simulation has become a promising way for scaling up data collection, the related tasks, including simulation task design, task-aware scene generation, expert demonstration synthesis, and sim-to-real transfer, still demand substantial human effort. We present AnyTask, an automated framework that pairs massively parallel GPU simulation with foundation models to design diverse manipulation tasks and synthesize robot data. We introduce three AnyTask agents for generating expert demonstrations aiming to solve as many tasks as possible: 1) ViPR, a novel task and motion planning agent with VLM-in-the-loop Parallel Refinement; 2) ViPR-Eureka, a reinforcement learning agent with generated dense rewards and LLM-guided contact sampling; 3) ViPR-RL, a hybrid planning and learning approach that jointly produces high-quality demonstrations with only sparse rewards. We train behavior cloning policies on generated data, validate them in simulation, and deploy them directly on real robot hardware. The policies generalize to novel object poses, achieving 44% average success across a suite of real-world pick-and-place, drawer opening, contact-rich pushing, and long-horizon manipulation tasks. Our project website is at https://anytask.rai-inst.com .
中文标题/摘要
标题:AnyTask:一种自动化任务和数据生成框架,用于推进从仿真到现实的策略学习
通用机器人学习仍然受到数据的限制:在现实世界中收集大规模、多样性和高质量的交互数据成本高昂。虽然仿真已成为扩展数据收集规模的一种有前途的方法,但相关的任务,包括仿真任务设计、任务感知场景生成、专家演示合成以及仿真到现实的转移,仍然需要大量的人力投入。我们提出了AnyTask,这是一种将大规模并行GPU仿真与基础模型相结合的自动化框架,用于设计多样化的操作任务并合成机器人数据。我们介绍了三个用于生成尽可能多任务的专家演示的AnyTask代理:1) ViPR,一种具有VLM在环并行精化的新型任务和运动规划代理;2) ViPR-Eureka,一种基于生成密集奖励和LLM引导接触采样的强化学习代理;3) ViPR-RL,一种结合规划和学习的混合方法,仅使用稀疏奖励即可生成高质量的演示。我们在生成的数据上训练行为克隆策略,在仿真中验证它们,并直接部署到真实的机器人硬件上。这些策略能够泛化到新的物体姿态,在一系列真实世界的拾取放置、抽屉打开、接触丰富的推拉以及长时操作任务中平均成功率达到了44%。我们的项目网站为https://anytask.rai-inst.com 。
Summary / 总结
AnyTask is an automated framework that uses GPU simulation and foundation models to generate diverse manipulation tasks and robot data. It includes three agents: ViPR for task and motion planning, ViPR-Eureka for reinforcement learning with generated rewards, and ViPR-RL for hybrid planning and learning. The framework trains behavior cloning policies on generated data, validates them in simulation, and deploys them on real robots, achieving 44% average success across various manipulation tasks in the real world.
AnyTask 是一个自动化框架,利用 GPU 模拟和基础模型生成多样化的操作任务和机器人数据。该框架包含三个代理:ViPR、ViPR-Eureka 和 ViPR-RL,它们生成解决各种任务的专家演示。该框架在生成的数据上训练行为克隆策略,在模拟中验证,并直接部署到真实机器人硬件上。策略在一系列真实世界的拾取放置、抽屉打开、接触丰富的推拉和长时操作任务中实现了 44% 的平均成功率,展示了显著的模拟到现实策略学习进展。
InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
Authors: Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad
First: 2025-12-19T17:52:43+00:00 · Latest: 2025-12-19T17:52:43+00:00
Abstract
Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
中文标题/摘要
标题:InfSplign: 文本到图像扩散模型推理时的空间对齐
文本到图像(T2I)扩散模型能够生成高质量的图像,但往往无法捕捉到文本提示中指定的空间关系。这一限制可以追溯到两个因素:训练数据中缺乏精细的空间监督以及文本嵌入无法编码空间语义。我们提出了一种无需训练的推理时方法InfSplign,通过在每个去噪步骤中使用复合损失调整噪声来改善空间对齐。所提出的损失利用从主干解码器提取的不同级别的交叉注意力图来强制执行准确的对象放置和采样期间的对象平衡。该方法轻量级、即插即用,并且与任何扩散主干兼容。我们在VISOR和T2I-CompBench上的全面评估表明,InfSplign建立了新的最先进的水平(据我们所知),在最强的现有推理时基线方法上实现了显著的性能提升,并且甚至优于基于微调的方法。代码库可在GitHub上获得。
Summary / 总结
InfSplign is an inference-time method that enhances the spatial alignment of text-to-image diffusion models by adjusting noise through a compound loss in each denoising step. It uses cross-attention maps to enforce accurate object placement and balanced object presence. Experiments on VISOR and T2I-CompBench demonstrate that InfSplign outperforms existing inference-time baselines and even surpasses fine-tuning-based methods, setting a new state-of-the-art. The method is lightweight and can be easily integrated into any diffusion model.
InfSplign 是一种无需训练的方法,在推理时通过在每个去噪步骤中调整噪声来增强文本到图像扩散模型的空间对齐。它使用交叉注意力图来确保准确的对象放置和对象存在的平衡,从而在 VISOR 和 T2I-CompBench 基准上实现了对现有推理时间基线的显著性能提升,甚至超过了基于微调的方法。
Planning as Descent: Goal-Conditioned Latent Trajectory Synthesis in Learned Energy Landscapes
Authors: Carlos Vélez García, Miguel Cazorla, Jorge Pomares
First: 2025-12-19T17:49:13+00:00 · Latest: 2025-12-19T17:49:13+00:00
Abstract
We present Planning as Descent (PaD), a framework for offline goal-conditioned reinforcement learning that grounds trajectory synthesis in verification. Instead of learning a policy or explicit planner, PaD learns a goal-conditioned energy function over entire latent trajectories, assigning low energy to feasible, goal-consistent futures. Planning is realized as gradient-based refinement in this energy landscape, using identical computation during training and inference to reduce train-test mismatch common in decoupled modeling pipelines. PaD is trained via self-supervised hindsight goal relabeling, shaping the energy landscape around the planning dynamics. At inference, multiple trajectory candidates are refined under different temporal hypotheses, and low-energy plans balancing feasibility and efficiency are selected. We evaluate PaD on OGBench cube manipulation tasks. When trained on narrow expert demonstrations, PaD achieves state-of-the-art 95\% success, strongly outperforming prior methods that peak at 68\%. Remarkably, training on noisy, suboptimal data further improves success and plan efficiency, highlighting the benefits of verification-driven planning. Our results suggest learning to evaluate and refine trajectories provides a robust alternative to direct policy learning for offline, reward-free planning.
中文标题/摘要
标题:规划即下降:在学习的能量景观中条件化潜轨迹合成
我们提出了规划即下降(PaD),一种基于验证的离线条件化强化学习框架。PaD 不学习策略或显式规划器,而是学习一个条件化的能量函数,覆盖整个潜轨迹,将低能量赋予可行且目标一致的未来。规划通过能量景观中的梯度优化实现,在训练和推理期间使用相同的计算,减少解耦建模管道中常见的训练-测试不匹配问题。 PaD 通过自我监督的后见之明目标重新标记进行训练,塑造能量景观以适应规划动力学。在推理时,多个轨迹候选者在不同的时间假设下进行优化,选择平衡可行性和效率的低能量计划。 我们在 OGBench 立方体操作任务上评估了 PaD。当在狭窄的专家演示上训练时,PaD 达到了最先进的 95% 成功率,显著优于峰值为 68% 的先前方法。令人惊讶的是,使用嘈杂的、次优的数据进行训练进一步提高了成功率和计划效率,突显了验证驱动规划的好处。我们的结果表明,学习评估和优化轨迹为离线、无奖励规划提供了一种稳健的替代方案。
Summary / 总结
Planning as Descent (PaD) is a framework for offline goal-conditioned reinforcement learning that learns a goal-conditioned energy function over latent trajectories. PaD is trained using self-supervised hindsight relabeling to shape the energy landscape, and at inference, it refines multiple trajectory candidates to select low-energy plans. PaD achieves state-of-the-art 95% success on OGBench cube manipulation tasks, outperforming prior methods, and even improves with noisy, suboptimal data, demonstrating the benefits of verification-driven planning.
Planning as Descent (PaD) 是一种用于离线目标条件强化学习的框架,它通过学习目标条件的能量函数来引导规划。PaD 使用自监督的回溯目标重新标记进行训练,并在推理时通过能量景观进行梯度优化。PaD 在 OGBench 立方体操作任务中达到了最先进的 95% 成功率,超过了之前的算法,并且即使在训练时使用噪声数据也能表现出更好的性能。
ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges
Authors: Roshan Kenia, Xiaoman Zhang, Pranav Rajpurkar
First: 2025-12-19T17:44:40+00:00 · Latest: 2025-12-19T17:44:40+00:00
Comments: https://github.com/rajpurkarlab/ReX-MLE
Abstract
Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, R&D-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.
中文标题/摘要
标题:ReX-MLE:医疗成像挑战的自主代理基准
基于大型语言模型(LLMs)的自主编码代理现在可以解决许多通用软件和机器学习任务,但在解决复杂、领域特定的科学问题方面仍然无效。医疗成像是一个特别具有挑战性的领域,需要长时间的训练周期、高维数据处理以及专门的预处理和验证管道,而现有的代理基准未能充分衡量这些能力。为了解决这一差距,我们引入了ReX-MLE,这是一个包含20个挑战的基准,这些挑战源自涵盖多种成像模态和任务类型的高影响力医疗成像竞赛。与之前的ML代理基准不同,ReX-MLE评估了完整的端到端工作流程,要求代理在现实的计算和时间约束下独立管理数据预处理、模型训练和提交。我们用不同的LLM后端(GPT-5、Gemini、Claude)评估了最先进的代理(AIDE、ML-Master、R&D-Agent),观察到性能差距非常严重:大多数提交的排名在人类专家的0百分位。失败的原因在于领域知识和工程限制。ReX-MLE揭示了这些瓶颈,并为开发领域意识自主AI系统提供了基础。
Summary / 总结
The paper introduces ReX-MLE, a benchmark for autonomous agents in medical imaging, addressing the limitations of existing benchmarks in handling complex, domain-specific tasks. It consists of 20 challenges from high-impact medical imaging competitions, evaluating end-to-end workflows including data preprocessing, model training, and submission. State-of-the-art agents, including AIDE, ML-Master, and R&D-Agent, perform poorly, ranking in the 0th percentile compared to human experts, highlighting domain-knowledge and engineering limitations.
ReX-MLE 是一个针对医学影像领域的自主代理基准,通过评估完整的端到端工作流程来弥补现有基准的不足。它包含来自高影响力医学影像竞赛的20个挑战。评估最先进的代理如AIDE、ML-Master和R&D-Agent时,研究发现存在显著的性能差距,大多数提交结果在人类专家的0百分位,原因是领域知识和工程限制。该基准旨在揭示这些瓶颈并推动更具有领域意识的自主AI系统的开发。
Human Mesh Modeling for Anny Body
Authors: Romain Brégier, Guénolé Fiche, Laura Bravo-Sánchez, Thomas Lucas, Matthieu Armando, Philippe Weinzaepfel, Grégory Rogez, Fabien Baradel
First: 2025-11-05T16:10:02+00:00 · Latest: 2025-12-19T17:42:14+00:00
Comments: We release our model and code at https://github.com/naver/anny
Abstract
Parametric body models provide the structural basis for many human-centric tasks, yet existing models often rely on costly 3D scans and learned shape spaces that are proprietary and demographically narrow. We introduce Anny, a simple, fully differentiable, and scan-free human body model grounded in anthropometric knowledge from the MakeHuman community. Anny defines a continuous, interpretable shape space, where phenotype parameters (e.g. gender, age, height, weight) control blendshapes spanning a wide range of human forms--across ages (from infants to elders), body types, and proportions. Calibrated using WHO population statistics, it provides realistic and demographically grounded human shape variation within a single unified model. Thanks to its openness and semantic control, Anny serves as a versatile foundation for 3D human modeling--supporting millimeter-accurate scan fitting, controlled synthetic data generation, and Human Mesh Recovery (HMR). We further introduce Anny-One, a collection of 800k photorealistic images generated with Anny, showing that despite its simplicity, HMR models trained with Anny can match the performance of those trained with scan-based body models. The Anny body model and its code are released under the Apache 2.0 license, making Anny an accessible foundation for human-centric 3D modeling.
中文标题/摘要
标题:Anny 人体网格建模
参数化人体模型为许多以人为中心的任务提供了结构基础,但现有模型往往依赖于昂贵的3D扫描和专有的、人口统计学上狭窄的学习形状空间。我们介绍了Anny,一个简单、完全可微分且无需扫描的人体模型,基于MakeHuman社区的形态学知识。Anny定义了一个连续、可解释的形状空间,其中表型参数(如性别、年龄、身高、体重)控制着跨越不同年龄段(从婴儿到老人)、体型和比例的广泛人体形态的混合形状。通过使用WHO的人口统计数据进行校准,它在单一统一模型中提供了现实且人口统计学上合理的身体形状变化。得益于其开放性和语义控制,Anny成为3D人体建模的多功能基础——支持毫米级准确的扫描拟合、受控合成数据生成以及人体网格恢复(HMR)。我们还介绍了Anny-One,一个使用Anny生成的80万张逼真图像的集合,表明尽管其简单,使用Anny训练的HMR模型可以与基于扫描人体模型训练的模型性能相当。Anny人体模型及其代码在Apache 2.0许可证下发布,使Anny成为以人为中心的3D建模的可访问基础。
Summary / 总结
The research aims to develop a simple, open, and anthropometrically grounded human body model that can support a wide range of applications in 3D human modeling. The method involves creating Anny, a fully differentiable human body model using anthropometric knowledge from the MakeHuman community, which defines a continuous shape space controlled by phenotype parameters. Key findings include Anny's ability to provide realistic human shape variation, support millimeter-accurate scan fitting, and generate photorealistic images, demonstrating its effectiveness in Human Mesh Recovery (HMR) models. Anny and its code are released under the Apache 2.0 license, making it accessible for further research and development.
研究旨在提供一种低成本且开源的人体模型,能够表示各种人体形态。引入了Anny,这是一种完全可微的人体模型,利用了人体测量学知识,并使用WHO的人口统计数据进行了校准。该模型允许实现现实且基于人口统计的人体形状变化,并支持各种应用,如扫描拟合、合成数据生成和人体网格恢复。实验表明,使用Anny训练的HMR模型在性能上与基于扫描的人体模型训练的模型相当,尽管其结构简单。
Step-GUI Technical Report
Authors: Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yifan Sui, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zihan Yan, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang
First: 2025-12-17T13:26:30+00:00 · Latest: 2025-12-19T17:36:21+00:00
Comments: 41 pages, 26 figures
Abstract
Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
中文标题/摘要
标题:Step-GUI技术报告
多模态大型语言模型的最新进展为GUI自动化带来了前所未有的机会。然而,一个基本挑战仍然存在:如何高效地获取高质量的训练数据并保持注释可靠性?我们引入了一种由校准步骤奖励系统驱动的自我进化的训练管道,该系统通过轨迹级校准将模型生成的轨迹转化为可靠的训练信号,实现了超过90%的注释准确率,同时成本降低了10-100倍。利用该管道,我们介绍了Step-GUI这一系列模型(4B/8B),它们在GUI性能上达到了最先进的水平(8B:80.2% AndroidWorld,48.5% OSWorld,62.6% ScreenShot-Pro),同时保持了强大的通用能力。随着GUI代理能力的提升,实际部署需要在异构设备上标准化接口,同时保护用户隐私。为此,我们提出了GUI-MCP,这是第一个用于GUI自动化的模型上下文协议,具有分层架构,结合了低级原子操作和高级任务委托给本地专家模型,从而实现高隐私执行,敏感数据保留在设备上。最后,为了评估代理是否能够处理真实的日常使用,我们引入了AndroidDaily,这是一个基于真实移动使用模式的基准,包含3146个静态动作和235个端到端任务,覆盖高频日常场景(8B:静态89.91%,端到端52.50%)。我们的工作推进了实用GUI代理的发展,并展示了其在日常数字交互中实际部署的强大潜力。
Summary / 总结
The research aims to develop efficient and reliable methods for training GUI automation models. It introduces a self-evolving training pipeline using a Calibrated Step Reward System to generate high-quality training data at a lower cost, achieving over 90% annotation accuracy. The Step-GUI models, developed using this pipeline, show state-of-the-art performance in GUI tasks while maintaining robust general capabilities. Additionally, the work proposes GUI-MCP, a Model Context Protocol for GUI automation, which enhances privacy by keeping sensitive data on-device. The AndroidDaily benchmark evaluates the models' ability to handle real-world mobile usage scenarios, demonstrating strong potential for practical deployment.
研究旨在解决高效获取高质量GUI自动化训练数据的同时保持注释可靠性的挑战。它引入了一种自进化训练管道,使用校准步骤奖励系统,将模型生成的轨迹转化为可靠的训练信号,实现超过90%的注释准确率,并大幅降低成本。研究还展示了Step-GUI这一系列模型(4B/8B),在GUI任务中超越现有方法,同时保持强大的通用能力。此外,提出了GUI-MCP模型上下文协议,结合低级原子操作和高级任务委托到本地专家模型,增强隐私性。研究还引入了AndroidDaily基准,基于实际移动使用模式,评估GUI代理在日常生活中的实用性。
RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
Authors: Dongyub Jude Lee, Zhenyi Ye, Pengcheng He
First: 2025-07-29T20:35:35+00:00 · Latest: 2025-12-19T17:35:31+00:00
Abstract
Preference-learning methods for machine translation (MT), such as Direct Preference Optimization (DPO), have shown strong gains but typically rely on large, carefully curated preference triplets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), which replaces static triplets with on-policy, actor-conditioned refinements produced by a frozen teacher. At each step, the actor samples candidate translations, the teacher performs a minimal local edit of each draft, and the actor is reinforced to close the gap using a composite reward that combines scaled negative edit distance for lexical and structural fidelity with COMET for semantic adequacy. This formulation yields a stable, model-aware learning signal without requiring explicit preference datasets. Experiments on FLORES-200 (English to German, Spanish, Chinese, Korean, and Japanese) show that RLfR consistently outperforms strong MT-SFT, DPO, and fixed-reference RL baselines, improving semantic quality and entity preservation, and also achieves superior performance under LLM-based judge evaluations.
中文标题/摘要
标题:从教师模型精炼中学习RL:逐步模仿学习在机器翻译中的应用
机器翻译(MT)的偏好学习方法,如直接偏好优化(DPO),已经显示出强大的增益,但通常依赖于大量精心策划的偏好三元组,并且往往难以在调优领域之外进行泛化。我们提出了教师模型精炼中的强化学习(RLfR),它用策略导向的、由冻结教师生成的在线策略修正来替代静态三元组。在每一步中,策略采样候选翻译,教师对每个草稿进行最小的局部编辑,策略通过结合缩放后的编辑距离和COMET的语义适当性复合奖励来获得强化,以缩小差距。这种形式化提供了一种稳定且模型感知的学习信号,而无需显式的偏好数据集。在FLORES-200(英语到德语、西班牙语、汉语、韩语和日语)上的实验表明,RLfR 一致地优于强大的MT-SFT、DPO和固定参考RL基线,提高了语义质量和实体保留,并且在基于LLM的评判下也实现了更好的性能。
Summary / 总结
The research aims to improve machine translation by addressing the limitations of existing preference-learning methods, which often require large, curated datasets and struggle with generalization. The proposed method, Reinforcement Learning from Teacher-Model Refinement (RLfR), uses on-policy, actor-conditioned refinements produced by a frozen teacher to generate a stable learning signal. Experiments on FLORES-200 show that RLfR outperforms strong baselines in terms of semantic quality and entity preservation, and performs well under LLM-based evaluations.
论文针对机器翻译中偏好学习方法的局限性,如需要大量且精心策划的偏好数据集以及其有限的泛化能力。提出了RLfR方法,用冻结教师生成的策略条件下的即时策略替换静态三元组。通过结合词汇和结构一致性以及语义恰当性的复合奖励来强化策略。实验结果显示,RLfR在FLORES-200上优于强基线,提升了语义质量和实体保留,并在基于LLM的评估中表现出色。
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Authors: Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald
First: 2025-12-19T17:22:35+00:00 · Latest: 2025-12-19T17:22:35+00:00
Abstract
While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.
中文标题/摘要
标题:合唱:多教师预训练以实现全面的3D高斯场景编码
虽然3DGS已经作为一种高保真场景表示出现,但直接从其基本元素中编码丰富的通用特征仍然未被充分探索。我们通过引入合唱,一种多教师预训练框架,通过从2D基础模型中提取互补信号来学习一个全面的前馈3D高斯点绘(3DGS)场景编码器,来填补这一空白。合唱使用共享的3D编码器和教师特定的投影器,从语言对齐、通用和对象感知的教师中学习,鼓励一个共享的嵌入空间,捕捉从高层语义到细粒度结构的信号。我们对合唱进行了广泛的任务评估:开放词汇语义和实例分割、线性和解码器探针,以及数据高效监督。除了3DGS,我们还测试了合唱在仅支持点云的几个基准上的表现,通过预训练一个仅使用高斯中心、颜色、估计法线作为输入的变体。有趣的是,这个编码器表现出强大的迁移性能,并在使用39.9倍少的训练场景时优于点云基线。最后,我们提出了一种渲染和提取适应方法,以促进域外微调。我们的代码和模型将在发表后发布。
Summary / 总结
Chorus is a multi-teacher pretraining framework that addresses the under-explored area of encoding rich, general-purpose features directly from 3D Gaussian Splatting primitives. It uses a shared 3D encoder and teacher-specific projectors to learn from different types of teachers, resulting in a holistic scene encoder that captures high-level semantics to fine-grained structure. Chorus is evaluated on various tasks including semantic and instance segmentation, linear and decoder probing, and data-efficient supervision, showing strong transfer and outperforming point clouds baselines with fewer training scenes.
Chorus 是一个多教师预训练框架,旨在直接从 3D Gaussian Splatting (3DGS) 原始数据中编码丰富的通用特征。它使用共享的 3D 编码器和特定于教师的投影器,从不同类型的教师(如语言对齐、通用和对象感知模型)中学习互补信号。Chorus 在语义和实例分割、线性探针和解码器探针等多种任务中进行了评估,并显示出强大的迁移性能,同时使用了显著较少的训练场景,优于点云基线。
The Diffusion Duality
Authors: Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
Venue: ICML 2025
First: 2025-06-12T16:55:35+00:00 · Latest: 2025-12-19T17:14:07+00:00
Comments: ICML 2025. We provide the code at: https://github.com/s-sahoo/duo [v3] includes improved theory, clearer presentation, and a new future work section
Abstract
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/duo
中文标题/摘要
标题:扩散二重性
均匀状态离散扩散模型由于其固有的自我纠正能力,有望实现快速文本生成。然而,它们通常被自回归模型和掩码扩散模型超越。在本工作中,我们通过利用一个关键见解来缩小性能差距:均匀状态扩散过程自然地源自潜在的高斯扩散。我们的方法Duo将高斯扩散中的强大技术转移到提高训练和采样方面。首先,我们引入了一种由高斯过程指导的课程学习策略,通过减少方差将训练速度翻倍。使用课程学习训练的模型在3个基准中的零样本困惑度上超越了自回归模型。其次,我们提出了离散一致性蒸馏,该算法将连续域中的一致性蒸馏适应到离散域。该算法通过将采样速度提高两个数量级解锁了扩散语言模型的多步生成。我们已在项目页面提供了代码、模型检查点和视频教程:http://s-sahoo.github.io/duo
Summary / 总结
The research aims to improve the performance of uniform-state discrete diffusion models by bridging them with Gaussian diffusion techniques. The method, named Duo, introduces a curriculum learning strategy and discrete consistency distillation to enhance training and sampling efficiency. Duo models outperform autoregressive models in zero-shot perplexity on three out of seven benchmarks and enable two orders of magnitude faster sampling in diffusion language models.
研究旨在通过将均匀状态离散扩散模型与高斯扩散技术相结合来提升其性能。方法名为Duo,引入了基于高斯过程的课程学习策略和离散一致性蒸馏,以提高训练和采样效率。Duo模型在七个基准中的三个上优于自回归模型,并使扩散语言模型的采样速度提高了两个数量级。
Intelligent Knowledge Mining Framework: Bridging AI Analysis and Trustworthy Preservation
Authors: Binh Vu
First: 2025-12-19T17:01:03+00:00 · Latest: 2025-12-19T17:01:03+00:00
Abstract
The unprecedented proliferation of digital data presents significant challenges in access, integration, and value creation across all data-intensive sectors. Valuable information is frequently encapsulated within disparate systems, unstructured documents, and heterogeneous formats, creating silos that impede efficient utilization and collaborative decision-making. This paper introduces the Intelligent Knowledge Mining Framework (IKMF), a comprehensive conceptual model designed to bridge the critical gap between dynamic AI-driven analysis and trustworthy long-term preservation. The framework proposes a dual-stream architecture: a horizontal Mining Process that systematically transforms raw data into semantically rich, machine-actionable knowledge, and a parallel Trustworthy Archiving Stream that ensures the integrity, provenance, and computational reproducibility of these assets. By defining a blueprint for this symbiotic relationship, the paper provides a foundational model for transforming static repositories into living ecosystems that facilitate the flow of actionable intelligence from producers to consumers. This paper outlines the motivation, problem statement, and key research questions guiding the research and development of the framework, presents the underlying scientific methodology, and details its conceptual design and modeling.
中文标题/摘要
标题:智能知识挖掘框架:连接AI分析与可信赖保存
前所未有的数字数据激增在所有数据密集型领域中带来了访问、整合和价值创造的重大挑战。有价值的信息经常被封装在不同的系统、非结构化的文档和异构格式中,形成了阻碍高效利用和协作决策的孤岛。本文介绍了智能知识挖掘框架(IKMF),这是一个全面的概念模型,旨在弥合动态AI驱动分析与可信赖长期保存之间的关键差距。该框架提出了一种双流架构:一个水平的挖掘过程,系统地将原始数据转换为语义丰富、机器可操作的知识,以及一个并行的可信赖归档流,确保这些资产的完整、来源和计算再现性。通过定义这种共生关系的蓝图,本文为将静态存储库转变为促进从生产者到消费者流动可操作智能的活生态系统提供了基础模型。本文概述了驱动框架研究和开发的动机、问题陈述和关键研究问题,介绍了其基础科学方法,并详细描述了其概念设计和建模。
Summary / 总结
The paper introduces the Intelligent Knowledge Mining Framework (IKMF) to address the challenges of accessing and integrating digital data across various sectors. The framework proposes a dual-stream architecture that transforms raw data into semantically rich knowledge through a horizontal Mining Process and ensures the integrity and reproducibility of these assets through a parallel Trustworthy Archiving Stream. Key findings include the successful integration of AI-driven analysis with long-term preservation, providing a blueprint for dynamic data ecosystems.
论文提出了智能知识挖掘框架(IKMF),以解决跨不同领域访问和整合数字数据的挑战。该框架提出了一种双流架构,通过水平挖掘过程将原始数据转换为语义丰富的知识,并通过并行的可信归档流确保这些资产的完整性和可再现性。关键发现包括AI驱动分析与长期保存的成功集成,提供了一个动态数据生态系统的蓝图。
Calibratable Disambiguation Loss for Multi-Instance Partial-Label Learning
Authors: Wei Tang, Yin-Fang Yang, Weijia Zhang, Min-Ling Zhang
First: 2025-12-19T16:58:31+00:00 · Latest: 2025-12-19T16:58:31+00:00
Abstract
Multi-instance partial-label learning (MIPL) is a weakly supervised framework that extends the principles of multi-instance learning (MIL) and partial-label learning (PLL) to address the challenges of inexact supervision in both instance and label spaces. However, existing MIPL approaches often suffer from poor calibration, undermining classifier reliability. In this work, we propose a plug-and-play calibratable disambiguation loss (CDL) that simultaneously improves classification accuracy and calibration performance. The loss has two instantiations: the first one calibrates predictions based on probabilities from the candidate label set, while the second one integrates probabilities from both candidate and non-candidate label sets. The proposed CDL can be seamlessly incorporated into existing MIPL and PLL frameworks. We provide a theoretical analysis that establishes the lower bound and regularization properties of CDL, demonstrating its superiority over conventional disambiguation losses. Experimental results on benchmark and real-world datasets confirm that our CDL significantly enhances both classification and calibration performance.
中文标题/摘要
标题:可校准的歧义损失用于多实例部分标签学习
多实例部分标签学习(MIPL)是一种弱监督框架,将多实例学习(MIL)和部分标签学习(PLL)的原则扩展到解决实例和标签空间中的不精确监督挑战。然而,现有的MIPL方法往往校准效果不佳,影响分类器的可靠性。本文提出了一种即插即用的可校准歧义损失(CDL),同时提高分类准确性和校准性能。该损失有两个实例化:第一个基于候选标签集的概率校准预测,而第二个则结合了候选标签集和非候选标签集的概率。提出的CDL可以无缝集成到现有的MIPL和PLL框架中。我们提供了理论分析,建立了CDL的下界和正则化性质,证明了其优于传统歧义损失的优越性。基准数据集和真实世界数据集上的实验结果证实,我们的CDL显著提高了分类和校准性能。
Summary / 总结
This work addresses the issue of poor calibration in multi-instance partial-label learning (MIPL) by proposing a calibratable disambiguation loss (CDL). CDL improves both classification accuracy and calibration by calibrating predictions based on probabilities from the candidate label set and integrating probabilities from both candidate and non-candidate label sets. Theoretical analysis and experimental results on benchmark and real-world datasets show that CDL outperforms conventional disambiguation losses in terms of both classification and calibration performance.
研究旨在通过提出可校准的去混淆损失(CDL)来解决多实例部分标签学习(MIPL)中的校准问题。该方法包括两种CDL的实例化,可以同时提高分类准确性和校准性能。理论分析表明,CDL具有下界和正则化特性,优于传统的去混淆损失。基准和真实世界数据集上的实验结果表明,CDL显著提高了分类和校准性能。
HGQ: High Granularity Quantization for Real-time Neural Networks on FPGAs
Authors: Chang Sun, Zhiqiang Que, Thea K. Årrestad, Vladimir Loncar, Jennifer Ngadiuba, Wayne Luk, Maria Spiropulu
First: 2024-05-01T17:18:46+00:00 · Latest: 2025-12-19T16:57:39+00:00
Abstract
Neural networks with sub-microsecond inference latency are required by many critical applications. Targeting such applications deployed on FPGAs, we present High Granularity Quantization (HGQ), a quantization-aware training framework that optimizes parameter bit-widths through gradient descent. Unlike conventional methods, HGQ determines the optimal bit-width for each parameter independently, making it suitable for hardware platforms supporting heterogeneous arbitrary precision arithmetic. In our experiments, HGQ shows superior performance compared to existing network compression methods, achieving orders of magnitude reduction in resource consumption and latency while maintaining the accuracy on several benchmark tasks. These improvements enable the deployment of complex models previously infeasible due to resource or latency constraints. HGQ is open-source and is used for developing next-generation trigger systems at the CERN ATLAS and CMS experiments for particle physics, enabling the use of advanced machine learning models for real-time data selection with sub-microsecond latency.
中文标题/摘要
标题:HGQ:实时FPGA上神经网络的高粒度量化
许多关键应用需要具有亚微秒推理延迟的神经网络。针对部署在FPGA上的此类应用,我们提出了高粒度量化(HGQ),这是一种量化感知训练框架,通过梯度下降优化参数位宽。与传统方法不同,HGQ 独立确定每个参数的最佳位宽,使其适用于支持异构任意精度算术的硬件平台。在我们的实验中,HGQ 在资源消耗和延迟方面比现有网络压缩方法表现出更优的性能,同时在多个基准任务上保持了准确性。这些改进使得复杂模型的部署成为可能,这些模型由于资源或延迟限制而之前不可行。HGQ 是开源的,并且用于在CERN ATLAS和CMS实验的下一代触发系统中开发粒子物理实验,使其能够使用先进的机器学习模型进行具有亚微秒延迟的实时数据选择。
Summary / 总结
The research aims to develop a framework for real-time neural network inference with sub-microsecond latency on FPGAs. High Granularity Quantization (HGQ) is introduced, a quantization-aware training method that optimizes each parameter's bit-width independently. HGQ outperforms existing methods by significantly reducing resource consumption and latency while maintaining accuracy. This enables the deployment of complex models that were previously impractical due to resource or latency constraints. The method is open-source and used in particle physics experiments at CERN for real-time data selection.
研究旨在开发一种框架,使神经网络在FPGA上实现亚微秒级的实时推理。提出了高粒度量化(HGQ)方法,这是一种量化感知训练方法,能够独立优化每个参数的位宽。HGQ在减少资源消耗和延迟的同时保持了准确性,显著优于现有方法,使得复杂模型的部署成为可能,这些模型因资源或延迟限制而此前不可行。该方法是开源的,并被用于CERN的ATLAS和CMS实验中的粒子物理数据实时筛选。
Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras
Authors: Ami Pandat, Punna Rajasekhar, G. Aravamuthan, Gopika Vinod, Rohit Shukla
First: 2025-12-19T16:54:43+00:00 · Latest: 2025-12-19T16:54:43+00:00
Abstract
Accurate camera models are essential for photogrammetry applications such as 3D mapping and object localization, particularly for long distances. Various stereo-camera based 3D localization methods are available but are limited to few hundreds of meters' range. This is majorly due to the limitation of the distortion models assumed for the non-linearities present in the camera lens. This paper presents a framework for modeling a suitable distortion model that can be used for localizing the objects at longer distances. It is well known that neural networks can be a better alternative to model a highly complex non-linear lens distortion function; on contrary, it is observed that a direct application of neural networks to distortion models fails to converge to estimate the camera parameters. To resolve this, a hybrid approach is presented in this paper where the conventional distortion models are initially extended to incorporate higher-order terms and then enhanced using neural network based residual correction model. This hybrid approach has substantially improved long-range localization performance and is capable of estimating the 3D position of objects at distances up to 5 kilometres. The estimated 3D coordinates are transformed to GIS coordinates and are plotted on a GIS map for visualization. Experimental validation demonstrates the robustness and effectiveness of proposed framework, offering a practical solution to calibrate CCTV cameras for long-range photogrammetry applications.
中文标题/摘要
标题:基于学习混合失真模型的长距离深度估计方法在CCTV摄像头中的应用
准确的相机模型对于摄影测量应用如三维建模和物体定位至关重要,特别是在长距离范围内。各种基于立体相机的三维定位方法可用,但其范围仅限于几百米。这主要是由于假设的失真模型无法处理相机镜头中存在的非线性。本文提出了一种建模合适失真模型的框架,可用于在更长距离内定位物体。众所周知,神经网络可以更好地用于建模高度复杂的非线性镜头失真函数;然而,直接将神经网络应用于失真模型无法收敛以估计相机参数。为了解决这一问题,本文提出了一种混合方法,其中传统的失真模型首先扩展以包含更高阶项,然后通过基于神经网络的残差校正模型进行增强。这种混合方法在长距离定位性能方面有了显著改进,并能够估计到5公里距离的物体3D位置。估计的3D坐标被转换为GIS坐标并在GIS地图上进行可视化。实验验证表明所提出框架的稳健性和有效性,提供了一种实际解决方案,用于校准用于长距离摄影测量应用的CCTV摄像头。
Summary / 总结
This paper addresses the need for accurate camera models in photogrammetry applications, especially for long-range 3D localization. It proposes a hybrid approach combining conventional distortion models with neural network-based residual correction to improve long-range depth estimation. The method can estimate 3D positions of objects up to 5 kilometers away, and the results are visualized on a GIS map. Experiments validate the robustness and effectiveness of the proposed framework for calibrating CCTV cameras in long-range photogrammetry.
论文针对传统立体相机方法在长距离下精度不足的问题,提出了一种结合传统畸变模型和基于神经网络的残差校正的混合方法,以估计相机参数。该方法显著提高了长距离定位性能,能够在5公里范围内准确估计3D位置,并将结果可视化在GIS地图上。
UrbanDIFF: A Denoising Diffusion Model for Spatial Gap Filling of Urban Land Surface Temperature Under Dense Cloud Cover
Authors: Arya Chavoshi, Hassan Dashtian, Naveen Sudharsan, Dev Niyogi
First: 2025-12-19T16:51:29+00:00 · Latest: 2025-12-19T16:51:29+00:00
Abstract
Satellite-derived Land Surface Temperature (LST) products are central to surface urban heat island (SUHI) monitoring due to their consistent grid-based coverage over large metropolitan regions. However, cloud contamination frequently obscures LST observations, limiting their usability for continuous SUHI analysis. Most existing LST reconstruction methods rely on multitemporal information or multisensor data fusion, requiring auxiliary observations that may be unavailable or unreliable under persistent cloud cover. Purely spatial gap-filling approaches offer an alternative, but traditional statistical methods degrade under large or spatially contiguous gaps, while many deep learning based spatial models deteriorate rapidly with increasing missingness. Recent advances in denoising diffusion based image inpainting models have demonstrated improved robustness under high missingness, motivating their adoption for spatial LST reconstruction. In this work, we introduce UrbanDIFF, a purely spatial denoising diffusion model for reconstructing cloud contaminated urban LST imagery. The model is conditioned on static urban structure information, including built-up surface data and a digital elevation model, and enforces strict consistency with revealed cloud free pixels through a supervised pixel guided refinement step during inference. UrbanDIFF is trained and evaluated using NASA MODIS Terra LST data from seven major United States metropolitan areas spanning 2002 to 2025. Experiments using synthetic cloud masks with 20 to 85 percent coverage show that UrbanDIFF consistently outperforms an interpolation baseline, particularly under dense cloud occlusion, achieving SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84 at 85 percent cloud coverage, while exhibiting slower performance degradation as cloud density increases.
中文标题/摘要
标题:UrbanDIFF:一种用于密集云层覆盖下城市地表温度空间缺省填充的去噪扩散模型
基于卫星的地表温度(LST)产品对于地表城市热岛(SUHI)监测至关重要,因为它们可以提供大都市区域的一致网格覆盖。然而,云层污染经常遮挡LST观测,限制了其在连续SUHI分析中的应用。大多数现有的LST重建方法依赖于多时相信息或多传感器数据融合,需要在持续云层覆盖下可能不可用或不可靠的辅助观测。纯粹的空间缺省填充方法提供了一种替代方案,但传统的统计方法在大或空间连续的缺省下退化,而许多基于深度学习的空间模型在缺失性增加时迅速退化。 最近基于去噪扩散的图像修复模型的进展表明,在高缺失性下具有更好的鲁棒性,促使它们被用于空间LST重建。在本文中,我们介绍了UrbanDIFF,这是一种基于空间的去噪扩散模型,用于重建受云污染的城市LST影像。该模型基于静态的城市结构信息,包括建成表面数据和数字高程模型,并通过监督像素引导的细化步骤在推理期间强制执行严格的与揭示的无云像素的一致性。 UrbanDIFF使用2002年至2025年期间NASA MODIS Terra LST数据从美国七个主要大都市地区进行训练和评估。使用20%至85%的合成云掩模进行的实验表明,UrbanDIFF在密集云遮挡下始终优于插值基线,SSIM为0.89,RMSE为1.2 K,R2为0.84,在85%云覆盖下,随着云密度增加,其性能退化较慢。
Summary / 总结
UrbanDIFF is a denoising diffusion model designed to fill spatial gaps in urban land surface temperature (LST) data under dense cloud cover. It leverages static urban structure information and a supervised pixel refinement step to ensure consistency with revealed cloud-free pixels. Experiments show that UrbanDIFF outperforms interpolation methods, achieving high SSIM, low RMSE, and a high R2 score, especially under dense cloud conditions.
UrbanDIFF 是一种去噪扩散模型,旨在填补在浓云覆盖下城市地表温度(LST)数据的空间缺失。它利用静态的城市结构信息,并在推断过程中通过监督像素引导的细化步骤确保与已揭示的无云像素的一致性。实验表明,UrbanDIFF 在浓云覆盖下优于插值方法,特别是在 85% 云覆盖率时,SSIM 为 0.89,RMSE 为 1.2 K,R2 为 0.84。
LiteGE: Lightweight Geodesic Embedding for Efficient Geodesics Computation and Non-Isometric Shape Correspondence
Authors: Yohanes Yudhi Adikusuma, Qixing Huang, Ying He
First: 2025-12-19T16:50:52+00:00 · Latest: 2025-12-19T16:50:52+00:00
Abstract
Computing geodesic distances on 3D surfaces is fundamental to many tasks in 3D vision and geometry processing, with deep connections to tasks such as shape correspondence. Recent learning-based methods achieve strong performance but rely on large 3D backbones, leading to high memory usage and latency, which limit their use in interactive or resource-constrained settings. We introduce LiteGE, a lightweight approach that constructs compact, category-aware shape descriptors by applying PCA to unsigned distance field (UDFs) samples at informative voxels. This descriptor is efficient to compute and removes the need for high-capacity networks. LiteGE remains robust on sparse point clouds, supporting inputs with as few as 300 points, where prior methods fail. Extensive experiments show that LiteGE reduces memory usage and inference time by up to 300$\times$ compared to existing neural approaches. In addition, by exploiting the intrinsic relationship between geodesic distance and shape correspondence, LiteGE enables fast and accurate shape matching. Our method achieves up to 1000$\times$ speedup over state-of-the-art mesh-based approaches while maintaining comparable accuracy on non-isometric shape pairs, including evaluations on point-cloud inputs.
中文标题/摘要
标题:LiteGE:轻量级测地嵌入以实现高效测地线计算和非等参形状对应
在3D视觉和几何处理中,计算3D曲面上的测地距离是许多任务的基础,与形状对应等任务有着深刻的联系。最近的基于学习的方法虽然表现强大,但依赖于大型的3D骨干网络,导致高内存使用和延迟,这限制了它们在交互式或资源受限环境中的应用。我们提出了LiteGE,这是一种轻量级的方法,通过在信息性体素上对无符号距离场(UDFs)样本应用PCA来构建紧凑的、类别感知的形状描述符。该描述符易于计算,并消除了对高容量网络的需求。LiteGE 在稀疏点云上保持鲁棒性,支持输入点数低至300个的场景,而先前的方法在此失败。大量实验表明,与现有神经网络方法相比,LiteGE 可将内存使用和推理时间减少多达300倍。此外,通过利用测地距离与形状对应之间的内在关系,LiteGE 使形状匹配变得快速而准确。我们的方法在非等参形状对上实现了与最先进的基于网格方法相当的准确性,包括对点云输入的评估,同时实现了高达1000倍的速度提升。
Summary / 总结
LiteGE is a lightweight method for computing geodesic distances on 3D surfaces, which constructs compact shape descriptors using PCA on UDF samples at informative voxels. This approach significantly reduces memory usage and inference time compared to existing neural methods, achieving up to 300x and 1000x reductions respectively. LiteGE also supports sparse point clouds and enables fast and accurate shape matching, outperforming mesh-based approaches on non-isometric shapes.
LiteGE 是一种轻量级方法,用于在 3D 表面计算测地线距离,通过在重要体素上对 UDF 样本应用 PCA 构建紧凑的形状描述符。该方法将内存使用和推理时间减少多达 300 倍,相比现有神经方法,并且能够实现快速且准确的形状匹配,相比最先进的基于网格的方法,速度提高多达 1000 倍。
On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness
Authors: Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Alexandra Gómez-Villa, Jorge Vila-Tomás, Valero Laparra, Jesus Malo
First: 2025-08-13T13:47:34+00:00 · Latest: 2025-12-19T16:47:41+00:00
Abstract
Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.
中文标题/摘要
标题:CLIP 图像-形状偏见动态演变及其与人类对齐和模型鲁棒性的关系
对比语言-图像模型如CLIP展示了卓越的泛化能力。然而,它们在训练过程中内部视觉表示如何演变以及这种演变如何与人类感知相关的问题仍然知之甚少。现有大多数分析仅针对完全训练好的模型,而代表性的偏见和感知对齐的动力学则很少被探索。在本研究中,我们对CLIP模型在整个训练过程中进行了逐个时期的分析,重点关注图像-形状偏见的演变、与人类感知判断的对齐以及对图像噪声的敏感性。通过涵盖低级图像质量评估、中级感知相似性、显著性对应和噪声鲁棒性的多个感知基准,我们发现了一种与训练阶段相关的代表性的转变。早期训练阶段表现出强烈的纹理偏见、与低级人类感知度量的增强对齐以及对高斯噪声扰动的增加敏感性。随着训练的进行,这种纹理偏见逐渐减少,更倾向于基于形状的表示,同时噪声鲁棒性提高,低级感知对齐下降。重要的是,这些动态在多个CLIP模型规模中一致出现,表明这一现象并非特定于某种架构规模。我们的研究结果提供了关于感知对齐、特征偏见和鲁棒性如何在多模态模型训练中共同演变的实证描述。这项工作揭示了早期低级感知对齐与后期鲁棒性之间的系统性权衡,为视觉-语言模型的表示动力学及其与人类视觉处理的关系提供了新的见解。
Summary / 总结
This study analyzes the evolution of CLIP models during training, focusing on the development of texture-shape bias, alignment with human perception, and robustness to image noise. The research reveals that early training stages show strong texture bias and high alignment with low-level perceptual measures, but as training progresses, the models shift towards shape-based representations, improving robustness to noise and decreasing low-level perceptual alignment.
研究分析了CLIP模型在训练过程中纹理-形状偏见和知觉对齐的变化,使用了多个知觉基准。早期训练阶段表现出强烈的纹理偏见和对噪声的敏感性,而后期阶段则表现出更多基于形状的表示和增强的鲁棒性。这些动态在不同模型规模中是一致的,表明早期的知觉对齐与后期的鲁棒性之间存在一般性的权衡。
MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation
Authors: Saikat Roy, Yannick Kirchhoff, Constantin Ulrich, Maximillian Rokuss, Tassilo Wald, Fabian Isensee, Klaus Maier-Hein
First: 2025-12-19T16:45:23+00:00 · Latest: 2025-12-19T16:45:23+00:00
Abstract
Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: https://www.github.com/MIC-DKFZ/nnUNet
中文标题/摘要
标题:MedNeXt-v2: 为医学图像分割中的大规模监督表示学习扩展3D ConvNeXts
大规模监督预训练正在迅速重塑3D医学图像分割。然而,现有努力主要集中在增加数据集规模上,而忽视了骨干网络在大规模下是否是一个有效的表示学习者的问题。在本文中,我们通过重新审视基于ConvNeXt的体素分割架构并引入MedNeXt-v2,一种复合扩展的3D ConvNeXt,利用改进的微架构和数据扩展来实现最先进的性能来填补这一空白。首先,我们展示了常规用于大规模预训练管道中的骨干网络往往是次优的。随后,我们在扩展之前进行了全面的骨干网络基准测试,并证明了从头开始的更强性能可靠地预测了预训练后的更强下游性能。根据这些发现,我们引入了3D全局响应归一化模块,并通过深度、宽度和上下文扩展来改进我们的架构,以实现有效的表示学习。我们在18000个CT体积上预训练了MedNeXt-v2,并在六个具有挑战性的CT和MR基准(144种结构)上进行微调,展示了相对于七个公开发布的预训练模型的一致改进。除了改进之外,我们对这些模型的基准测试还揭示了更强的骨干网络在相似数据上表现更好,表示扩展在病理分割中具有不成比例的好处,以及模态特定的预训练在完全微调后几乎没有益处。总之,我们的结果确立了MedNeXt-v2作为3D医学图像分割中大规模监督表示学习的强骨干。我们的代码和预训练模型已与官方nnUNet仓库一起提供:https://www.github.com/MIC-DKFZ/nnUNet
Summary / 总结
This work addresses the gap in large-scale supervised pretraining for 3D medical image segmentation by revisiting ConvNeXt-based architectures and introducing MedNeXt-v2. The authors show that commonly used backbones are often suboptimal and that stronger from-scratch performance predicts better downstream performance. MedNeXt-v2 incorporates a 3D Global Response Normalization module and uses depth, width, and context scaling to enhance representation learning. Pretraining MedNeXt-v2 on 18k CT volumes, the model demonstrates state-of-the-art performance across six challenging benchmarks, showing consistent gains over seven publicly released pretrained models.
该研究通过重新审视ConvNeXt基架构并引入MedNeXt-v2,填补了大规模监督预训练在3D医学图像分割中的空白。作者表明,常用的骨干网络往往不够优化,并且从零开始的更强性能预测了更好的下游性能。MedNeXt-v2引入了3D全局响应归一化模块,并使用深度、宽度和上下文缩放来增强表示学习。在18k CT数据集上预训练MedNeXt-v2后,该模型在六个具有挑战性的基准测试中表现出最先进的性能,显示出比七个公开发布的预训练模型的一致改进。
Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image
Authors: Simon Giebenhain, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Zhe Chen, Matthias Nießner
Venue: www
First: 2025-12-19T16:44:32+00:00 · Latest: 2025-12-19T16:44:32+00:00
Comments: Project website: https://simongiebenhain.github.io/Pix2NPHM/ , Video: https://www.youtube.com/watch?v=MgpEJC5p1Ts
Abstract
Neural Parametric Head Models (NPHMs) are a recent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. However, fitting NPHMs to visual inputs is notoriously challenging due to the expressive nature of their underlying latent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geometry and accurate facial expressions. For broad generalization, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct supervision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstructions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimization against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face reconstruction quality that can run at scale on in-the-wild data.
中文标题/摘要
标题:Pix2NPHM:从单张图像学习回归NPHM重建
神经参数头部模型(NPHMs)是最近在网格基3D可变形模型(3DMMs)之上的一种进步,以促进高保真几何细节。然而,由于其潜在的隐空间具有表达性,将NPHMs拟合到视觉输入中非常具有挑战性。为此,我们提出了一种名为Pix2NPHM的视觉变换器(ViT)网络,该网络可以直接从单张图像中回归NPHM参数。与现有方法相比,神经参数空间使我们的方法能够重建更可识别的面部几何结构和准确的面部表情。为了实现广泛的泛化,我们利用特定领域的ViTs作为骨干网络,这些网络在几何预测任务上进行了预训练。我们使用混合3D数据训练Pix2NPHM,包括超过10万个NPHM注册,这些数据可以在SDF空间中提供直接监督,以及大规模的2D视频数据集,其中法线估计作为伪地面真几何。Pix2NPHM不仅允许以交互帧率进行3D重建,还可以通过后续推理时的优化估计表面法线和标准点图来提高几何保真度。因此,我们实现了前所未有的面部重建质量,可以在野外数据上大规模运行。
Summary / 总结
Pix2NPHM is a vision transformer network that directly regresses Neural Parametric Head Model (NPHM) parameters from a single image, addressing the challenges of fitting NPHMs to visual inputs. The method uses domain-specific pretrained ViTs and a mixture of 3D and 2D data for training, enabling high-fidelity 3D reconstructions with accurate facial geometry and expressions. It can achieve interactive frame rates and further improve geometric fidelity through inference-time optimization, resulting in superior face reconstruction quality on in-the-wild data.
Pix2NPHM通过提出一个视觉变换器网络直接从单张图像中回归神经参数化头部模型(NPHM)参数来解决将NPHM拟合到图像的挑战。该方法使用特定领域的ViT骨干网络并在几何预测任务上进行预训练,并结合3D和2D数据集进行训练,实现了高保真的3D重建,具有准确的面部几何和表情。通过后续使用表面法线和标准点图进行优化,进一步提高了重建的几何保真度,从而在大规模野外数据上实现了前所未有的面部重建质量,并能在交互帧率下运行。
Can You Hear Me Now? A Benchmark for Long-Range Graph Propagation
Authors: Luca Miglior, Matteo Tolloso, Alessio Gravina, Davide Bacciu
First: 2025-12-19T16:34:27+00:00 · Latest: 2025-12-19T16:34:27+00:00
Abstract
Effectively capturing long-range interactions remains a fundamental yet unresolved challenge in graph neural network (GNN) research, critical for applications across diverse fields of science. To systematically address this, we introduce ECHO (Evaluating Communication over long HOps), a novel benchmark specifically designed to rigorously assess the capabilities of GNNs in handling very long-range graph propagation. ECHO includes three synthetic graph tasks, namely single-source shortest paths, node eccentricity, and graph diameter, each constructed over diverse and structurally challenging topologies intentionally designed to introduce significant information bottlenecks. ECHO also includes two real-world datasets, ECHO-Charge and ECHO-Energy, which define chemically grounded benchmarks for predicting atomic partial charges and molecular total energies, respectively, with reference computations obtained at the density functional theory (DFT) level. Both tasks inherently depend on capturing complex long-range molecular interactions. Our extensive benchmarking of popular GNN architectures reveals clear performance gaps, emphasizing the difficulty of true long-range propagation and highlighting design choices capable of overcoming inherent limitations. ECHO thereby sets a new standard for evaluating long-range information propagation, also providing a compelling example for its need in AI for science.
中文标题/摘要
标题:你能听到我吗?长距离图传播基准
有效地捕捉长距离相互作用仍然是图神经网络(GNN)研究中的一个基本但未解决的挑战,对于跨多个科学领域的应用至关重要。为系统地解决这一问题,我们引入了ECHO(评估长跳通信),这是一种新型基准,专门设计用于严格评估GNN在处理非常长距离图传播方面的能力。ECHO 包含三个合成图任务,分别是单源最短路径、节点 eccentricity 和图直径,每个任务都构建在多样且结构上具有挑战性的拓扑上,故意设计以引入显著的信息瓶颈。ECHO 还包括两个真实世界数据集,ECHO-Charge 和 ECHO-Energy,它们分别定义了基于化学的基准,用于预测原子部分电荷和分子总能量,参考计算在密度泛函理论(DFT)水平上获得。两个任务都内在地依赖于捕捉复杂的长距离分子相互作用。我们对流行的GNN架构的广泛基准测试揭示了明显的性能差距,强调了真实长距离传播的难度,并突显了能够克服固有限制的设计选择。ECHO 因此为评估长距离信息传播设定了新的标准,同时也为AI在科学中的需求提供了有力的示例。
Summary / 总结
The paper introduces ECHO, a benchmark designed to evaluate GNNs' ability to handle long-range graph propagation. It includes both synthetic and real-world datasets to test GNNs on tasks that require capturing complex long-range interactions. The benchmarking of popular GNN architectures reveals significant performance gaps, indicating the challenges in true long-range propagation and highlighting the need for better design choices to address these limitations.
论文提出了ECHO基准,旨在评估图神经网络处理长距离交互的能力。该基准包含合成和真实世界的数据集,用于测试GNN在需要捕获复杂长距离分子交互的任务上的表现。对流行GNN架构的基准测试显示了显著的性能差距,表明真正的长距离传播的难度,并强调了需要更好的设计选择来克服这些限制。
AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora
Authors: Zhihan Zhou, Daqian Shi, Rui Song, Lida Shi, Xiaolei Diao, Hao Xu
First: 2025-12-19T16:28:57+00:00 · Latest: 2025-12-19T16:28:57+00:00
Abstract
Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.
中文标题/摘要
标题:AncientBench:针对出土和传世中文语料的全面评估
理解古代文献在考古学和中国历史与文明理解中起着重要作用。大型语言模型的快速发展需要能够评估其对古代文字理解能力的基准。现有的中文基准主要针对现代中文和古代中文传世文献,但出土文献部分未被涵盖。为满足这一需求,我们提出了AncientBench,旨在评估古代文字的理解能力,特别是在出土文献场景中的理解能力。AncientBench分为四个维度,对应古代文字理解的四种能力:字形理解、读音理解、意义理解以及语境理解。基准还包含十个任务,包括部首、声旁、同音词、填空、翻译等,提供了一个全面的评估框架。我们召集了考古研究人员进行实验评估,提出了一个古代模型作为基线,并对当前表现最好的大型语言模型进行了广泛的实验。实验结果揭示了大型语言模型在古代文本场景中的巨大潜力以及与人类的差距。我们的研究旨在促进大型语言模型在考古学和古代汉语领域的开发和应用。
Summary / 总结
AncientBench is designed to evaluate the comprehension of ancient Chinese characters, especially for excavated documents, which are not covered by existing benchmarks. It includes four dimensions and ten tasks to assess glyph, pronunciation, meaning, and contextual comprehension. Experiments with current best-performing large language models show their potential in ancient textual scenarios but also highlight the gap with human understanding. This benchmark aims to advance the use of large language models in archaeology and ancient Chinese language research.
AncientBench 旨在评估古代汉字的理解能力,尤其是针对出土文献。它包含四个维度:字形、读音、意义和上下文理解,共有十个任务。该基准测试了领先的大型语言模型,并与古代模型进行了比较,展示了模型在古代文本场景中的潜力,但也突显了与人类理解之间的差距。这项研究旨在推动大型语言模型在考古学和古代汉语语言研究中的发展和应用。
PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning
Authors: Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu
First: 2025-08-14T10:03:47+00:00 · Latest: 2025-12-19T16:27:34+00:00
Abstract
Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, LLM-Judge, semantic similarity, etc.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.
中文标题/摘要
标题:PASS:概率代理超网络采样以实现可解释和自适应的胸部X光推理
现有的工具增强代理系统在现实世界中受到以下限制:(i) 黑盒推理步骤削弱了决策制定的信任并带来安全风险,(ii) 贫乏的多模态整合,这对医疗保健任务至关重要,以及(iii) 刚性且计算效率低的代理管道。我们引入了PASS(概率代理超网络采样),这是第一个在胸部X光(CXR)推理中解决这些挑战的多模态框架。PASS 适应性地在多工具图上采样代理工作流,生成带有可解释概率的决策路径。鉴于复杂的CXR推理任务和多模态医疗数据,PASS 利用其在代理超网络上学习的任务条件分布。因此,它在每个超网络层选择最合适的工具,提供带有概率注释的轨迹以供事后审计,并直接增强医疗AI的安全性。PASS 还不断将关键发现压缩到不断发展的个性化记忆中,同时动态决定是否加深其推理路径或调用早期退出以提高效率。为了优化平衡性能和成本的帕累托前沿,我们设计了一种新颖的三阶段训练程序,包括专家知识预热、对比路径排名和成本感知强化学习。为了促进严格的评估,我们引入了CAB-E,这是一个全面的多步骤、安全关键、自由形式CXR推理基准。跨多个基准的实验验证了PASS在多个指标(如准确性、LLM-Judge、语义相似性等)上显著优于强基线,同时平衡计算成本,推动了可解释、自适应和多模态医疗代理系统的新范式转变。
Summary / 总结
PASS addresses the limitations of existing tool-augmented agentic systems in healthcare by introducing a multimodal framework that enhances interpretability and adaptability in Chest X-Ray reasoning. It adaptively samples workflows over a multi-tool graph, providing probability-annotated decision paths. PASS optimizes performance and computational costs through a three-stage training procedure and continuously compresses findings into a personalized memory. Experiments show that PASS outperforms strong baselines in accuracy and other metrics while balancing computational efficiency.
PASS通过引入一个多模态框架来解决现有医疗工具增强型代理系统在胸部X光推理中的局限性,增强可解释性和适应性。它通过多工具图自适应采样工作流,提供带有概率注释的决策路径。PASS通过三阶段训练程序优化性能和计算成本,并不断将发现压缩到个性化记忆中。实验表明,PASS在准确性和其他指标上优于强基线,同时平衡计算效率。
The Generation Phases of Flow Matching: a Denoising Perspective
Authors: Anne Gagneux, Ségolène Martin, Rémi Gribonval, Mathurin Massias
First: 2025-10-28T16:42:53+00:00 · Latest: 2025-12-19T16:21:05+00:00
Abstract
Flow matching has achieved remarkable success, yet the factors influencing the quality of its generation process remain poorly understood. In this work, we adopt a denoising perspective and design a framework to empirically probe the generation process. Laying down the formal connections between flow matching models and denoisers, we provide a common ground to compare their performances on generation and denoising. This enables the design of principled and controlled perturbations to influence sample generation: noise and drift. This leads to new insights on the distinct dynamical phases of the generative process, enabling us to precisely characterize at which stage of the generative process denoisers succeed or fail and why this matters.
中文标题/摘要
标题:流匹配的生成阶段:一种去噪视角
流匹配已经取得了显著的成功,但其生成过程质量的影响因素仍然知之甚少。在本文中,我们从去噪的角度出发,设计了一个框架来实证探究生成过程。我们建立了流匹配模型与去噪器之间的形式联系,为比较它们在生成和去噪方面的性能提供了一个共同的基础。这使得我们可以设计出有原则的和可控的扰动来影响样本生成:噪声和漂移。这为我们提供了关于生成过程不同动态阶段的新见解,使我们能够精确地描述去噪器在生成过程中的成功或失败阶段及其原因。
History
20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553