arXiv 论文速递

2025-11-27 03:24
Snapshot: 20251127_0324
RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
Authors: Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, Chunming Qiao
First: 2025-11-25T18:59:55+00:00 · Latest: 2025-11-25T18:59:55+00:00
Abstract
Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
中文标题/摘要
标题:RubricRL:简单通用的文本到图像生成奖励设计
强化学习(RL)最近已成为一种有前途的方法,用于使文本到图像生成模型与人类偏好对齐。然而,一个关键挑战在于设计有效的且可解释的奖励。现有方法通常依赖于复合度量(例如,CLIP、OCR和现实度得分)的固定权重或从人类偏好模型中提炼出的单一标量奖励,这可能会限制可解释性和灵活性。我们提出了RubricRL,这是一种基于评分表的简单且通用的奖励设计框架,提供了更高的可解释性、可组合性和用户控制。RubricRL 不使用黑盒标量信号,而是为每个提示动态构建一个结构化的评分表——一个细粒度视觉标准的可分解检查表,如对象正确性、属性准确性、OCR保真度和现实度,以适应输入文本。每个标准由多模态裁判(例如,o4-mini)独立评估,并且提示自适应加权机制强调最相关的维度。这种设计不仅为策略优化(例如,GRPO或PPO)生成了可解释且模块化的监督信号,还使用户能够直接调整奖励或惩罚哪些方面。与自回归文本到图像模型的实验表明,RubricRL 提高了提示忠实度、视觉细节和通用性,同时为跨文本到图像架构的可解释RL对齐提供了灵活且可扩展的基础。
Summary / 总结
RubricRL is a reinforcement learning framework designed to improve text-to-image generation by using a structured rubric for reward design. This method offers greater interpretability and flexibility compared to existing approaches by dynamically constructing a checklist of visual criteria tailored to each prompt. Experiments show that RubricRL enhances prompt faithfulness, visual detail, and generalizability, providing a modular and user-adjustable framework for policy optimization in text-to-image generation models.
RubricRL 是一个强化学习框架,旨在通过提供可解释和灵活的奖励来提升文本到图像的生成。它为每个提示构建了一个结构化的评分表,包含如物体准确性等细粒度的视觉标准,这些标准独立评估。这种方法增强了提示的一致性、视觉细节和泛化能力,并允许用户调整奖励重点。实验表明,RubricRL 在这些方面优于现有方法,并为文本到图像生成中的 RL 对齐提供了一个灵活的基础。
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
Authors: Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal
First: 2025-11-25T18:59:53+00:00 · Latest: 2025-11-25T18:59:53+00:00
Abstract
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.
中文标题/摘要
标题:MedROV:跨多种医学影像模态的实时开放词汇检测
传统医学影像中的对象检测模型在封闭集范式下运行,限制了它们检测新标签对象的能力。开放词汇对象检测(OVOD)解决了这一限制,但由于数据集稀缺和文本-图像对齐较弱,医学影像领域对此研究较少。为弥合这一差距,我们引入了MedROV,这是首个用于医学影像的实时开放词汇检测模型。为了实现开放词汇学习,我们构建了一个大规模数据集Omnis,包含九种成像模态下的60万检测样本,并引入了一种伪标签策略来处理多源数据集中的缺失注释。此外,我们通过引入大型预训练基础模型的知识来增强泛化能力。通过利用对比学习和跨模态表示,MedROV 有效地检测了已知和新型结构。实验结果表明,MedROV 在医学图像检测中优于之前的最先进的基础模型,平均绝对改进为40 mAP50,并且在超过封闭集检测器3 mAP50的同时,运行速度达到70 FPS,从而在医学检测中设立了新的基准。我们的源代码、数据集和训练模型可在 https://github.com/toobatehreem/MedROV 获取。
Summary / 总结
MedROV is a real-time open-vocabulary detection model for medical imaging that addresses the limitations of traditional closed-set models. It introduces a large-scale dataset, Omnis, and a pseudo-labeling strategy to handle multi-source annotations, and leverages a large pre-trained model to enhance generalization. Experimental results show that MedROV outperforms previous state-of-the-art models by 40 mAP50 and surpasses closed-set detectors by more than 3 mAP50, achieving 70 FPS performance.
MedROV 是一种实时开放词汇检测模型,用于医疗成像,解决了传统封闭集模型的局限性。它引入了一个大规模数据集 Omnis 和伪标签策略来处理多源注释,并利用一个大型预训练模型来增强泛化能力。实验结果显示,MedROV 在 mAP50 上比之前最先进的模型高出 40 个点,并且比封闭集检测器高出超过 3 个点,实现了每秒 70 帧的性能。
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
Authors: Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag
First: 2025-11-25T18:59:46+00:00 · Latest: 2025-11-25T18:59:46+00:00
Comments: Project Page: https://infinity-rope.github.io/
Abstract
Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
中文标题/摘要
标题:Infinity-RoPE:动作可控的无限视频生成源自自回归自我展开
当前的自回归视频扩散模型受到三个核心瓶颈的限制:(i) 基本模型的3D旋转位置嵌入(3D-RoPE)施加的有限时间窗口;(ii) 在长时间展开过程中保持精细动作控制的响应速度缓慢;(iii) 无法在单个生成流中实现不连续的电影转换。我们引入了$\infty$-RoPE,这是一种统一的推理时框架,通过三个相互关联的组件解决了所有三个限制:块相对RoPE、KV刷新和RoPE剪切。块相对RoPE将时间编码重新表述为移动的局部参考框架,其中每个新生成的潜在块相对于基模型的最大帧窗口旋转,而较早的块则向后旋转以保持相对时间几何结构。这种相对表述消除了固定的时间位置,使视频生成能够远远超出基位置限制。为了在不重新编码的情况下获得精细的动作控制,KV刷新通过保留全局汇和最后一个生成的潜在帧来更新KV缓存,从而确保立即的提示响应。最后,RoPE剪切引入了时间RoPE坐标中的受控不连续性,使单个连续展开中能够实现多剪辑场景过渡。这些组件共同确立了$\infty$-RoPE作为无限时间、可控和电影风格视频扩散的无训练基础。全面的实验表明,$\infty$-RoPE在总体VBench评分上始终优于之前的自回归模型。
Summary / 总结
The paper addresses the limitations of current autoregressive video diffusion models by introducing $\infty$-RoPE, which overcomes three core bottlenecks: finite temporal horizon, slow prompt responsiveness, and inability to achieve discontinuous transitions. $\infty$-RoPE uses Block-Relativistic RoPE to enable continuous video generation beyond the base model's temporal limits, KV Flush to maintain fine-grained action control, and RoPE Cut to allow for cinematic scene transitions. Experiments show that $\infty$-RoPE outperforms previous models in overall VBench scores.
论文通过引入$\infty$-RoPE解决了当前自回归视频扩散模型的三个核心瓶颈:有限的时间范围、缓慢的提示响应能力和无法实现电影级过渡。$\infty$-RoPE 使用 Block-Relativistic RoPE 实现了超出基模型时间范围的连续视频生成,KV Flush 保持精细的动作控制,RoPE Cut 引入了受控的断点以实现场景过渡。实验表明,$\infty$-RoPE 在整体 VBench 分数上优于之前的模型。
Cloud4D: Estimating Cloud Properties at a High Spatial and Temporal Resolution
Authors: Jacob Lin, Edward Gryspeerdt, Ronald Clark
Venue: NeurIPS 2025 Spotlight
First: 2025-11-24T18:59:37+00:00 · Latest: 2025-11-25T18:59:46+00:00
Comments: NeurIPS 2025 Spotlight, project page: https://cloud4d.jacob-lin.com/
Abstract
There has been great progress in improving numerical weather prediction and climate models using machine learning. However, most global models act at a kilometer-scale, making it challenging to model individual clouds and factors such as extreme precipitation, wind gusts, turbulence, and surface irradiance. Therefore, there is a need to move towards higher-resolution models, which in turn require high-resolution real-world observations that current instruments struggle to obtain. We present Cloud4D, the first learning-based framework that reconstructs a physically consistent, four-dimensional cloud state using only synchronized ground-based cameras. Leveraging a homography-guided 2D-to-3D transformer, Cloud4D infers the full 3D distribution of liquid water content at 25 m spatial and 5 s temporal resolution. By tracking the 3D liquid water content retrievals over time, Cloud4D additionally estimates horizontal wind vectors. Across a two-month deployment comprising six skyward cameras, our system delivers an order-of-magnitude improvement in space-time resolution relative to state-of-the-art satellite measurements, while retaining single-digit relative error ($<10\%$) against collocated radar measurements. Code and data are available on our project page https://cloud4d.jacob-lin.com/.
中文标题/摘要
标题:Cloud4D:高空间和时间分辨率估算云属性
在使用机器学习改进数值天气预报和气候模型方面取得了巨大进展。然而,大多数全球模型以千米尺度运行,难以模拟单个云和极端降水、风速、湍流和地表辐射等因子。因此,需要转向更高分辨率的模型,这反过来又需要当前仪器难以获取的高分辨率现实世界观测数据。我们提出了Cloud4D,这是第一个仅使用同步地面相机重建物理一致的四维云状态的学习框架。利用同化指导的2D到3D变换器,Cloud4D推断出25米空间和5秒时间分辨率的全3D液态水含量分布。通过跟踪3D液态水含量的时空演变,Cloud4D还估计了水平风向量。在为期两个月的部署中,包含六个天空方向的相机,我们的系统在空间-时间分辨率上比最先进的卫星测量提高了数量级,同时保留了与共置雷达测量的单位误差(<10%)。代码和数据可在我们的项目页面https://cloud4d.jacob-lin.com/上获取。
Summary / 总结
Cloud4D is a learning-based framework that reconstructs a four-dimensional cloud state using synchronized ground-based cameras, providing a high spatial (25 m) and temporal (5 s) resolution. It infers the 3D distribution of liquid water content and estimates horizontal wind vectors, achieving an order-of-magnitude improvement in space-time resolution compared to satellite measurements with single-digit relative error against radar measurements.
Cloud4D 是一个基于学习的框架,利用同步的地面摄像头重建一个物理上一致的四维云状态。它推断出 25 米空间和 5 秒时间分辨率下的完整三维液态水分布,并通过跟踪三维液态水含量随时间的变化来估计水平风向量。该系统在空间-时间分辨率上比卫星测量提高了数量级,相对误差低于 10% 对比共址的雷达测量。
Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
Authors: Tahira Kazimi, Connor Dunlop, Pinar Yanardag
First: 2025-11-25T18:59:45+00:00 · Latest: 2025-11-25T18:59:45+00:00
Comments: Project webpage: https://diverse-video.github.io/
Abstract
While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.
中文标题/摘要
标题:基于行列式点过程引导策略优化的多样化视频生成
尽管最近的文本到视频(T2V)扩散模型在质量和提示对齐方面取得了令人印象深刻的成果,但在从单个文本提示生成多个视频时,它们通常会产生低多样性的输出。我们通过将其表述为集合级策略优化问题来应对这一挑战,目标是训练一个能够覆盖给定提示的多样化可能结果的策略。为此,我们引入了DPP-GRPO,这是一种结合行列式点过程(DPPs)和组相对策略优化(GRPO)理论的新框架,以在多样化生成中施加显式的奖励。我们的目标通过DPP减少冗余样本的回报,同时通过GRPO在候选集上提供组别反馈,将多样性转化为一个显式的信号。我们的框架是即插即用且模型无关的,能够在视觉外观、摄像机运动和场景结构方面促进多样化生成,而不牺牲提示保真度或感知质量。我们在WAN和CogVideoX上实现我们的方法,并展示了我们的方法在VBench、VideoScore和人类偏好研究等最先进的基准测试中一致提高了视频的多样性。此外,我们还发布了我们的代码和一个包含30,000个多样化提示的新基准数据集,以支持未来的研究。
Summary / 总结
This paper addresses the challenge of generating low-diversity outputs from text-to-video models by proposing DPP-GRPO, a framework that combines Determinantal Point Processes and Group Relative Policy Optimization. The method explicitly encourages diversity by reducing the reward for redundant samples and providing groupwise feedback. Experiments on WAN and CogVideoX show consistent improvements in video diversity across various benchmarks and human preference studies without compromising prompt fidelity or perceptual quality.
研究通过将问题表述为集合级策略优化任务来解决文本到视频生成中的低多样性问题。引入了DPP-GRPO框架,该框架结合了确定性点过程和组相对策略优化,以促进多样性的视频生成。实验表明,DPP-GRPO在VBench、VideoScore等基准测试和人类偏好研究中提高了视频多样性,同时没有牺牲提示保真度或感知质量。
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Authors: Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
First: 2025-11-25T18:59:45+00:00 · Latest: 2025-11-25T18:59:45+00:00
Comments: Tech report. Project page: https://nvlabs.github.io/LocateAnything3D/
Abstract
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
中文标题/摘要
标题:LocateAnything3D:基于视线链的视觉-语言3D检测
为了在世界中行动,模型必须命名它所看到的并知道自己在3D中的位置。今天的视觉-语言模型(VLM)在开放的2D描述和定位方面表现出色,但多对象3D检测仍然缺乏于VLM工具箱中。我们提出了LocateAnything3D,这是一种VLM原生的方法,将3D检测视为下一个标记预测问题。关键在于一个简短明确的视线链(CoS)序列,这反映了人类如何从图像中推理:先在2D中找到一个物体,然后推断其距离、大小和姿态。解码器首先发出2D检测作为视觉思维链,然后按照容易到困难的课程预测3D框:在对象之间,从近到远的顺序减少了早期的模糊性并匹配了以自我为中心的实用性;在每个对象内部,从相机中心、尺寸和旋转分解按稳定性和可学习性排列信息。这种VLM原生的接口保留了开放词汇和视觉提示的能力,而无需专门的头部。在具有挑战性的Omni3D基准测试中,我们的模型达到了最先进的结果,3D AP得分为49.89,即使基线提供了真实2D框,绝对改进也超过了前最佳值15.51。它还以强大的鲁棒性在零样本下推广到未见过的类别。通过将3D检测转化为一个有纪律的下一个标记问题,LocateAnything3D为模型提供了一个感知3D的实际基础。
Summary / 总结
LocateAnything3D addresses the gap in 3D detection capabilities of vision-language models by framing it as a next-token prediction task. It uses a Chain-of-Sight (CoS) sequence to guide the model from 2D object detection to 3D box prediction, following a human-like reasoning process. The model shows superior performance on the Omni3D benchmark, achieving 49.89 AP_3D and outperforming previous methods by 15.51 points, even when given ground-truth 2D boxes. It also demonstrates strong zero-shot generalization to new categories.
研究旨在使视觉语言模型能够执行多目标3D检测,目前这还是一个缺失的功能。方法是将3D检测转化为下一个标记预测问题,使用链视图(CoS)序列模拟人类的推理过程。模型首先生成2D检测,然后在从易到难的课程中预测3D框。在Omni3D基准测试中,所提出的方法达到了最先进的结果,AP_3D得分为49.89,比之前的最佳结果提高了15.51的绝对值。它还展示了对新类别的强零样本泛化能力。
3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding
Authors: Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li
First: 2025-11-25T18:59:34+00:00 · Latest: 2025-11-25T18:59:34+00:00
Comments: 3D-aware Multi-task Learning, Cross-view Correlations, Code will be available at https://github.com/WeiHongLee/CrossView3DMTL
Abstract
This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.
中文标题/摘要
标题:基于3D感知的多任务学习与跨视图相关性在密集场景理解中的应用
本文解决了训练单个网络同时执行多个密集预测任务(如分割和深度估计)即多任务学习(MTL)的挑战。当前的方法主要在2D图像空间中捕捉任务间的相关性,往往导致缺乏3D感知的无序特征。我们认为,3D感知对于建模对于全面场景理解至关重要的任务间相关性至关重要。我们提出通过整合跨视图相关性,即成本体,作为MTL网络中的几何一致性来解决这个问题。具体而言,我们引入了一个轻量级的跨视图模块(CvM),在任务间共享,用于在视图间交换信息并捕捉跨视图相关性,与MTL编码器的特征结合以进行多任务预测。该模块具有架构无关性,可以应用于单视图和多视图数据。在NYUv2和PASCAL-Context上的广泛结果表明,我们的方法有效地将几何一致性注入到现有的MTL方法中,以提高性能。
Summary / 总结
This paper proposes a 3D-aware multi-task learning approach that integrates cross-view correlations to improve dense scene understanding tasks like segmentation and depth estimation. The method introduces a lightweight Cross-view Module (CvM) to exchange information across views and capture geometric consistency, enhancing existing MTL methods. Experiments on NYUv2 and PASCAL-Context show improved performance compared to traditional 2D-based MTL approaches.
该论文提出了一种3D感知的多任务学习方法,通过整合跨视图相关性来提升密集场景理解任务,如分割和深度估计。方法引入了一个轻量级的跨视图模块(CvM),用于在视图之间交换信息并捕捉几何一致性,增强现有的多任务学习方法。实验结果表明,该方法在NYUv2和PASCAL-Context上的表现优于传统的基于2D的方法。
PixelDiT: Pixel Diffusion Transformers for Image Generation
Authors: Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo
First: 2025-11-25T18:59:25+00:00 · Latest: 2025-11-25T18:59:25+00:00
Abstract
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
中文标题/摘要
标题:PixelDiT: 图像生成的像素扩散变换器
潜在空间建模一直是扩散变换器(DiTs)的标准方法。然而,它依赖于一个两阶段的流水线,其中预训练的自编码器引入了有损重构,导致误差累积并妨碍联合优化。为了解决这些问题,我们提出了PixelDiT,这是一种单阶段、端到端的模型,消除了自编码器的需要,并直接在像素空间中学习扩散过程。PixelDiT 采用了一种由双层设计形成的完全基于变换器的架构:一个块级的DiT捕捉全局语义,一个像素级的DiT细化纹理细节,使像素空间中的扩散模型训练更加高效,同时保留了精细的细节。我们的分析表明,有效的像素级标记建模对于像素扩散的成功至关重要。PixelDiT 在 ImageNet 256x256 上实现了 1.61 的 FID,大幅超越了现有的像素生成模型。我们进一步将 PixelDiT 扩展到文本到图像生成,并在像素空间中对其进行预训练,分辨率为 1024x1024。它在 GenEval 上达到了 0.74,在 DPG-bench 上达到了 83.5,接近最佳潜在扩散模型。
Summary / 总结
PixelDiT is a single-stage, end-to-end model that directly learns the diffusion process in the pixel space, avoiding the need for a lossy autoencoder. It uses a dual-level transformer architecture to capture global semantics and refine texture details, leading to efficient training and fine detail preservation. PixelDiT outperforms existing pixel generative models with a 1.61 FID on ImageNet 256x256 and achieves competitive results in text-to-image generation with GenEval 0.74 and DPG-bench 83.5 scores.
PixelDiT 是一个单阶段端到端模型,直接在像素空间学习扩散过程,避免使用有损自编码器。它使用双层变压器架构来捕捉全局语义并细化纹理细节,实现高效训练和细节保留。PixelDiT 在 ImageNet 256x256 上的 FID 得分为 1.61,超越现有像素生成模型,并在文本到图像生成中达到 GenEval 0.74 和 DPG-bench 83.5 的分数,接近最佳潜空间扩散模型。
Vision-Language Memory for Spatial Reasoning
Authors: Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang
First: 2025-11-25T18:59:02+00:00 · Latest: 2025-11-25T18:59:02+00:00
Abstract
Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.
中文标题/摘要
标题:视觉-语言记忆在空间推理中的应用
空间推理是智能机器人的一项关键能力,但当前的视觉-语言模型(VLMs)在基于视频的空间推理方面仍无法达到人类水平的性能。这一差距主要源于两个挑战:语义-几何的不匹配,导致无法实现一致的三维理解,以及缺乏持久记忆来保留三维表示和理解。为了解决这些限制,我们提出了VLM$^2$,这是一种具有持久记忆的视觉-语言模型,用于基于视图一致的三维感知表示的空间推理,完全从二维视频中获得。具体来说,为了增强长时推理,我们引入了一个双记忆模块,包括一个工作记忆,作为滑动窗口以关注即时上下文,以及一个情景记忆,用于整合和存储关键的长期信息。这种设计使得空间推理在固定计算成本下高效且具有长时性。在多个基准上的广泛实验表明,VLM$^2$在仅视频模型中达到了最先进的性能,显著推进了视觉-空间智能的前沿。
Concept-Aware Batch Sampling Improves Language-Image Pretraining
Authors: Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge
First: 2025-11-25T18:58:07+00:00 · Latest: 2025-11-25T18:58:07+00:00
Comments: Tech Report
Abstract
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
中文标题/摘要
标题:概念感知批量采样提高语言-图像预训练
视觉-语言模型应该训练在什么数据上?为回答这个问题,许多数据整理工作集中在数据集的质量上。然而,这些现有方法大多(i)离线的,即从一组预定义的过滤标准生成静态数据集,(ii)概念无关的,即使用模型基础过滤器引入额外的数据偏差。在本文中,我们超越了这些离线的概念无关方法,提倡更灵活的任务适应在线概念基础整理。我们的第一个贡献是DataConcept,一个包含1.28亿网络抓取的图像-文本对的集合,这些对被细粒度地标记有关其概念组成的信息。基于DataConcept,我们引入了概念感知批量采样(CABS),一种简单而有效的批量采样框架,可以根据特定的目标分布实时构建批量。我们提出了两种变体:(i)多样性最大化(CABS-DM)以整理具有广泛概念覆盖范围的批量,(ii)频率最大化(CABS-FM)以整理具有高对象多重性的批量。通过在28个基准上的广泛评估,我们证明了我们的CABS方法显著提高了CLIP/SigLIP模型类,并产生了高性能的模型。总体而言,CABS代表了一个强大的开源替代品,用于专有的在线数据整理算法,使实践者能够定义自定义的概念分布以优化特定的下游任务。
Summary / 总结
This work addresses the question of what data a vision-language model should be trained on by proposing a new approach called DataConcept, which includes 128M image-text pairs annotated with detailed concept information. The authors introduce Concept-Aware Batch Sampling (CABS), a flexible batch sampling framework that constructs batches based on specific target distributions. Two variants of CABS are proposed: CABS-DM for broad concept coverage and CABS-FM for high object multiplicity. Extensive evaluations across 28 benchmarks show that CABS improves the performance of CLIP/SigLIP models, demonstrating its effectiveness in online, concept-based curation.
该研究通过提出一种概念感知的批采样方法CABS来解决视觉-语言模型训练数据选择的问题,引入了包含详细概念标注的大量图像-文本对数据集DataConcept,并开发了两种CABS变体:CABS-DM用于最大化多样性,CABS-FM用于最大化频率。广泛评估28个基准表明,CABS显著提升了CLIP/SigLIP模型的性能,展示了其在在线、任务适应性数据筛选中的有效性。
Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition
Authors: Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, Min-Ling Zhang
First: 2025-11-25T18:57:28+00:00 · Latest: 2025-11-25T18:57:28+00:00
Abstract
Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.
中文标题/摘要
标题:释放视觉语言模型在长尾多标签视觉识别中的潜力
长尾多标签视觉识别提出了重大挑战,因为图像通常包含多个具有高度不平衡类分布的标签,导致模型偏向于头部类而对尾部类表现不佳。最近的努力利用了预训练的视觉语言模型,如CLIP,结合长尾学习技术,利用丰富的视觉文本先验以提高性能。然而,现有方法通常直接从不平衡数据集中推导出语义类间关系,由于数据稀缺,这导致尾部类的不可靠相关性。此外,CLIP的零样本范式优化了单标签图像文本匹配,使其在多标签任务中表现不佳。为了解决这些问题,我们提出了相关性适应提示网络(CAPNET),这是一种新颖的端到端框架,明确从CLIP的文本编码器中建模标签相关性。该框架结合了图卷积网络进行标签感知传播,并使用可学习的软提示进行细化嵌入。它使用分布平衡的Focal损失和类感知重权进行优化训练。此外,它通过测试时集成提高泛化能力,并通过参数高效微调重新对齐视觉文本模态,以避免在不牺牲头部类性能的情况下过度拟合尾部类。在包括VOC-LT、COCO-LT和NUS-WIDE在内的基准测试上的广泛实验和消融研究表明,CAPNET在最先进的方法上取得了显著的改进,验证了其在实际长尾多标签视觉识别中的有效性。
Summary / 总结
The paper addresses the challenge of long-tailed multi-label visual recognition by proposing CAPNET, which uses a correlation adaptation prompt network to model label correlations from CLIP's textual encoder. It incorporates a graph convolutional network for label-aware propagation, learnable soft prompts, and a distribution-balanced Focal loss with class-aware re-weighting. Experiments show that CAPNET outperforms existing methods on benchmarks like VOC-LT, COCO-LT, and NUS-WIDE, demonstrating its effectiveness in handling imbalanced class distributions and improving generalization for tail classes.
论文提出了一种名为CAPNET的新框架,通过CLIP的文本编码器建模标签间的相关性。该框架包含图卷积网络进行标签感知传播,并使用可学习的软提示进行嵌入细化,同时采用分布平衡的Focal损失和类感知重权进行训练。实验结果表明,CAPNET在VOC-LT、COCO-LT和NUS-WIDE基准上优于现有方法,证明了其在处理长尾数据分布方面的有效性。
MotionV2V: Editing Motion in a Video
Authors: Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz
First: 2025-11-25T18:57:25+00:00 · Latest: 2025-11-25T18:57:25+00:00
Abstract
While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V
中文标题/摘要
标题:MotionV2V:视频中编辑动作
虽然生成视频模型在保真度和一致性方面取得了显著成就,但将这些能力应用于视频编辑仍然是一个复杂的挑战。最近的研究探索了动作可控性以增强文本到视频生成或图像动画;然而,我们发现精确的动作控制是一个有前景但尚未充分探索的编辑现有视频的范式。在本文中,我们提出通过直接编辑从输入中提取的稀疏轨迹来修改视频动作。我们将输入和输出轨迹之间的偏差称为“动作编辑”,并证明这种表示,当与生成主干结合时,能够实现强大的视频编辑能力。为了实现这一点,我们引入了一种生成“动作反事实”的管道,即共享相同内容但不同动作的视频对,并在该数据集上微调了动作条件的视频扩散架构。我们的方法允许从任意时间戳开始进行编辑,并自然传播。在四组头对头的用户研究中,我们的模型在偏好上超过了先前的工作超过65%。请参见我们的项目页面:https://ryanndagreat.github.io/MotionV2V
Summary / 总结
This paper addresses the challenge of precise motion control in video editing, proposing MotionV2V, which edits video motion by directly modifying sparse trajectories. The method uses a generative backbone and a dataset of motion counterfactuals to enable powerful and natural edits. User studies show that MotionV2V outperforms previous methods with over 65 percent preference in a four-way comparison.
本文旨在解决视频编辑中精确运动控制的挑战,提出了一种通过直接修改稀疏轨迹来编辑视频运动的方法。该方法使用生成模型和运动条件下的视频扩散架构,并通过数据集训练,以实现强大的编辑能力。实验结果显示,在四组头对头的用户测试中,该模型的性能优于先前的方法,超过65%的参与者更偏好由该模型生成的编辑视频。项目页面见 https://ryanndagreat.github.io/MotionV2V
Latent Collaboration in Multi-Agent Systems
Authors: Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang
First: 2025-11-25T18:56:57+00:00 · Latest: 2025-11-25T18:56:57+00:00
Comments: Project: https://github.com/Gen-Verse/LatentMAS
Abstract
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
中文标题/摘要
标题:多智能体系统中的潜在协作
多智能体系统(MAS)将大型语言模型(LLMs)从独立的单模型推理扩展到协调的系统级智能。虽然现有的LLM智能体依赖于基于文本的调解进行推理和通信,但我们通过使模型能够在连续的潜在空间中直接协作来迈出一步。我们引入了LatentMAS,这是一种端到端无需训练的框架,使LLM智能体之间能够纯粹地进行潜在协作。在LatentMAS中,每个智能体首先通过最后一层隐藏嵌入进行自回归潜在思维生成。共享的潜在工作记忆则保存并转移每个智能体的内部表示,确保无损信息交换。我们提供了理论分析,证明LatentMAS在表达能力和无损信息保存方面比传统的基于文本的MAS具有更高的性能,且复杂度显著降低。此外,在涵盖数学和科学推理、常识理解和代码生成的9个全面基准测试中,LatentMAS在所有基准测试中都优于强大的单模型和基于文本的MAS基线,准确率最高可提高14.6%,输出令牌使用量减少70.8%-83.7%,端到端推理速度提高4-4.3倍。这些结果表明,我们的新潜在协作框架在提高系统级推理质量的同时,还提供了显著的效率提升,无需额外训练。代码和数据已完全开源,可在https://github.com/Gen-Verse/LatentMAS获取。
Summary / 总结
The research aims to enhance multi-agent systems (MAS) by enabling direct collaboration among large language model (LLM) agents in a continuous latent space, rather than relying on text-based mediation. LatentMAS, an end-to-end training-free framework, allows agents to generate latent thoughts and share internal representations through a shared latent working memory, ensuring lossless information exchange. Empirical evaluations across nine benchmarks show LatentMAS outperforms single-model and text-based MAS baselines, achieving higher accuracy, reduced token usage, and faster inference times without additional training.
研究旨在通过让大型语言模型(LLM)代理在连续的潜在空间中直接协作,而不是依赖文本中介,来提升多代理系统(MAS)的质量。LatentMAS 是一个端到端无需训练的框架,允许代理生成潜在思想并通过共享的潜在工作记忆传递内部表示,确保信息无损交换。在九个基准测试中的实证评估表明,LatentMAS 在准确率、减少令牌使用量和加快端到端推理速度方面优于单模型和基于文本的MAS基线,且无需额外训练。
Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization
Authors: Paul Strang, Zacharie Alès, Côme Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson
First: 2025-11-12T11:28:08+00:00 · Latest: 2025-11-25T18:56:26+00:00
Abstract
Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B&B). A key driver influencing B&B solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B&B setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B&B dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.
中文标题/摘要
标题:分支定界中的规划:基于模型的强化学习精确组合优化
混合整数线性规划(MILP)是许多实际世界组合优化(CO)问题的核心,传统上通过分支定界(B&B)求解。影响B&B求解器效率的关键因素是指导分支决策的变量选择启发式算法。为了超越静态的手工启发式算法,最近的研究探索了将传统强化学习(RL)算法适应到B&B环境中,旨在学习针对特定MILP分布的定制化分支策略。与此同时,RL代理通过利用环境模拟器利用蒙特卡洛树搜索(MCTS)在棋盘游戏中取得了显著的成功,这是一种非常特定类型的组合问题。在此基础上,我们引入了Plan-and-Branch-and-Bound(PlanB&B),一种基于模型的强化学习(MBRL)代理,利用对B&B动力学的内部模型来发现改进的分支策略。计算实验实证验证了我们的方法,我们的MBRL分支代理在四个标准MILP基准测试中优于先前最先进的RL方法。
Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model
Authors: Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani
First: 2025-11-25T18:55:12+00:00 · Latest: 2025-11-25T18:55:12+00:00
Abstract
Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.
中文标题/摘要
标题:Image2Gcode:使用扩散变换模型的图像到G代码生成技术
机械设计和制造工作流程通常始于概念设计,随后创建计算机辅助设计(CAD)模型并通过材料挤出(MEX)打印进行制造。这一过程需要将CAD几何图形转换为机器可读的G代码,通过切片和路径规划实现。虽然每一步都已成熟,但依赖CAD建模仍然是一个主要瓶颈:构建特定对象的3D几何图形速度较慢,且不适用于快速原型制作。即使是细微的设计变化通常也需要在CAD软件中手动更新,使得迭代过程耗时且难以扩展。为解决这一限制,我们引入了Image2Gcode,这是一种端到端的数据驱动框架,绕过了CAD阶段,直接从图像和零件图纸生成打印机就绪的G代码。该框架首先从图像中提取切片级的结构线索,然后使用G代码序列上的去噪扩散概率模型(DDPM)。通过迭代去噪,模型将高斯噪声转化为可执行的打印移动轨迹,与相应的挤出参数相对应,从而建立了从视觉输入到原生工具路径的直接映射。通过直接从二维图像生成结构化的G代码,Image2Gcode消除了对CAD或STL中间件的需求,降低了增材制造的入门门槛,并加速了从设计到制造的周期。该方法支持从简单的草图或视觉参考进行按需原型制作,并与上游的2D到3D重建模块集成,以实现从概念到物理制品的自动化流程。结果是一种灵活且计算高效的框架,促进了设计迭代、修复工作流程和分布式制造的可访问性。
Summary / 总结
The paper introduces Image2Gcode, an end-to-end framework that generates printer-ready G-code directly from images and part drawings, bypassing the need for CAD models. It uses a denoising diffusion probabilistic model to transform Gaussian noise into executable print-move trajectories. Key findings show that Image2Gcode significantly reduces the design-to-fabrication time and lowers the entry barrier for additive manufacturing, supporting rapid prototyping from simple sketches or visual references.
Image2Gcode 是一个端到端的框架,可以直接从图像和零件图纸生成打印机可读的 G-code,无需使用 CAD 模型。它使用去噪扩散概率模型将高斯噪声转化为可执行的打印轨迹。主要发现包括省去了 CAD 阶段,使设计到制造的周期更快更灵活,并支持从简单草图进行按需原型制作。
iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation
Authors: Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin
First: 2025-11-25T18:54:16+00:00 · Latest: 2025-11-25T18:54:16+00:00
Abstract
Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.
中文标题/摘要
标题:iMontage:统一、多功能、高度动态的多对多图像生成
预训练的视频模型学习生成高质量、时间上连贯的内容的强大先验知识。尽管这些模型在时间连贯性方面表现出色,但它们的动力学往往受限于其连续训练数据的性质。我们假设,通过将丰富且不受约束的内容多样性从图像数据注入这种连贯的时间框架中,我们可以生成既包含自然过渡又具有更广泛动态范围的图像集。为此,我们引入了iMontage,这是一种统一框架,旨在将强大的视频模型重新用于全方位的图像生成器。该框架消耗和产生可变长度的图像集,统一了广泛的图像生成和编辑任务。为了实现这一点,我们提出了一种优雅且微创的适应策略,辅以定制的数据整理过程和训练范式。这种方法使模型能够在不破坏其宝贵的原始运动先验知识的情况下获得广泛的图像操作能力。iMontage在多个主流的多对多任务中表现出色,不仅保持了强大的跨图像上下文一致性,还生成了具有非凡动态范围的场景,超越了传统范围。请访问我们的主页:https://kr1sjfu.github.io/iMontage-web/
Summary / 总结
The research aims to enhance the dynamic range and diversity of image generation by integrating the temporal coherence from pre-trained video models with the rich content from image data. The iMontage framework adapts a powerful video model to generate versatile and dynamic image sets, achieving both natural transitions and broad manipulation capabilities. Key findings show that iMontage outperforms conventional methods in maintaining cross-image consistency while generating highly dynamic scenes that surpass standard limitations.
研究旨在通过将预训练视频模型的时间连贯性与图像数据的丰富内容相结合,增强图像生成的动态范围和多样性。iMontage框架将强大的视频模型适应为生成多样且动态的图像集,实现自然过渡和广泛的操控能力。关键发现表明,iMontage在保持跨图像一致性的同时,生成的动态场景超越了传统方法的限制。
MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models
Authors: Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi
First: 2025-11-25T18:49:21+00:00 · Latest: 2025-11-25T18:49:21+00:00
Abstract
Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
中文标题/摘要
标题:MapReduce LoRA:在生成模型多偏好优化中的帕累托前沿推进
基于人类反馈的强化学习(RLHF)与奖励模型的进步使生成模型与人类审美和感知偏好更加一致。然而,同时优化多个奖励通常会带来一种对齐税,即在提高一个维度的同时会降低其他维度。为了解决这个问题,我们引入了两种互补的方法:MapReduce LoRA和奖励感知的标记嵌入(RaTE)。MapReduce LoRA 并行训练特定偏好的 LoRA 专家,并迭代合并它们以细化共享基础模型;RaTE 在推理时学习特定奖励的标记嵌入,以灵活控制偏好。在文本到图像生成(Stable Diffusion 3.5 Medium 和 FLUX.1-dev)实验中,分别在 GenEval、PickScore 和 OCR 上取得了 36.1%、4.6% 和 55.7% 以及 32.7%、4.3% 和 67.1% 的改进。在文本到视频生成(HunyuanVideo)中,视觉和运动质量分别提高了 48.1% 和 90.0%。在语言任务中,有益助手(使用 Llama-2 7B)的有益和无害分别提高了 43.4% 和 136.7%。我们的框架在不同模态中设定了新的多偏好对齐新标准。
Summary / 总结
This paper addresses the challenge of multi-preference optimization in generative models by introducing MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific experts in parallel and iteratively merges them to refine a shared base model, while RaTE learns reward-specific token embeddings for flexible preference control at inference. The experiments demonstrate significant improvements across various metrics, including GenEval, PickScore, OCR, visual and motion quality in Text-to-Video generation, and language task performance, setting a new state-of-the-art in multi-preference alignment across modalities.
论文旨在解决生成模型中多偏好优化的挑战,特别是人类反馈强化学习(RLHF)中的问题。提出了两种方法:MapReduce LoRA,通过并行训练偏好特定的专家并在迭代中合并它们;以及Reward-aware Token Embedding (RaTE),学习奖励特定的词嵌入以实现灵活的偏好控制。实验结果显示在文本到图像和文本到视频生成以及语言任务中取得了显著改进,涵盖了不同模态的新最先进的多偏好对齐方法。
Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enabled Safety-Critical Systems
Authors: Anastasia Mavridou, Divya Gopinath, Corina S. Păsăreanu
First: 2025-11-25T18:48:19+00:00 · Latest: 2025-11-25T18:48:19+00:00
Abstract
The integration of AI components, particularly Deep Neural Networks (DNNs), into safety-critical systems such as aerospace and autonomous vehicles presents fundamental challenges for assurance. The opacity of AI systems, combined with the semantic gap between high-level requirements and low-level network representations, creates barriers to traditional verification approaches. These AI-specific challenges are amplified by longstanding issues in Requirements Engineering, including ambiguity in natural language specifications and scalability bottlenecks in formalization. We propose an approach that leverages AI itself to address these challenges through two complementary components. REACT (Requirements Engineering with AI for Consistency and Testing) employs Large Language Models (LLMs) to bridge the gap between informal natural language requirements and formal specifications, enabling early verification and validation. SemaLens (Semantic Analysis of Visual Perception using large Multi-modal models) utilizes Vision Language Models (VLMs) to reason about, test, and monitor DNN-based perception systems using human-understandable concepts. Together, these components provide a comprehensive pipeline from informal requirements to validated implementations.
中文标题/摘要
标题:用AI对抗AI:利用基础模型确保AI驱动的安全关键系统
将AI组件,尤其是深度神经网络(DNNs),集成到航空和自动驾驶车辆等安全关键系统中,提出了确保方面的根本性挑战。AI系统的不透明性与高层要求和低层网络表示之间的语义差距,阻碍了传统验证方法的应用。这些特定于AI的挑战被长期存在的需求工程问题放大,包括自然语言规范的模糊性和形式化中的可扩展性瓶颈。我们提出了一种方法,通过两个互补的组件利用AI本身来应对这些挑战。REACT(需求工程中的AI一致性与测试)利用大型语言模型(LLMs)弥合非正式自然语言规范与形式化规范之间的差距,实现早期验证和验证。SemaLens(视觉感知的语义分析使用大型多模态模型)利用视觉语言模型(VLMs)使用人类可理解的概念来推理、测试和监控基于DNN的感知系统。这些组件共同提供了一个从非正式需求到验证实现的全面管道。
Summary / 总结
This paper addresses the challenges of ensuring safety in AI-enabled systems by proposing an approach that uses AI itself. REACT leverages Large Language Models to translate informal requirements into formal specifications, facilitating early verification. SemaLens employs Vision Language Models to analyze and monitor DNN-based perception systems using human-understandable concepts. The key findings show that this approach can effectively bridge the semantic gap and improve the assurance of safety-critical systems like aerospace and autonomous vehicles.
本文提出了一种利用AI自身来解决AI启用的安全系统保障问题的方法。REACT 使用大型语言模型将非正式的需求转化为正式规范,促进早期验证。SemaLens 利用视觉语言模型通过人类可理解的概念来分析和测试基于DNN的感知系统。主要发现表明,这些方法能够有效弥合语义差距,并在航空航天和自动驾驶等安全关键应用中提高AI系统的保障水平。
ShapeGen: Towards High-Quality 3D Shape Synthesis
Authors: Yangguang Li, Xianglong He, Zi-Xin Zou, Zexiang Liu, Wanli Ouyang, Ding Liang, Yan-Pei Cao
Venue: SIGGRAPH Asia 2025
First: 2025-11-25T18:47:27+00:00 · Latest: 2025-11-25T18:47:27+00:00
Comments: Accepted to SIGGRAPH Asia 2025
Abstract
Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.
中文标题/摘要
标题:ShapeGen:迈向高质量3D形状合成
受图像和视频生成范式的启发,3D形状生成已取得显著进展,能够从单张图像快速合成高保真3D资产。然而,当前方法仍面临细节不足、表面过度平滑和碎片化薄壳结构等挑战。这些限制使得生成的3D资产仍未达到艺术家所青睐的标准。在本文中,我们提出了ShapeGen,通过3D表示和监督改进、分辨率放大以及线性变换的优势,实现了高质量的图像到3D形状生成。这些进步使生成的资产能够无缝集成到3D流水线中,促进了其在各种应用中的广泛应用。通过大量实验,我们验证了这些改进对整体性能的影响。最终,得益于这些增强措施的协同效应,ShapeGen在图像到3D生成方面取得了显著飞跃,建立了新的性能基准。
Summary / 总结
ShapeGen aims to improve the quality of 3D shape synthesis by addressing issues such as lack of details and overly smoothed surfaces. The method uses 3D representation and supervision improvements, resolution scaling, and linear transformers to enhance the generated 3D assets. Extensive experiments show that ShapeGen significantly improves the quality of image-to-3D shape generation, setting a new state-of-the-art performance.
ShapeGen旨在通过解决细节不足和表面过度平滑等问题来提升3D形状生成的质量。它利用3D表示和监督改进、分辨率放大和线性变换器来增强生成过程。大量实验表明,ShapeGen显著超越了现有方法,在图像到3D生成方面达到了新的技术水平。
Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI
Authors: Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang, Yifei Ma, Li Guo, Yiming Li, Jing Zhang, Chen Feng
First: 2025-11-25T18:43:55+00:00 · Latest: 2025-11-25T18:43:55+00:00
Abstract
Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at https://ai4ce.github.io/wanderland/.
中文标题/摘要
标题:Wanderland:基于几何的开放世界体态AI仿真
在体态AI如视觉导航中,可重现的闭环评估仍然是一个主要瓶颈。一条有希望的前进道路是结合逼真传感器渲染和几何上接地的交互的高保真仿真。尽管最近的视频-3DGS方法简化了开放世界场景的捕获,但由于视觉和几何上的仿真到现实的巨大差距,它们仍然不适合基准测试。为了解决这些挑战,我们引入了Wanderland,一个实到仿的框架,具备多传感器捕获、可靠的重建、精确的几何和稳健的视图合成功能。通过此流程,我们精心制作了一个多样化的室内-室外城市场景数据集,并系统地展示了仅图像管道的扩展性差、几何质量对新颖视图合成的影响,以及所有这些因素如何对导航策略学习和评估可靠性产生负面影响。除了作为可信的体态导航测试平台外,Wanderland丰富的原始传感器数据还允许对3D重建和新颖视图合成模型进行基准测试。我们的工作为开放世界体态AI的可重现研究奠定了新的基础。项目网站为https://ai4ce.github.io/wanderland/。
Summary / 总结
Wanderland is a real-to-sim framework designed to address the challenges of reproducible closed-loop evaluation in embodied AI, particularly in visual navigation. It combines multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis to create a diverse dataset of indoor-outdoor urban scenes. The study demonstrates that image-only pipelines perform poorly, and that geometry quality significantly impacts novel view synthesis, which in turn affects navigation policy learning and evaluation reliability. Beyond navigation, Wanderland also provides benchmarking for 3D reconstruction and novel view synthesis models.
研究旨在通过提出Wanderland框架解决视觉导航等体态AI评估中的挑战,该框架结合了多传感器捕获、可靠重建和精确几何。研究显示,仅基于图像的管道表现不佳,几何质量对新视角合成有显著影响,进而影响导航策略学习和评估可靠性。该框架还允许对3D重建和新视角合成模型进行基准测试,为开放世界体态AI的可重复研究奠定了新的基础。
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
Authors: Xin Wang, Yuwei Zhou, Bin Huang, Hong Chen, Wenwu Zhu
Venue: IEEE Transactions on Circuits and Systems for Video Technology 2025
First: 2024-09-23T13:16:09+00:00 · Latest: 2025-11-25T18:43:50+00:00
Comments: 21 pages, 10 figures, 3 tables
Abstract
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for multi-modal understanding; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of multi-modal generation. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions which may contribute to the ongoing advancement of multi-modal generative AI.
中文标题/摘要
标题:多模态生成AI:多模态大语言模型、扩散模型及其统一
多模态生成AI(人工智能)引起了学术界和工业界的广泛关注。特别是,两种主要的技术家族已经出现:i) 多模态大语言模型(LLMs)展示了多模态理解的出色能力;ii) 扩散模型在多模态生成方面表现出显著的能力。因此,本文提供了多模态生成AI的全面概述,包括多模态LLMs、扩散模型以及理解和生成的统一。为了为统一模型奠定坚实的基础,我们首先分别对多模态LLMs和扩散模型进行了详细的回顾,包括它们的概率建模过程、多模态架构设计以及应用于图像/视频LLMs以及文本到图像/视频生成的高级应用。此外,我们探讨了朝着理解和生成统一模型的新兴努力。为了实现理解和生成的统一,我们研究了包括自回归建模和扩散建模在内的关键设计,以及密集和混合专家(MoE)架构。然后介绍了几种统一模型的策略,分析了它们的潜在优势和劣势。此外,我们总结了广泛用于多模态生成AI预训练的常见数据集。最后但同样重要的是,我们提出了几个具有挑战性的未来研究方向,这些方向可能有助于多模态生成AI的持续发展。
Summary / 总结
This paper explores multi-modal generative AI, focusing on multi-modal large language models and diffusion models. It provides a comprehensive review of their architectures and applications, and discusses the unification of understanding and generation through autoregressive and diffusion-based models. Key findings include the advantages and disadvantages of different unified model designs and the use of common datasets for pretraining. The research aims to advance the field by identifying future research directions.
该论文全面概述了多模态生成AI,重点介绍了多模态大型语言模型和扩散模型。它回顾了它们的概率建模、架构设计和应用,并探讨了理解和生成的统一模型。讨论了诸如自回归和扩散建模、密集和Mixture-of-Experts架构等关键设计及其优缺点。论文还总结了用于预训练的常用数据集,并提出了该领域未来研究方向的建议。
Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities
Authors: Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri
First: 2025-11-25T18:40:48+00:00 · Latest: 2025-11-25T18:40:48+00:00
Comments: 10 pages, 6 figures, 7 tables
Abstract
This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.
中文标题/摘要
标题:评估深度学习模型在负载搬运活动中全身动态3D姿态预测中的性能
本研究旨在探索深度神经网络在动态负载搬运活动中对人体姿态预测的应用。使用双向长短期记忆(BLSTM)和变换器架构训练了两个时间序列模型。数据集包括20名正常体重健康男性的3D全身动态坐标,每人完成204次从不同负载位置进行的搬运任务,采用各种搬运和处理技术。模型输入包括手-负载位置的3D位置、搬运(弯腰、全蹲和半蹲)和处理(单手和双手)技术、体重和身高,以及任务前25%时间内的身体姿态3D坐标数据。这些输入用于预测任务剩余75%时间内的身体坐标。此外,提出了一种新颖的方法,通过优化新的成本函数来强制保持身体段长度的恒定,从而提高先前和当前姿态预测网络的准确性。结果表明,新的成本函数分别将手臂和腿部模型的预测误差降低了约8%和21%。我们表明,使用变换器架构,均方根误差为47.0毫米,比基于BLSTM的模型的长期性能提高了约58%。本研究强调了在3D运动帧中捕捉时间序列依赖性的神经网络的使用,为理解并预测手动搬运活动中运动动态提供了一种独特的方法。
Summary / 总结
This study aimed to evaluate the performance of deep learning models in predicting whole-body posture during dynamic load-reaching activities. Two models, using BLSTM and transformer architectures, were trained on 3D full-body gait data from 20 healthy males performing 204 tasks. A novel cost function was introduced to improve posture prediction accuracy, reducing errors by 8% and 21% for arm and leg models, respectively. The transformer model showed better long-term accuracy with a root-mean-square-error of 47.0 mm, outperforming the BLSTM model by 58%.
本研究旨在评估深度学习模型在预测动态搬运重物过程中人体整体姿态方面的性能。研究人员使用BLSTM和变压器架构训练了模型,基于20名健康男性在204次搬运任务中的3D全身步态数据。引入了一种新的成本函数来提高姿态预测的准确性,分别减少了手臂和腿部模型8%和21%的预测误差。变压器模型的均方根误差为47.0毫米,比BLSTM模型的长期性能提高了58%。
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
Authors: Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou
First: 2025-11-25T18:40:25+00:00 · Latest: 2025-11-25T18:40:25+00:00
Comments: Project page: https://ouyangziheng.github.io/ImageCritic-Page/
Abstract
Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
中文标题/摘要
标题:生成图像一致性批评者:通过参考引导注意力对齐纠正生成图像的一致性问题
先前的工作已经探索了各种基于参考图像的定制生成任务,但仍面临在生成精细细节时的一致性限制。本文旨在通过应用参考引导的后编辑方法解决生成图像的一致性问题,并介绍我们的ImageCritic。我们首先构建了一个参考退化目标三元组数据集,通过基于VLM的选择和显式退化,有效地模拟了现有生成模型中常见的不准确或不一致性。此外,基于对模型注意力机制和内在表示的深入研究,我们相应地设计了一种注意力对齐损失和细节编码器,以精确纠正不一致性。ImageCritic可以集成到代理框架中,自动检测不一致性并在复杂场景中进行多轮和局部编辑进行纠正。广泛的实验表明,ImageCritic可以有效地解决各种定制生成场景中的细节相关问题,显著优于现有方法。
Summary / 总结
The research aims to address the inconsistency issue in generated images by proposing ImageCritic, a reference-guided post-editing approach. The method involves constructing a dataset of reference-degraded-target triplets and using an attention alignment loss and a detail encoder to correct inconsistencies. Experiments show that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, outperforming existing methods.
研究旨在通过提出一种参考引导的后编辑方法ImageCritic来解决生成图像中的不一致性问题。该方法包括构建参考降级目标三元组数据集,并使用注意力对齐损失和细节编码器来纠正不一致性。实验表明,ImageCritic可以有效解决各种定制生成场景中的细节相关问题,优于现有方法。
Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning
Authors: Panayiotis Danassis, Naman Goel
First: 2025-11-25T18:40:22+00:00 · Latest: 2025-11-25T18:40:22+00:00
Abstract
The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
中文标题/摘要
标题:vibe编码能击败研究生CS学生吗?基于市场驱动的战略规划的LLM与人类编码比赛
大型语言模型(LLMs)的迅速发展已经革新了AI辅助的代码生成。LLMs的发展速度超出了我们对其正确基准测试的能力。现有的基准测试主要关注单元测试通过率和语法正确性,这些指标低估了许多需要规划、优化和战略互动的现实问题的难度。我们引入了一个基于现实世界物流优化问题(拍卖、取货和送货问题)的多智能体推理驱动基准,该问题结合了竞争性拍卖和容量约束路由。该基准要求构建能够(i)在不确定性下进行战略投标和(ii)优化规划者以最大化利润来完成任务的智能体。我们评估了40个LLM编码智能体(由多种最先进的LLM在多种提示方法下,包括vibe编码生成)与17个人类编码智能体(这些智能体是在LLM出现之前开发的)进行了12场双全赛和约40000场比赛。结果显示:(i)人类(研究生)编码智能体表现出明显优势:前五名始终由人类编码智能体获得;(ii)大多数LLM编码智能体(40个中的33个)被非常简单的基线击败;(iii)给定最佳人类解决方案作为输入并提示其改进时,表现最好的LLM反而使解决方案显著变差,而不是改进它。我们的结果突显了LLMs在生成能够在现实世界中竞争工作的代码方面的能力差距,并激励了新的评估,这些评估强调在现实世界场景中的推理驱动代码合成。
Summary / 总结
This study evaluates the performance of Large Language Models (LLMs) and human-coded agents in a strategic planning benchmark based on a real-world logistics optimization problem. The benchmark involves competitive auctions and capacity-constrained routing, requiring agents to bid strategically and optimize planners to maximize profit. Despite the advanced capabilities of LLMs, human-coded agents outperformed LLM-coded agents, winning the top 5 spots consistently. Most LLM-coded agents were also outperformed by simple baselines, and even when provided with the best human solution to improve, the LLM made it worse. These findings indicate a significant gap in LLMs' ability to generate competitive code in real-world scenarios.
研究使用一个基于真实物流优化问题的战略规划锦标赛来评估大型语言模型(LLMs)和人工编码的代理。基准测试要求代理在不确定性下进行战略投标,并优化计划以最大化利润。尽管LLMs功能强大,但人工编码的代理表现更优,始终占据前五名。大多数LLM编码的代理也被简单的基线所超越,即使给LLMs提供最佳的人类解决方案来改进,它们反而使解决方案变得更差。这突显了LLMs在现实世界中进行推理驱动的代码合成方面的局限性。
Sparse-to-Field Reconstruction via Stochastic Neural Dynamic Mode Decomposition
Authors: Yujin Kim, Sarah Dean
First: 2025-11-25T18:39:50+00:00 · Latest: 2025-11-25T18:39:50+00:00
Abstract
Many consequential real-world systems, like wind fields and ocean currents, are dynamic and hard to model. Learning their governing dynamics remains a central challenge in scientific machine learning. Dynamic Mode Decomposition (DMD) provides a simple, data-driven approximation, but practical use is limited by sparse/noisy observations from continuous fields, reliance on linear approximations, and the lack of principled uncertainty quantification. To address these issues, we introduce Stochastic NODE-DMD, a probabilistic extension of DMD that models continuous-time, nonlinear dynamics while remaining interpretable. Our approach enables continuous spatiotemporal reconstruction at arbitrary coordinates and quantifies predictive uncertainty. Across four benchmarks, a synthetic setting and three physics-based flows, it surpasses a baseline in reconstruction accuracy when trained from only 10% observation density. It further recovers the dynamical structure by aligning learned modes and continuous-time eigenvalues with ground truth. Finally, on datasets with multiple realizations, our method learns a calibrated distribution over latent dynamics that preserves ensemble variability rather than averaging across regimes. Our code is available at: https://github.com/sedan-group/Stochastic-NODE-DMD
中文标题/摘要
标题:基于随机神经动态模式分解的稀疏到场重建
许多重要的现实世界系统,如风场和海洋流,是动态的且难以建模。学习其支配动力学仍然是科学机器学习中的一个核心挑战。动态模式分解(DMD)提供了一种简单的数据驱动近似方法,但其实际应用受限于连续场的稀疏/噪声观测、对线性近似的依赖以及缺乏原理性的不确定性量化。为了解决这些问题,我们引入了随机NODE-DMD,这是一种DMD的概率扩展,能够建模连续时间的非线性动力学,同时保持可解释性。我们的方法能够在任意坐标处实现连续时空重建,并量化预测不确定性。在四个基准测试、一个合成设置和三个基于物理的流中,当仅从10%的观测密度中进行训练时,它在重建精度上超过了基线。此外,它还通过使学习到的模式和连续时间特征值与真实值对齐来恢复动力学结构。最后,在具有多个实现的数据集上,我们的方法学习了一个保留集合变异性而非在不同模式上平均的潜在动力学的校准分布。我们的代码可在以下链接获取:https://github.com/sedan-group/Stochastic-NODE-DMD
Summary / 总结
This paper addresses the challenge of learning governing dynamics of complex systems like wind fields and ocean currents, which are dynamic and difficult to model. It introduces Stochastic NODE-DMD, a probabilistic extension of DMD that models continuous-time, nonlinear dynamics while remaining interpretable. The method enables continuous spatiotemporal reconstruction at arbitrary coordinates and quantifies predictive uncertainty. Experiments show that it outperforms a baseline in reconstruction accuracy when trained from only 10% observation density and recovers the dynamical structure by aligning learned modes and continuous-time eigenvalues with ground truth. Additionally, it learns a calibrated distribution over latent dynamics that preserves ensemble variability on datasets with multiple realizations.
该论文旨在解决复杂系统(如风场和海洋流)的动态建模难题,这些系统动态且难以建模。它引入了Stochastic NODE-DMD,这是一种概率扩展的DMD方法,能够建模连续时间下的非线性动力学并保持可解释性。该方法能够在任意坐标上实现连续时空重建,并量化预测不确定性。实验表明,当仅从10%的观测密度中训练时,它在重建精度上优于基线,并通过将学习到的模式和连续时间特征值与真实值对齐来恢复动力学结构。此外,它在具有多个实现的数据集上学习了一个保留集合变异性而非在不同模式间平均的动态分布。
How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets
Authors: Xiwen Huang, Pierre Pinson
First: 2025-11-25T18:34:33+00:00 · Latest: 2025-11-25T18:34:33+00:00
Comments: Submitted as a preprint. 34 pages, 14 figures, 4 tables
Abstract
We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.
Summary / 总结
The paper introduces active learning markets as a cost-effective method for acquiring labels, particularly useful for improving model fitting in predictive analytics. It formulates the market clearing as an optimization problem, incorporating budget constraints and improvement thresholds. Two active learning strategies—variance-based and query-by-committee-based—are proposed and compared to random sampling. The strategies are validated on real-world datasets from real estate pricing and energy forecasting, showing consistent superior performance with fewer labels needed compared to traditional methods.
论文提出了一种使用主动学习市场以成本效益的方式获取标签的方法,特别是在提高预测分析模型拟合度时。它将市场清算形式化为优化问题,纳入预算约束和改进阈值。提出了基于方差和基于委员会查询的两种主动学习策略,并将其与随机抽样方法进行了比较。这些策略在房地产定价和能源预测等实际数据集上得到了验证,显示了在获取更少标签的情况下比传统方法具有更优性能。
On Evaluating LLM Alignment by Evaluating LLMs as Judges
Authors: Yixin Liu, Pengfei Liu, Arman Cohan
Venue: NeurIPS 2025
First: 2025-11-25T18:33:24+00:00 · Latest: 2025-11-25T18:33:24+00:00
Comments: NeurIPS 2025 Camera Ready
Abstract
Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.
中文标题/摘要
标题:通过评估LLM作为法官来评估LLM的对齐
与人类偏好的一致性是LLM评估的重要方面,要求它们能够提供帮助、诚实、安全,并精确遵循人类指令。评估大型语言模型(LLM)的一致性通常涉及直接评估它们的开放式响应,这需要人类注释者或强大的LLM法官。相反,LLM本身也被广泛评估为评估一致性的法官。在这项工作中,我们探讨了LLM在生成和评估能力方面与人类偏好的一致性关系。为此,我们首先对各种LLM的生成-评估一致性(GE-一致性)进行了全面分析,揭示了当由强大的LLM偏好Oracle评估时,它们的生成和评估能力之间存在很强的相关性。利用这一发现,我们提出了一种基准测试范式,该范式通过评估LLM在评估者角色中的表现来衡量它们与人类偏好的一致性,而不是直接评估它们生成的输出。我们的评估表明,我们提出的基准测试AlignEval在捕捉人类偏好方面与广泛使用的自动LLM评估基准AlpacaEval和Arena-Hard相当或更优。我们的研究为LLM的生成和评估能力之间的联系提供了有价值的见解,并引入了一种无需直接评估模型输出即可评估一致性的基准测试。
Summary / 总结
This study evaluates the alignment of large language models (LLMs) with human preferences by leveraging their evaluation capabilities. It finds a strong correlation between LLMs' generation and evaluation abilities and proposes AlignEval, a benchmark that assesses LLM alignment through their role as evaluators. AlignEval outperforms existing benchmarks like AlpacaEval and Arena-Hard in capturing human preferences when ranking LLMs.
该研究通过利用LLMs的评估能力来评估其与人类偏好的一致性,并发现LLMs的生成能力和评估能力之间存在强烈的相关性。研究提出了一种名为AlignEval的基准,通过LLMs的评估角色来评估其一致性,AlignEval在捕捉人类偏好方面优于或与AlpacaEval和Arena-Hard等现有基准相当。
BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents
Authors: Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, Ninghui Li
First: 2025-11-25T18:28:35+00:00 · Latest: 2025-11-25T18:28:35+00:00
Abstract
The integration of artificial intelligence (AI) agents into web browsers introduces security challenges that go beyond traditional web application threat models. Prior work has identified prompt injection as a new attack vector for web agents, yet the resulting impact within real-world environments remains insufficiently understood. In this work, we examine the landscape of prompt injection attacks and synthesize a benchmark of attacks embedded in realistic HTML payloads. Our benchmark goes beyond prior work by emphasizing injections that can influence real-world actions rather than mere text outputs, and by presenting attack payloads with complexity and distractor frequency similar to what real-world agents encounter. We leverage this benchmark to conduct a comprehensive empirical evaluation of existing defenses, assessing their effectiveness across a suite of frontier AI models. We propose a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. Our work offers a blueprint for designing practical, secure web agents through a defense-in-depth approach.
中文标题/摘要
标题:BrowseSafe:理解并防止AI浏览器代理中的提示注入
将人工智能(AI)代理集成到网络浏览器中引入了超越传统网络应用程序威胁模型的安全挑战。先前的研究已经识别出提示注入是网络代理的新攻击向量,但其在实际环境中的影响仍然不够了解。 在这项工作中,我们研究了提示注入攻击的景观,并综合了一个嵌入在现实HTML负载中的攻击基准。我们的基准超越了先前的工作,强调了可以影响实际操作的注入,而不是仅仅影响文本输出,并且展示了与实际代理遇到的复杂性和干扰频率相似的攻击负载。我们利用这个基准进行了一项全面的经验性评估,评估现有防御措施在一系列前沿AI模型中的有效性。我们提出了一种多层次的防御策略,包括架构和模型基础的防御,以抵御不断演变的提示注入攻击。我们的工作提供了一种通过多层次防御方法设计实用、安全的网络代理的蓝图。
Summary / 总结
The research aims to understand and prevent prompt injection attacks in AI browser agents, which pose new security challenges beyond traditional web application threats. The study develops a benchmark of realistic HTML payloads to evaluate existing defenses against prompt injection attacks, focusing on attacks that can influence real-world actions. The key findings show that current defenses are insufficient, and a multi-layered defense strategy, combining architectural and model-based approaches, is proposed to better protect against evolving prompt injection attacks.
本研究针对AI浏览器代理中的提示注入攻击带来的安全挑战,这些挑战超出了传统网络应用程序威胁模型的范围。研究人员开发了一个包含复杂提示注入攻击的基准,这些攻击嵌入在真实的HTML内容中,以评估现有防御措施的有效性。研究发现,当前的防御措施不足以应对复杂的提示注入攻击,并提出了一种结合架构和模型基础的多层次防御策略,以增强安全性。
Attention Trajectories as a Diagnostic Axis for Deep Reinforcement Learning
Authors: Charlotte Beylier, Hannah Selder, Arthur Fleig, Simon M. Hofmann, Nico Scherf
First: 2025-11-25T18:20:42+00:00 · Latest: 2025-11-25T18:20:42+00:00
Abstract
The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attention-oriented metrics (ATOMs) to investigate the development of an RL agent's attention during training. In a controlled experiment, we tested ATOMs on three variations of a Pong game, each designed to teach the agent distinct behaviours, complemented by a behavioural assessment. ATOMs successfully delineate the attention patterns of an agent trained on each game variation, and that these differences in attention patterns translate into differences in the agent's behaviour. Through continuous monitoring of ATOMs during training, we observed that the agent's attention developed in phases, and that these phases were consistent across game variations. Overall, we believe that ATOM could help improve our understanding of the learning processes of RL agents and better understand the relationship between attention and learning.
中文标题/摘要
标题:注意力轨迹作为深度强化学习诊断轴
强化学习(RL)代理的学习过程在数学描述其学习算法之外仍然知之甚少。为了解决这一问题,我们引入了注意力导向的度量标准(ATOMs),以研究RL代理在训练期间注意力的发展。在一项受控实验中,我们对三种不同版本的Pong游戏测试了ATOMs,每种版本都旨在教会代理不同的行为,并辅以行为评估。ATOMs成功地界定了每个游戏版本训练的代理的注意力模式,且这些注意力模式的差异转化为代理行为的差异。通过在训练过程中持续监控ATOMs,我们观察到代理的注意力在阶段上发展,并且这些阶段在不同游戏版本中是一致的。总体而言,我们认为ATOMs有助于提高我们对RL代理学习过程的理解,并更好地理解注意力与学习之间的关系。
Summary / 总结
The study aims to better understand the learning process of reinforcement learning agents by introducing attention-oriented metrics (ATOMs) to analyze the development of an agent's attention during training. The method involves testing ATOMs on three variations of a Pong game, each teaching the agent different behaviors, and conducting a behavioral assessment. The key findings show that ATOMs can effectively distinguish the attention patterns of the agent trained on each game variation, and these patterns correlate with the agent's behavior. The study also observed that the agent's attention developed in distinct phases, consistent across different game variations, suggesting that ATOMs can provide insights into the learning processes of RL agents.
研究旨在通过引入注意力导向度量(ATOMs)来更好地理解强化学习代理的学习过程,通过在三种不同版本的Pong游戏中测试ATOMs,每种版本教会代理不同的行为,并进行行为评估。关键发现表明,ATOMs可以有效区分代理在每个游戏版本训练后的注意力模式,这些模式与代理的行为相关。研究还观察到,代理的注意力在不同游戏版本中都经历了不同的阶段,表明ATOMs可以提供关于RL代理学习过程的见解。
MSTN: Fast and Efficient Multivariate Time Series Model
Authors: Sumit S Shevtekar, Chandresh K Maurya, Gourab Sil
First: 2025-11-25T18:09:42+00:00 · Latest: 2025-11-25T18:09:42+00:00
Comments: 21 pages, 1 figure, 5 tables
Abstract
Real-world time-series data is highly non stationary and complex in dynamics that operate across multiple timescales, ranging from fast, short-term changes to slow, long-term trends. Most existing models rely on fixed-scale structural priors, such as patch-based tokenization, fixed frequency transformations, or frozen backbone architectures. This often leads to over-regularization of temporal dynamics, which limits their ability to adaptively model the full spectrum of temporal variations and impairs their performance on unpredictable, Sudden, high-magnitude events. To address this, we introduce the Multi-scale Temporal Network (MSTN), a novel deep learning architecture founded on a hierarchical multi-scale and sequence modeling principle. The MSTN framework integrates: (i) a multi-scale convolutional encoder that constructs a hierarchical feature pyramid for local patterns (ii) a sequence modeling component for long-range temporal dependencies. We empirically validate this with BiLSTM and Transformer variants, establishing a flexible foundation for future architectural advancements. and (iii) a gated fusion mechanism augmented with squeeze-and-excitation (SE) and multi-head temporal attention (MHTA) for dynamic, context-aware feature integration. This design enables MSTN to adaptively model temporal patterns from milliseconds to long-range dependencies within a unified framework. Extensive evaluations across time-series long-horizon forecasting, imputation, classification and generalizability study demonstrate that MSTN achieves competitive state-of-the-art (SOTA) performance, showing improvements over contemporary approaches including EMTSF, LLM4TS, HiMTM, TIME-LLM, MTST, SOFTS, iTransformer, TimesNet, and PatchTST. In total, MSTN establishes new SOTA performance on 24 of 32 benchmark datasets, demonstrating its consistent performance across diverse temporal tasks.
中文标题/摘要
标题:MSTN:快速高效的多变量时间序列模型
现实世界的时间序列数据高度非平稳且动态复杂,跨越多个时间尺度,从快速的短期变化到缓慢的长期趋势。大多数现有模型依赖于固定尺度的结构先验,如基于块的标记化、固定频率变换或冻结的骨干架构。这通常会导致时间动态的过度正则化,限制了它们适应性地建模时间变化全谱的能力,并影响其在不可预测、突发、高幅度事件上的性能。为解决这一问题,我们引入了多尺度时间网络(MSTN),这是一种基于分层多尺度和序列建模原则的新型深度学习架构。MSTN框架整合了:(i) 多尺度卷积编码器,构建局部模式的分层特征金字塔;(ii) 用于长程时间依赖性的序列建模组件。我们通过双向LSTM和Transformer变体进行实证验证,建立了未来架构进步的灵活基础;(iii) 带有挤压-激励(SE)和多头时间注意力(MHTA)的门控融合机制,实现动态、上下文感知的特征融合。此设计使MSTN能够在统一框架中自适应地建模从毫秒到长程依赖的时间模式。广泛的评估表明,MSTN在时间序列长视窗预测、插补、分类和泛化研究中实现了具有竞争力的最先进(SOTA)性能,优于包括EMTSF、LLM4TS、HiMTM、TIME-LLM、MTST、SOFTS、iTransformer、TimesNet和PatchTST在内的当代方法。总体而言,MSTN在32个基准数据集中的24个上建立了新的SOTA性能,展示了其在各种时间任务中的一致性能。
Summary / 总结
The research aims to address the limitations of existing models in handling non-stationary and complex multivariate time series data by introducing the Multi-scale Temporal Network (MSTN). MSTN uses a hierarchical multi-scale and sequence modeling approach, combining a multi-scale convolutional encoder, a sequence modeling component, and a gated fusion mechanism with squeeze-and-excitation and multi-head temporal attention. Experiments on forecasting, imputation, classification, and generalizability show that MSTN outperforms contemporary models like EMTSF, LLM4TS, HiMTM, TIME-LLM, MTST, SOFTS, iTransformer, TimesNet, and PatchTST, achieving state-of-the-art performance on 24 out of 32 benchmark datasets.
论文提出了MSTN,这是一种新的深度学习架构,旨在处理跨越多个时间尺度的非平稳和复杂的多变量时间序列数据。MSTN使用多尺度卷积编码器处理局部模式,序列建模组件处理长范围依赖性,并使用门控融合机制结合squeeze-and-excitation和多头时间注意力动态集成特征。实验表明,MSTN在长周期预测、插补、分类和泛化性方面优于几种当代模型,在32个基准数据集中的24个上达到了最先进的性能。
VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Authors: Chenhui Gou, Zilong Chen, Zeyu Wang, Feng Li, Deyao Zhu, Zicheng Duan, Kunchang Li, Chaorui Deng, Hongyi Yuan, Haoqi Fan, Cihang Xie, Jianfei Cai, Hamid Rezatofighi
First: 2025-11-25T18:06:22+00:00 · Latest: 2025-11-25T18:06:22+00:00
Abstract
This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.
中文标题/摘要
标题:VQ-VA世界:迈向高质量视觉问题-视觉答案
本文研究了视觉问题-视觉答案(VQ-VA):生成图像而非文本来回答视觉问题的能力——这种能力最近在NanoBanana和GPT-Image等专有系统中出现。为了将这种能力带给开源模型,我们引入了VQ-VA世界,这是一个以数据为中心的框架,围绕一个自主的流水线进行大规模、定向的数据构建。利用大规模部署,该流水线爬取了约180万高质量的图像-文本样本供模型训练。为了评估,我们进一步发布了IntelligentBench,这是一个由人类精心策划的基准,系统地评估VQ-VA在世界知识、设计知识和推理方面的表现。使用VQ-VA世界数据进行训练带来了显著的经验收益:它帮助LightFusion在IntelligentBench上达到53.06分,大幅超越了最佳开源基线(即vanilla LightFusion的7.78分;UniWorld-V1的1.94分),并显著缩小了与领先专有系统的差距(例如,NanoBanana的81.67分;GPT-Image的82.64分)。通过发布完整的模型权重、数据集和流水线,我们希望激发未来对VQ-VA的研究。
E2E-GRec: An End-to-End Joint Training Framework for Graph Neural Networks and Recommender Systems
Authors: Rui Xue, Shichao Zhu, Liang Qin, Guangmou Pan, Yang Song, Tianfu Wu
First: 2025-11-25T17:59:22+00:00 · Latest: 2025-11-25T17:59:22+00:00
Abstract
Graph Neural Networks (GNNs) have emerged as powerful tools for modeling graph-structured data and have been widely used in recommender systems, such as for capturing complex user-item and item-item relations. However, most industrial deployments adopt a two-stage pipeline: GNNs are first pre-trained offline to generate node embeddings, which are then used as static features for downstream recommender systems. This decoupled paradigm leads to two key limitations: (1) high computational overhead, since large-scale GNN inference must be repeatedly executed to refresh embeddings; and (2) lack of joint optimization, as the gradient from the recommender system cannot directly influence the GNN learning process, causing the GNN to be suboptimally informative for the recommendation task. In this paper, we propose E2E-GRec, a novel end-to-end training framework that unifies GNN training with the recommender system. Our framework is characterized by three key components: (i) efficient subgraph sampling from a large-scale cross-domain heterogeneous graph to ensure training scalability and efficiency; (ii) a Graph Feature Auto-Encoder (GFAE) serving as an auxiliary self-supervised task to guide the GNN to learn structurally meaningful embeddings; and (iii) a two-level feature fusion mechanism combined with Gradnorm-based dynamic loss balancing, which stabilizes graph-aware multi-task end-to-end training. Extensive offline evaluations, online A/B tests (e.g., a +0.133% relative improvement in stay duration, a 0.3171% reduction in the average number of videos a user skips) on large-scale production data, together with theoretical analysis, demonstrate that E2E-GRec consistently surpasses traditional approaches, yielding significant gains across multiple recommendation metrics.
中文标题/摘要
标题:E2E-GRec:图神经网络和推荐系统联合训练的端到端框架
图神经网络(GNNs)已成为建模图结构数据的强大工具,并广泛应用于推荐系统中,如捕捉复杂用户-项目和项目-项目关系。然而,大多数工业部署采用两阶段管道:GNNs首先离线预训练生成节点嵌入,然后作为静态特征用于下游推荐系统。这种分离范式导致两个关键限制:(1)高计算开销,因为大规模GNN推理必须反复执行以刷新嵌入;(2)缺乏联合优化,因为推荐系统的梯度不能直接影响GNN的学习过程,导致GNN对推荐任务的信息不足。在本文中,我们提出了一种新颖的端到端训练框架E2E-GRec,将GNN训练与推荐系统统一起来。我们的框架由三个关键组件组成:(i)从大规模跨域异构图中高效采样子图以确保训练的可扩展性和效率;(ii)图特征自编码器(GFAE)作为辅助自监督任务,引导GNN学习结构上具有意义的嵌入;(iii)结合Gradnorm动态损失平衡的两级特征融合机制,稳定图感知的多任务端到端训练。大规模离线评估、在线A/B测试(例如,停留时间相对提高0.133%,用户跳过的平均视频数量减少0.3171%)以及生产数据上的理论分析表明,E2E-GRec在多个推荐指标上始终优于传统方法,取得了显著的提升。
Summary / 总结
The paper proposes E2E-GRec, an end-to-end training framework that integrates Graph Neural Networks (GNNs) with recommender systems to address the limitations of the two-stage pipeline. Key components include efficient subgraph sampling, a Graph Feature Auto-Encoder for self-supervised learning, and a two-level feature fusion mechanism with Gradnorm-based loss balancing. Experimental results show significant improvements in recommendation metrics, such as a +0.133% relative increase in stay duration and a 0.3171% reduction in skipped videos.
E2E-GRec 是一个端到端的训练框架,将图神经网络(GNN)与推荐系统结合,通过减少计算开销和实现联合优化来解决两阶段管道的限制。它包括高效的子图采样、用于自监督学习的图特征自动编码器以及带有动态损失平衡的两层特征融合机制。实验结果表明,E2E-GRec 在推荐指标上有所改进,例如相对停留时间增加了 0.133%,平均跳过的视频数量减少了 0.3171%。
A Reason-then-Describe Instruction Interpreter for Controllable Video Generation
Authors: Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua
First: 2025-11-25T17:59:07+00:00 · Latest: 2025-11-25T17:59:07+00:00
Comments: 27 pages, 13 figures, 13 tables, Project Page: https://sqwu.top/ReaDe/
Abstract
Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.
中文标题/摘要
标题:一种用于可控视频生成的先推理后描述指令解释器
扩散变换器显著提高了视频保真度和时间连贯性,然而实际可控性仍然有限。简洁、模糊且组成上复杂的用户输入与训练中使用的详细提示形成对比,导致意图与输出不匹配。我们提出了一种通用的、模型无关的解释器ReaDe,它可以将原始指令转换为精确的、可操作的规范,供下游视频生成器使用。ReaDe 遵循先推理后描述的范式:首先分析用户请求以识别核心要求并解决模糊性,然后生成详细的指导,以实现忠实且可控的生成。我们通过两阶段优化训练ReaDe:(i) 增强推理监督赋予分析解析和逐步追踪以及密集描述,(ii) 多维度奖励分配器实现自然风格描述的稳定、反馈驱动的改进。在单条件和多条件场景中的实验显示,在指令保真度、描述准确性以及下游视频质量方面的一致改进,并且在推理密集和未见过的输入上有强大的泛化能力。ReaDe 提供了一条将可控视频生成与准确解释用户意图相一致的实用途径。项目页面:https://sqwu.top/ReaDe/
Summary / 总结
The paper addresses the challenge of practical controllability in video generation by proposing ReaDe, a model-agnostic interpreter that converts ambiguous user instructions into precise specifications. ReaDe uses a reason-then-describe approach to first analyze user requests and then generate detailed guidance for video generation. Experiments show consistent improvements in instruction fidelity, caption accuracy, and video quality, with strong generalization to unseen inputs.
论文解决了视频生成中用户输入与详细训练提示不匹配的问题,提出了一个模型通用的解释器ReaDe,将原始指令转化为视频生成的具体规范。ReaDe采用先解释后描述的方法,首先分析用户请求,然后提供详细的指导。实验结果显示,在指令准确度、描述准确性和视频质量方面均有提升,并且在处理复杂推理和未见过的输入时表现出良好的泛化能力,表明ReaDe能够有效将用户意图与生成的视频对齐。
PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding
Authors: Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, Wangmeng Zuo
First: 2025-11-25T17:59:04+00:00 · Latest: 2025-11-25T17:59:04+00:00
Abstract
While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.
中文标题/摘要
标题:PhysChoreo:基于部分意识语义接地的物理可控视频生成
尽管近期的视频生成模型在视觉保真度方面取得了显著进展,但它们往往缺乏明确的物理可控性和合理性。为解决这一问题,一些近期的研究尝试通过基于物理的渲染来引导视频生成。然而,这些方法在准确建模复杂物理属性以及在长时间序列中有效控制物理行为方面存在固有的挑战。在本文中,我们提出了一种名为PhysChoreo的新框架,可以从单张图像生成具有多样可控性和物理真实感的视频。我们的方法分为两个阶段:首先,通过部分意识物理属性重构估计图像中所有物体的静态初始物理属性;然后,通过时间指令和物理可编辑的模拟,合成具有丰富动态行为和物理真实感的高质量视频。实验结果表明,PhysChoreo能够生成具有丰富行为和物理真实感的视频,在多个评估指标上优于现有最先进的方法。
Summary / 总结
The research aims to improve the physical controllability and plausibility of video generation by addressing the limitations of existing methods. PhysChoreo uses a two-stage process: first, it reconstructs the physical properties of objects in an image, and then it simulates the video with rich dynamic behaviors and physical realism through temporally instructed and physically editable simulation. The results demonstrate that PhysChoreo outperforms current state-of-the-art methods in generating videos with diverse controllability and physical realism.
研究旨在通过解决现有方法的局限性,增强视频生成中的物理可控性和真实性。提出的PhysChoreo框架首先在图像中重建物体的物理属性,然后使用时间指令和物理可编辑的模拟生成具有丰富行为和物理真实性的高质量视频。实验结果表明,PhysChoreo在生成具有丰富行为和物理真实性的视频方面优于现有方法。
Spatio-Temporal Hierarchical Causal Models
Authors: Xintong Li, Haoran Zhang, Xiao Zhou
First: 2025-11-25T17:56:43+00:00 · Latest: 2025-11-25T17:56:43+00:00
Abstract
The abundance of fine-grained spatio-temporal data, such as traffic sensor networks, offers vast opportunities for scientific discovery. However, inferring causal relationships from such observational data remains challenging, particularly due to unobserved confounders that are specific to units (e.g., geographical locations) yet influence outcomes over time. Most existing methods for spatio-temporal causal inference assume that all confounders are observed, an assumption that is often violated in practice. In this paper, we introduce Spatio-Temporal Hierarchical Causal Models (ST-HCMs), a novel graphical framework that extends hierarchical causal modeling to the spatio-temporal domain. At the core of our approach is the Spatio-Temporal Collapse Theorem, which shows that a complex ST-HCM converges to a simpler flat causal model as the amount of subunit data increases. This theoretical result enables a general procedure for causal identification, allowing ST-HCMs to recover causal effects even in the presence of unobserved, time-invariant unit-level confounders, a scenario where standard non-hierarchical models fail. We validate the effectiveness of our framework on both synthetic and real-world datasets, demonstrating its potential for robust causal inference in complex dynamic systems.
中文标题/摘要
标题:空间-时间层次因果模型
丰富的细粒度空间-时间数据,如交通传感器网络,为科学研究提供了巨大的机会。然而,从这种观察数据中推断因果关系仍然具有挑战性,特别是在由于特定于单元(例如地理位置)但随时间影响结果的未观察到的混杂因素方面。现有的大多数空间-时间因果推断方法假设所有混杂因素都是观察到的,而在实践中这一假设经常被违反。在本文中,我们引入了空间-时间层次因果模型(ST-HCMs),这是一种新的图形框架,将层次因果建模扩展到空间-时间领域。我们方法的核心是空间-时间坍缩定理,该定理表明,随着子单元数据量的增加,复杂的ST-HCM收敛于一个更简单的平面因果模型。这一理论结果使因果识别成为可能,即使在存在未观察到的时间不变单元级混杂因素的情况下,ST-HCMs也能恢复因果效应,而标准的非层次模型在这种情况下会失败。我们在合成数据集和真实世界数据集上验证了我们框架的有效性,展示了其在复杂动态系统中进行稳健因果推断的潜力。
Summary / 总结
This paper addresses the challenge of inferring causal relationships from spatio-temporal data, such as traffic sensor networks, by introducing Spatio-Temporal Hierarchical Causal Models (ST-HCMs). The method leverages a Spatio-Temporal Collapse Theorem to simplify complex models as more subunit data is available, enabling causal identification even with unobserved, time-invariant unit-level confounders. Experiments on synthetic and real-world datasets show that ST-HCMs can robustly recover causal effects in dynamic systems where standard models fail.
论文提出了时空层次因果模型(ST-HCMs),以解决从时空数据中推断因果关系的挑战,特别是在存在未观察到的混杂因素时。关键方法是时空坍缩定理,该定理随着子单元数据的增加,将复杂的ST-HCM简化为扁平的因果模型。该模型能够在标准非层次模型无法处理的情况下,成功恢复因果效应,通过合成数据和真实世界数据集的验证得到了证实。
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
Authors: Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
Venue: NeurIPS 2025
First: 2025-09-21T17:53:30+00:00 · Latest: 2025-11-25T17:49:27+00:00
Comments: Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM); update with trials on Gemini 3 Pro
Abstract
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
中文标题/摘要
标题:FlagEval 发现报告:大型推理模型在自动可验证文本和视觉问题上的初步评估
我们进行了一项中等规模的无污染评估,对当前的大型推理模型(LRMs)进行了初步发现。我们还发布了 ROME,这是一个用于视觉语言模型的评估基准,旨在测试从视觉线索中进行推理的能力。更多基准、评估数据和其他更新请参见:https://flageval-baai.github.io/LRM-Eval/
Summary / 总结
This study evaluates current large reasoning models (LRMs) in a contamination-free manner and introduces ROME, a benchmark for vision language models to test reasoning from visual clues. Key findings include preliminary insights into the performance of LRMs on automatically verifiable textual and visual questions.
研究以无污染的方式评估了当前的大规模推理模型(LRMs),并引入了用于测试从视觉线索进行推理的基准测试ROME。主要发现包括对LRMs在自动可验证的文本和视觉问题上的表现的初步见解。
Time-Domain Linear Model-based Framework for Passive Acoustic Mapping of Cavitation Activity
Authors: Tatiana Gelvez-Barrera, Barbara Nicolas, Denis Kouamé, Bruno Gilles, Adrian Basarab
First: 2025-11-25T17:48:04+00:00 · Latest: 2025-11-25T17:48:04+00:00
Abstract
Passive acoustic mapping enables the spatial mapping and temporal monitoring of cavitation activity, playing a crucial role in therapeutic ultrasound applications. Most conventional beamforming methods, whether implemented in the time or frequency domains, suffer from limited axial resolution due to the absence of a reference emission onset time. While frequency-domain methods, the most efficient of which are based on the cross-spectral matrix, require long signals for accurate estimation, time-domain methods typically achieve lower spatial resolution. To address these limitations, we propose a linear model-based beamforming framework fully formulated in the time domain. The linear forward model relates a discretized spatiotemporal distribution of cavitation activity to the temporal signals recorded by a probe, explicitly accounting for time-of-flight delays dictated by the acquisition geometry. This model is then inverted using regularization techniques that exploit prior knowledge of cavitation activity in both spatial and temporal domains. Experimental results show that the proposed framework achieves enhanced or competitive cavitation map quality while using only 20\% of the data typically required by frequency-domain methods. This highlights the substantial gain in data efficiency and the flexibility of our spatiotemporal regularization to adapt to diverse passive cavitation scenarios, outperforming state-of-the-art techniques.
中文标题/摘要
标题:基于时域线性模型的被动声学映射空化活动框架
被动声学映射能够实现空化活动的空间映射和时间监测,在治疗性超声应用中发挥着重要作用。大多数传统的波束形成方法,无论是实施在时域还是频域,都因缺乏参考发射起始时间而限制了轴向分辨率。虽然频域方法中最有效的方法基于交叉谱矩阵,需要长信号才能进行准确估计,但时域方法通常实现较低的空间分辨率。为了解决这些限制,我们提出了一种完全在时域中构建的基于线性模型的波束形成框架。该线性前向模型将空化活动的空间-时间分布与探头记录的时域信号相关联,并明确考虑了由采集几何结构决定的时间飞行延迟。然后使用利用空间和时间域中空化活动先验知识的正则化技术对该模型进行反演。实验结果表明,所提出的框架在使用仅需频域方法所需数据量的20%的情况下,实现了增强或竞争性的空化图质量。这突显了在数据效率方面的显著增益以及我们的时间-空间正则化在适应各种被动空化场景方面的灵活性,超越了现有最先进的技术。
Summary / 总结
The paper proposes a time-domain linear model-based framework for passive acoustic mapping of cavitation activity, addressing the limitations of conventional methods by improving axial resolution. The framework uses a linear forward model that accounts for time-of-flight delays and is inverted with regularization techniques. Experiments demonstrate that the proposed method achieves high-quality cavitation maps using only 20% of the data needed by frequency-domain methods, showcasing its data efficiency and adaptability to various scenarios.
研究旨在改进治疗超声应用中气泡活动的空间映射和时间监测。提出了一种基于时间域线性模型的波束形成框架,以克服传统方法的低轴向分辨率和需要长时间信号才能准确估计的问题。该框架使用一个考虑时间飞行延迟的线性前向模型,并通过利用空间和时间域的先验知识进行正则化反演。实验结果表明,所提出的方法仅使用频率域方法所需数据的20%就能获得高质量的气泡图,突显了其数据效率和对各种气泡场景的适应性。
Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
Authors: Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun
First: 2025-11-25T17:47:11+00:00 · Latest: 2025-11-25T17:47:11+00:00
Abstract
Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
中文标题/摘要
标题:Flash-DMD:高效快速的图像生成框架,结合高效蒸馏和联合强化学习
扩散模型已经成为了生成模型中的领先类别,但其迭代采样过程仍然非常耗时。时间步长蒸馏是一种加速生成过程的有前途的技术,但它通常需要大量的训练,并导致图像质量下降。此外,使用强化学习(RL)对这些蒸馏模型进行微调以满足特定目标,如美学吸引力或用户偏好,通常非常不稳定,容易陷入奖励作弊。在本文中,我们提出了Flash-DMD,这是一种新颖的框架,能够通过蒸馏和联合RL优化实现快速收敛。具体来说,我们首先提出了一种高效的时间步长感知蒸馏策略,该策略显著降低了训练成本并增强了现实感,仅使用DMD2训练成本的2.1%就取得了更好的效果。其次,我们引入了一种联合训练方案,其中模型在使用RL目标进行微调的同时,时间步长蒸馏训练继续进行。我们证明,持续蒸馏提供的稳定且明确的损失作为强大的正则化器,有效地稳定了RL训练过程并防止了策略崩溃。在基于分数和流匹配模型的广泛实验中,我们提出的Flash-DMD不仅收敛速度显著加快,而且在少量步骤采样的情况下实现了最先进的生成质量,在视觉质量、人类偏好和文本-图像对齐指标方面均优于现有方法。我们的工作为训练高效、高保真和稳定的生成模型提供了一种有效的范式。代码即将发布。
Summary / 总结
Flash-DMD is a novel framework that combines efficient timestep distillation and joint RL-based refinement to accelerate the generation process of diffusion models while maintaining high image quality. It introduces an efficient distillation strategy that reduces training cost and enhances realism, and a joint training scheme that stabilizes RL training and prevents policy collapse. Experiments show that Flash-DMD converges faster and achieves superior generation quality compared to existing methods in visual and alignment metrics.
Flash-DMD 是一种新型框架,通过高效的 timestep 蒸馏和联合 RL 精炼来加速扩散模型的生成过程。它引入了一种高效的蒸馏策略,既能减少训练成本又能保持图像质量,并且采用联合训练方案来稳定 RL 训练并防止策略崩溃。实验表明,Flash-DMD 不仅收敛速度更快,而且在视觉质量和文本-图像对齐指标上也优于现有方法。
New York Smells: A Large Multimodal Dataset for Olfaction
Authors: Ege Ozguroglu, Junbang Liang, Ruoshi Liu, Mia Chiquier, Michael DeTienne, Wesley Wei Qian, Alexandra Horowitz, Andrew Owens, Carl Vondrick
First: 2025-11-25T17:44:50+00:00 · Latest: 2025-11-25T17:44:50+00:00
Comments: Project website at https://smell.cs.columbia.edu
Abstract
While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.'' Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70$\times$ more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.
中文标题/摘要
标题:纽约的气味:一种大型多模态嗅觉数据集
虽然嗅觉是动物感知世界的关键,但这种丰富的化学感知模态对机器来说仍然难以触及。一个主要瓶颈是缺乏在自然环境中收集的多样化的多模态嗅觉训练数据。我们介绍了纽约的气味,这是一个在野外捕获的图像和嗅觉信号配对的大规模数据集。我们的数据集包含来自室内和室外环境的3500个不同物体的7000个嗅觉-图像配对,比现有嗅觉数据集中的物体多约70倍。我们的基准测试包含三个任务:跨模态嗅觉到图像检索、仅从嗅觉识别场景、物体和材料、以及对草类进行精细区分。通过在我们数据集上的实验,我们发现视觉数据能够促进跨模态嗅觉表示学习,且我们学习到的嗅觉表示优于广泛使用的手工特征。
Summary / 总结
The research aims to address the lack of diverse olfactory data for machine learning by introducing New York Smells, a large multimodal dataset of smell-image pairs collected in natural settings. The dataset includes 7,000 pairs from 3,500 objects across indoor and outdoor environments, significantly expanding the existing olfactory datasets. The study evaluates three tasks: cross-modal smell-to-image retrieval, scene, object, and material recognition from smell, and grass species discrimination. The results show that visual data helps in learning olfactory representations, and the learned representations outperform traditional hand-crafted features.
研究旨在通过引入包含7000个嗅觉-图像配对的New York Smells数据集,解决机器学习中缺乏多样嗅觉数据的问题。该数据集涵盖了3500个物体的室内和室外环境下的嗅觉-图像配对,显著扩展了现有的嗅觉数据集。研究评估了三项任务:跨模态嗅觉到图像检索、仅凭嗅觉识别场景、物体和材料、以及草本植物种类区分。结果表明,视觉数据有助于学习嗅觉表示,且学习到的表示优于传统的手工特征。
Feature-Modulated UFNO for Improved Prediction of Multiphase Flow in Porous Media
Authors: Alhasan Abdellatif, Hannah P. Menke, Ahmed H. Elsheikh, Florian Doster, Kamaljit Singh
First: 2025-11-25T17:44:28+00:00 · Latest: 2025-11-25T17:44:28+00:00
Abstract
The UNet-enhanced Fourier Neural Operator (UFNO) extends the Fourier Neural Operator (FNO) by incorporating a parallel UNet pathway, enabling the retention of both high- and low-frequency components. While UFNO improves predictive accuracy over FNO, it inefficiently treats scalar inputs (e.g., temperature, injection rate) as spatially distributed fields by duplicating their values across the domain. This forces the model to process redundant constant signals within the frequency domain. Additionally, its standard loss function does not account for spatial variations in error sensitivity, limiting performance in regions of high physical importance. We introduce UFNO-FiLM, an enhanced architecture that incorporates two key innovations. First, we decouple scalar inputs from spatial features using a Feature-wise Linear Modulation (FiLM) layer, allowing the model to modulate spatial feature maps without introducing constant signals into the Fourier transform. Second, we employ a spatially weighted loss function that prioritizes learning in critical regions. Our experiments on subsurface multiphase flow demonstrate a 21\% reduction in gas saturation Mean Absolute Error (MAE) compared to UFNO, highlighting the effectiveness of our approach in improving predictive accuracy.
中文标题/摘要
标题:特征调制UFNO以提高多相流在多孔介质中预测的准确性
UNet增强的傅里叶神经算子(UFNO)通过引入并行UNet路径扩展了傅里叶神经算子(FNO),使其能够保留高频和低频分量。尽管UFNO在预测准确性上优于FNO,但它以复制值的方式将标量输入(例如,温度、注入率)视为空间分布场,这导致模型在频域中处理冗余的常数信号。此外,其标准损失函数没有考虑到空间误差敏感性的变化,限制了在物理重要区域的性能。我们引入了UFNO-FiLM,这是一种增强的架构,包含两个关键创新。首先,我们使用特征值线性调制(FiLM)层将标量输入与空间特征解耦,使模型能够调节空间特征图而不引入常数信号到傅里叶变换中。其次,我们采用空间加权损失函数,优先在关键区域学习。我们在地下多相流实验中展示了与UFNO相比,气相饱和度的平均绝对误差(MAE)降低了21%,突显了我们方法在提高预测准确性方面的有效性。
Summary / 总结
The research aims to improve the prediction of multiphase flow in porous media by enhancing the UFNO model. It introduces UFNO-FiLM, which uses a Feature-wise Linear Modulation (FiLM) layer to decouple scalar inputs from spatial features and a spatially weighted loss function to focus on critical regions. Experiments show a 21% reduction in gas saturation Mean Absolute Error (MAE) compared to UFNO, demonstrating improved predictive accuracy.
研究旨在通过改进UFNO模型来提高多相流在多孔介质中的预测精度,解决UFNO模型中存在的问题。UFNO-FiLM引入了Feature-wise Linear Modulation (FiLM)层来解耦标量输入和空间特征,并使用空间加权损失函数优先在关键区域学习。实验结果显示,与UFNO相比,气体饱和度的Mean Absolute Error (MAE)降低了21%,证明了该方法的有效性。
History
20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553