arXiv 论文速递

CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback

Authors: Wenhang Ge, Guibao Shen, Jiawei Feng, Luozhou Wang, Hao Lu, Xingye Tian, Xin Tao, Ying-Cong Chen

First: 2026-01-22T18:59:56+00:00 · Latest: 2026-01-22T18:59:56+00:00

Abstract

Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.

中文标题/摘要

标题：CamPilot：通过高效相机奖励反馈提高视频扩散模型中的相机控制

近期在相机控制视频扩散模型方面的进展显著提高了视频与相机的对齐。然而，相机的可控性仍然有限。在本工作中，我们基于奖励反馈学习，旨在进一步提高相机的可控性。然而，直接借用现有的奖励反馈学习（ReFL）方法面临几个挑战。首先，当前的奖励模型缺乏评估视频与相机对齐的能力。其次，将潜在变量解码为RGB视频以进行奖励计算引入了大量计算开销。第三，视频解码过程中通常忽略了3D几何信息。为解决这些限制，我们引入了一种高效的相机感知3D解码器，将视频潜在变量解码为3D表示以进行奖励量化。具体来说，视频潜在变量与相机姿态一起被解码为3D高斯分布。在这个过程中，相机姿态不仅作为输入，还作为投影参数。视频潜在变量与相机姿态之间的对齐不良会导致3D结构中的几何失真，从而产生模糊的渲染结果。基于这一特性，我们显式地优化渲染的新视角与真实视角之间的像素级一致性作为奖励。为了适应随机性，我们进一步引入了一个可见性项，仅监督通过几何变形得到的确定性区域。在RealEstate10K和WorldScore基准上的广泛实验表明了我们提出方法的有效性。项目页面：https://a-bigbao.github.io/CamPilot/。

Summary / 总结

The research aims to enhance camera control in video diffusion models by addressing limitations in current reward feedback learning approaches. The method introduces an efficient 3D decoder that decodes video latent into 3D representations, using camera pose as both input and projection parameter. This allows for explicit optimization of pixel-level consistency between rendered and ground-truth views as reward. Experiments on RealEstate10K and WorldScore benchmarks show the proposed method improves camera controllability and video-camera alignment effectively.

研究旨在通过解决现有奖励反馈学习方法的局限性，提升视频扩散模型中的相机可控性。引入了一种高效的相机感知3D解码器，将视频潜变量解码为3D表示，以量化奖励并优化渲染视图与真实视图的像素级一致性。在RealEstate10K和WorldScore基准上的实验表明，所提出的方法在提高相机可控性和视频-相机对齐方面具有有效性。

Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition

Authors: Geo Ahn, Inwoong Lee, Taeoh Kim, Minho Shim, Dongyoon Wee, Jinwoo Choi

First: 2026-01-22T18:59:13+00:00 · Latest: 2026-01-22T18:59:13+00:00

Comments: The code is available at https://github.com/KHU-VLL/RCORE

Abs · PDF · Code1 · Code2 · Code3

Abstract

We study Compositional Video Understanding (CVU), where models must recognize verbs and objects and compose them to generalize to unseen combinations. We find that existing Zero-Shot Compositional Action Recognition (ZS-CAR) models fail primarily due to an overlooked failure mode: object-driven verb shortcuts. Through systematic analysis, we show that this behavior arises from two intertwined factors: severe sparsity and skewness of compositional supervision, and the asymmetric learning difficulty between verbs and objects. As training progresses, the existing ZS-CAR model increasingly ignores visual evidence and overfits to co-occurrence statistics. Consequently, the existing model does not gain the benefit of compositional recognition in unseen verb-object compositions. To address this, we propose RCORE, a simple and effective framework that enforces temporally grounded verb learning. RCORE introduces (i) a composition-aware augmentation that diversifies verb-object combinations without corrupting motion cues, and (ii) a temporal order regularization loss that penalizes shortcut behaviors by explicitly modeling temporal structure. Across two benchmarks, Sth-com and our newly constructed EK100-com, RCORE significantly improves unseen composition accuracy, reduces reliance on co-occurrence bias, and achieves consistently positive compositional gaps. Our findings reveal object-driven shortcuts as a critical limiting factor in ZS-CAR and demonstrate that addressing them is essential for robust compositional video understanding.

中文标题/摘要

标题：为什么我打不开抽屉？缓解零样本组成动作识别中的对象驱动捷径

我们研究组成视频理解（CVU），其中模型必须识别动词和对象并组合它们以泛化到未见过的组合。我们发现现有的零样本组成动作识别（ZS-CAR）模型主要由于一个被忽视的失败模式而失败：对象驱动的动词捷径。通过系统的分析，我们表明这种行为源自两个交织的因素：组成监督的严重稀疏性和偏斜，以及动词和对象之间不对称的学习难度。随着训练的进行，现有的ZS-CAR模型越来越多地忽视视觉证据并过度拟合共现统计。因此，现有的模型在未见过的动词-对象组合中无法获得组成识别的好处。为了解决这个问题，我们提出了RCORE，这是一种简单而有效的框架，强制执行时间上一致的动词学习。RCORE引入了（i）一种组成感知的增强方法，它多样化了动词-对象组合而不破坏运动线索，以及（ii）一种时间顺序正则化损失，通过明确建模时间结构来惩罚捷径行为。在两个基准Sth-com和我们新构建的EK100-com上，RCORE显著提高了未见过的组合准确性，减少了对共现偏差的依赖，并实现了一致的正组成差距。我们的研究结果揭示了对象驱动的捷径是ZS-CAR中的一个关键限制因素，并证明解决这些问题对于稳健的组成视频理解是必不可少的。

Summary / 总结

The paper addresses the issue of object-driven verb shortcuts in Zero-Shot Compositional Action Recognition (ZS-CAR) models, which fail to generalize to unseen verb-object combinations. The authors propose RCORE, a framework that includes a composition-aware augmentation and a temporal order regularization loss to mitigate these shortcuts. Experiments on Sth-com and EK100-com benchmarks show that RCORE improves unseen composition accuracy and reduces reliance on co-occurrence bias, achieving consistent positive compositional gaps.

研究通过识别零样本组成动作识别（ZS-CAR）模型中的一个失败模式——对象驱动的动词捷径，来解决组成视频理解（CVU）的挑战。作者提出了RCORE框架，该框架引入了组成感知增强和时间顺序正则化损失来缓解这一问题。RCORE在两个基准Sth-com和EK100-com上提高了未见过的组成准确性，减少了对共现偏差的依赖，展示了其在稳健组成视频理解中的有效性。

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Authors: Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou

First: 2026-01-22T18:58:55+00:00 · Latest: 2026-01-22T18:58:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

中文标题/摘要

标题：PyraTok：语言对齐的分层Tokenizer用于视频理解和生成

离散视频VAEs是现代文本到视频生成和视频理解系统的基石，但现有的Tokenizer通常在单尺度上学习视觉码本，词汇量有限且语言监督浅，导致跨模态对齐差和零样本迁移效果差。我们提出了PyraTok，一种语言对齐的分层Tokenizer，能够在多个时空分辨率上学习语义结构化的离散潜在变量。PyraTok基于一个预训练的视频VAE和一个新颖的语言对齐分层量化（LaPQ）模块，使用共享的大二进制码本在多个深度上离散化编码特征，产生紧凑且表达能力强的视频Token序列。为了紧密耦合视觉Token与语言，PyraTok联合优化多尺度文本引导量化和Token层次上的全局自回归目标。在十个基准测试中，PyraTok在视频重构上达到最佳性能，一致提高文本到视频质量，并在视频分割、时序动作定位和视频理解上设置新的零样本最佳性能，能够稳健扩展到4K/8K分辨率。

Summary / 总结

PyraTok is a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions, improving cross-modal alignment and zero-shot transfer in video understanding and generation. It uses a Language-aligned Pyramidal Quantization (LaPQ) module to discretize encoder features at several depths with a shared large binary codebook, and jointly optimizes multi-scale text-guided quantization and a global autoregressive objective. PyraTok achieves state-of-the-art performance on video reconstruction, text-to-video quality, and zero-shot video segmentation, temporal action localization, and understanding benchmarks, scaling well to high resolutions up to 8K.

PyraTok 是一种语言对齐的分层 tokenizer，能够在多个时空分辨率下学习语义结构化的离散潜变量，从而改善跨模态对齐和零样本迁移。它使用 Language-aligned Pyramidal Quantization (LaPQ) 模块在多个深度上使用共享的大二进制码本对编码特征进行离散化，并联合优化多尺度文本引导量化和全局自回归目标。PyraTok 在视频重建、文本到视频质量、零样本视频分割、时序动作定位和理解基准测试中均达到最佳性能，并且能够扩展到高达 8K 的高分辨率。

CropCraft: Complete Structural Characterization of Crop Plants From Images

Authors: Albert J. Zhai, Xinlei Wang, Kaiyuan Li, Zhao Jiang, Junxiong Zhou, Sheng Wang, Zhenong Jin, Kaiyu Guan, Shenlong Wang

First: 2024-11-14T18:58:02+00:00 · Latest: 2026-01-22T18:58:18+00:00

Comments: 3DV 2026 (Oral). Project page: https://ajzhai.github.io/CropCraft

Abs · PDF · Code1 · Code2 · Project1

Abstract

The ability to automatically build 3D digital twins of plants from images has countless applications in agriculture, environmental science, robotics, and other fields. However, current 3D reconstruction methods fail to recover complete shapes of plants due to heavy occlusion and complex geometries. In this work, we present a novel method for 3D modeling of agricultural crops based on optimizing a parametric model of plant morphology via inverse procedural modeling. Our method first estimates depth maps by fitting a neural radiance field and then optimizes a specialized loss to estimate morphological parameters that result in consistent depth renderings. The resulting 3D model is complete and biologically plausible. We validate our method on a dataset of real images of agricultural fields, and demonstrate that the reconstructed canopies can be used for a variety of monitoring and simulation applications.

中文标题/摘要

标题：CropCraft：从图像中全面结构化表征农作物

从图像自动构建植物的3D数字双胞胎在农业、环境科学、机器人学和其他领域具有无数应用。然而，当前的3D重建方法由于严重的遮挡和复杂的几何结构，无法恢复完整的植物形状。在本工作中，我们提出了一种基于逆过程建模优化植物形态参数化模型的新方法。该方法首先通过拟合神经辐射场估计深度图，然后优化特定损失以估计导致一致深度渲染的形态参数。生成的3D模型是完整的且生物学上合理。我们在包含农业田地真实图像的数据集上验证了该方法，并展示了重建的树冠可以用于各种监测和模拟应用。

Summary / 总结

The research aims to develop a method for automatically creating 3D digital twins of crop plants from images, which is crucial for various applications in agriculture and environmental science. The method uses inverse procedural modeling to optimize a parametric model of plant morphology by first estimating depth maps with a neural radiance field and then refining morphological parameters to ensure consistent depth renderings. The key experimental finding is that the resulting 3D models are complete and biologically plausible, validated on real agricultural field images, and suitable for monitoring and simulation applications.

研究旨在通过图像自动构建作物的3D数字孪生体，这对于农业和环境科学等多个领域至关重要。方法采用逆过程建模优化植物形态的参数化模型，首先使用神经辐射场估计深度图，然后优化形态参数以确保深度渲染的一致性。生成的3D模型完整且生物学上合理，已在实际农田图像上进行了验证，并展示了监测和模拟应用的潜力。

LLM-in-Sandbox Elicits General Agentic Intelligence

Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei

First: 2026-01-22T18:57:09+00:00 · Latest: 2026-01-22T18:57:09+00:00

Comments: Project Page: https://llm-in-sandbox.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.

中文标题/摘要

标题：LLM-in-Sandbox 激发通用代理智能

我们介绍了 LLM-in-Sandbox，使大语言模型能够在代码沙盒（即虚拟计算机）中探索，以激发非代码领域的通用智能。我们首先展示了强大的大语言模型在无需额外训练的情况下，能够利用代码沙盒来执行非代码任务的一般化能力。例如，大语言模型自发地访问外部资源以获取新知识，利用文件系统处理长文本，并执行脚本以满足格式要求。我们进一步表明，通过仅使用非代理数据训练用于沙盒探索的模型的 LLM-in-Sandbox 强化学习（LLM-in-Sandbox-RL），这些代理能力可以得到增强。实验表明，无论是在无训练模式还是在后训练模式下，LLM-in-Sandbox 都能够实现涵盖数学、物理、化学、生物医学、长文本理解以及指令遵循的稳健泛化。最后，我们从计算和系统角度分析了 LLM-in-Sandbox 的效率，并将其开源为 Python 包，以促进其实用部署。

Summary / 总结

The research introduces LLM-in-Sandbox, which allows large language models (LLMs) to explore a code sandbox to develop general intelligence in non-code domains. The study demonstrates that strong LLMs can generalize and use the sandbox for non-code tasks, such as accessing external resources and executing scripts. Additionally, LLM-in-Sandbox-RL enhances these capabilities through reinforcement learning without additional training data. Experiments show robust generalization across various fields including mathematics, physics, and biomedicine. The research also analyzes the efficiency of LLM-in-Sandbox from computational and system perspectives and opens it as a Python package for deployment.

研究引入了LLM-in-Sandbox，使大型语言模型（LLMs）能够在代码沙箱中探索，以在非代码领域发展一般智能。研究展示了强大的LLMs可以泛化并在非代码任务中使用沙箱，例如访问外部资源、处理长文本和执行脚本。该方法进一步通过仅使用非智能数据训练模型来增强这些能力，即LLM-in-Sandbox强化学习。实验表明，LLM-in-Sandbox在数学、物理、化学、生物医学等多个领域以及指令遵循方面实现了稳健的泛化。研究还从计算和系统角度评估了LLM-in-Sandbox的效率，并将其开源为Python包以促进实际部署。

Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing

Authors: Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang

First: 2026-01-22T18:52:21+00:00 · Latest: 2026-01-22T18:52:21+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.

中文标题/摘要

标题：多模态大型语言模型特征空间平滑的可验证鲁棒性

多模态大型语言模型（MLLMs）在多种应用中表现出强大的能力，但仍然容易受到使特征表示失真并导致错误预测的对抗性扰动的影响。为了解决这一脆弱性，我们提出了特征空间平滑（FS）并理论上证明了FS为MLLMs的特征表示提供了可验证的鲁棒性。具体而言，FS将任何特征编码器转换为一种平滑变体，该变体在$\ell_2$有界攻击下保证了干净和对抗性表示之间的特征余弦相似度的可验证下界。此外，我们表明，从原始编码器中获得的特征余弦相似度界（FCSB）的值可以通过扩大定义的高斯鲁棒性得分来提高。在此基础上，我们引入了净化器和平滑映射器（PSM），这是一种即插即用模块，可以提高MLLMs的高斯鲁棒性得分，从而在不需对MLLMs进行任何重新训练的情况下增强其在FS下的可验证鲁棒性。我们证明，FS与PSM不仅提供了强大的理论鲁棒性保证，而且在对抗训练中表现出更优越的实证性能。广泛的实验表明，FS-PSM在各种白盒攻击下的攻击成功率（ASR）从接近90%降低到约1%。

Summary / 总结

The paper addresses the vulnerability of multimodal large language models (MLLMs) to adversarial perturbations by proposing Feature-space Smoothing (FS), which ensures certified robustness on feature representations. FS transforms feature encoders into smoothed variants, maintaining a lower bound on feature cosine similarity under $\ell_2$-bounded attacks. The Purifier and Smoothness Mapper (PSM) further enhances the Gaussian robustness score, improving certified robustness without retraining. Experiments show that FS with PSM reduces the Attack Success Rate from nearly 90% to about 1% across various MLLMs and tasks.

论文通过提出特征空间平滑（FS）来解决多模态大型语言模型（MLLMs）对对抗扰动的脆弱性问题。FS通过将特征编码器转换为平滑变体，确保在$\ell_2$-有界攻击下特征余弦相似度的下限，从而提供认证鲁棒性。Purifier和Smoothness Mapper（PSM）进一步通过提高高斯鲁棒性得分来增强这种鲁棒性，使攻击成功率（ASR）从接近90%显著降低到约1%。理论证明和各种MLLMs及任务的实验结果验证了FS-PSM的有效性。

Training-Free Geospatial Place Representation Learning from Large-Scale Point-of-Interest Graph Data

Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle

First: 2025-06-25T15:10:31+00:00 · Latest: 2026-01-22T18:46:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest(POI) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a training-free geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.

中文标题/摘要

标题：无需训练的地理空间地点表示学习从大规模兴趣点图数据

学习有效的城市环境表示需要捕捉超越固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点(POI)聚合到预先定义的行政区域中，如普查单位或邮政编码区域，并为每个区域分配一个单一的嵌入。然而，POI往往形成具有语义意义的群体，跨越、位于或超出这些边界，定义了更好地反映人类活动和城市功能的地点。为了解决这一局限性，我们提出了一种无需训练的地理空间表示学习方法PlaceRep，该方法通过聚类空间上和语义上相关的POI来构建地点级表示。PlaceRep从美国Foursquare数据中总结大规模POI图，生成通用的城市区域嵌入，同时自动识别跨多个空间尺度的地点。通过消除模型预训练，PlaceRep提供了一种可扩展且高效的多粒度地理空间分析解决方案。使用人口密度估计和房价预测作为下游任务的实验表明，PlaceRep在大多数基于图的地理空间表示学习方法中表现更优，并在生成大规模POI图的区域级表示时实现了高达100倍的速度提升。PlaceRep的实现可在https://github.com/mohammadhashemii/PlaceRep获取。

Summary / 总结

The research aims to develop a training-free method for learning geospatial place representations from large-scale POI graphs, addressing the limitations of existing approaches that aggregate POIs into fixed administrative regions. PlaceRep clusters spatially and semantically related POIs to generate place-level representations, which are then used for tasks like population density estimation and housing price prediction. Experiments show that PlaceRep outperforms state-of-the-art methods and provides up to a 100x speedup in generating region-level representations on large-scale POI graphs.

PlaceRep 是一种无需训练的方法，通过聚类 POI 来创建地方级别的表示，捕捉跨越行政边界的语义上有意义的群体。它在人口密度估计和房价预测等任务中优于现有方法，并且在生成大规模 POI 图的区域级别表示时可提供高达 100 倍的速度提升。

360Anything: Geometry-Free Lifting of Images and Videos to 360°

Authors: Ziyi Wu, Daniel Watson, Andrea Tagliasacchi, David J. Fleet, Marcus A. Brubaker, Saurabh Saxena

First: 2026-01-22T18:45:59+00:00 · Latest: 2026-01-22T18:45:59+00:00

Comments: Project page: https://360anything.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.

中文标题/摘要

标题：360Anything：无需几何的图像和视频到360°提升

将视角图像和视频提升为360°全景图可以生成沉浸式的3D世界。现有方法通常依赖于视角和等效圆柱投影（ERP）空间之间的显式几何对齐。然而，这需要已知的相机元数据，这在野外数据中通常是缺失或噪声的。我们提出了360Anything，一个基于预训练扩散变换器的几何无关框架。通过将视角输入和全景目标简单地视为标记序列，360Anything以完全数据驱动的方式学习视角到等效圆柱投影的映射，消除了对相机信息的需求。我们的方法在图像和视频视角到360°生成方面均达到了最先进的性能，优于使用真实相机信息的先前工作。我们还追踪了ERP边界处接缝伪影的根本原因，归因于VAE编码器中的零填充，并引入了循环潜编码以促进无缝生成。最后，我们在零样本相机视场和方向估计基准测试中展示了竞争力的结果，证明了360Anything在计算机视觉任务中的深刻几何理解和更广泛的应用。更多结果请参见https://360anything.github.io/

Summary / 总结

360Anything is a geometry-free framework that uses pre-trained diffusion transformers to lift perspective images and videos to 360° panoramas. It eliminates the need for camera metadata by treating the inputs and outputs as token sequences, achieving state-of-the-art performance and outperforming previous methods that rely on ground-truth camera information. The approach also introduces Circular Latent Encoding to address seam artifacts and demonstrates strong zero-shot camera FoV and orientation estimation capabilities.

研究旨在开发一种无需显式几何对齐即可将视角图像和视频转换为360°全景图的方法，这在实际场景中往往难以实现。提出的360Anything框架利用预训练的扩散变换器从数据中学习视角到球面投影的映射。该方法超越了依赖相机元数据的先前方法，并引入了循环潜编码以解决接缝伪影问题，实现了图像和视频转换任务的最先进结果。此外，该方法在估计相机视野和方向方面表现出色，表明其在计算机视觉任务中的更广泛用途。

SciArena: An Open Evaluation Platform for Non-Verifiable Scientific Literature-Grounded Tasks

Authors: Yilun Zhao, Kaiyan Zhang, Tiansheng Hu, Sihong Wu, Ronan Le Bras, Charles McGrady, Taira Anderson, Jonathan Bragg, Joseph Chee Chang, Jesse Dodge, Matt Latzke, Yixin Liu, Xiangru Tang, Zihang Wang, Chen Zhao, Hannaneh Hajishirzi, Doug Downey, Arman Cohan

Venue: NeurIPS 2025 Spotlight

First: 2025-07-01T17:51:59+00:00 · Latest: 2026-01-22T18:32:06+00:00

Comments: NeurIPS 2025 Datasets & Benchmarks Track (Spotlight)

Abs · PDF · Code1 · Code2

Abstract

We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature-grounded tasks. Unlike traditional benchmarks for scientific literature understanding and synthesis, SciArena engages the research community directly, following the Chatbot Arena evaluation approach of community voting on model comparisons. By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks that demand literature-grounded, long-form responses. The platform currently supports 47 foundation models and has collected over 20,000 votes from human researchers across diverse scientific domains. Our analysis of the data collected so far confirms its high quality. We discuss the results and insights based on the model ranking leaderboard. To further promote research in building model-based automated evaluation systems for literature tasks, we release SciArena-Eval, a meta-evaluation benchmark based on collected preference data. It measures the accuracy of models in judging answer quality by comparing their pairwise assessments with human votes. Our experiments highlight the benchmark's challenges and emphasize the need for more reliable automated evaluation methods.

中文标题/摘要

标题：SciArena：一个开放的基于科学文献任务的基础模型评估平台

我们介绍了SciArena，一个开放且协作的平台，用于在科学文献驱动的任务上评估基础模型。与传统的科学文献理解和合成基准不同，SciArena 直接与研究社区互动，采用聊天机器人竞技场的评价方法，通过社区投票进行模型比较。通过利用集体智慧，SciArena 提供了一个由社区驱动的评估，评估模型在需要基于文献、长篇回答的开放性科学任务上的表现。该平台目前支持47个基础模型，并已从跨学科科学领域的研究人员那里收集了超过20,000票。我们对迄今为止收集的数据进行了分析，确认其高质量。我们基于模型排名排行榜讨论了结果和见解。为了进一步促进构建用于文献任务的基础模型自动化评估系统的研究，我们发布了SciArena-Eval，这是一个基于收集的偏好数据的元评估基准。它通过比较模型的成对评估与人类投票来衡量模型判断答案质量的准确性。我们的实验突出了基准的挑战，并强调了需要更可靠的自动化评估方法。

Summary / 总结

SciArena is an open platform for evaluating foundation models on scientific literature-grounded tasks, using community voting similar to the Chatbot Arena approach. It supports 47 models and has received over 20,000 votes from researchers. The platform demonstrates high-quality data and includes a meta-evaluation benchmark, SciArena-Eval, to measure models' ability to judge answer quality accurately.

SciArena 是一个开放平台，用于评估基础模型在科学文献任务上的表现，采用类似于 Chatbot Arena 的社区投票方式。它支持 47 个模型，并已收到来自研究人员的超过 20,000 票。该平台展示了高质量的数据，并包含一个元评估基准 SciArena-Eval，用于衡量模型判断答案质量的准确性。

Learning to Discover at Test Time

Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun

First: 2026-01-22T18:24:00+00:00 · Latest: 2026-01-22T18:24:00+00:00

Comments: Code: https://github.com/test-time-training/discover

Abs · PDF · Code1 · Code2 · Code3

Abstract

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

中文标题/摘要

标题：在测试时学习发现

我们如何使用AI在科学问题上发现新的前沿？先前的测试时缩放工作，如AlphaEvolve，通过提示冻结的LLM进行搜索。我们进行测试时的强化学习，因此LLM可以继续训练，但现在具有针对测试问题的经验。这种持续学习的形式非常特殊，因为它旨在产生一个最佳解决方案，而不是平均多个较好的解决方案，并且解决这个问题而不是泛化到其他问题。因此，我们的学习目标和搜索子程序设计优先考虑最有前途的解决方案。我们称这种方法为测试时训练以发现（TTT-Discover）。我们遵循先前的工作，专注于具有连续奖励的问题。我们报告了我们尝试的每个问题的结果，涵盖数学、GPU内核工程、算法设计和生物学。TTT-Discover在几乎所有问题上都设定了新的前沿：(i) Erdős的最小重叠问题和自相关不等式；(ii) GPUMode内核竞赛（比先前的最佳结果快至2倍）；(iii) 过去的AtCoder算法竞赛；和(iv) 单细胞分析中的去噪问题。我们的解决方案由专家或组织者审核。所有结果均使用OpenAI gpt-oss-120b开源模型获得，并可通过我们公开的代码重现，与之前的最佳结果相比，这些结果不需要封闭的前沿模型。我们的测试时训练运行使用Thinking Machines的Tinker API，每解决问题的成本仅为几百美元。

Summary / 总结

The research aims to use AI to discover new state-of-the-art solutions for scientific problems by performing reinforcement learning at test time. The method, Test-Time Training to Discover (TTT-Discover), allows the LLM to continue training with problem-specific experience, prioritizing promising solutions. Across various domains including mathematics, GPU kernel engineering, algorithm design, and biology, TTT-Discover sets new state-of-the-art results in almost all problems, such as solving Erdős' minimum overlap problem, improving a GPUMode kernel, and enhancing denoising in single-cell analysis. All results are achieved using an open model and publicly available code, making them reproducible and cost-effective.

研究旨在通过在测试时进行强化学习来使用AI发现科学问题的新前沿解决方案。方法Test-Time Training to Discover (TTT-Discover) 允许LLM在具有特定问题经验的情况下继续训练，并优先考虑有前途的解决方案。TTT-Discover 在数学、GPU 内核工程、算法设计和生物学等多个领域设置了新的前沿结果，这些结果得到了专家或组织者的审查。该方法使用了一个开源模型，OpenAI gpt-oss-120b，并且可以通过公开的代码进行重现，与之前的最佳结果依赖于封闭模型的方法不同。

Is this chart lying to me? Automating the detection of misleading visualizations

Authors: Jonathan Tonglet, Jan Zimny, Tinne Tuytelaars, Iryna Gurevych

First: 2025-08-29T14:36:45+00:00 · Latest: 2026-01-22T18:23:24+00:00

Comments: Preprint under review. Code and data available at: https://github.com/UKPLab/arxiv2025-misviz

Abs · PDF · Code1 · Code2 · Code3

Abstract

Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also create Misviz-synth, a synthetic dataset of 57,665 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and image-axis classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.

中文标题/摘要

标题：这张图表在欺骗我吗？自动化检测误导性可视化

误导性可视化是社交媒体和网络上信息误导的强大驱动因素。通过违反图表设计原则，它们扭曲数据并引导读者得出不准确的结论。先前的研究表明，无论是人类还是多模态大型语言模型（MLLMs）都经常被这些可视化所欺骗。自动检测误导性可视化并识别它们违反的具体设计规则可以帮助保护读者并减少信息误导的传播。然而，由于缺乏大型、多样且公开可用的数据集，AI模型的训练和评估受到了限制。在本研究中，我们引入了Misviz，这是一个包含2,604个真实世界可视化并标注了12种类型误导的基准数据集。为了支持模型训练，我们还创建了Misviz-synth，这是一个基于真实数据表生成的57,665个可视化数据集，使用Matplotlib生成。我们使用最先进的MLLMs、基于规则的系统和图像轴分类器对两个数据集进行了全面评估。我们的结果表明，该任务仍然极具挑战性。我们发布了Misviz、Misviz-synth及其配套代码。

Summary / 总结

This research addresses the issue of misleading visualizations that can spread misinformation. It introduces Misviz, a benchmark dataset of 2,604 real-world visualizations annotated with 12 types of misleaders, and Misviz-synth, a synthetic dataset of 57,665 visualizations. The study evaluates state-of-the-art multimodal large language models, rule-based systems, and image-axis classifiers on these datasets and finds that the task is still highly challenging. The authors release the datasets and code to advance the field.

该论文针对误导性可视化可能传播虚假信息的问题，引入了包含2,604个真实世界可视化和12种误导类型的Misviz基准数据集，以及基于真实数据表生成的57,665个可视化合成数据集Misviz-synth。作者使用最先进的模型、基于规则的系统和图像轴分类器对这些数据集进行了全面评估，并发现该任务仍然极具挑战性。该工作旨在通过自动化检测误导性可视化及其违反的具体设计规则来帮助保护读者免受虚假信息的影响。

Structured Hints for Sample-Efficient Lean Theorem Proving

Authors: Zachary Burton

First: 2026-01-22T18:16:46+00:00 · Latest: 2026-01-22T18:16:46+00:00

Comments: 9 pages, 1 figure

Abs · PDF · Code1 · Code2

Abstract

State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.

中文标题/摘要

标题：结构化提示以实现高效学习定理证明

像DeepSeek-Prover-V1.5这样的先进神经定理证明器结合了大型语言模型和强化学习，在复杂的训练过程中取得了令人印象深刻的成果。我们提出的问题是：这些高度训练的模型在推理时是否仍然受益于简单的结构指导？我们在miniF2F基准测试上评估了一种轻量级干预措施——固定提示调度表，覆盖15种常见的战术骨架。这种方法简单有效，与从同一模型标准采样相比，16个样本的通过率提高了21.7%，相对改进了43%，使用相同的最大生成长度（1024个标记）。我们的结果表明，即使是有能力的RL训练证明器也未能充分利用战术语言中可用的结构先验，并且简单的推理时指导仍然是一个廉价的补充提升。

Summary / 总结

The research aims to explore whether state-of-the-art neural theorem provers, despite being highly trained, can still benefit from simple structural guidance during inference. The study evaluates a lightweight intervention—a fixed prompt schedule over 15 common tactic skeletons—on the miniF2F benchmark. This approach improves pass@16 by 21.7% compared to standard sampling, representing a 43% relative improvement using the same number of samples and maximum generation length. The findings indicate that even advanced models underutilize structural priors and that simple inference-time guidance can significantly enhance performance.

研究旨在探索尽管最先进的神经定理证明器经过高度训练，但在推理过程中是否仍能从简单的结构指导中受益。研究在miniF2F基准上评估了一种轻量级干预措施——15种常见策略骨架的固定提示调度。这种方法将pass@16提高了21.7%，相比标准采样，相对改进了43%，使用相同数量的样本和最大生成长度。研究结果表明，即使先进的模型也未能充分利用结构先验，而简单的推理时指导可以显著提升性能。

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Authors: Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, Jinwei Gu

First: 2026-01-22T18:09:30+00:00 · Latest: 2026-01-22T18:09:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent video generation models demonstrate remarkable ability to capture complex physical interactions and scene evolution over time. To leverage their spatiotemporal priors, robotics works have adapted video models for policy learning but introduce complexity by requiring multiple stages of post-training and new architectural components for action generation. In this work, we introduce Cosmos Policy, a simple approach for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on the robot demonstration data collected on the target platform, with no architectural modifications. Cosmos Policy learns to directly generate robot actions encoded as latent frames within the video model's latent diffusion process, harnessing the model's pretrained priors and core learning algorithm to capture complex action distributions. Additionally, Cosmos Policy generates future state images and values (expected cumulative rewards), which are similarly encoded as latent frames, enabling test-time planning of action trajectories with higher likelihood of success. In our evaluations, Cosmos Policy achieves state-of-the-art performance on the LIBERO and RoboCasa simulation benchmarks (98.5% and 67.1% average success rates, respectively) and the highest average score in challenging real-world bimanual manipulation tasks, outperforming strong diffusion policies trained from scratch, video model-based policies, and state-of-the-art vision-language-action models fine-tuned on the same robot demonstrations. Furthermore, given policy rollout data, Cosmos Policy can learn from experience to refine its world model and value function and leverage model-based planning to achieve even higher success rates in challenging tasks. We release code, models, and training data at https://research.nvidia.com/labs/dir/cosmos-policy/

中文标题/摘要

标题：宇宙政策：针对视觉运动控制和规划微调视频模型

近期的视频生成模型展示了捕捉复杂物理交互和随时间演变的场景的非凡能力。为了利用其时空先验知识，机器人学工作将视频模型适应于策略学习，但引入了复杂性，需要多阶段的后训练和新的架构组件来生成动作。在本工作中，我们提出了宇宙政策(Cosmos Policy)，这是一种简单的方法，通过在目标平台上收集的机器人演示数据的单阶段后训练，将大型预训练视频模型(Cosmos-Predict2)适应为有效的机器人策略，无需架构修改。宇宙政策学习直接生成机器人动作，编码为视频模型的潜在扩散过程中的潜在帧，利用模型的预训练先验和核心学习算法捕捉复杂动作分布。此外，宇宙政策生成未来状态图像和值（预期累积奖励），同样编码为潜在帧，使测试时能够规划具有更高成功概率的动作轨迹。在我们的评估中，宇宙政策在LIBERO和RoboCasa模拟基准测试中分别实现了98.5%和67.1%的平均成功率，并在具有挑战性的实际双臂操作任务中获得了最高的平均分数，优于从零开始训练的强大扩散策略、基于视频模型的策略和在相同机器人演示上微调的最先进的视觉-语言-动作模型。此外，给定策略展开数据，宇宙政策可以从经验中学习改进其世界模型和价值函数，并利用基于模型的规划在具有挑战性的任务中实现更高的成功率。我们将在https://research.nvidia.com/labs/dir/cosmos-policy/发布代码、模型和训练数据。

Summary / 总结

Cosmos Policy is a method for adapting a large pretrained video model (Cosmos-Predict2) into an effective robot policy through a single stage of post-training on robot demonstration data, without architectural modifications. It learns to generate robot actions and future state images as latent frames, leveraging the model's pretrained priors. In evaluations, Cosmos Policy outperforms other approaches on simulation benchmarks and real-world bimanual manipulation tasks, achieving state-of-the-art success rates and enabling model-based planning for higher success in challenging tasks.

Cosmos Policy 是一种方法，通过在机器人演示数据上进行一次阶段的后训练，将一个大型预训练视频模型（Cosmos-Predict2）转换为有效的机器人策略，无需修改架构。它学习将机器人动作和未来状态图像作为潜变量帧生成，利用模型的预训练先验。Cosmos Policy 在仿真基准测试和真实世界的双臂操作任务中表现出色，达到最先进的成功率。

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change

Authors: Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger

First: 2025-05-25T21:29:00+00:00 · Latest: 2026-01-22T18:06:39+00:00

Comments: 45 pages, 21 figures, under review

Abs · PDF · Code1 · Code2

Abstract

Ambivalence and hesitancy (A/H), a closely related construct, is the primary reasons why individuals delay, avoid, or abandon health behaviour changes. It is a subtle and conflicting emotion that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. It manifests by a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exists for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours captured from 300 participants across Canada answering predefined questions to elicit A/H. It is intended to mirror real-world online personalized behaviour change interventions. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participants' meta-data are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, zero-shot prediction, and personalization using source-free domain adaptation. The data, code, and pretrained weights are available.

中文标题/摘要

标题：BAH数据集：视频中数字行为改变中犹豫/矛盾识别

犹豫和矛盾（A/H），这一紧密相关的概念，是个人推迟、避免或放弃健康行为改变的主要原因。这是一种微妙且矛盾的情绪，使人处于正向和负向态度之间，或接受与拒绝某事之间。它表现为情感在多种模态之间或同一模态内的不一致，如面部和语音表达以及肢体语言。尽管专家可以被训练来识别A/H，如在面对面互动中所做的那样，将其整合到数字健康干预措施中既昂贵又效果不佳。因此，自动识别A/H对于数字行为改变干预措施的个性化和成本效益至关重要。然而，目前尚无用于设计机器学习模型识别A/H的数据集。本文介绍了为视频中多模态识别A/H而收集的Behavioral Ambivalence/Hesitancy (BAH)数据集。该数据集包含1,427个视频，总时长10.60小时，来自加拿大300名参与者回答预定义问题以引发A/H。它旨在模拟现实世界的在线个性化行为改变干预措施。BAH由三位专家注释，提供A/H发生的时间戳，以及帧级和视频级带有A/H线索的注释。还提供了视频转录、裁剪和对齐的脸部以及参与者元数据。由于A和H在实践中表现相似，我们提供了二元注释，表明A/H的存在或不存在。此外，本文还包括在BAH上使用基线模型进行帧级和视频级识别、零样本预测和使用源代码免费领域适应进行个性化处理的基准结果。数据、代码和预训练权重均可用。

Summary / 总结

This paper introduces the BAH dataset for recognizing ambivalence and hesitancy (A/H) in videos, crucial for personalizing digital health interventions. The dataset includes 1,427 videos from 300 participants answering questions to elicit A/H, with annotations by experts indicating A/H occurrences and providing frame-level and video-level cues. Benchmarking results using baseline models show promising performance for A/H recognition at both frame and video levels, as well as zero-shot prediction and personalization through source-free domain adaptation.

本文介绍了用于识别视频中矛盾和犹豫（A/H）的BAH数据集，对于个性化数字健康干预至关重要。该数据集包含来自300名参与者回答问题以引发A/H的1,427个视频，专家对A/H的发生进行了标注，并提供了帧级和视频级的提示。基准测试结果表明，使用基线模型在帧和视频级别上对A/H进行识别表现出良好的性能，同时通过源无域适应实现了零样本预测和个人化。

Domain-Incremental Continual Learning for Robust and Efficient Keyword Spotting in Resource Constrained Systems

Authors: Prakash Dhungana, Sayed Ahmad Salehi

First: 2026-01-22T17:59:31+00:00 · Latest: 2026-01-22T17:59:31+00:00

Comments: 12 pages, 8 figures, and 3 tables

Abs · PDF · Code1 · Code2

Abstract

Keyword Spotting (KWS) systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges due to domain shifts caused by varying noise and recording conditions. To address this, we propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency. The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features, supported by a multi-stage denoising process, involving discrete wavelet transform and spectral subtraction techniques, plus model and prototype update blocks. Unlike prior methods that restrict updates to specific layers, our approach updates the complete quantized model, made possible due to compact model architecture. A subset of input samples are selected during runtime using class prototypes and confidence-driven filtering, which are then pseudo-labeled and combined with rehearsal buffer for incremental model retraining. Experimental results on noisy test dataset demonstrate the framework's effectiveness, achieving 99.63\% accuracy on clean data and maintaining robust performance (exceeding 94\% accuracy) across diverse noisy environments, even at -10 dB Signal-to-Noise Ratio. The proposed framework work confirms that integrating efficient denoising with prototype-based continual learning enables KWS models to operate autonomously and robustly in resource-constrained, dynamic environments.

中文标题/摘要

标题：资源受限系统中鲁棒高效关键词识别的领域增量连续学习

部署在边缘设备上的具有小体积模型的关键词识别（KWS）系统由于不同噪声和录音条件引起的领域偏移而面临显著的准确性和鲁棒性挑战。为解决这一问题，我们提出了一种全面的连续学习框架，旨在适应新领域的同时保持计算效率。所提出的管道集成了双输入卷积神经网络，利用梅尔频率倒谱系数（MFCC）和梅尔频谱图特征，并通过离散小波变换和频谱减法技术等多阶段去噪过程支持，还包括模型和原型更新模块。与先前方法仅限制更新特定层不同，我们的方法更新完整的量化模型，这得益于紧凑的模型架构。在运行时，根据类原型和置信度驱动的过滤选择一部分输入样本，然后进行伪标签并结合回放缓冲区进行增量模型重训练。在嘈杂测试数据集上的实验结果表明，该框架的有效性，其在干净数据上的准确率达到99.63%，并在多种嘈杂环境中保持鲁棒性能（超过94%的准确率），即使在-10 dB信噪比下也是如此。所提出的工作框架证实，将高效的去噪与基于原型的连续学习相结合，使KWS模型能够在资源受限、动态环境中自主且鲁棒地运行。

Summary / 总结

The paper addresses the challenges of Keyword Spotting (KWS) systems in resource-constrained devices, particularly the issues of domain shifts and varying noise conditions. It proposes a domain-incremental continual learning framework that integrates a dual-input Convolutional Neural Network and a multi-stage denoising process, updating the complete quantized model. The framework uses class prototypes and confidence-driven filtering to select and pseudo-label input samples for incremental retraining. Experiments show the framework achieves 99.63% accuracy on clean data and maintains robust performance across noisy environments, even at -10 dB Signal-to-Noise Ratio.

本文提出了一种针对资源受限设备中的关键词识别（KWS）系统的增量持续学习框架，以应对域偏移带来的挑战。该方法结合了使用MFCC和梅尔谱图特征的双输入卷积神经网络，并通过多阶段去噪过程和模型更新支持。关键发现表明，该框架在干净数据上的准确率达到99.63%，并在各种噪声环境中保持超过94%的鲁棒性能，即使在-10 dB信噪比下也能实现这一性能。

HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

Authors: Zequn Xie, Xin Liu, Boyun Zhang, Yuxiao Lin, Sihang Cai, Tao Jin

Venue: ICASSP 2026

First: 2026-01-22T17:57:42+00:00 · Latest: 2026-01-22T17:57:42+00:00

Comments: Accepted by ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.

中文标题/摘要

标题：HVD：基于人类视觉的视频表示学习方法在文本-视频检索中的应用

CLIP的成功推动了文本-视频检索领域的显著进步。然而，当前的方法往往受到“盲视”特征交互的困扰，模型难以从背景噪声中区分关键的视觉信息，这主要是由于文本查询的稀疏性。为了解决这一问题，我们借鉴了人类的认知行为，并提出了Human Vision-Driven (HVD)模型。我们的框架建立了一种从粗到细的对齐机制，包括两个关键组件：帧特征选择模块（FFSM）和补丁特征压缩模块（PFCM）。FFSM通过选择关键帧来模拟人类的宏观感知能力，从而消除时间冗余。随后，PFCM通过先进的注意力机制将补丁特征聚合为显著的视觉实体，模拟微观感知，实现精确的实体级匹配。在五个基准上的广泛实验表明，HVD不仅捕捉到了类似人类的视觉焦点，还实现了最先进的性能。

Summary / 总结

The HVD model is designed to improve text-video retrieval by addressing the issue of 'blind' feature interaction, where models struggle to distinguish key visual information from background noise. It introduces a coarse-to-fine alignment mechanism with two components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM selects key frames to reduce temporal redundancy, while PFCM aggregates patch features into salient visual entities through an advanced attention mechanism. Experiments show that HVD captures human-like visual focus and achieves state-of-the-art performance on five benchmarks.

研究旨在通过解决模型难以区分关键视觉信息和背景噪声的问题，提高文本视频检索的效果。提出了Human Vision-Driven (HVD)模型，包括Frame Features Selection Module (FFSM)和Patch Features Compression Module (PFCM)。FFSM通过选择关键帧减少时间冗余，而PFCM使用高级注意力机制聚合片段特征为显著的视觉实体，实现精确的实体级匹配。在五个基准上的实验表明，HVD不仅捕捉到了人类的视觉焦点，还达到了最先进的性能。

ActionMesh: Animated 3D Mesh Generation with Temporal 3D Diffusion

Authors: Remy Sabathier, David Novotny, Niloy J. Mitra, Tom Monnier

First: 2026-01-22T17:41:13+00:00 · Latest: 2026-01-22T17:41:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.

中文标题/摘要

标题：ActionMesh：基于时间3D扩散的动画3D网格生成

生成动画3D对象是许多应用的核心，但大多数先进的工作由于其有限的设置、长时间的运行或有限的质量，通常难以在实践中应用。我们介绍了ActionMesh，这是一种生成模型，能够以前馈方式预测“在行动”中的生产级3D网格。受到早期视频模型的启发，我们的关键见解是修改现有的3D扩散模型，加入时间轴，从而形成我们称之为“时间3D扩散”的框架。具体来说，我们首先将3D扩散阶段适应为生成表示时间变化和独立3D形状的同步潜在变量序列。其次，我们设计了一个时间3D自编码器，将一系列独立形状转换为预定义参考形状的相应变形，使我们能够构建动画。结合这两个组件，ActionMesh可以从单目视频、文本描述甚至带有动画描述的3D网格等不同输入生成动画3D网格。此外，与以前的方法相比，我们的方法速度快，生成的结果无骨架且拓扑一致，因此能够实现快速迭代和无缝应用，如纹理化和目标变换。我们在标准视频到4D基准（Consistent4D，Objaverse）上评估了我们的模型，并在几何准确性和时间一致性方面报告了最先进的性能，证明了我们的模型能够以前所未有的速度和质量生成动画3D网格。

Summary / 总结

ActionMesh is a generative model that predicts animated 3D meshes in a feed-forward manner by incorporating a temporal axis into 3D diffusion models. It consists of two main components: a 3D diffusion stage that generates synchronized latents for time-varying shapes, and a temporal 3D autoencoder that translates these shapes into deformations of a reference shape. This model can generate animations from various inputs such as monocular videos, text descriptions, or 3D meshes with text prompts. ActionMesh is fast and produces rig-free, topology-consistent results, outperforming previous methods on geometric accuracy and temporal consistency in standard benchmarks.

ActionMesh 是一种生成模型，能够使用前馈方法预测生产级的 3D 网格动画。它通过将 3D 扩散模型修改为包含时间轴，形成了名为 '时间 3D 扩散' 的框架。该模型生成一系列同步的潜在变量，表示时间变化且独立的 3D 形状，并使用时间 3D 自编码器将这些形状转化为参考形状的变形，从而实现动画生成。ActionMesh 可以从各种输入生成 3D 网格动画，并且生成的结果速度快、无骨架且拓扑一致，在几何准确性和时间一致性基准测试中超越了之前的模型。

Dynamical Mechanisms for Coordinating Long-term Working Memory Based on the Precision of Spike-timing in Cortical Neurons

Authors: Terrence J. Sejnowski

First: 2025-12-17T19:05:18+00:00 · Latest: 2026-01-22T17:40:42+00:00

Comments: 31 pages, 13 figures

Abs · PDF · Code1 · Code2

Abstract

In the last century, most sensorimotor studies of cortical neurons relied on average firing rates. Rate coding is efficient for fast sensorimotor processing that occurs within a few seconds. Much less is known about long-term working memory with a time scale of hours (Ericsson and Kintsch, 1995). The discovery of millisecond-precision spike initiation in cortical neurons was unexpected (Mainen and Sejnowski, 1995). Even more striking was the precision of spiking in vivo, in response to rapidly fluctuating sensory inputs, suggesting that neural circuits could preserve and manipulate sensory information through spike timing. High temporal resolution enables a broader range of neural codes. It could also support spike-timing-dependent plasticity (STDP), which is triggered by the relative timing of spikes between presynaptic and postsynaptic neurons in the millisecond range. What spike-timing mechanisms could regulate STDP in vivo? Cortical traveling waves have been observed across many frequency bands with high temporal precision. Traveling waves have wave fronts that could link spike timing to STDP. As a wave front passes through a cortical column, excitatory synapses on the dendrites of both pyramidal and basket cells are stimulated synchronously. Inhibitory basket cells form a calyx on pyramidal cell bodies, and inhibitory rebound following a strong transient hyperpolarization can trigger a backpropagating action potential, which arrives shortly after the excitatory inputs on pyramidal dendrites. STDP activated in this way could persist for hours, creating a second-tier network. This temporary network could support long-term working memory, a cognitive network riding above the long-term sensorimotor network. On their own, traveling waves and STDP have not yet yielded new insights into cortical function. Together, they could be responsible for how we think (Sejnowski, 2025).

中文标题/摘要

标题：基于皮层神经元放电时间精确性的长期工作记忆协调动力机制

在上个世纪，大多数关于皮层神经元的传感器运动研究依赖于平均放电率。速率编码对于几秒钟内发生的快速传感器运动处理是高效的。关于具有小时时间尺度的长期工作记忆知之甚少（Ericsson和Kintsch, 1995）。皮层神经元毫秒级精确的放电启动的发现是出乎意料的（Mainen和Sejnowski, 1995）。更令人惊讶的是，神经元在响应快速波动的感官输入时放电的精确性，这表明神经回路可以通过放电时间来保存和操控感官信息。高时间分辨率能够支持更广泛的神经编码。它还可以支持由突触前和突触后神经元在毫秒范围内相对放电时间触发的突触后可塑性（STDP）。在体内，什么样的放电时间机制可以调节STDP？在许多频率范围内观察到皮层波浪传播具有高时间精确性。波浪传播的波前可以将放电时间与STDP联系起来。当波前通过一个皮层柱时，兴奋性突触同时刺激了锥体细胞和篮细胞的树突。抑制性篮细胞在锥体细胞体上形成一个囊状结构，强瞬时超极化后的抑制性反弹可以触发一个回传动作电位，该电位在兴奋性输入到达锥体细胞树突后不久到达。以这种方式激活的STDP可以持续数小时，形成一个二级网络。这个临时网络可以支持长期工作记忆，即在长期传感器运动网络之上的认知网络。单独来看，波浪传播和STDP尚未提供关于皮层功能的新见解。结合在一起，它们可能负责我们如何思考（Sejnowski, 2025）。

Summary / 总结

This study investigates the mechanisms for coordinating long-term working memory based on the precision of spike-timing in cortical neurons. The research employs traveling waves and spike-timing-dependent plasticity (STDP) to explore how neural circuits preserve and manipulate sensory information over hours. Key findings show that as traveling waves pass through a cortical column, they synchronize excitatory and inhibitory synaptic inputs, which can trigger STDP and create a second-tier network supporting long-term working memory.

该研究探讨了基于皮层神经元放电时间精确度协调长期工作记忆的机制。研究利用行波和依赖于放电时间的可塑性（STDP）来探索神经回路如何在数小时内保存和操控感官信息。关键发现表明，当行波通过皮层柱时，它们会同步兴奋性和抑制性突触输入，从而触发STDP并形成支持长期工作记忆的次级网络。

Mantis: A Foundation Model for Mechanistic Disease Forecasting

Authors: Carson Dudley, Reiden Magdaleno, Christopher Harding, Ananya Sharma, Emily Martin, Marisa Eisenberg

First: 2025-08-17T06:55:29+00:00 · Latest: 2026-01-22T17:34:42+00:00

Comments: 11 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Infectious disease forecasting in novel outbreaks or low-resource settings is hampered by the need for large disease and covariate data sets, bespoke training, and expert tuning, all of which can hinder rapid generation of forecasts for new settings. To help address these challenges, we developed Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. We evaluated Mantis against 48 forecasting models across six diseases with diverse modes of transmission, assessing both point forecast accuracy (mean absolute error) and probabilistic performance (weighted interval score and coverage). Despite using no real-world data during training, Mantis achieved lower mean absolute error than all models in the CDC's COVID-19 Forecast Hub when backtested on early pandemic forecasts which it had not previously seen. Across all other diseases tested, Mantis consistently ranked in the top two models across evaluation metrics. Mantis further generalized to diseases with transmission mechanisms not represented in its training data, demonstrating that it can capture fundamental contagion dynamics rather than memorizing disease-specific patterns. These capabilities illustrate that purely simulation-based foundation models such as Mantis can provide a practical foundation for disease forecasting: general-purpose, accurate, and deployable where traditional models struggle.

中文标题/摘要

标题：螳螂：一种基于机理的疾病预报基础模型

在新型疫情或低资源环境中进行传染病预报受到大规模疾病和协变量数据集、定制化训练和专家调优的限制，这些都可能阻碍对新环境的快速预报生成。为应对这些挑战，我们开发了螳螂，一种完全基于机理模拟训练的基础模型，它能够在疾病、地区和结果方面实现开箱即用的预报，即使在历史数据有限的环境中也是如此。我们使用6种具有不同传播模式的疾病对48个预报模型进行了评估，评估了点预报准确性（平均绝对误差）和概率性能（加权区间评分和覆盖率）。尽管在训练过程中未使用任何真实世界数据，但在CDC的COVID-19预报中心回测早期疫情预报时，螳螂的平均绝对误差低于所有模型。在测试的所有其他疾病中，螳螂在所有评估指标中始终排名前二。此外，螳螂还能够泛化到其训练数据中未包含传播机制的疾病，这表明它可以捕捉到基本的传播动态，而不是记忆特定疾病的模式。这些能力表明，像螳螂这样的基于模拟的基础模型可以为疾病预报提供实用的基础：通用、准确且在传统模型难以发挥作用的环境中可部署。

Summary / 总结

Mantis is a foundation model trained on mechanistic simulations to address the challenges of infectious disease forecasting in novel outbreaks or low-resource settings. It achieved lower mean absolute error than all models in the CDC's COVID-19 Forecast Hub during backtesting and consistently ranked in the top two models across various diseases and evaluation metrics. Mantis demonstrated the ability to generalize to diseases with transmission mechanisms not present in its training data, highlighting its potential as a practical foundation for disease forecasting.

Mantis 是一种基于机械模拟训练的基础模型，旨在跨不同地区和结果实现传染病的开箱即用预测，尤其适用于数据稀缺的环境。它与 48 种模型在六种疾病上的表现进行了对比，结果显示其在 CDC 的 COVID-19 预测枢纽中的均绝对误差低于其他模型，并且在所有其他疾病测试中始终排名前二。Mantis 还能够很好地泛化到其训练数据中未包含的疾病，表明它捕捉的是基本的传播动态而非特定疾病的模式。

CGS-GAN: 3D Consistent Gaussian Splatting GANs for High Resolution Human Head Synthesis

Authors: Florian Barthel, Wieland Morgenstern, Paul Hinzer, Anna Hilsmann, Peter Eisert

First: 2025-05-23T07:56:25+00:00 · Latest: 2026-01-22T17:31:58+00:00

Comments: Main paper 12 pages, supplementary materials 8 pages

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: https://fraunhoferhhi.github.io/cgs-gan/

中文标题/摘要

标题：CGS-GAN：基于3D高斯点渲染的高分辨率人体头部合成GAN

近年来，基于3D高斯点渲染的3D GAN被提出，用于高质量的人体头部合成。然而，现有方法通过将随机潜在向量条件化于当前相机位置来稳定训练并提高从陡峭视角的渲染质量，这损害了3D一致性，因为我们观察到在每次相机移动时重新合成3D头部时存在显著的身份变化。相反，固定相机到单一视角可以为该视角提供高质量的渲染，但对新颖视角的表现较差。去除视点条件化通常会破坏GAN的训练稳定性，往往导致训练崩溃。为应对这些挑战，我们提出了CGS-GAN，这是一种新颖的3D高斯点渲染GAN框架，能够在不依赖视点条件化的情况下实现稳定的训练和高质量的3D一致合成。为了确保训练稳定性，我们引入了一种多视图正则化技术，该技术在最小的计算开销下增强了生成器的收敛性。此外，我们适应了现有3D高斯点渲染GAN中使用的条件损失，并提出了一种生成器架构，不仅能够稳定训练，还能够促进高效的渲染和简单的扩展，使输出分辨率可达$2048^2$。为了评估CGS-GAN的能力，我们从FFHQ中构建了一个新的数据集。该数据集支持非常高的分辨率，关注人体头部的更大部分，减少了视点依赖的伪影以提高3D一致性，并排除了手或其他物体遮挡主体的图像。因此，我们的方法实现了非常高的渲染质量，由竞争性的FID分数支持，同时确保了3D场景的一致生成。请访问我们的项目页面：https://fraunhoferhhi.github.io/cgs-gan/

Summary / 总结

CGS-GAN addresses the challenge of synthesizing high-resolution human heads by introducing a novel 3D Gaussian splatting GAN framework that avoids the need for view-conditioning, thus maintaining 3D consistency. It achieves this through a multi-view regularization technique and an adapted conditional loss, enabling stable training and high-quality renderings. The approach supports resolutions up to $2048^2$ and is evaluated using a new dataset derived from FFHQ, which improves 3D consistency and reduces view-dependent artifacts, resulting in competitive FID scores.

CGS-GAN通过引入一种新的3D高斯点云生成的GAN框架，避免了视点条件化的需求，从而保持了3D一致性。该方法通过多视图正则化技术和改进的条件损失，实现了稳定的训练和高质量的渲染。该方法支持高达$2048^2$的分辨率，并使用从FFHQ派生的新数据集进行评估，该数据集提高了3D一致性并减少了视点依赖的伪影，最终获得了竞争力的FID分数。

ViSymRe: Vision-guided Multimodal Symbolic Regression

Authors: Da Li, Junping Yin, Jin Xu, Xinxin Li, Juan Zhang

First: 2024-12-15T10:05:31+00:00 · Latest: 2026-01-22T17:29:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Extracting simple mathematical expression from an observational dataset to describe complex natural phenomena is one of the core objectives of artificial intelligence (AI). This field is known as symbolic regression (SR). Traditional SR models are based on genetic programming (GP) or reinforcement learning (RL), facing well-known challenges, such as low efficiency and overfitting. Recent studies have integrated SR with large language models (LLMs), enabling fast zero-shot inference by learning mappings from millions of dataset-expression pairs. However, since the input and output are inherently different modalities, such models often struggle to converge effectively. In this paper, we introduce ViSymRe, a vision-guided multimodal SR model that incorporates the third resource, expression graph, to bridge the modality gap. Different from traditional multimodal models, ViSymRe is trained to extract vision, termed virtual vision, from datasets, without relying on the global availability of expression graphs, which addresses the essential challenge of visual SR, i.e., expression graphs are not available during inference. Evaluation results on multiple mainstream benchmarks show that ViSymRe achieves more competitive performance than the state-of-the-art dataset-only baselines. The expressions predicted by ViSymRe not only fit the dataset well but are also simple and structurally accurate, goals that SR models strive to achieve.

中文标题/摘要

标题：ViSymRe：视觉引导的多模态符号回归

从观测数据集中提取简单的数学表达式以描述复杂的自然现象是人工智能（AI）的核心目标之一。这一领域被称为符号回归（SR）。传统的SR模型基于遗传编程（GP）或强化学习（RL），面临着低效率和过拟合等已知挑战。最近的研究将SR与大型语言模型（LLMs）结合，通过学习数百万数据集-表达式对之间的映射，实现快速的零样本推理。然而，由于输入和输出是固有的不同模态，这类模型往往难以有效收敛。在本文中，我们介绍了ViSymRe，这是一种视觉引导的多模态SR模型，它结合了表达图这一资源来弥合模态差距。与传统的多模态模型不同，ViSymRe被训练从数据集中提取所谓的虚拟视觉，而无需依赖全局可用的表达图，这解决了视觉SR的基本挑战，即在推理过程中表达图不可用。在多个主流基准上的评估结果表明，ViSymRe在与数据集仅基线相比时，实现了更优的性能。ViSymRe预测的表达式不仅很好地拟合了数据集，而且简单且结构准确，这是SR模型努力实现的目标。

Summary / 总结

ViSymRe is a vision-guided multimodal symbolic regression model that addresses the challenges of traditional symbolic regression methods by incorporating expression graphs. Unlike previous approaches, ViSymRe trains on datasets to extract virtual vision, enabling effective inference without requiring global expression graphs. Experimental results demonstrate that ViSymRe outperforms state-of-the-art dataset-only baselines, producing simple and structurally accurate expressions that fit the datasets well.

ViSymRe 是一种基于视觉的多模态符号回归模型，通过整合表达图来弥合符号回归中的模态差距。与传统多模态模型不同，ViSymRe 在训练时通过数据集提取虚拟视觉，无需全局表达图，从而解决视觉符号回归的核心挑战。实验结果表明，ViSymRe 在多个主流基准上优于最先进的数据集仅基线模型，生成的表达式不仅拟合数据良好，而且简单且结构准确。

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Authors: Tingyu Song, Yanzhao Zhang, Mingxin Li, Zhuoning Guo, Dingkun Long, Pengjun Xie, Siyue Zhang, Yilun Zhao, Shu Wu

First: 2026-01-22T17:26:52+00:00 · Latest: 2026-01-22T17:26:52+00:00

Comments: Under review

Abs · PDF · Code1 · Code2

Abstract

Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.

中文标题/摘要

标题：重新思考组合图像检索评估：来自图像编辑的细粒度基准

组合图像检索（CIR）是多模态理解中的一个关键且复杂的任务。当前的CIR基准通常包含有限的查询类别，无法捕捉到现实场景中的多样化需求。为了弥合这种评估差距，我们利用图像编辑实现对修改类型和内容的精确控制，从而建立一个合成查询的管道，涵盖广泛类别。利用这一管道，我们构建了EDIR，这是一个新颖的细粒度CIR基准。EDIR包含5000个高质量的查询，结构化分布在五个主要类别和十五个子类别中。我们对13种多模态嵌入模型的全面评估揭示了显著的能力差距；即使是最先进的模型（如RzenEmbed和GME）也无法在所有子类别中保持一致表现，突显了我们基准的严格性。通过对比分析，我们进一步揭示了现有基准的内在局限性，如模态偏差和类别覆盖不足。此外，一个领域内训练实验展示了我们基准的可行性。该实验通过区分可利用目标数据解决的类别和暴露当前模型架构固有限制的类别，阐明了任务挑战。

Summary / 总结

The paper addresses the limitations of current Composed Image Retrieval (CIR) benchmarks by introducing EDIR, a fine-grained benchmark created through image editing. The method involves synthesizing queries across various categories to evaluate 13 multimodal embedding models, revealing significant capability gaps, especially for state-of-the-art models like RzenEmbed and GME. The findings highlight the need for more comprehensive benchmarks to cover diverse real-world scenarios and expose inherent limitations in existing benchmarks and model architectures.

论文通过引入基于图像编辑的细粒度基准EDIR，解决了当前Composed Image Retrieval (CIR)基准的局限性。该方法通过合成跨多个类别的查询来评估13种多模态嵌入模型，揭示了显著的能力差距，尤其是对于RzenEmbed和GME等最先进的模型。研究结果强调了需要更全面的基准来涵盖多样化的现实场景，并揭示了现有基准和模型架构的内在局限性。

AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs

Authors: Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang

First: 2025-11-17T11:45:41+00:00 · Latest: 2026-01-22T17:11:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Audio-Language Models (LALMs) have recently shown impressive progress in speech recognition, audio captioning, and auditory question answering. Yet, whether these models can perceive spatial dynamics, particularly the motion of sound sources, remains unclear. In this work, we uncover a systematic motion perception deficit in current ALLMs. To investigate this issue, we introduce AudioMotionBench, the first benchmark explicitly designed to evaluate auditory motion understanding. AudioMotionBench introduces a controlled question-answering benchmark designed to evaluate whether Audio-Language Models (LALMs) can infer the direction and trajectory of moving sound sources from binaural audio. Comprehensive quantitative and qualitative analyses reveal that current models struggle to reliably recognize motion cues or distinguish directional patterns. The average accuracy remains below 50\%, underscoring a fundamental limitation in auditory spatial reasoning. Our study highlights a fundamental gap between human and model auditory spatial reasoning, providing both a diagnostic tool and new insight for enhancing spatial cognition in future Audio-Language Models.

中文标题/摘要

标题：AudioMotionBench：评估音频LLMs的听觉运动感知

大型音频-语言模型（LALMs）在语音识别、音频描述和听觉问答方面最近取得了令人印象深刻的进展。然而，这些模型是否能够感知空间动态，特别是声源的运动，仍然不清楚。在本文中，我们揭示了当前ALLMs在运动感知方面存在系统性的缺陷。为了研究这一问题，我们引入了AudioMotionBench，这是第一个明确设计用于评估听觉运动理解的基准。AudioMotionBench引入了一个受控的问答基准，旨在评估音频-语言模型（LALMs）是否能够从立体声音频中推断出移动声源的方向和轨迹。全面的定量和定性分析表明，当前的模型在可靠地识别运动线索或区分方向模式方面存在困难。平均准确率低于50%，突显了听觉空间推理的基本局限性。我们的研究突显了人类和模型在听觉空间推理方面的根本差距，为未来音频-语言模型的空间认知增强提供了诊断工具和新的见解。

Summary / 总结

This study addresses the limitation of current Large Audio-Language Models (LALMs) in perceiving spatial dynamics, particularly the motion of sound sources. It introduces AudioMotionBench, a benchmark for evaluating auditory motion understanding, and finds that current models struggle to recognize motion cues or distinguish directional patterns, with average accuracy below 50%. This highlights a fundamental gap in auditory spatial reasoning between humans and models, suggesting a need for improvement in this area.

本研究关注当前大型音频语言模型（LALMs）在感知空间动态，特别是声音来源的运动方面的局限性。它引入了AudioMotionBench，一个用于评估听觉运动理解的基准，发现当前模型难以识别运动线索或区分方向模式，平均准确率低于50%。这表明人类和模型在听觉空间推理方面存在根本差距，需要在未来增强空间认知方面进行改进。

Distillation-based Layer Dropping (DLD) Effective End-to-end Framework for Dynamic Speech Networks

Authors: Abdul Hannan, Daniele Falavigna, Shah Nawaz, Mubashir Noman, Markus Schedl, Alessio Brutti

Venue: ICASSP 2026

First: 2026-01-22T17:11:44+00:00 · Latest: 2026-01-22T17:11:44+00:00

Comments: Accepted at ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.

中文标题/摘要

标题：基于蒸馏的层舍弃（DLD）：端到端的动态语音网络有效框架

边缘设备在受限和变化的资源环境中运行，需要能够适应可用资源限制的动态架构。为了满足这些需求，通常使用层舍弃（$\mathcal{LD}$）方法将静态模型转换为动态模型，通过跳过网络的一部分来减少整体计算复杂度。然而，现有的$\mathcal{LD}$方法在低和高舍弃情况下极大地影响了动态模型的性能，恶化了性能-计算权衡。为此，我们提出了一种基于蒸馏的层舍弃（DLD）框架，该框架以端到端的方式有效结合了知识蒸馏和$\mathcal{LD}$的能力，从而在动态语音网络中实现了最先进的性能。利用包括conformer和WavLM在内的知名语音识别方法，在三个公开基准上的全面实验表明，我们的框架有效，高舍弃情况下将词错误率降低了$9.32\%$，无舍弃情况下降低了$2.25\%$，同时训练时间减少了$33.3\%$。

Summary / 总结

The research aims to address the need for dynamic architectures in edge devices with constrained resources. It proposes a distillation-based layer dropping (DLD) framework that combines knowledge distillation and layer dropping in an end-to-end manner. Experimental results on conformer, WavLM, and three public benchmarks show that DLD reduces the word error rate by 9.32% and 2.25% for high and no dropping cases, respectively, with a 33.3% reduction in training time.

论文提出了一种基于蒸馏的层丢弃（DLD）框架，通过将知识蒸馏和层丢弃结合在端到端的方式中来改进动态语音网络。该方法在高丢弃和无丢弃情况下分别将词错误率降低了9.32%和2.25%，同时将训练时间减少了33.3%，优于现有方法在低丢弃和高丢弃情况下性能下降的问题。

Enhanced Climbing Image Nudged Elastic Band method with Hessian Eigenmode Alignment

Authors: Rohit Goswami, Miha Gunde, Hannes Jónsson

First: 2026-01-19T00:21:52+00:00 · Latest: 2026-01-22T17:11:23+00:00

Comments: 25 pages. 11 figures

Abs · PDF · Code1 · Code2

Abstract

Accurate determination of transition states is central to an understanding of reaction kinetics. Double-endpoint methods where both initial and final states are specified, such as the climbing image nudged elastic band (CI-NEB), identify the minimum energy path between the two and thereby the saddle point on the energy surface that is relevant for the given transition, thus providing an estimate of the transition state within the harmonic approximation of transition state theory. Such calculations can, however, incur high computational costs and may suffer stagnation on exceptionally flat or rough energy surfaces. Conversely, methods that only require specification of an initial set of atomic coordinates, such as the minimum mode following (MMF) method, offer efficiency but can converge on saddle points that are not relevant for transition of interest. Here, we present an adaptive hybrid algorithm that integrates the CI-NEB with the MMF method so as to get faster convergence to the relevant saddle point. The method is benchmarked for the Baker-Chan (BC) saddle point test set using the PET-MAD machine-learned potential as well as 59 transitions of a heptamer island on Pt(111) from the OptBench benchmark set. A Bayesian analysis of the performance shows a median reduction in energy and force calculations of 46% [95% CrI: -55%, -37%] relative to CI-NEB for the BC set, while a 28% reduction is found for the transitions of the heptamer island. These results establish this hybrid method as a highly effective tool for high-throughput automated chemical discovery of atomic rearrangements.

中文标题/摘要

标题：增强攀爬图像拉伸带方法与哈密尔顿特征模式对齐

准确确定过渡态是理解反应动力学的核心。双端点方法，如攀爬图像拉伸带（CI-NEB）方法，通过指定初始和最终状态来识别两者之间的最低能量路径，从而确定与给定过渡相关的鞍点，提供过渡态的谐振子近似估计。然而，此类计算可能产生高昂的计算成本，并可能在异常平坦或粗糙的能量表面上停滞不前。相比之下，仅需指定一组原子坐标的方法，如最小模式跟随（MMF）方法，虽然效率更高，但可能会收敛到与所需过渡无关的鞍点。在此，我们提出了一种自适应混合算法，将CI-NEB方法与MMF方法结合，以更快地收敛到相关鞍点。该方法使用PET-MAD机器学习势能对Baker-Chan（BC）鞍点测试集进行了基准测试，并对Pt(111)上七聚岛的59个过渡进行了基准测试。贝叶斯分析表明，对于BC集，能量和力的计算中位数减少46% [95% CrI: -55%，-37%]，而对于七聚岛的过渡，减少28%。这些结果确立了该混合方法作为高效工具，用于高通量自动化原子重排的化学发现。

Summary / 总结

The research aims to improve the accuracy and efficiency of determining transition states in chemical reactions. It introduces an enhanced CI-NEB method combined with MMF for faster convergence to relevant saddle points. The method was benchmarked on the BC test set and the heptamer island transitions, showing a median reduction of 46% in energy and force calculations for the BC set and 28% for the heptamer transitions compared to CI-NEB.

研究旨在提高确定化学反应过渡态的准确性和效率。该方法结合了爬升图像拉伸带（CI-NEB）和最小模式跟随（MMF）方法，以更快地收敛到相关鞍点。该混合算法在Baker-Chan（BC）鞍点测试集和Pt(111)上59个庚聚体岛的过渡中进行了基准测试，显示BC集中的能量和力计算减少了46%，而庚聚体过渡中的减少为28%。

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Authors: Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi

First: 2025-05-24T15:57:07+00:00 · Latest: 2026-01-22T17:10:05+00:00

Comments: Accepted by NeurIPS2025

Abs · PDF · Code1 · Code2

Abstract

Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.

中文标题/摘要

标题：GenPO：生成式扩散模型与在线强化学习的结合

强化学习（RL）的最新进展展示了基于生成扩散策略的强大探索能力和多模态性。虽然在离线RL和离策RL设置中取得了显著进展，但将扩散策略整合到像PPO这样的在线框架中仍然鲜有探索。鉴于大规模并行GPU加速模拟器（如IsaacLab）的广泛应用，这些模拟器优化了在线RL算法，使得复杂机器人任务的快速训练成为可能，这一差距尤为重要。一个关键挑战在于在扩散策略下计算状态-动作对数似然，对于高斯策略来说是直接的，但对于基于流的模型来说是不可行的，因为不可逆的正向-反向过程和离散化误差（例如Euler-Maruyama近似）导致了不可解性。为了解决这一问题，我们提出了GenPO，一种利用精确扩散反演构建可逆动作映射的生成策略优化框架。GenPO引入了一种新颖的双虚拟动作机制，通过交替更新实现可逆性，解决了对数似然计算障碍。此外，我们还使用动作对数似然进行无偏熵和KL散度估计，使KL自适应学习率和熵正则化能够在在线更新中实现。在八个IsaacLab基准测试上的广泛实验，包括腿足运动（Ant、Humanoid、Anymal-D、Unitree H1、Go2）、灵巧操作（Shadow Hand）、空中控制（Quadcopter）和机器人臂任务（Franka），证明了GenPO优于现有RL基线。值得注意的是，GenPO是第一个成功将扩散策略整合到在线RL中的方法，开启了其在大规模并行化训练和实际机器人部署中的潜力。

Summary / 总结

GenPO is a generative policy optimization framework that integrates diffusion policies into on-policy reinforcement learning (RL) frameworks like PPO. It addresses the challenge of computing state-action log-likelihoods for flow-based models by proposing a novel doubled dummy action mechanism, enabling exact diffusion inversion and invertible action mappings. Extensive experiments on various IsaacLab benchmarks show that GenPO outperforms existing RL baselines and is the first method to successfully integrate diffusion policies into on-policy RL, facilitating large-scale parallelized training and real-world robotic deployment.

该论文提出了GenPO，一种将生成扩散模型整合到基于策略的强化学习（RL）方法如PPO中的生成政策优化框架。通过提出一种新颖的双虚拟动作机制来解决计算扩散策略的状态-动作对数似然比的挑战。实验结果表明，GenPO在包括腿足运动、灵巧操作、空中控制和机器人手臂任务在内的八个IsaacLab基准测试中优于现有RL基线，使扩散策略适用于大规模并行训练和实际机器人部署。

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Authors: Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

First: 2026-01-21T17:15:22+00:00 · Latest: 2026-01-22T17:01:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

Summary / 总结

BayesianVLA addresses the issue of information collapse in Vision-Language-Action models, where language instructions become predictable from visual inputs alone, leading to poor generalization. It introduces a dual-branch architecture with learnable latent action queries to estimate both vision-only and language-conditioned policies, optimizing the policy to maximize the conditional PMI between actions and instructions. This approach improves out-of-distribution generalization, achieving a 11.3% improvement on the OOD SimplerEnv benchmark.

研究旨在解决Vision-Language-Action (VLA)模型在新或复杂场景中泛化能力差的问题，这主要是由于信息坍缩现象。提出的BayesianVLA框架使用贝叶斯分解和潜在动作查询来强制执行指令遵循，并通过最大化动作和指令之间的条件点互信息来优化策略。该方法无需新数据即可提高泛化能力，在SimplerEnv和RoboCasa的实验中表现出显著的改进，特别是在OOD SimplerEnv基准测试中取得了11.3%的改进。

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Authors: Noam Glazner, Noam Tsfaty, Sharon Shalev, Avishai Weizman

First: 2025-11-17T21:57:46+00:00 · Latest: 2026-01-22T16:56:57+00:00

Comments: 1 figure, 1 table, Accepted to ICSEE 2026

Abs · PDF · Code1 · Code2

Abstract

We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

中文标题/摘要

标题：找到漏洞，修复裂痕：基于聚类的方法防止视频衍生数据集中的泄漏

我们提出了一种基于聚类的帧选择策略，以减轻视频衍生帧数据集中信息泄漏的问题。通过在划分训练、验证和测试集之前对视觉上相似的帧进行分组，该方法生成了更具代表性的、更平衡和更可靠的数据集分区。

Summary / 总结

The research aims to reduce information leakage in video-derived datasets by proposing a cluster-based frame selection strategy. This method groups visually similar frames before partitioning them into training, validation, and test sets, resulting in more representative and balanced dataset splits.

研究旨在通过提出基于聚类的帧选择策略来减少视频衍生数据集中的信息泄漏。该方法在将帧分割为训练、验证和测试集之前，先将视觉上相似的帧分组，从而获得更具代表性和平衡的数据集分割。

Multimodal Climate Disinformation Detection: Integrating Vision-Language Models with External Knowledge Sources

Authors: Marzieh Adeli Shamsabad, Hamed Ghodrati

First: 2026-01-22T16:55:48+00:00 · Latest: 2026-01-22T16:55:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Climate disinformation has become a major challenge in today digital world, especially with the rise of misleading images and videos shared widely on social media. These false claims are often convincing and difficult to detect, which can delay actions on climate change. While vision-language models (VLMs) have been used to identify visual disinformation, they rely only on the knowledge available at the time of training. This limits their ability to reason about recent events or updates. The main goal of this paper is to overcome that limitation by combining VLMs with external knowledge. By retrieving up-to-date information such as reverse image results, online fact-checks, and trusted expert content, the system can better assess whether an image and its claim are accurate, misleading, false, or unverifiable. This approach improves the model ability to handle real-world climate disinformation and supports efforts to protect public understanding of science in a rapidly changing information landscape.

中文标题/摘要

标题：多模态气候 misinformation 检测：结合视觉-语言模型与外部知识源

气候 misinformation 已成为当今数字世界的主要挑战，尤其是在社交媒体上广泛传播误导性的图片和视频的情况下。这些虚假声明往往令人信服且难以识别，这可能会延迟应对气候变化的行动。虽然视觉-语言模型（VLMs）已被用于识别视觉 misinformation，但它们仅依赖于训练时可用的知识。这限制了它们对近期事件或更新进行推理的能力。本文的主要目标是通过结合 VLMs 与外部知识来克服这一限制。通过检索最新的信息，如逆向图像搜索结果、在线事实核查和可信专家内容，该系统可以更好地评估图片及其声明是否准确、误导、虚假或无法验证。这种方法提高了模型处理现实世界气候 misinformation 的能力，并支持在快速变化的信息环境中保护公众对科学的理解的努力。

Summary / 总结

This paper addresses the challenge of detecting climate disinformation by integrating vision-language models with external knowledge sources. The method involves retrieving up-to-date information such as reverse image results, online fact-checks, and expert content to enhance the models' ability to assess the accuracy of visual claims. Key findings show that this approach improves the models' performance in identifying misleading images and videos, making them more effective in combating climate disinformation on social media.

研究旨在应对气候 misinformation 的挑战，特别是社交媒体上广泛传播的误导性图片和视频。它提出将视觉语言模型与外部知识源结合，以增强检测能力。通过整合最新的信息，如逆向图像搜索和在线事实核查，系统可以更准确地评估视觉声明的准确性，从而提高其处理现实世界 misinformation 的能力。

YOLO Meets Mixture-of-Experts: Adaptive Expert Routing for Robust Object Detection

Authors: Ori Meiraz, Sharon Shalev, Avishai Weizman

First: 2025-11-17T13:11:11+00:00 · Latest: 2026-01-22T16:55:20+00:00

Comments: 1 figure, 1 table, Accepted to ICSEE 2026

Abs · PDF · Code1 · Code2

Abstract

This paper presents a novel Mixture-of-Experts framework for object detection, incorporating adaptive routing among multiple YOLOv9-T experts to enable dynamic feature specialization and achieve higher mean Average Precision (mAP) and Average Recall (AR) compared to a single YOLOv9-T model.

中文标题/摘要

标题：YOLO与混合专家模型结合：适应性专家路由以实现稳健的目标检测

本文提出了一种新颖的目标检测混合专家框架，结合了多个YOLOv9-T专家之间的自适应路由，以实现动态特征专业化，并且在平均精度（mAP）和平均召回率（AR）方面优于单一的YOLOv9-T模型。

Summary / 总结

This paper introduces a Mixture-of-Experts framework for object detection that uses adaptive routing among multiple YOLOv9-T experts to enhance feature specialization. The method achieves higher mAP and AR compared to a single YOLOv9-T model.

该论文提出了一种混合专家框架用于目标检测，通过在多个YOLOv9-T专家之间进行自适应路由，实现了比单一YOLOv9-T模型更高的mAP和AR。该方法能够动态地进行特征专业化，从而提高目标检测的鲁棒性。

No Mesh, No Problem: Estimating Coral Volume and Surface from Sparse Multi-View Images

Authors: Diego Eustachio Farchione, Ramzi Idoughi, Peter Wonka

First: 2025-09-14T08:52:01+00:00 · Latest: 2026-01-22T16:55:00+00:00

Comments: Reverted to previous version due to clarity issues

Abs · PDF · Code1 · Code2

Abstract

Effective reef monitoring requires the quantification of coral growth via accurate volumetric and surface area estimates, which is a challenging task due to the complex morphology of corals. We propose a novel, lightweight, and scalable learning framework that addresses this challenge by predicting the 3D volume and surface area of coral-like objects from 2D multi-view RGB images. Our approach utilizes a pre-trained module (VGGT) to extract dense point maps from each view; these maps are merged into a unified point cloud and enriched with per-view confidence scores. The resulting cloud is fed to two parallel DGCNN decoder heads, which jointly output the volume and the surface area of the coral, as well as their corresponding confidence estimate. To enhance prediction stability and provide uncertainty estimates, we introduce a composite loss function based on Gaussian negative log-likelihood in both real and log domains. Our method achieves competitive accuracy and generalizes well to unseen morphologies. This framework paves the way for efficient and scalable coral geometry estimation directly from a sparse set of images, with potential applications in coral growth analysis and reef monitoring.

中文标题/摘要

标题：无网格，无问题：从稀疏多视角图像估计珊瑚体积和表面积

有效的礁监测需要通过准确的体积和表面积估计来量化珊瑚生长，但由于珊瑚的复杂形态，这是一个具有挑战性的任务。我们提出了一种新颖、轻量级和可扩展的学习框架，通过预测珊瑚状物体的3D体积和表面积来解决这一挑战，从2D多视角RGB图像。我们的方法利用预训练模块（VGGT）从每个视角提取密集点图；这些图被合并成一个统一的点云，并附有每个视角的置信分数。生成的云被输入到两个并行的DGCNN解码器头中，这两个头共同输出珊瑚的体积和表面积及其相应的置信估计。为了增强预测稳定性并提供不确定性估计，我们引入了一种基于实数和对数域高斯负对数似然的复合损失函数。我们的方法实现了竞争力的准确性，并且能够很好地泛化到未见过的形态。该框架为直接从稀疏图像集估计珊瑚几何形状铺平了道路，具有在珊瑚生长分析和礁监测中的潜在应用。

Summary / 总结

The research aims to accurately estimate the 3D volume and surface area of corals from sparse multi-view images, addressing the complexity of coral morphology. The method uses a pre-trained VGGT module to extract dense point maps from each view, merges them into a unified point cloud, and enriches it with confidence scores. Two parallel DGCNN decoder heads output the volume and surface area along with their confidence estimates. A composite loss function is introduced to enhance prediction stability and provide uncertainty estimates. The approach achieves competitive accuracy and generalizes well to unseen morphologies, enabling efficient coral geometry estimation from sparse images.

研究旨在通过稀疏多视角图像准确估计珊瑚的3D体积和表面积，解决珊瑚形态复杂的问题。方法使用预训练的VGGT模块从每个视角提取密集的点图，将其合并成统一的点云，并添加视图置信度分数。两个并行的DGCNN解码器头输出体积和表面积及其相应的置信度估计。引入复合损失函数以增强预测稳定性并提供不确定性估计。该方法实现了竞争性的准确性，并能很好地泛化到未见过的形态，从而能够从稀疏图像中高效地估计珊瑚的几何形状。

Benchmarking Deep Learning Models for Raman Spectroscopy Across Open-Source Datasets

Authors: Adithya Sineesh, Akshita Kamsali

First: 2026-01-22T16:54:53+00:00 · Latest: 2026-01-22T16:54:53+00:00

Comments: 17 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Deep learning classifiers for Raman spectroscopy are increasingly reported to outperform classical chemometric approaches. However their evaluations are often conducted in isolation or compared against traditional machine learning methods or trivially adapted vision-based architectures that were not originally proposed for Raman spectroscopy. As a result, direct comparisons between existing deep learning models developed specifically for Raman spectral analysis on shared open-source datasets remain scarce. To the best of our knowledge, this study presents one of the first systematic benchmarks comparing three or more published Raman-specific deep learning classifiers across multiple open-source Raman datasets. We evaluate five representative deep learning architectures under a unified training and hyperparameter tuning protocol across three open-source Raman datasets selected to support standard evaluation, fine-tuning, and explicit distribution-shift testing. We report classification accuracies and macro-averaged F1 scores to provide a fair and reproducible comparison of deep learning models for Raman spectra based classification.

中文标题/摘要

标题：跨开源数据集评估拉曼光谱深度学习模型

用于拉曼光谱的深度学习分类器越来越多地被报道优于经典化学计量方法。然而，这些评估通常是在孤立状态下进行的，或者与传统的机器学习方法或为拉曼光谱分析未专门设计的简单视觉架构进行比较。因此，针对共享开源数据集直接比较专门为拉曼光谱分析开发的现有深度学习模型仍然很少见。据我们所知，本研究首次系统地比较了三个或更多已发表的拉曼特定深度学习分类器在多个开源拉曼数据集上的表现。我们按照统一的训练和超参数调优协议评估了五种代表性的深度学习架构，并选择了三个开源拉曼数据集以支持标准评估、微调和明确的分布转移测试。我们报告了分类准确率和宏平均F1分数，以提供一个公平且可重复的拉曼光谱分类的深度学习模型比较。

Summary / 总结

This study benchmarks deep learning classifiers for Raman spectroscopy by evaluating five representative architectures across three open-source datasets. The motivation is to provide a fair and reproducible comparison since previous evaluations were often isolated or compared to non-specialized methods. Key findings include classification accuracies and macro-averaged F1 scores, demonstrating the performance of these models in a standardized setting.

本研究通过在三个开源数据集上评估五种代表性的深度学习架构，填补了文献中直接比较的空白。研究发现，这些模型在传统化学计量方法和基于视觉的方法上表现出色，但在不同数据集上的性能有所差异，表明在特定应用中需要仔细选择模型。

TDFlow: Agentic Workflows for Test Driven Development

Authors: Kevin Han, Siddharth Maddikayala, Tim Knappe, Om Patel, Austen Liao, Amir Barati Farimani

First: 2025-10-27T18:44:59+00:00 · Latest: 2026-01-22T16:50:52+00:00

Comments: Published in the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026 Main Conference)

Abs · PDF · Code1 · Code2

Abstract

We introduce TDFlow, a novel test-driven agentic workflow that frames repository-scale software engineering as a test-resolution task, specifically designed to solve human-written tests. Given a set of tests, TDFlow repeatedly proposes, revises, and debugs repository-scale patches using precisely engineered sub-agents and tightly constrained tools. The workflow decomposes software engineering program repair into four components governed by respective sub-agents. This simple, forced decoupling of patch proposing, debugging, patch revision, and optional test generation (1) reduces long-context burden on any individual sub-agent, (2) focuses each sub-agent on specific, pre-defined sub-tasks, and (3) allows for specialized performance improvement on specific sub-tasks. When provided human-written tests, TDFlow attains 88.8% pass rate on SWE-Bench Lite (an absolute improvement of 27.8% over the next best system) and 94.3% on SWE-Bench Verified. Manual inspection of the 800 TDFlow runs within SWE-Bench Lite and Verified uncover only 7 instances of test hacking, which were subsequently counted as failures. Furthermore, we show that the primary obstacle to human-level software engineering performance lies within writing successful reproduction tests. We envision a human-LLM interactive system powered by TDFlow where human developers write tests solved by LLM systems. Together, these results indicate that modern LLMs, when embedded in a narrowly engineered, test-driven workflow, already achieve human-level test resolution -- with the final frontier for fully autonomous repository repair being the accurate generation of valid reproduction tests.

中文标题/摘要

标题：TDFlow：驱动测试的代理工作流

我们介绍了TDFlow，这是一种新颖的驱动测试的代理工作流，将仓库规模的软件工程框架为一个测试解决任务，特别设计用于解决人类编写的测试。给定一组测试，TDFlow反复地提出、修订和调试仓库规模的补丁，使用精确设计的子代理和严格限制的工具。该工作流将软件工程程序修复分解为四个由相应子代理管理的组件。这种简单的强制解耦（1）减少了任何单一子代理的长上下文负担，（2）使每个子代理专注于特定的预定义子任务，（3）允许在特定子任务上进行专门的性能改进。当提供人类编写的测试时，TDFlow在SWE-Bench Lite（绝对改进27.8%）和SWE-Bench Verified上分别达到了88.8%和94.3%的通过率。在SWE-Bench Lite和Verified中的800次TDFlow运行中，手动检查仅发现了7个测试作弊实例，随后被计为失败。此外，我们展示了人类级软件工程性能的主要障碍在于编写成功的复现测试。我们设想了一个由TDFlow支持的人类-大语言模型交互系统，其中人类开发人员编写由大语言模型系统解决的测试。这些结果表明，当现代大语言模型嵌入到一个精确设计的、驱动测试的工作流中时，它们已经达到了人类级的测试解决能力——完全自主的仓库修复的最后障碍是准确生成有效的复现测试。

Summary / 总结

TDFlow is a test-driven workflow that addresses software engineering tasks by framing them as test-resolution tasks. It uses sub-agents to propose, revise, and debug patches, focusing on specific sub-tasks to improve performance. TDFlow achieves an 88.8% pass rate on SWE-Bench Lite and 94.3% on SWE-Bench Verified, significantly outperforming other systems. Manual inspection found only 7 instances of test hacking, which were counted as failures. The main challenge is writing successful reproduction tests, suggesting that modern LLMs can achieve human-level test resolution when embedded in a narrowly engineered workflow.

TDFlow 是一种测试驱动的工作流，将软件工程任务视为测试解决任务。它使用子代理来提出、修订和调试补丁，专注于特定子任务以提高性能。TDFlow 在 SWE-Bench Lite 上实现了 88.8% 的通过率，在 SWE-Bench Verified 上达到了 94.3%，显著优于其他系统。人工检查发现只有 7 个测试作弊实例，被计为失败。主要挑战在于编写成功的复现测试，表明现代 LLM 在嵌入于精心设计的测试驱动工作流中时，已经可以实现人类级别的测试解决能力。

Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification

Authors: Zack Dewis, Yimin Zhu, Zhengsen Xu, Mabel Heffring, Saeid Taleghanidoozdoozan, Quinn Ledingham, Lincoln Linlin Xu

First: 2026-01-22T16:47:07+00:00 · Latest: 2026-01-22T16:47:07+00:00

Comments: 5 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Although Mamba models greatly improve Hyperspectral Image (HSI) classification, they have critical challenges in terms defining efficient and adaptive token sequences for improve performance. This paper therefore presents CSSMamba (Clustering-guided Spatial-Spectral Mamba) framework to better address the challenges, with the following contributions. First, to achieve efficient and adaptive token sequences for improved Mamba performance, we integrate the clustering mechanism into a spatial Mamba architecture, leading to a cluster-guided spatial Mamba module (CSpaMamba) that reduces the Mamba sequence length and improves Mamba feature learning capability. Second, to improve the learning of both spatial and spectral information, we integrate the CSpaMamba module with a spectral mamba module (SpeMamba), leading to a complete clustering-guided spatial-spectral Mamba framework. Third, to further improve feature learning capability, we introduce an Attention-Driven Token Selection mechanism to optimize Mamba token sequencing. Last, to seamlessly integrate clustering into the Mamba model in a coherent manner, we design a Learnable Clustering Module that learns the cluster memberships in an adaptive manner. Experiments on the Pavia University, Indian Pines, and Liao-Ning 01 datasets demonstrate that CSSMamba achieves higher accuracy and better boundary preservation compared to state-of-the-art CNN, Transformer, and Mamba-based methods.

中文标题/摘要

标题：聚类引导的空间光谱Mamba模型在高光谱图像分类中的应用

尽管Mamba模型极大地提高了高光谱图像（HSI）分类的效果，但在定义高效的自适应标记序列以提高性能方面仍面临重大挑战。因此，本文提出了CSSMamba（聚类引导的空间光谱Mamba）框架，以更好地应对这些挑战，其贡献如下。首先，为了实现高效的自适应标记序列以提高Mamba性能，我们将聚类机制整合到空间Mamba架构中，从而形成聚类引导的空间Mamba模块（CSpaMamba），该模块减少了Mamba序列长度并提高了Mamba特征学习能力。其次，为了提高空间和光谱信息的学习，我们将CSpaMamba模块与光谱Mamba模块（SpeMamba）结合，形成了完整的聚类引导的空间光谱Mamba框架。第三，为了进一步提高特征学习能力，我们引入了注意力驱动的标记选择机制来优化Mamba标记序列。最后，为了以连贯的方式无缝地将聚类整合到Mamba模型中，我们设计了一个可学习的聚类模块，该模块以自适应方式学习聚类成员身份。在Pavia大学、印度平原和辽宁01数据集上的实验表明，CSSMamba在准确性和边界保持方面优于最先进的CNN、Transformer和Mamba基方法。

Summary / 总结

The research aims to enhance the performance of Mamba models in hyperspectral image classification by addressing the challenge of defining efficient and adaptive token sequences. The proposed CSSMamba framework integrates clustering into a spatial Mamba architecture, resulting in a cluster-guided spatial Mamba module (CSpaMamba) that reduces sequence length and improves feature learning. By combining CSpaMamba with a spectral Mamba module (SpeMamba), the framework enhances the learning of both spatial and spectral information. Additionally, an Attention-Driven Token Selection mechanism optimizes Mamba token sequencing, and a Learnable Clustering Module adapts cluster memberships. Experiments show that CSSMamba outperforms state-of-the-art methods in terms of accuracy and boundary preservation.

该论文提出了CSSMamba框架，将聚类机制整合到空间-光谱Mamba模型中以提升高光谱图像分类性能。该框架包括一个聚类引导的空间Mamba模块（CSpaMamba），用于减少序列长度和提高特征学习能力，一个光谱Mamba模块（SpeMamba）以更好地学习空间和光谱信息，以及一种注意力驱动的令牌选择机制以优化令牌序列。实验表明，CSSMamba在三个数据集上的准确性和边界保持方面优于最先进的方法。

Neural Particle Automata: Learning Self-Organizing Particle Dynamics

Authors: Hyunsoo Kim, Ehsan Pajouheshgar, Sabine Süsstrunk, Wenzel Jakob, Jinah Park

First: 2026-01-22T16:46:28+00:00 · Latest: 2026-01-22T16:46:28+00:00

Comments: 15 pages, 15 figures

Abs · PDF · Code1 · Code2

Abstract

We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.

中文标题/摘要

标题：神经粒子自动机：学习自组织粒子动力学

我们引入了神经粒子自动机（NPA），这是一种拉格朗日泛化，将神经细胞自动机（NCA）从静态网格扩展到动态粒子系统。与经典的欧拉NCA不同，后者将细胞固定在像素或体素上，NPA将每个细胞建模为具有连续位置和内部状态的粒子，两者均由一个共享的学习神经规则更新。基于粒子的表述使得细胞具有明确的个体化，允许异质动力学，并仅在活动区域集中计算。同时，粒子系统带来了挑战：邻域是动态的，朴素的局部交互实现会随着粒子数量的平方级增长。我们通过使用基于高效内存和CUDA加速内核的可微分平滑粒子流体动力学（SPH）算子来解决这些挑战，从而实现可扩展的端到端训练。在包括形态发生、点云分类和基于粒子的纹理合成等任务中，我们展示了NPA保留了NCA的关键行为，如鲁棒性和自我再生，同时使粒子系统特有的新行为成为可能。这些结果共同将NPA定位为学习自组织粒子动力学的紧凑神经模型。

Summary / 总结

Neural Particle Automata (NPA) is introduced as a Lagrangian approach to model dynamic particle systems, extending the concept of Neural Cellular Automata (NCA) from static lattices. NPA treats each cell as a particle with continuous position and internal state, updated by a shared neural rule. This method overcomes the challenges of dynamic neighborhoods and quadratic scaling by using differentiable Smoothed Particle Hydrodynamics (SPH) operators. Experiments show that NPA retains NCA behaviors like robustness and self-regeneration, while enabling new particle-specific behaviors, making it a compact model for learning self-organizing particle dynamics.

引入了基于拉格朗日方法的神经粒子自动机（NPA），将神经细胞自动机（NCA）的概念从静态网格扩展到动态粒子系统。NPA通过共享神经网络更新每个粒子的位置和内部状态，允许异质动力学并仅在活动区域进行高效计算。为解决动态邻域和二次缩放的挑战，NPA使用了可微分的平滑粒子流体动力学（SPH）操作符，并结合了CUDA加速的内核。实验表明，NPA保留了NCA的稳健性和自我再生等特性，同时在形态发生、点云分类和粒子基纹理合成等任务中实现了新的粒子特定行为。

SAMTok: Representing Any Mask with Two Words

Authors: Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li

First: 2026-01-22T16:44:09+00:00 · Latest: 2026-01-22T16:44:09+00:00

Comments: 27 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

中文标题/摘要

标题：SAMTok：用两个词表示任何掩码

像素级能力对于构建交互式智能系统至关重要。然而，像素级多模态LLM（MLLMs）由于复杂的区域级编码器、专门的分割解码器和不兼容的训练目标，难以扩展。为了解决这些挑战，我们提出了SAMTok，这是一种离散掩码分词器，将任何区域掩码转换为两个特殊标记，并使用高保真度重建掩码。通过将掩码视为新的语言标记，SAMTok使基础MLLM（如QwenVL系列）能够通过标准的下一个标记预测和简单的强化学习学习像素级能力，无需进行架构修改和专门的损失设计。SAMTok基于SAM2，并使用掩码编码器和残差向量量化器训练209M种多样化的掩码，生成离散、紧凑且信息丰富的标记。通过5M SAMTok格式化的掩码理解和生成数据样本，QwenVL-SAMTok在区域描述、区域VQA、基于场景的对话、引用分割、场景图解析和多轮交互分割上取得了最先进的或可比的结果。我们进一步引入了一种文本答案匹配奖励，使掩码生成的强化学习更加高效，在GRES和GCG基准上取得了显著改进。我们的结果展示了为MLLMs提供强大像素级能力的可扩展且简单的范式。我们的代码和模型已公开。

Summary / 总结

SAMTok is designed to enable pixel-wise capabilities in multi-modal LLMs by converting region masks into two special tokens, allowing base MLLMs to learn these capabilities through standard prediction and reinforcement learning. SAMTok achieves state-of-the-art or comparable results on various tasks including region captioning and segmentation, and introduces a textual reward for efficient mask generation, improving performance on GRES and GCG benchmarks.

SAMTok通过将区域掩码转换为两个特殊标记，使基础MLLM能够通过标准的下一个标记预测和强化学习来学习像素级能力，从而解决像素级多模态LLM的可扩展性挑战。SAMTok在区域描述、区域VQA、语境对话、引用分割、场景图解析和多轮交互分割等多种任务上取得了最先进的或可比的结果。此外，它还引入了文本答案匹配奖励，以提高掩码生成的效率，在GRES和GCG基准上取得了显著改进。

Controlling Long-Horizon Behavior in Language Model Agents with Explicit State Dynamics

Authors: Sukesh Subaharan

First: 2026-01-22T16:34:05+00:00 · Latest: 2026-01-22T16:34:05+00:00

Comments: Supplementary materials can be found here: https://github.com/drsukeshs/agent-behavior-ext-dynamics

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model (LLM) agents often exhibit abrupt shifts in tone and persona during extended interaction, reflecting the absence of explicit temporal structure governing agent-level state. While prior work emphasizes turn-local sentiment or static emotion classification, the role of explicit affective dynamics in shaping long-horizon agent behavior remains underexplored. This work investigates whether imposing dynamical structure on an external affective state can induce temporal coherence and controlled recovery in multi-turn dialogue. We introduce an agent-level affective subsystem that maintains a continuous Valence-Arousal-Dominance (VAD) state external to the language model and governed by first- and second-order update rules. Instantaneous affective signals are extracted using a fixed, memoryless estimator and integrated over time via exponential smoothing or momentum-based dynamics. The resulting affective state is injected back into generation without modifying model parameters. Using a fixed 25-turn dialogue protocol, we compare stateless, first-order, and second-order affective dynamics. Stateless agents fail to exhibit coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis that increase with momentum, revealing a trade-off between stability and responsiveness.

中文标题/摘要

标题：在具有显式状态动力学的语言模型代理中控制长期行为

大型语言模型（LLM）代理在长时间交互中经常表现出语调和人设的突然转变，反映出缺乏管理代理级状态的显式时间结构。尽管先前的工作强调回合局部情感或静态情感分类，但显式情感动力学在塑造长期代理行为中的作用仍被忽视。这项工作探讨了是否可以通过在外部情感状态上施加动力学结构来诱导多轮对话中的时间连贯性和可控恢复。我们引入了一个代理级情感子系统，该系统维护一个连续的愉悦度-唤醒度-支配度（VAD）状态，该状态独立于语言模型，并由一阶和二阶更新规则管理。瞬时情感信号使用固定的记忆缺失估计器提取，并通过指数平滑或动量驱动的动力学进行时间集成。由此产生的情感状态在不修改模型参数的情况下注入生成过程。使用固定的25轮对话协议，我们比较了无状态、一阶和二阶情感动力学。无状态代理无法表现出连贯的轨迹或恢复，而状态持续性使代理能够延迟响应并可靠地恢复。二阶动力学引入了随动量增加而增加的情感惯性和滞后，揭示了稳定性和响应性之间的权衡。

Summary / 总结

This work addresses the issue of abrupt shifts in tone and persona in long interactions by introducing an explicit affective state dynamics system. The method involves maintaining a continuous Valence-Arousal-Dominance (VAD) state external to the language model and updating it using first- and second-order dynamics. The study finds that stateless agents fail to show coherent trajectories or recovery, while state persistence enables delayed responses and reliable recovery. Second-order dynamics introduce affective inertia and hysteresis, revealing a trade-off between stability and responsiveness.

这项工作通过引入具有显式状态动态的代理级情感子系统来解决长时间交互中情绪突变的问题。该系统使用由一阶和二阶规则更新的Valence-Arousal-Dominance (VAD)状态，并通过指数平滑或动量动态整合情感信号。实验表明，无状态代理缺乏连贯的轨迹，而一阶动态能够实现延迟响应和可靠恢复。二阶动态引入了情感惯性和滞差，揭示了稳定性和响应性之间的权衡。

FedIA: Towards Domain-Robust Aggregation in Federated Graph Learning

Authors: Zhanting Zhou, KaHou Tam, Yiding Feng, Ziqiang Zheng, Zeyu Ma, Yang Yang

First: 2025-09-17T13:04:11+00:00 · Latest: 2026-01-22T16:28:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Federated Graph Learning (FGL) enables a central server to coordinate model training across distributed clients without local graph data being shared. However, FGL significantly suffers from cross-silo domain shifts, where each "silo" (domain) contains a limited number of clients with distinct graph topologies. These heterogeneities induce divergent optimization trajectories, ultimately leading to global model divergence. In this work, we reveal a severe architectural pathology termed Structural Orthogonality: the topology-dependent message passing mechanism forces gradients from different domains to target disjoint coordinates in the parameter space. Through a controlled comparison between backbones, we statistically prove that GNN updates are near-perpendicular across domains (with projection ratios $\to$ 0). Consequently, naive averaging leads to Consensus Collapse, a phenomenon where sparse, informative structural signals from individual domains are diluted by the near-zero updates of others. This forces the global model into a "sub-optimal" state that fails to represent domain-specific structural patterns, resulting in poor generalization. To address this, we propose FedIA, a lightweight server-side framework designed to reconcile update conflicts without auxiliary communication. FedIA operates in two stages: (1) Global Importance Masking (GIM) identifies a shared parameter subspace to filter out domain-specific structural noise and prevent signal dilution; (2) Confidence-Aware Momentum Weighting (CAM) dynamically re-weights client contributions based on gradient reliability to amplify stable optimization signals.

中文标题/摘要

标题：FedIA：面向联邦图学习中域稳健聚合的研究

联邦图学习（FGL）允许中央服务器协调分布式客户端上的模型训练，而无需共享本地图数据。然而，FGL在跨孤岛域转移方面遭受严重困扰，每个“孤岛”（域）包含有限数量具有不同图拓扑结构的客户端。这些异质性导致优化轨迹发散，最终导致全局模型发散。在本文中，我们揭示了一种严重的架构病态，称为结构正交性：依赖于拓扑的消息传递机制迫使来自不同域的梯度在参数空间中瞄准不同的坐标。通过对比不同骨干网络，我们统计上证明了GNN更新在域间几乎垂直（投影比→0）。因此，简单的平均导致共识崩溃，一种现象，其中来自个别域的稀疏、有信息的结构信号被其他域近乎零的更新稀释。这迫使全局模型进入一个“次优”状态，无法表示域特定的结构模式，导致泛化能力差。为了解决这个问题，我们提出了FedIA，这是一种轻量级的服务器端框架，旨在在不使用辅助通信的情况下解决更新冲突。（1）全局重要性掩码（GIM）识别一个共享参数子空间，以过滤掉域特定的结构噪声并防止信号稀释；（2）基于梯度可靠性的信心感知动量加权（CAM）动态重新加权客户端贡献，以放大稳定的优化信号。

Summary / 总结

The research addresses the issue of domain shifts in Federated Graph Learning (FGL), where different domains have distinct graph topologies leading to divergent optimization trajectories. The study identifies a problem called Structural Orthogonality, where GNN updates are nearly perpendicular across domains, causing consensus collapse and poor generalization. To solve this, the paper introduces FedIA, a framework that includes Global Importance Masking (GIM) to filter out domain-specific noise and Confidence-Aware Momentum Weighting (CAM) to re-weight client contributions based on gradient reliability, thereby improving model robustness.

研究针对联邦图学习（FGL）中的域偏移问题，识别出结构正交性这一问题，即不同域的梯度更新几乎垂直，导致泛化能力差和共识崩塌。为了解决这一问题，论文提出FedIA框架，包括全局重要性掩码来过滤掉域特定的噪声，以及基于梯度可靠性动态重新加权客户端贡献，从而提高全局模型的鲁棒性。

Probably Approximately Correct Maximum A Posteriori Inference

Authors: Matthew Shorvon, Frederik Mallmann-Trenn, David S. Watson

First: 2026-01-22T16:28:01+00:00 · Latest: 2026-01-22T16:28:01+00:00

Comments: 7 pages main text, 16 total, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Computing the conditional mode of a distribution, better known as the $\mathit{maximum\ a\ posteriori}$ (MAP) assignment, is a fundamental task in probabilistic inference. However, MAP estimation is generally intractable, and remains hard even under many common structural constraints and approximation schemes. We introduce $\mathit{probably\ approximately\ correct}$ (PAC) algorithms for MAP inference that provide provably optimal solutions under variable and fixed computational budgets. We characterize tractability conditions for PAC-MAP using information theoretic measures that can be estimated from finite samples. Our PAC-MAP solvers are efficiently implemented using probabilistic circuits with appropriate architectures. The randomization strategies we develop can be used either as standalone MAP inference techniques or to improve on popular heuristics, fortifying their solutions with rigorous guarantees. Experiments confirm the benefits of our method in a range of benchmarks.

中文标题/摘要

标题：大概率近似正确最大后验概率推理

计算分布的条件模式，也就是最大后验概率（MAP）分配，是概率推理中的一个基本任务。然而，MAP估计通常不可计算，即使在许多常见的结构约束和近似方案下也很难解决。我们引入了在可变和固定计算预算下提供可证明最优解的大概率近似正确（PAC）的MAP推理算法。我们使用信息论度量来表征PAC-MAP的可计算条件，这些度量可以从有限样本中估计出来。我们的PAC-MAP求解器通过使用具有适当架构的概率电路高效实现。我们开发的随机化策略既可以作为独立的MAP推理技术，也可以用来改进流行的启发式方法，为其解决方案提供严格的保证。实验结果证实了我们方法在一系列基准测试中的优势。

Summary / 总结

The paper addresses the intractability of maximum a posteriori (MAP) inference by introducing probably approximately correct (PAC) algorithms that provide provably optimal solutions within given computational budgets. The tractability of PAC-MAP is characterized using information-theoretic measures, and the algorithms are implemented using probabilistic circuits. Experiments show that these methods offer improved performance over existing heuristics in various benchmarks.

论文解决了最大后验（MAP）推理的不可计算性问题，提出了可能近似正确（PAC）算法，能够在给定的计算预算内提供最优解。通过信息论度量来表征PAC-MAP的可计算性，并使用概率电路实现算法。实验结果表明该方法在多种基准测试中的优势。

Masked Modeling for Human Motion Recovery Under Occlusions

Authors: Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang

First: 2026-01-22T16:22:20+00:00 · Latest: 2026-01-22T16:22:20+00:00

Comments: Project page: https://mikeqzy.github.io/MoRo

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings.Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.

中文标题/摘要

标题：遮罩建模在遮挡下的人体运动恢复

从单目视频中重建人体运动是计算机视觉中的基本挑战，广泛应用于AR/VR、机器人技术和数字内容创作，但在现实世界中频繁的遮挡下仍然具有挑战性。现有的基于回归的方法效率高但对缺失观测数据敏感，而基于优化和扩散的方法虽然提高了鲁棒性，但推理速度慢且需要大量的预处理步骤。为了解决这些限制，我们利用生成的遮罩建模的最新进展，提出了MoRo：遮挡下的人体运动恢复。MoRo是一个鲁棒的端到端生成框架，将运动重建公式化为视频条件任务，并从RGB视频中高效地恢复人体运动到一致的全局坐标系统中。通过遮罩建模，MoRo自然地处理遮挡，同时实现高效的端到端推理。为了克服配对视频-运动数据的稀缺性，我们设计了一种跨模态学习方案，从一组异构数据集中学习多模态先验：(i) 一种基于MoCap数据集的轨迹感知运动先验，(ii) 一种基于图像-姿态数据集的条件姿态先验，捕捉每帧的多样化姿态，(iii) 一种基于视频-运动数据集的视频条件遮罩变换器，融合运动和姿态先验，通过视觉线索与运动动力学的集成进行鲁棒推理。在EgoBody和RICH上的广泛实验表明，MoRo在遮挡下的准确性和运动真实性方面显著优于最先进的方法，而在非遮挡场景下表现相当。MoRo在单个H200 GPU上实现了每秒70帧的实时推理。

Summary / 总结

The paper addresses the challenge of human motion recovery from monocular videos under occlusions, which is crucial for applications like AR/VR and robotics. It introduces MoRo, a masked modeling framework that formulates motion reconstruction as a video-conditioned task, enabling efficient and robust motion recovery. MoRo uses a cross-modality learning scheme to integrate motion and pose priors from different datasets, achieving superior performance in occluded scenarios while maintaining competitive results in non-occluded settings. Experiments show MoRo outperforms existing methods in accuracy and motion realism under occlusions, with real-time inference capabilities.

MoRo 是一种基于掩码建模的人体运动恢复框架，适用于单目视频，通过将运动重建视为视频条件任务，高效地在一致的全局坐标系中恢复人体运动。MoRo 在遮挡情况下的准确性和运动真实感方面优于现有方法，同时在非遮挡场景中保持相当的性能，实现实时推理，每秒 70 帧，在单个 H200 GPU 上运行。