Utonia: Toward One Encoder for All Point Clouds
Authors: Yujia Zhang, Xiaoyang Wu, Yunhan Yang, Xianzhe Fan, Han Li, Yuechen Zhang, Zehao Huang, Naiyan Wang, Hengshuang Zhao
First: 2026-03-03T18:59:58+00:00 · Latest: 2026-03-03T18:59:58+00:00
Comments: produced by Pointcept, project page: https://pointcept.github.io/Utonia
Abstract
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.
中文标题/摘要
标题:Utonia:朝向通用点云编码器的一步
我们梦想着一个未来,来自所有领域的点云能够汇聚在一起,共同塑造一个能够惠及所有领域的单一模型。为此,我们提出了Utonia,这是朝着训练一个跨多种领域的单一自监督点变换编码器迈出的第一步,这些领域包括遥感、户外LiDAR、室内RGB-D序列、对象中心的CAD模型以及从纯RGB视频中提取的点云。尽管它们具有不同的传感几何结构、密度和先验知识,Utonia仍然能够学习一个一致的表示空间,该空间可以在不同领域之间进行迁移。这种统一提高了感知能力,同时揭示了只有在联合训练领域时才会出现的有趣涌现行为。超越感知,我们观察到Utonia表示还可以为具身和多模态推理提供帮助:基于Utonia特征的视觉-语言-动作策略可以提高机器人的操作能力,将它们整合到视觉-语言模型中也能在空间推理方面取得进步。我们希望Utonia能够作为稀疏3D数据基础模型的一步,支持AR/VR、机器人技术和自动驾驶等下游应用。
MIBURI: Towards Expressive Interactive Gesture Synthesis
Authors: M. Hamza Mughal, Rishabh Dabral, Vera Demberg, Christian Theobalt
Venue: CVPR 2026
First: 2026-03-03T18:59:51+00:00 · Latest: 2026-03-03T18:59:51+00:00
Comments: CVPR 2026. Project page: https://vcai.mpi-inf.mpg.de/projects/MIBURI/
Abstract
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.
中文标题/摘要
标题:MIBURI:迈向富有表现力的交互手势合成
具身对话代理(ECAs)旨在通过语音、手势和面部表情模拟人类面对面的互动。当前基于大型语言模型(LLM)的对话代理缺乏具身性和自然互动所需的表现力手势。现有的ECAs解决方案往往产生僵硬、低多样性的动作,不适合人类互动。相反,用于同步口伴手势生成的生成方法可以产生自然的身体手势,但依赖于未来的语音上下文,并需要长时间运行。为弥合这一差距,我们提出了MIBURI,这是第一个在线因果框架,用于生成与实时口语对话同步的富有表现力的全身手势和面部表情。我们使用肢体感知的手势编解码器,将层次运动细节编码为多级离散令牌。这些令牌然后由一个二维因果框架自回归生成,该框架基于LLM的语音-文本嵌入进行条件化,实时建模时间和部分级运动层次结构。此外,我们引入了辅助目标,以鼓励表现力和多样性的手势,同时防止收敛到静态姿势。对比评估表明,我们的因果和实时方法在与最近基线相比时,生成了自然且上下文对齐的手势。我们敦促读者访问https://vcai.mpi-inf.mpg.de/projects/MIBURI/上的演示视频。
Summary / 总结
MIBURI is an online causal framework designed to generate expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue, addressing the limitations of existing solutions that produce rigid and low-diversity motions. It uses body-part aware gesture codecs to encode hierarchical motion details into discrete tokens, which are autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings. Comparative evaluations show that MIBURI produces natural and contextually aligned gestures more effectively than recent baselines.
MIBURI 是一个在线因果框架,旨在实时生成与口语对话同步的富有表现力的全身手势和面部表情,解决当前基于大语言模型的对话代理的局限性。它使用身体部位感知的手势编解码器将层次运动细节编码为多级离散令牌,这些令牌由一个二维因果框架自回归生成,该框架基于大语言模型的语音文本嵌入进行条件化,可以实时建模时间和部位级别的运动层次结构。比较评估表明,MIBURI 生成的自然且上下文相关的手势优于最近的基线方法,在表现性和多样性方面表现更佳。
How to Peel with a Knife: Aligning Fine-Grained Manipulation with Human Preference
Authors: Toru Lin, Shuying Deng, Zhao-Heng Yin, Pieter Abbeel, Jitendra Malik
First: 2026-03-03T18:59:32+00:00 · Latest: 2026-03-03T18:59:32+00:00
Comments: Project page can be found at https://toruowo.github.io/peel
Abstract
Many essential manipulation tasks - such as food preparation, surgery, and craftsmanship - remain intractable for autonomous robots. These tasks are characterized not only by contact-rich, force-sensitive dynamics, but also by their "implicit" success criteria: unlike pick-and-place, task quality in these domains is continuous and subjective (e.g. how well a potato is peeled), making quantitative evaluation and reward engineering difficult. We present a learning framework for such tasks, using peeling with a knife as a representative example. Our approach follows a two-stage pipeline: first, we learn a robust initial policy via force-aware data collection and imitation learning, enabling generalization across object variations; second, we refine the policy through preference-based finetuning using a learned reward model that combines quantitative task metrics with qualitative human feedback, aligning policy behavior with human notions of task quality. Using only 50-200 peeling trajectories, our system achieves over 90% average success rates on challenging produce including cucumbers, apples, and potatoes, with performance improving by up to 40% through preference-based finetuning. Remarkably, policies trained on a single produce category exhibit strong zero-shot generalization to unseen in-category instances and to out-of-distribution produce from different categories while maintaining over 90% success rates.
中文标题/摘要
标题:如何使用刀具去皮:细粒度操作与人类偏好的对齐
许多重要的操作任务,如食物准备、外科手术和手工艺,对于自主机器人来说仍然难以解决。这些任务不仅具有接触丰富、力敏感的动力学特征,还具有“隐含”的成功标准:与拾取和放置不同,这些领域的任务质量是连续且主观的(例如,土豆去皮的质量),这使得定量评估和奖励工程变得困难。我们提出了一种用于此类任务的学习框架,以使用刀具去皮作为代表性的例子。我们的方法遵循两阶段管道:首先,我们通过力感知数据收集和模仿学习学习一个稳健的初始策略,使其能够在不同物体之间泛化;其次,我们通过基于偏好的微调使用学习到的奖励模型来细化策略,该模型结合了定量任务指标和定性的用户反馈,使策略行为与人类对任务质量的看法相一致。仅使用50-200个去皮轨迹,我们的系统在包括黄瓜、苹果和土豆在内的具有挑战性的农产品上实现了超过90%的平均成功率,通过基于偏好的微调,性能提高了高达40%。值得注意的是,仅在一个农产品类别上训练的策略在未见过的同类别实例以及来自不同类别的分布外农产品上表现出强大的零样本泛化能力,同时保持超过90%的成功率。
Summary / 总结
The paper addresses the challenge of performing fine-grained manipulation tasks like peeling with a knife, which are difficult for autonomous robots due to their complex dynamics and subjective success criteria. It proposes a two-stage learning framework: first, a robust initial policy is learned through force-aware data collection and imitation learning, allowing for generalization across different objects; second, the policy is refined using preference-based fine-tuning with a learned reward model that combines quantitative metrics and qualitative human feedback. The system achieves over 90% success rates on various produce with up to 40% improvement through fine-tuning, demonstrating strong zero-shot generalization to unseen instances and different categories.
本文解决了如用刀削皮这样的精细操作任务对自主机器人来说难以实现的问题,因为这些任务的成功标准是连续且主观的。作者提出了一种两阶段学习框架:首先,使用力感知的数据收集和模仿学习来开发一个鲁棒的初始策略,使其能够跨不同物体进行泛化;然后,通过结合定量任务指标和定性人类反馈的奖励模型来细化策略。这种方法在各种农产品上实现了超过90%的成功率,并通过偏好细化提高了高达40%的表现,展示了在未见过的实例和不同类别的分布外农产品上强大的零样本泛化能力。
ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation
Authors: Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui
First: 2026-03-03T18:59:29+00:00 · Latest: 2026-03-03T18:59:29+00:00
Comments: Project Page: https://ultra-humanoid.github.io/
Abstract
Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.
中文标题/摘要
标题:ULTRA:统一多模态控制的人形全身体动操作
实现自主且多功能的全身体动操作仍然是使类人机器人实用化的关键障碍。然而,现有方法存在根本限制:重新目标化的数据往往稀缺或质量低;方法难以扩展到大量技能组合;最重要的是,它们依赖于跟踪预定义的运动参考,而不是从感知和高层次任务规范生成行为。为解决这些限制,我们提出了一种统一框架,包含两个关键组件。首先,我们引入了一种基于物理的神经重新目标化算法,将大规模运动捕捉转换为类人机器人实体,同时保持物理合理性,以支持接触丰富的交互。其次,我们学习了一个统一的多模态控制器,支持密集参考和稀疏任务规范,在从准确的运动捕捉状态到嘈杂的主观视觉输入的感知范围内。我们将通用跟踪策略提炼到该控制器中,将运动技能压缩到紧凑的潜在空间,并应用强化学习微调以扩展覆盖范围并提高在分布外场景下的鲁棒性。这使得在测试时无需参考运动即可实现协调的全身体动行为。我们在模拟和真实Unitree G1类人机器人上评估了ULTRA。结果表明,ULTRA能够从主观感知自主实现目标导向的全身体动操作,且在技能有限的情况下始终优于仅跟踪的基线。
Summary / 总结
The research aims to enable autonomous and versatile whole-body locomotion and manipulation for humanoids. The method involves a unified framework with a physics-driven neural retargeting algorithm and a unified multimodal controller. Key findings show that ULTRA can generalize to autonomous, goal-conditioned whole-body behavior from egocentric perception, outperforming tracking-only baselines with limited skills in both simulation and real-world testing on a Unitree G1 humanoid.
研究旨在实现人形机器人全身自主移动和操作的灵活性。提出的ULTRA框架包括一个物理驱动的神经重定位算法和一个统一的多模态控制器。该框架将大规模运动捕捉转换为人形机器人,并支持密集参考和稀疏任务规范。实验结果表明,ULTRA在自主、基于目标的全身行为方面优于仅跟踪基线,并能够从第一人称感知实现自主操作。
Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping
Authors: William Liang, Sam Wang, Hung-Ju Wang, Osbert Bastani, Yecheng Jason Ma, Dinesh Jayaraman
Venue: ICLR
First: 2026-03-03T18:59:07+00:00 · Latest: 2026-03-03T18:59:07+00:00
Comments: International Conference on Learning Representations (ICLR), 2026. Project website and code: https://tether-research.github.io
Abstract
The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (<=10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.
中文标题/摘要
标题:Tether:基于对应驱动轨迹扭曲的自主功能性玩耍
能够进行交互和从经验中学习的能力是机器人技术中的一个核心挑战,提供了一种劳动密集型的人类示范的可扩展替代方案。然而,实现这种“玩耍”需要(1)一种对各种潜在分布外环境状态具有鲁棒性的策略,以及(2)一种能够持续生成有用机器人经验的程序。为了解决这些挑战,我们引入了Tether,一种涉及结构化、任务导向交互的自主功能性玩耍方法。首先,我们设计了一种新颖的开环策略,通过将动作锚定到目标场景中的语义关键点对应关系,对来自少量源示范(≤10个)的动作进行扭曲。我们展示了这种设计在数据效率和鲁棒性方面具有极大的优势,即使在显著的空间和语义变化下也是如此。其次,我们通过视觉理解能力引导的连续循环任务选择、执行、评估和改进,将此策略部署到现实世界中进行自主功能性玩耍。这种方法生成了大量高质量的数据集,同时减少了人类干预。在一个类似家庭的多对象设置中,我们的方法是第一个仅从少量示范开始,在现实世界中进行多任务自主玩耍数小时的方法。这产生了一条持续改进闭环模仿策略性能的数据流,最终产生了超过1000条专家级轨迹,并训练出与人类收集示范学习的策略竞争的策略。
Summary / 总结
Tether is a method for autonomous functional play in robotics, addressing the challenges of robust policy and continuous useful experience generation. It uses a novel open-loop policy that warps actions from a few source demonstrations based on semantic keypoint correspondences, showing high data efficiency and robustness. Tether continuously selects tasks, executes them, evaluates outcomes, and improves, generating diverse high-quality datasets with minimal human intervention. This method enables hours of autonomous multi-task play in a household-like setup, improving closed-loop imitation policies and producing over 1000 expert-level trajectories.
Tether 是一种用于自主功能玩耍的方法,旨在解决稳健策略和持续经验生成的挑战。它使用一种新颖的开环策略,通过将动作锚定到目标场景中的语义关键点对应关系,从少量源演示中进行动作变形,显示出高效的数据利用和鲁棒性。Tether 通过选择任务、执行、评估和改进的连续循环过程,利用视觉语言模型进行指导,生成高质量的多样化数据集,无需大量人工干预,能够在类似家庭的多对象设置中实现长时间的自主多任务玩耍,提高闭环模仿策略的表现,并生成超过1000条专家级轨迹。
Beyond Language Modeling: An Exploration of Multimodal Pretraining
Authors: Shengbang Tong, David Fan, John Nguyen, Ellis Brown, Gaoyue Zhou, Shengyi Qian, Boyang Zheng, Théophane Vallaeys, Junlin Han, Rob Fergus, Naila Murray, Marjan Ghazvininejad, Mike Lewis, Nicolas Ballas, Amir Bar, Michael Rabbat, Jakob Verbeek, Luke Zettlemoyer, Koustuv Sinha, Yann LeCun, Saining Xie
First: 2026-03-03T18:58:00+00:00 · Latest: 2026-03-03T18:58:00+00:00
Comments: Project website at https://beyond-llms.github.io/
Abstract
The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.
中文标题/摘要
标题:超越语言建模:多模态预训练探索
视觉世界为推进基础模型超越语言提供了关键维度。尽管对该方向的兴趣日益增长,但原生多模态模型的设计空间仍然模糊不清。我们通过控制性的从零开始的预训练实验,提供了实证上的清晰度,隔离了多模态预训练的关键因素,而不受语言预训练的干扰。我们采用Transfusion框架,使用下一个标记预测语言,使用扩散模型处理视觉,训练数据包括文本、视频、图像-文本对,甚至动作条件下的视频。我们的实验得出四个关键见解:(i) 表征自编码器(RAE)通过在视觉理解和生成方面表现出色,提供了最优的统一视觉表示;(ii) 视觉和语言数据是互补的,共同促进了下游能力;(iii) 统一的多模态预训练自然地导向世界建模,能力源自通用训练;(iv) 混合专家(MoE)架构使多模态扩展既高效又有效,自然地促进了模态专业化。通过IsoFLOP分析,我们计算了两种模态的扩展法则,并揭示了扩展不对称性:视觉比语言更需要大量数据。我们证明MoE架构通过提供语言所需的高模型容量,同时适应视觉的数据密集特性,协调了这种扩展不对称性,为真正统一的多模态模型铺平了道路。
Summary / 总结
The research aims to advance foundation models by exploring multimodal pretraining beyond language modeling. The study uses the Transfusion framework with next-token prediction for language and diffusion for vision, training on diverse data. Key findings include the superiority of Representation Autoencoder in unified visual representation, complementary benefits of visual and language data, emergence of world modeling capabilities from unified pretraining, and the efficiency of Mixture-of-Experts in handling the data-intensive nature of vision and model capacity requirements of language.
研究旨在通过探索多模态预训练超越语言模型来推进基础模型。研究使用Transfusion框架,语言使用next-token预测,视觉使用扩散,训练多样数据。关键发现包括Representation Autoencoder在统一视觉表示中的优越性,视觉和语言数据的互补性,统一多模态预训练自然产生世界建模能力,以及Mixture-of-Experts架构在处理视觉的数据密集性和语言的模型容量需求方面的高效性。
Gravity Falls: A Comparative Analysis of Domain-Generation Algorithm (DGA) Detection Methods for Mobile Device Spearphishing
Authors: Adam Dorian Wong, John D. Hastings
First: 2026-03-03T18:55:44+00:00 · Latest: 2026-03-03T18:55:44+00:00
Comments: Disclaimer: The views expressed are those of the authors and do not necessarily reflect the official policy or position of the U.S. Department of Defense or the U.S. Government. References to external sites do not constitute endorsement. Cleared for release on 24 FEB 2026 (DOPSR 26-T-0771). Gravity Falls Dataset DOI: 10.5281/zenodo.17624554
Abstract
Mobile devices are frequent targets of eCrime threat actors through SMS spearphishing (smishing) links that leverage Domain Generation Algorithms (DGA) to rotate hostile infrastructure. Despite this, DGA research and evaluation largely emphasize malware C2 and email phishing datasets, leaving limited evidence on how well detectors generalize to smishing-driven domain tactics outside enterprise perimeters. This work addresses that gap by evaluating traditional and machine-learning DGA detectors against Gravity Falls, a new semi-synthetic dataset derived from smishing links delivered between 2022 and 2025. Gravity Falls captures a single threat actor's evolution across four technique clusters, shifting from short randomized strings to dictionary concatenation and themed combo-squatting variants used for credential theft and fee/fine fraud. Two string-analysis approaches (Shannon entropy and Exp0se) and two ML-based detectors (an LSTM classifier and COSSAS DGAD) are assessed using Top-1M domains as benign baselines. Results are strongly tactic-dependent: performance is highest on randomized-string domains but drops on dictionary concatenation and themed combo-squatting, with low recall across multiple tool/cluster pairings. Overall, both traditional heuristics and recent ML detectors are ill-suited for consistently evolving DGA tactics observed in Gravity Falls, motivating more context-aware approaches and providing a reproducible benchmark for future evaluation.
中文标题/摘要
标题:Gravity Falls:移动设备钓鱼攻击中DGA检测方法的比较分析
移动设备经常成为通过短信钓鱼(SMShing)链接进行eCrime威胁行为者的攻击目标,这些链接利用域名生成算法(DGA)来轮换敌对基础设施。尽管如此,DGA研究和评估主要集中在恶意软件C2和电子邮件钓鱼数据集上,几乎没有证据表明检测器如何在企业外围之外的一系列钓鱼驱动的域名策略中泛化。本研究通过评估传统和机器学习DGA检测器在Gravity Falls数据集上的表现来填补这一空白,Gravity Falls是一个新的半合成数据集,源自2022年至2025年间交付的钓鱼链接。Gravity Falls捕捉了一个威胁行为者在四个技术集群中的演变,从随机字符串到字典连接和主题组合抢注域名变体,用于凭证盗窃和罚款欺诈。使用Top-1M域名作为良性基线,评估了两种字符串分析方法(香农熵和Exp0se)和两种基于机器学习的检测器(LSTM分类器和COSSAS DGA检测器)。结果高度依赖于策略:在随机字符串域名上表现最佳,但在字典连接和主题组合抢注域名上表现较差,多个工具/集群配对的召回率都很低。总体而言,传统启发式方法和最近的机器学习检测器都不适合Gravity Falls中观察到的一致演变的DGA策略,这促使采用更具上下文感知的方法,并为未来的评估提供可重复的基准。
Summary / 总结
This study evaluates traditional and machine-learning DGA detectors using a new semi-synthetic dataset, Gravity Falls, which captures smishing links from 2022 to 2025. The dataset includes four technique clusters evolving from randomized strings to themed combo-squatting. Results show that performance varies significantly depending on the tactic, with high performance on randomized-string domains but poor performance on dictionary concatenation and themed combo-squatting domains. Both traditional heuristics and ML detectors are found to be inadequate for evolving DGA tactics, highlighting the need for context-aware approaches.
研究使用新的半合成数据集Gravity Falls评估了传统和机器学习DGA检测器,该数据集涵盖了2022年至2025年的钓鱼链接,包括四个技术集群,从随机字符串演变为主题组合抢注。结果显示,性能在不同策略下差异显著,对随机字符串域表现良好,但在字典组合和主题组合抢注域表现较差。传统启发式方法和ML检测器均不适用于演化的DGA策略,强调了需要上下文感知的方法。
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, Deqing Sun
First: 2026-03-03T18:55:37+00:00 · Latest: 2026-03-03T18:55:37+00:00
Comments: Project page: https://LoGeR-project.github.io/
Abstract
Feedforward geometric foundation models achieve strong short-window reconstruction, yet scaling them to minutes-long videos is bottlenecked by quadratic attention complexity or limited effective memory in recurrent designs. We present LoGeR (Long-context Geometric Reconstruction), a novel architecture that scales dense 3D reconstruction to extremely long sequences without post-optimization. LoGeR processes video streams in chunks, leveraging strong bidirectional priors for high-fidelity intra-chunk reasoning. To manage the critical challenge of coherence across chunk boundaries, we propose a learning-based hybrid memory module. This dual-component system combines a parametric Test-Time Training (TTT) memory to anchor the global coordinate frame and prevent scale drift, alongside a non-parametric Sliding Window Attention (SWA) mechanism to preserve uncompressed context for high-precision adjacent alignment. Remarkably, this memory architecture enables LoGeR to be trained on sequences of 128 frames, and generalize up to thousands of frames during inference. Evaluated across standard benchmarks and a newly repurposed VBR dataset with sequences of up to 19k frames, LoGeR substantially outperforms prior state-of-the-art feedforward methods--reducing ATE on KITTI by over 74%--and achieves robust, globally consistent reconstruction over unprecedented horizons.
中文标题/摘要
标题:LoGeR:长上下文几何重建与混合记忆
前馈几何基础模型在短窗口重建中表现出色,但将其扩展到几分钟长的视频受到二次注意力复杂度或递归设计中有限的有效记忆的限制。我们提出了LoGeR(长上下文几何重建),这是一种新型架构,可以在无需后优化的情况下将密集的3D重建扩展到极其长的序列。LoGeR 以块的形式处理视频流,利用强大的双向先验知识进行高保真度的块内推理。为了解决块边界间连贯性这一关键挑战,我们提出了一种基于学习的混合记忆模块。这个双组件系统结合了一个参数化的测试时训练(TTT)记忆,用于锚定全局坐标框架并防止尺度漂移,以及一个非参数化的滑动窗口注意力(SWA)机制,用于保留未压缩的上下文以实现高精度的相邻对齐。令人惊讶的是,这种记忆架构使LoGeR能够在128帧的序列上进行训练,并在推理过程中泛化到数千帧。LoGeR 在标准基准测试和一个新重新利用的VBR数据集上进行了评估,该数据集包含长达19000帧的序列,LoGeR 显著优于先前的最先进的前馈方法——在KITTI上的ATE降低了超过74%——并且实现了前所未有的长距离上稳健且全局一致的重建。
Summary / 总结
LoGeR is designed to perform long-context geometric reconstruction for extremely long video sequences by processing them in chunks and using a hybrid memory module. This module includes a parametric Test-Time Training memory to maintain the global coordinate frame and a non-parametric Sliding Window Attention mechanism to preserve context. LoGeR can be trained on sequences of 128 frames and generalize to thousands of frames, significantly outperforming previous methods on benchmarks and a newly repurposed VBR dataset with sequences up to 19k frames, reducing ATE on KITTI by over 74%.
LoGeR 通过分块处理长视频序列并使用混合记忆模块来保持块边界之间的连贯性,从而实现密集3D重建的扩展。该架构结合了参数化测试时训练记忆和非参数化滑动窗口注意机制,使LoGeR 能够在128帧上进行训练并在推理时泛化到数千帧。实验结果表明,LoGeR 在KITTI 上的绝对轨迹误差(ATE)降低了超过74%,并且实现了在长时段内的稳健且全局一致的重建。
DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction
Authors: Yufu Wang, Evonne Ng, Soyong Shin, Rawal Khirodkar, Yuan Dong, Zhaoen Su, Jinhyung Park, Kris Kitani, Alexander Richard, Fabian Prada, Michael Zollhofer
Venue: CVPR 2026
First: 2026-03-03T18:54:17+00:00 · Latest: 2026-03-03T18:54:17+00:00
Comments: CVPR 2026. Project page: https://yufu-wang.github.io/duomo/
Abstract
We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. Reconstructing such motion requires solving a fundamental trade-off: generalizing from diverse and noisy video inputs while maintaining global motion consistency. Our approach addresses this problem by factorizing motion learning into two diffusion models. The camera-space model first estimates motion from videos in camera coordinates. The world-space model then lifts this initial estimate into world coordinates and refines it to be globally consistent. Together, the two models can reconstruct motion across diverse scenes and trajectories, even from highly noisy or incomplete observations. Moreover, our formulation is general, generating the motion of mesh vertices directly and bypassing parametric models. DuoMo achieves state-of-the-art performance. On EMDB, our method obtains a 16% reduction in world-space reconstruction error while maintaining low foot skating. On RICH, it obtains a 30% reduction in world-space error. Project page: https://yufu-wang.github.io/duomo/
中文标题/摘要
标题:DuoMo:世界空间中的双运动扩散人体重建
我们提出了DuoMo,一种生成方法,可以从具有噪声或不完整观察的不受限制的视频中恢复人体在世界坐标系中的运动。重建这种运动需要解决一个基本的权衡问题:从多样且有噪声的视频输入中泛化,同时保持全局运动一致性。我们的方法通过将运动学习分解为两个扩散模型来解决这个问题。摄像机空间模型首先从摄像机坐标系中的视频中估计运动。世界空间模型然后将初始估计提升到世界坐标系中,并对其进行细化以使其全局一致。两个模型结合在一起,可以在各种场景和轨迹中重建运动,即使是从高度噪声或不完整观察中。此外,我们的方法是通用的,直接生成网格顶点的运动,绕过了参数模型。DuoMo达到了最先进的性能。在EMDB上,我们的方法在世界空间重建误差上减少了16%,同时保持了低脚滑。在RICH上,它在世界空间误差上减少了30%。项目页面:https://yufu-wang.github.io/duomo/
Summary / 总结
DuoMo is a generative method that reconstructs human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. It addresses the challenge of generalizing from diverse and noisy inputs while maintaining global motion consistency by using two diffusion models: one for estimating motion in camera coordinates and another for lifting and refining this estimate into world coordinates. This approach achieves state-of-the-art performance, reducing world-space reconstruction error by 16% on EMDB and 30% on RICH compared to previous methods while maintaining low foot skating. Project page: https://yufu-wang.github.io/duomo/
DuoMo 是一种生成方法,可以从包含噪声或不完整观察的不受限制的视频中重建人体在世界坐标系中的运动。该方法将运动学习分解为两个扩散模型:一个相机空间模型进行初始运动估计,一个世界空间模型进行全局一致性优化。DuoMo 达到了最先进的性能,在 EMDB 上将世界空间重建误差降低了 16%,在 RICH 上降低了 30%,同时保持了低脚滑。
Physics-informed post-processing of stabilized finite element solutions for transient convection-dominated problems
Authors: Süleyman Cengizci, Ömür Uğur, Srinivasan Natesan
First: 2026-03-03T18:51:17+00:00 · Latest: 2026-03-03T18:51:17+00:00
Abstract
The numerical simulation of convection-dominated transient transport phenomena poses significant computational challenges due to sharp gradients and propagating fronts across the spatiotemporal domain. Classical discretization methods often generate spurious oscillations, requiring advanced stabilization techniques. However, even stabilized finite element methods may require additional regularization to accurately resolve localized steep layers. On the other hand, standalone physics-informed neural networks (PINNs) struggle to capture sharp solution structures in convection-dominated regimes and typically require a large number of training epochs. This work presents a hybrid computational framework that extends the PINN-Augmented SUPG with Shock-Capturing (PASSC) methodology from steady to unsteady problems. The approach combines a semi-discrete stabilized finite element method with a PINN-based correction strategy for transient convection-diffusion-reaction equations. Stabilization is achieved using the Streamline-Upwind Petrov-Galerkin (SUPG) formulation augmented with a YZbeta shock-capturing operator. Rather than training over the entire space-time domain, the neural network is applied selectively near the terminal time, enhancing the finite element solution using the last K_s temporal snapshots while enforcing residual constraints from the governing equations and boundary conditions. The network incorporates residual blocks with random Fourier features and employs progressive training with adaptive loss weighting. Numerical experiments on five benchmark problems, including boundary and interior layers, traveling waves, and nonlinear Burgers dynamics, demonstrate significant accuracy improvements at the terminal time compared to standalone stabilized finite element solutions.
中文标题/摘要
标题:物理信息驱动的稳定有限元解的后处理方法用于瞬态对流占优问题
对流占优瞬态传输现象的数值模拟由于时空域内尖锐梯度和传播前沿的存在,带来了显著的计算挑战。经典离散化方法通常会产生虚假振荡,需要先进的稳定技术。然而,即使稳定化的有限元方法也可能需要额外的正则化来准确解析局部陡峭层。另一方面,独立的物理信息神经网络(PINNs)在对流占优区域难以捕捉尖锐的解结构,通常需要大量的训练周期。本文提出了一种混合计算框架,将PINN增强的SUPG与冲击捕捉(PASSC)方法从稳态扩展到非稳态问题。该方法结合了半离散稳定化的有限元方法和基于PINN的校正策略,用于瞬态对流扩散反应方程。通过使用SUPG形式化与YZbeta冲击捕捉算子相结合的Streamline-Upwind Petrov-Galerkin (SUPG) 形式化实现稳定化。神经网络不是在整个空间-时间域上进行训练,而是选择性地应用于终端时间附近,利用最后K_s个时间快照增强有限元解,同时从控制方程和边界条件中强制执行残差约束。网络包含具有随机傅里叶特征的残差块,并采用自适应损失加权的渐进训练。在五个基准问题上的数值实验,包括边界层、内部层、行波和非线性Burgers动力学,表明与独立的稳定化有限元解相比,在终端时间具有显著的精度改进。
Summary / 总结
This study addresses the computational challenges of simulating convection-dominated transient transport phenomena by developing a hybrid framework that integrates a physics-informed neural network (PINN) with a stabilized finite element method. The approach uses the Streamline-Upwind Petrov-Galerkin (SUPG) method augmented with a shock-capturing operator and applies the PINN selectively near the terminal time, using the last K_s temporal snapshots to enhance the finite element solution. The results show significant accuracy improvements compared to standalone stabilized finite element solutions in various benchmark problems, including boundary and interior layers, traveling waves, and nonlinear Burgers dynamics.
该研究通过结合稳定化的有限元方法和物理信息神经网络(PINN),解决了对流占优的瞬态传输现象的数值模拟中的计算挑战。该方法使用了沿流线上游的 Petrov-Galerkin(SUPG)方法并结合了冲击捕捉算子,并在终端时间附近选择性地应用神经网络,使用最后 K_s 个时间截面。数值实验表明,在各种基准问题上,该方法相比单独的稳定化有限元解决方案具有显著的精度提升。
Valet: A Standardized Testbed of Traditional Imperfect-Information Card Games
Authors: Mark Goadrich, Achille Morenville, Éric Piette
First: 2026-03-03T18:46:47+00:00 · Latest: 2026-03-03T18:46:47+00:00
Comments: 12 pages, 1 table, 4 figures
Abstract
AI algorithms for imperfect-information games are typically compared using performance metrics on individual games, making it difficult to assess robustness across game choices. Card games are a natural domain for imperfect information due to hidden hands and stochastic draws. To facilitate comparative research on imperfect-information game-playing algorithms and game systems, we introduce Valet, a diverse and comprehensive testbed of 21 traditional imperfect-information card games. These games span multiple genres, cultures, player counts, deck structures, mechanics, winning conditions, and methods of hiding and revealing information. To standardize implementations across systems, we encode the rules of each game in RECYCLE, a card game description language. We empirically characterize each game's branching factor and duration using random simulations, reporting baseline score distributions for a Monte Carlo Tree Search player against random opponents to demonstrate the suitability of Valet as a benchmarking suite.
中文标题/摘要
标题:Valet:传统不完美信息纸牌游戏的标准测试平台
不完美信息游戏的AI算法通常通过单个游戏上的性能指标进行比较,这使得跨游戏选择评估鲁棒性变得困难。纸牌游戏是不完美信息的自然领域,因为有隐藏的手牌和随机抽取。为了促进不完美信息博弈算法和游戏系统的比较研究,我们引入了Valet,一个包含21种传统不完美信息纸牌游戏的多样化和全面的测试平台。这些游戏跨越了多个类型、文化、玩家数量、牌组结构、机制、胜利条件以及信息的隐藏和揭示方法。为了在系统之间标准化实现,我们使用RECYCLE,一种纸牌游戏描述语言,来编码每种游戏的规则。我们通过随机模拟实证地表征每种游戏的分支因子和持续时间,报告蒙特卡洛树搜索玩家与随机对手的基线得分分布,以证明Valet作为基准测试套件的适用性。
Summary / 总结
The research aims to provide a standardized testbed for evaluating AI algorithms in imperfect-information card games, which are naturally suited due to hidden hands and stochastic draws. Valet, a comprehensive testbed, includes 21 traditional card games with diverse characteristics. The study uses RECYCLE, a card game description language, to standardize game implementations. Key findings show that Valet is suitable as a benchmarking suite, with baseline score distributions reported for a Monte Carlo Tree Search player against random opponents.
研究旨在提供一个标准化测试平台,用于评估AI算法在不完美信息纸牌游戏中的表现,这类游戏由于隐藏的手牌和随机抽取而自然适合。Valet包括21种传统纸牌游戏,具有多种特性。研究使用RECYCLE,一种纸牌游戏描述语言,来标准化游戏实现。关键发现表明,Valet适合作为基准测试套件,报告了蒙特卡洛树搜索玩家与随机对手对战的基本得分分布。
Theory of Code Space: Do Code Agents Understand Software Architecture?
Authors: Grigory Sapunov
First: 2026-02-28T11:40:17+00:00 · Latest: 2026-03-03T18:45:08+00:00
Comments: updated experiments
Abstract
AI code agents excel at isolated tasks yet struggle with complex, multi-file software engineering requiring understanding of how dozens of modules relate. We hypothesize these failures stem from inability to construct, maintain, and update coherent architectural beliefs during codebase exploration. We introduce Theory of Code Space (ToCS), a benchmark that evaluates this capability by placing agents in procedurally generated codebases under partial observability, requiring them to build structured belief states over module dependencies, cross-cutting invariants, and design intent. The framework features: (1) a procedural codebase generator producing medium-complexity Python projects with four typed edge categories reflecting different discovery methods -- from syntactic imports to config-driven dynamic wiring -- with planted architectural constraints and verified ground truth; (2) a partial observability harness where agents explore under a budget; and (3) periodic belief probing via structured JSON, producing a time-series of architectural understanding. We decompose the Active-Passive Gap from spatial reasoning benchmarks into selection and decision components, and introduce Architectural Constraint Discovery as a code-specific evaluation dimension. Preliminary experiments with four rule-based baselines and five frontier LLM agents from three providers validate discriminative power: methods span a wide performance range (F1 from 0.129 to 0.646), LLM agents discover semantic edge types invisible to all baselines, yet weaker models score below simple heuristics -- revealing that belief externalization, faithfully serializing internal understanding into structured JSON, is itself a non-trivial capability and a first-order confounder in belief-probing benchmarks. Open-source toolkit: https://github.com/che-shr-cat/tocs
中文标题/摘要
标题:代码空间理论:代码代理是否理解软件架构?
AI代码代理在执行孤立任务方面表现出色,但在处理需要理解数十个模块之间关系的复杂、多文件软件工程时却遇到困难。我们假设这些失败源于其在代码库探索过程中无法构建、维护和更新一致的架构信念。我们提出了代码空间理论(ToCS),通过将代理置于部分可观测的程序生成代码库中,要求它们构建模块依赖关系、切面不变量和设计意图的结构化信念状态来评估这一能力。该框架包括:(1) 一个程序生成代码库生成器,生成具有四种类型边类别的中等复杂度的Python项目,反映不同的发现方法——从语法导入到配置驱动的动态连接,并植入架构约束和验证的地面真相;(2) 一个部分可观测性框架,代理在预算内探索;(3) 通过结构化JSON进行定期信念探查,产生架构理解的时间序列。我们将空间推理基准中的主动-被动差距分解为选择和决策组件,并引入架构约束发现作为代码特定的评估维度。使用四个基于规则的基线和五个来自三个提供商的前沿LLM代理的初步实验验证了区分能力:方法的性能范围广泛(F1从0.129到0.646),LLM代理发现所有基线都无法识别的语义边类型,但较弱的模型得分低于简单启发式方法——揭示信念外化,将内部理解忠实序列化为结构化JSON,本身就是一项非平凡的能力,并且是信念探查基准中的首要混淆因素。开源工具包:https://github.com/che-shr-cat/tocs
Summary / 总结
The paper aims to evaluate AI code agents' ability to understand software architecture by introducing Theory of Code Space (ToCS). ToCS uses a procedural codebase generator to create medium-complexity Python projects with different types of dependencies and architectural constraints. Agents are tested under partial observability, and their understanding is probed periodically. Experiments with four rule-based baselines and five LLM agents show a wide performance range, with LLMs discovering semantic edge types not visible to baselines but still scoring below simple heuristics when it comes to belief externalization. This highlights the importance of accurately serializing internal understanding into structured JSON for belief-probing benchmarks.
研究旨在通过假设AI代码代理在复杂软件工程任务中表现不佳是因为无法构建和维护一致的架构信念来理解其原因。研究引入了代码空间理论(ToCS),通过部分可观测的程序生成代码库来评估这一能力。关键发现包括方法间广泛的表现范围(F1从0.129到0.646),LLM代理发现了规则基线无法发现的语义边类型,并揭示了将内部理解外部化为结构化JSON本身是一项非平凡的能力。
UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
First: 2026-01-07T23:49:52+00:00 · Latest: 2026-03-03T18:40:54+00:00
Comments: Project Page: https://unidrive-wm.github.io/UniDrive-WM
Abstract
World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .
中文标题/摘要
标题:UniDrive-WM:统一理解、规划和生成世界模型在自动驾驶中的应用
世界模型已成为自动驾驶的核心,准确的场景理解和未来预测对于安全控制至关重要。近期研究探索了使用视觉-语言模型(VLMs)进行规划,但现有方法通常将感知、预测和规划视为独立模块。我们提出UniDrive-WM,这是一种基于VLM的统一世界模型,在单一架构中联合执行驾驶场景理解、轨迹规划和基于轨迹的未来图像生成。UniDrive-WM的轨迹规划器预测未来轨迹,条件化VLM图像生成器以生成合理的未来帧。这些预测提供了额外的监督信号,增强场景理解并逐步细化轨迹生成。我们进一步比较了离散和连续输出表示对未来图像预测的影响,分析其对下游驾驶性能的影响。在具有挑战性的Bench2Drive基准测试中,UniDrive-WM生成了高保真度的未来图像,并在L2轨迹误差和碰撞率方面分别提高了5.9%和9.2%,超过了之前的最佳方法。这些结果表明,将VLM驱动的推理、规划和生成世界建模紧密集成对于自动驾驶的优势。项目页面可在https://unidrive-wm.github.io/UniDrive-WM 查看。
Summary / 总结
UniDrive-WM is a unified VLM-based world model that integrates driving-scene understanding, trajectory planning, and future image generation. It uses a trajectory planner to predict future trajectories, which conditions a VLM to generate plausible future frames. Experiments show that UniDrive-WM improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate compared to the previous best method on the Bench2Drive benchmark.
UniDrive-WM 是一个统一的基于 VLM 的世界模型,集成了驾驶场景理解、轨迹规划和未来图像生成。轨迹规划器预测未来路径,条件化 VLM 生成可能的未来帧。实验表明,UniDrive-WM 在 Bench2Drive 基准上的 L2 轨迹误差降低了 5.9%,碰撞率降低了 9.2%,优于之前的方法,突显了将 VLM 驱动的推理和生成建模紧密集成的优势。
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
Authors: Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen
First: 2026-03-03T18:36:16+00:00 · Latest: 2026-03-03T18:36:16+00:00
Abstract
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.
中文标题/摘要
标题:UniG2U-Bench:统一模型是否推进了多模态理解?
统一多模态模型最近展示了强大的生成能力,但生成是否以及何时提升理解仍不清楚。现有基准缺乏对生成促进理解的具体任务的系统探索。为此,我们引入了UniG2U-Bench,这是一个全面的基准,将生成到理解(G2U)评估分为7个阶段和30个子任务,需要不同程度的隐式或显式的视觉转换。对超过30个模型的广泛评估揭示了三个核心发现:1)统一模型通常不如其基础视觉语言模型(VLM),生成后推理(GtA)通常会降低性能相对于直接推理。2)在空间智能、视觉错觉或多轮推理子任务中出现一致的增强,其中增强的空间和形状感知以及多步中间图像状态是有益的。3)具有相似推理结构的任务和共享架构的模型表现出相关行为,表明生成-理解耦合在任务、预训练数据和模型架构上诱导出类一致的归纳偏置。这些发现强调了需要更多样化的训练数据和新颖的范式来充分释放统一多模态建模的潜力。
Summary / 总结
The study introduces UniG2U-Bench, a benchmark that evaluates the ability of unified models to improve understanding through generation. It categorizes tasks into 7 regimes and 30 subtasks, revealing that unified models generally underperform their base VLMs and that generation-to-understanding (G2U) inference often degrades performance. The study finds consistent enhancements in spatial intelligence, visual illusions, and multi-round reasoning subtasks, and suggests that generation-understanding coupling induces class-consistent inductive biases, highlighting the need for diverse training data and novel paradigms.
研究引入了UniG2U-Bench基准,评估统一模型通过生成来提升理解的能力。该基准将任务分为7个类别和30个子任务,结果显示统一模型通常不如其基础视觉-语言模型表现,并且生成-然后-回答(GtA)推理往往降低性能。研究发现,在空间智能、视觉错觉和多轮推理子任务中存在一致的增强效果,并表明生成-理解耦合会诱导类别一致的归纳偏置,强调需要多样化的训练数据和新的范式来充分利用统一多模态建模的潜力。
COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design
Authors: Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski
First: 2026-03-03T18:31:46+00:00 · Latest: 2026-03-03T18:31:46+00:00
Abstract
Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// miquel-espinosa.github.io/cop-gen
中文标题/摘要
标题:COP-GEN:用于哥白尼地球观测数据的潜在扩散变换器——设计上具有随机性
地球观测应用越来越多地依赖于多传感器数据,包括光学、雷达、高程和土地覆盖产品。这些模态之间的关系对于数据集成至关重要,但它们是本原非单射的:相同的条件信息可以对应多个物理上合理的观测结果。因此,这样的条件映射应该被参数化为数据分布。结果,确定性模型往往会向条件均值坍塌,并且无法表示诸如数据完成和跨传感器转换等任务所需的不确定性与变化性。我们引入了COP-GEN,这是一种多模态潜在扩散变换器,它在各自的原生空间分辨率下建模了异构地球观测模态的联合分布。通过将跨模态映射参数化为条件分布,COP-GEN 使灵活的任意到任意条件生成成为可能,包括零样本模态转换、光谱波段填充以及在部分或缺失输入下的生成,而无需针对特定任务重新训练。在大规模全球多模态数据集上的实验表明,COP-GEN 生成了多样且物理上一致的实现,同时在光学、雷达和高程模态中保持了强大的峰值保真度。定性和定量分析表明,该模型捕捉到了有意义的跨模态结构,并且随着条件信息的增加,系统地调整其输出不确定性。这些结果突显了随机生成建模在地球观测中的实际重要性,并激励了超越单一参考点度量的评估协议。
Summary / 总结
COP-GEN is a multimodal latent diffusion transformer designed for Earth observation data, addressing the challenge of non-injective relationships between different sensor modalities. By modeling these relationships as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation and spectral band infilling. Experiments show that COP-GEN generates diverse and physically consistent realizations while maintaining strong peak fidelity across various Earth observation modalities, highlighting the importance of stochastic generative modeling for Earth observation tasks.
COP-GEN 是一种多模态潜扩散变换器,用于从多种传感器生成地球观测数据。它在原生分辨率下建模不同地球观测模态的联合分布,并将跨模态映射参数化为条件分布。这种方法允许灵活的任意到任意条件生成,包括零样本模态转换和光谱带填充。实验表明,COP-GEN 生成了多样且物理上一致的实现,同时在光学、雷达和高程数据中保持了高峰值保真度。该模型能够捕捉有意义的跨模态结构,并根据提供的条件信息调整其输出不确定性,突出了在地球观测任务中随机生成建模的重要性。
On Geometry Regularization in Autoencoder Reduced-Order Models with Latent Neural ODE Dynamics
Authors: Mikhail Osipov
First: 2026-03-03T18:31:13+00:00 · Latest: 2026-03-03T18:31:13+00:00
Comments: 25 pages, 2 figures, 3 tables
Abstract
We investigate geometric regularization strategies for learned latent representations in encoder--decoder reduced-order models. In a fixed experimental setting for the advection--diffusion--reaction (ADR) equation, we model latent dynamics using a neural ODE and evaluate four regularization approaches applied during autoencoder pre-training: (a) near-isometry regularization of the decoder Jacobian, (b) a stochastic decoder gain penalty based on random directional gains, (c) a second-order directional curvature penalty, and (d) Stiefel projection of the first decoder layer. Across multiple seeds, we find that (a)--(c) often produce latent representations that make subsequent latent-dynamics training with a frozen autoencoder more difficult, especially for long-horizon rollouts, even when they improve local decoder smoothness or related sensitivity proxies. In contrast, (d) consistently improves conditioning-related diagnostics of the learned latent dynamics and tends to yield better rollout performance. We discuss the hypothesis that, in this setting, the downstream impact of latent-geometry mismatch outweighs the benefits of improved decoder smoothness.
中文标题/摘要
标题:关于自动编码器降阶模型中潜空间神经ODE动力学的几何正则化
我们研究了在编码器-解码器降阶模型中学习潜空间表示的几何正则化策略。在固定实验设置下的对流-扩散-反应(ADR)方程中,我们使用神经ODE建模潜空间动力学,并评估了四种在自动编码器预训练期间应用的正则化方法:(a) 解码器雅可比矩阵的近似等距正则化,(b) 基于随机方向增益的随机解码器增益惩罚,(c) 方向曲率二阶惩罚,以及(d) 第一个解码层的史蒂费尔投影。在多个随机种子下,我们发现(a)-(c) 经常导致后续冻结自动编码器的潜空间动力学训练更加困难,尤其是在长时序滚动预测中,即使它们能改善局部解码器平滑度或相关灵敏度代理。相比之下,(d) 一致地改善了学习到的潜空间动力学的条件诊断,并倾向于产生更好的滚动预测性能。我们讨论了在这种情况下,潜空间几何不匹配的下游影响超过了解码器平滑度改进带来的好处的假设。
Summary / 总结
The study explores geometric regularization techniques for latent representations in autoencoder reduced-order models using a neural ODE for latent dynamics. Four regularization methods were evaluated during autoencoder pre-training: near-isometry of the decoder Jacobian, stochastic decoder gain penalty, second-order directional curvature penalty, and Stiefel projection of the first decoder layer. Results show that the first three methods often complicate latent-dynamics training, especially for long-term predictions, despite improving local smoothness. In contrast, Stiefel projection consistently enhances the conditioning of the latent dynamics and leads to better performance in rollouts.
研究探讨了在描述对流--扩散--反应方程的自编码器降阶模型中,对潜在表示进行几何正则化的方法。测试了四种正则化方法:解码器雅可比矩阵的近等距正则化、随机解码器增益惩罚、二阶方向曲率惩罚以及第一层解码器的Stiefel投影。虽然前三者改善了局部解码器的平滑性,但往往使后续的潜在动态训练更加困难,尤其是在长期预测中。相比之下,Stiefel投影始终提高了学习到的潜在动态的条件,并改善了滚动预测的表现。
CIRCLE: A Framework for Evaluating AI from a Real-World Lens
Authors: Reva Schwartz, Carina Westling, Morgan Briggs, Marzieh Fadaee, Isar Nejadgholi, Matthew Holmes, Fariza Rashid, Maya Carlyle, Afaf Taïk, Kyra Wilson, Peter Douglas, Theodora Skeadas, Gabriella Waters, Rumman Chowdhury, Thiago Lacerda
First: 2026-02-27T14:43:23+00:00 · Latest: 2026-03-03T18:25:54+00:00
Comments: Accepted at Intelligent Systems Conference (IntelliSys) 2026
Abstract
This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI's materialized outcomes in deployment. While existing frameworks like MLOps focus on system stability and benchmarks measure abstract capabilities, decision-makers outside the AI stack lack systematic evidence about the behavior of AI technologies under real-world user variability and constraints. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This can enable governance based on materialized downstream effects rather than theoretical capabilities.
中文标题/摘要
标题:CIRCLE:一种从现实角度评估AI的框架
本文提出了一种名为CIRCLE的六阶段生命周期框架,旨在弥合模型中心性能指标与部署中AI实际成果之间的现实差距。虽然现有的框架如MLOps侧重于系统稳定性,基准测试衡量的是抽象能力,但AI堆栈之外的决策者缺乏系统性的证据,以了解AI技术在实际用户变异性及约束条件下的行为。CIRCLE通过将TEVV(测试、评估、验证和验证)中的验证阶段具体化,将堆栈之外的利益相关者关切转化为可测量的信号。与通常局限于特定场景的参与式设计或事后进行的算法审计不同,CIRCLE提供了一种结构化的前瞻性协议,将情境敏感的定性见解与可扩展的定量指标联系起来。通过将现场测试、红队测试和纵向研究等方法整合到协调的管道中,CIRCLE产生了一种系统性知识:这种证据在不同地点之间具有可比性,但又对当地环境敏感。这可以基于实际下游影响而非理论能力进行治理。
Summary / 总结
CIRCLE is a six-stage framework designed to address the gap between model-centric performance metrics and real-world AI outcomes. It operationalizes the Validation phase of TEVV by translating stakeholder concerns into measurable signals, providing a structured protocol for linking qualitative insights to quantitative metrics. Key findings include the integration of field testing, red teaming, and longitudinal studies, which produce systematic knowledge that is both comparable across sites and sensitive to local context, enabling governance based on materialized downstream effects rather than theoretical capabilities.
CIRCLE 是一个六阶段框架,旨在弥合模型中心性能指标与实际AI成果之间的差距。它通过将利益相关者的关切转化为可测量的信号来实现TEVV验证阶段的标准化和前瞻性流程。CIRCLE 结合了现场测试、红队测试和纵向研究,生成了既可跨地点比较又对当地环境敏感的系统性知识,从而基于实际下游效果而非理论能力进行治理。
AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
Authors: Zihang Zeng, Jiaquan Zhang, Pengze Li, Yuan Qi, Xi Chen
First: 2026-03-03T18:25:00+00:00 · Latest: 2026-03-03T18:25:00+00:00
Abstract
Large Language Models (LLMs) demonstrate potentials for automating scientific code generation but face challenges in reliability, error propagation in multi-agent workflows, and evaluation in domains with ill-defined success metrics. We present a Bayesian adversarial multi-agent framework specifically designed for AI for Science (AI4S) tasks in the form of a Low-code Platform (LCP). Three LLM-based agents are coordinated under the Bayesian framework: a Task Manager that structures user inputs into actionable plans and adaptive test cases, a Code Generator that produces candidate solutions, and an Evaluator providing comprehensive feedback. The framework employs an adversarial loop where the Task Manager iteratively refines test cases to challenge the Code Generator, while prompt distributions are dynamically updated using Bayesian principles by integrating code quality metrics: functional correctness, structural alignment, and static analysis. This co-optimization of tests and code reduces dependence on LLM reliability and addresses evaluation uncertainty inherent to scientific tasks. LCP also streamlines human-AI collaboration by translating non-expert prompts into domain-specific requirements, bypassing the need for manual prompt engineering by practitioners without coding backgrounds. Benchmark evaluations demonstrate LCP's effectiveness in generating robust code while minimizing error propagation. The proposed platform is also tested on an Earth Science cross-disciplinary task and demonstrates strong reliability, outperforming competing models.
中文标题/摘要
标题:面向科学的低代码平台与贝叶斯对抗多智能体框架
大型语言模型(LLMs)展示了自动化科学代码生成的潜力,但面临可靠性和多智能体工作流中的错误传播挑战,以及在成功度量不明确的领域中的评估挑战。我们提出了一种专门针对科学人工智能(AI4S)任务的贝叶斯对抗多智能体框架,以低代码平台(LCP)的形式呈现。在贝叶斯框架下协调了三个基于LLM的智能体:任务管理器将用户输入结构化为可执行计划和自适应测试案例,代码生成器生成候选解决方案,评估器提供全面反馈。框架采用了一个对抗循环,其中任务管理器迭代细化测试案例以挑战代码生成器,同时使用贝叶斯原则动态更新提示分布,结合代码质量指标:功能正确性、结构对齐和静态分析。这种测试和代码的协同优化减少了对LLM可靠性的依赖,并解决了科学任务固有的评估不确定性。LCP通过将非专家提示翻译成特定领域的规范,简化了人机协作,绕过了没有编程背景的从业者需要的手动提示工程。基准评估表明,LCP在生成稳健代码的同时最大限度地减少了错误传播。所提出的平台还在地球科学跨学科任务中进行了测试,并展示了强大的可靠性,优于竞争对手模型。
Summary / 总结
The research aims to address the limitations of Large Language Models (LLMs) in scientific code generation, particularly focusing on reliability and error propagation. The study introduces a Bayesian adversarial multi-agent framework within a Low-code Platform (LCP) to enhance AI for Science tasks. Three LLM-based agents work together: a Task Manager that creates actionable plans and test cases, a Code Generator that produces code, and an Evaluator that provides feedback. The framework uses an adversarial loop to refine test cases and update prompt distributions, which helps in reducing reliance on LLM reliability and addressing evaluation uncertainty. Experimental results show that LCP effectively generates robust code and outperforms competing models in Earth Science tasks.
研究旨在通过大型语言模型(LLMs)解决自动化科学代码生成中的可靠性和错误传播问题。提出了一种贝叶斯对抗多代理框架,嵌入低代码平台(LCP)中,以增强AI for Science任务的自动化。三个基于LLM的代理——任务管理器、代码生成器和评估器——协调工作以生成和改进代码,通过迭代测试和反馈形成对抗循环,提高代码质量。实验结果表明,LCP能够有效生成稳健的代码,减少错误传播,并在地球科学任务中表现出色,优于其他竞争模型。
SynthCharge: An Electric Vehicle Routing Instance Generator with Feasibility Screening to Enable Learning-Based Optimization and Benchmarking
Authors: Mertcan Daysalilar, Fuat Uyguroglu, Gabriel Nicolosi, Adam Meyers
First: 2026-03-03T18:23:38+00:00 · Latest: 2026-03-03T18:23:38+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
The electric vehicle routing problem with time windows (EVRPTW) extends the classical VRPTW by introducing battery capacity constraints and charging station decisions. Existing benchmark datasets are often static and lack verifiable feasibility, which restricts reproducible evaluation of learning-based routing models. We introduce SynthCharge, a parametric generator that produces diverse, feasibility-screened EVRPTW instances across varying spatiotemporal configurations and scalable customer counts. While SynthCharge can currently generate large-scale instances of up to 500 customers, we focus our experiments on sizes ranging from 5 to 100 customers. Unlike static benchmark suites, SynthCharge integrates instance geometry with adaptive energy capacity scaling and range-aware charging station placement. To guarantee structural validity, the generator systematically filters out unsolvable instances through a fast feasibility screening process. Ultimately, SynthCharge provides the dynamic benchmarking infrastructure needed to systematically evaluate the robustness of emerging neural routing and data-driven approaches.
中文标题/摘要
标题:SynthCharge:一种具有可行性筛选的电动汽车路径生成器,以实现基于学习的优化和基准测试
电动汽车带时间窗的路径问题(EVRPTW)在经典的VRPTW基础上引入了电池容量约束和充电站决策。现有的基准数据集往往是静态的,缺乏可验证的可行性,这限制了基于学习的路径模型的可重复评估。我们引入了SynthCharge,这是一种参数生成器,可以生成在不同空间时间和客户数量配置下具有可行性的多样化EVRPTW实例。虽然SynthCharge目前可以生成多达500个客户的大型实例,但我们主要在5到100个客户的规模上进行实验。与静态基准套件不同,SynthCharge将实例几何与自适应能量容量缩放和范围感知充电站布局相结合。为了保证结构有效性,生成器通过快速的可行性筛选过程系统地过滤出不可解的实例。最终,SynthCharge提供了系统评估新兴神经路径和数据驱动方法鲁棒性的动态基准测试基础设施。
Summary / 总结
The research addresses the electric vehicle routing problem with time windows (EVRPTW) by introducing SynthCharge, a parametric generator that creates diverse and feasible instances. It integrates instance geometry with adaptive energy capacity scaling and range-aware charging station placement to ensure structural validity. Key findings show that SynthCharge can generate instances from 5 to 500 customers, enabling the systematic evaluation of learning-based routing models.
研究通过引入SynthCharge,一种参数生成器,来解决带时间窗口的电动汽车路由问题(EVRPTW),生成多样且可行的实例。该生成器结合实例几何、自适应能量容量缩放和范围感知充电站布局,确保结构有效性。关键发现表明,SynthCharge可以生成5到500个客户规模的实例,从而系统地评估基于学习的路由模型。
Stabilized Adaptive Loss and Residual-Based Collocation for Physics-Informed Neural Networks
Authors: Divyavardhan Singh, Shubham Kamble, Dimple Sonone, Kishor Upla
First: 2026-03-03T18:17:28+00:00 · Latest: 2026-03-03T18:17:28+00:00
Comments: 6 pages, 2 Figures, 4 tables
Abstract
Physics-Informed Neural Networks (PINNs) have been recognized as a mesh-free alternative to solve partial differential equations where physics information is incorporated. However, in dealing with problems characterized by high stiffness or shock-dominated dynamics, traditional PINNs have been found to have limitations, including unbalanced training and inaccuracy in solution, even with small physics residuals. In this research, we seek to address these limitations using the viscous Burgers' equation with low viscosity and the Allen-Cahn equation as test problems. In addressing unbalanced training, we have developed a new adaptive loss balancing scheme using smoothed gradient norms to ensure satisfaction of initial and boundary conditions. Further, to address inaccuracy in the solution, we have developed an adaptive residual-based collocation scheme to improve the accuracy of solutions in the regions with high physics residuals. The proposed new approach significantly improves solution accuracy with consistent satisfaction of physics residuals. For instance, in the case of Burgers' equation, the relative L2 error is reduced by about 44 percent compared to traditional PINNs, while for the Allen-Cahn equation, the relative L2 error is reduced by approximately 70 percent. Additionally, we show the trustworthy solution comparison of the proposed method using a robust finite difference solver.
中文标题/摘要
标题:稳定自适应损失和基于残差的配置方法在物理信息神经网络中的应用
物理信息神经网络(PINNs)已被视为一种无网格替代方案,用于解决包含物理信息的偏微分方程。然而,在处理具有高刚度或冲击主导动力学的问题时,传统的PINNs显示出局限性,包括训练不平衡和解的不准确,即使物理残差较小也是如此。在本研究中,我们使用低粘度的粘性Burgers方程和Allen-Cahn方程作为测试问题,旨在解决这些局限性。为了解决训练不平衡,我们开发了一种新的自适应损失平衡方案,使用平滑梯度范数确保满足初始和边界条件。此外,为了解决解的不准确性,我们开发了一种自适应基于残差的配置方案,以提高高物理残差区域解的准确性。所提出的新方法显著提高了解的准确性,同时保持了物理残差的一致满足。例如,在Burgers方程的情况下,相对L2误差降低了约44%,而在Allen-Cahn方程的情况下,相对L2误差降低了约70%。此外,我们使用稳健的有限差分求解器展示了所提出方法的解的可信比较。
Summary / 总结
This research aims to improve the performance of Physics-Informed Neural Networks (PINNs) in solving problems with high stiffness or shock-dominated dynamics. The authors introduce a new adaptive loss balancing scheme using smoothed gradient norms to address unbalanced training and an adaptive residual-based collocation scheme to enhance solution accuracy in regions with high physics residuals. Experimental results show a significant reduction in relative L2 error for both the viscous Burgers' equation and the Allen-Cahn equation, with improvements of about 44% and 70%, respectively, compared to traditional PINNs.
该研究旨在提高物理信息神经网络(PINNs)在解决具有高刚性或冲击主导动力学的偏微分方程问题时的性能。作者引入了一种使用平滑梯度范数的自适应损失平衡方案来解决训练不平衡问题,并提出了一种自适应残差基配置方案以提高在高物理残差区域的解精度。实验结果表明,与传统PINNs相比,伯格斯方程的相对L2误差减少了约44%,而艾伦-坎方程的相对L2误差减少了约70%。
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Authors: Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song, Shuo Li, Kezhen Chen
First: 2026-02-23T05:17:41+00:00 · Latest: 2026-03-03T18:16:35+00:00
Abstract
We introduce CFE-Bench (Classroom Final Exam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains. CFE-Bench is curated from repeatedly used, authentic university homework and exam problems, paired with reference solutions provided by course instructors. CFE-Bench remains challenging for frontier models: the newly released Gemini-3.1-pro-preview achieves 59.69% overall accuracy, while the second-best model, Gemini-3-flash-preview, reaches 55.46%, leaving substantial room for improvement. Beyond aggregate scores, we conduct a diagnostic analysis by decomposing instructor reference solutions into structured reasoning flows. We find that while frontier models often answer intermediate sub-questions correctly, they struggle to reliably derive and maintain correct intermediate states throughout multi-step solutions. We further observe that model-generated solutions typically contain more reasoning steps than instructor solutions, indicating lower step efficiency and a higher risk of error accumulation. Data and code are available at https://github.com/Analogy-AI/CFE_Bench.
中文标题/摘要
标题:教室期末考试:由教师测试的推理基准
我们介绍了CFE-Bench(教室期末考试),这是一个多模态基准,用于评估大型语言模型在超过20个STEM领域的推理能力。CFE-Bench 从反复使用的、真实的大学作业和考试问题中精选而来,并配以课程教师提供的参考解决方案。CFE-Bench 对前沿模型仍然具有挑战性:新发布的Gemini-3.1-pro-preview 的总体准确率为59.69%,而第二好的模型Gemini-3-flash-preview 达到55.46%,留有很大改进空间。除了总分外,我们通过将教师参考解决方案分解为结构化的推理流程进行了诊断分析。我们发现,虽然前沿模型通常能正确回答中间子问题,但在多步解决方案中可靠地推导和保持正确中间状态方面存在困难。我们还观察到,模型生成的解决方案通常包含比教师解决方案更多的推理步骤,表明较低的步骤效率和更高的错误累积风险。数据和代码可在https://github.com/Analogy-AI/CFE_Bench 获取。
NutriBench: A Dataset for Evaluating Large Language Models on Nutrition Estimation from Meal Descriptions
Authors: Andong Hua, Mehak Preet Dhaliwal, Laya Pullela, Ryan Burke, Yao Qin
Venue: ICLR 2025
First: 2024-07-04T15:10:51+00:00 · Latest: 2026-03-03T18:03:31+00:00
Comments: ICLR 2025
Abstract
Accurate nutrition estimation helps people make informed dietary choices and is essential in the prevention of serious health complications. We present NutriBench, the first publicly available natural language meal description nutrition benchmark. NutriBench consists of 11,857 meal descriptions generated from real-world global dietary intake data. The data is human-verified and annotated with macro-nutrient labels, including carbohydrates, proteins, fats, and calories. We conduct an extensive evaluation of NutriBench on the task of carbohydrate estimation, testing twelve leading Large Language Models (LLMs), including GPT-4o, Llama3.1, Qwen2, Gemma2, and OpenBioLLM models, using standard, Chain-of-Thought and Retrieval-Augmented Generation strategies. Additionally, we present a study involving professional nutritionists, finding that LLMs can provide comparable but significantly faster estimates. Finally, we perform a real-world risk assessment by simulating the effect of carbohydrate predictions on the blood glucose levels of individuals with diabetes. Our work highlights the opportunities and challenges of using LLMs for nutrition estimation, demonstrating their potential to aid professionals and laypersons and improve health outcomes. Our benchmark is publicly available at: https://mehak126.github.io/nutribench.html
中文标题/摘要
标题:NutriBench:用于评估大型语言模型从餐食描述中估计营养成分的大规模数据集
准确的营养估计有助于人们做出知情的饮食选择,并且对于预防严重的健康并发症至关重要。我们介绍了NutriBench,这是首个公开的自然语言餐食描述营养基准数据集。NutriBench 包含来自全球实际饮食摄入数据的 11,857 条餐食描述。数据由人工验证并标注了宏量营养素标签,包括碳水化合物、蛋白质、脂肪和卡路里。我们对NutriBench 进行了广泛的碳水化合物估计任务评估,测试了包括GPT-4o、Llama3.1、Qwen2、Gemma2 和 OpenBioLLM 模型在内的十二个领先的大规模语言模型(LLMs),使用了标准、链式思维和检索增强生成策略。此外,我们还进行了一项由专业营养师参与的研究,发现LLMs 可以提供与之相当但显著更快的估计。最后,我们通过模拟碳水化合物预测对糖尿病患者血糖水平的影响进行了实际风险评估。我们的工作突显了使用LLMs 进行营养估计的机会和挑战,展示了它们在帮助专业人士和普通人群提高健康结果方面的潜力。我们的基准数据集可在以下网址获取:https://mehak126.github.io/nutribench.html
Summary / 总结
NutriBench is a dataset for evaluating large language models in estimating nutrition from meal descriptions, consisting of 11,857 human-verified meal descriptions. The study evaluates twelve leading LLMs on carbohydrate estimation, showing that LLMs can provide comparable estimates but faster than professional nutritionists. The work also assesses the real-world impact of carbohydrate predictions on individuals with diabetes, highlighting the potential of LLMs in aiding health outcomes. The benchmark is publicly available at https://mehak126.github.io/nutribench.html.
NutriBench 是一个用于评估大型语言模型从餐食描述中估计营养成分的数据集,包含 11,857 条经过人工验证的餐食描述。该数据集使用十二种领先的 LLM 进行了碳水化合物估计的评估,结果显示 LLM 可以提供与专业营养师相当但更快的估计。研究还评估了碳水化合物预测对糖尿病患者的影响,突出了使用 LLM 进行营养估计的机会和挑战。
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian
Venue: ICLR 2026
First: 2025-08-25T17:57:49+00:00 · Latest: 2026-03-03T17:59:41+00:00
Comments: Accepted by ICLR 2026. Code: https://github.com/Ironieser/mmtok , Project Homepage: https://project.ironieser.cc/mmtok
Abstract
Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual inputs to vision tokens. However, redundancy in vision tokens results in the degraded inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the coverage criterion. We first formulate the subset selection problem as a maximum coverage problem. Afterwards, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Finally, with only four vision tokens, 87.7% of the original performance is still preserved on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection. The code is available at https://github.com/Ironieser/mmtok
中文标题/摘要
标题:MMTok:多模态覆盖率最大化以提高VLMs高效推理
视觉-语言模型(VLMs)通过将视觉输入转换为视觉标记来理解带有语言指令的视觉内容,表现出令人印象深刻的性能。然而,视觉标记中的冗余性导致了VLMs推理效率的下降。虽然已经提出了许多算法来减少视觉标记的数量,但大多数算法仅使用单模态信息(即视觉/文本)进行剪枝,忽略了视觉-语言任务的固有多模态特性。此外,缺乏一个适用于不同模态的通用标准。为了解决这一局限性,本文提出利用视觉和文本标记来通过覆盖率标准选择信息性的视觉标记。首先,将子集选择问题形式化为最大覆盖问题。之后,优化一个视觉标记子集,使其同时覆盖文本标记和原始的视觉标记集。所提出的方法MMTok在不同VLMs的基准数据集上进行了广泛评估。比较结果表明,视觉和文本信息是互补的,结合多模态信息可以明显超越单模态基线。此外,在POPE数据集上的最大覆盖标准下,我们的方法在LLaVA-NeXT-13B上实现了1.87倍的速度提升,同时保持了98.7%的原始性能。最后,仅使用四个视觉标记,LLaVA-1.5-7B的原始性能仍保留了87.7%。这些结果突显了覆盖率在标记选择中的有效性。代码可在https://github.com/Ironieser/mmtok 获取。
Summary / 总结
The research aims to improve the inference efficiency of Vision-Language Models (VLMs) by reducing redundant vision tokens while preserving performance. The method, MMTok, leverages both vision and text tokens to select informative vision tokens based on a coverage criterion. Experiments show that combining multimodal information outperforms unimodal baselines, achieving a 1.87x speedup with 98.7% of the original performance on LLaVA-NeXT-13B and maintaining 87.7% of the performance with only four vision tokens on LLaVA-1.5-7B.
研究旨在通过减少冗余的视觉标记来提高视觉语言模型(VLMs)的推理效率,同时保持性能。方法MMTok利用视觉和文本标记根据覆盖标准选择信息性的视觉标记。实验表明,结合多模态信息优于单模态基线,实现了1.87倍的速度提升,并在LLaVA-NeXT-13B上保持了98.7%的原始性能,在LLaVA-1.5-7B上仅使用四个视觉标记仍保持了87.7%的原始性能。
ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments
Authors: Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, Xiaogang Wang
First: 2026-03-03T17:53:45+00:00 · Latest: 2026-03-03T17:53:45+00:00
Comments: Code: https://github.com/ACE-BRAIN-Team/ACE-Brain-0 Hugging Face: https://huggingface.co/ACE-Brain/ACE-Brain-0-8B
Abstract
Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long-tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain-specific proficiency. In this report, we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model~(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross-embodiment transfer. Building on this insight, we propose the Scaffold-Specialize-Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model's comprehensive capability. Extensive experiments demonstrate that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.
中文标题/摘要
标题:ACE-Brain-0:作为通用体态基础架构的空间智能
通用体态智能要求在异构体态(如自动驾驶、机器人和无人机)之间进行稳健的泛化。然而,现有的体态大脑在训练统一模型时,经常遇到长尾数据、梯度干扰和灾难性遗忘等问题,使得在保持通用泛化的同时兼顾领域特定能力变得极其困难。在本报告中,我们介绍了ACE-Brain-0,这是一种通用基础大脑,它在一个多模态大型语言模型(MLLM)中统一了空间推理、自动驾驶和体态操作。我们的核心见解是,空间智能是跨不同物理体态的通用基础:尽管车辆、机器人和无人机在形态上差异巨大,但它们都对建模三维心理空间有共同的需求,使空间认知成为一种自然的、领域无关的基础,适用于跨体态迁移。基于这一见解,我们提出了支架专业化调和(SSR)范式,首先建立共享的空间基础,然后培养领域专业化专家,最后通过数据驱动的模型合并使它们和谐共存。此外,我们采用了组相对策略优化(GRPO)以增强模型的综合能力。广泛的实验表明,ACE-Brain-0 在 24 个空间和体态相关的基准测试中实现了具有竞争力甚至达到最先进的性能。
Summary / 总结
The research aims to develop a unified model for universal embodied intelligence across different physical embodiments like autonomous driving, robotics, and UAVs. The method involves a Scaffold-Specialize-Reconcile (SSR) paradigm that first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Key experimental results show that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.
研究旨在开发一种统一模型,以实现跨不同物理载体(如自动驾驶、机器人和无人机)的通用嵌入式智能。方法包括一种称为Scaffold-Specialize-Reconcile (SSR)的范式,首先建立共享的空间基础,然后培养领域特定的专家,并通过无数据模型合并最终协调它们。关键实验结果表明,ACE-Brain-0在24个空间和载体相关的基准测试中实现了竞争性和甚至是最先进的性能。
Specificity-aware reinforcement learning for fine-grained open-world classification
Authors: Samuele Angheben, Davide Berasi, Alessandro Conti, Elisa Ricci, Yiming Wang
Venue: CVPR 2026
First: 2026-03-03T17:52:39+00:00 · Latest: 2026-03-03T17:52:39+00:00
Comments: Accepted at CVPR 2026
Abstract
Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.
中文标题/摘要
标题:基于特定性的强化学习在开放世界细粒度分类中的应用
在开放世界设置下对细粒度视觉概念进行分类,即在没有预定义标签集的情况下进行分类,要求模型既准确又具体。最近的推理多模态模型(LMMs)展示了强大的视觉理解能力,但在执行细粒度图像分类时倾向于产生过于通用的预测。初步分析表明,模型确实具备内在的细粒度领域知识。然而,在不牺牲正确性(准确性)的情况下促进更具体预测(特定性)仍然是一个非平凡且未充分研究的挑战。在本文中,我们研究如何引导推理LMMs产生既准确又具体预测。我们提出了一种新颖的基于特定性的强化学习框架SpeciaRL,用于在开放世界设置下对细粒度图像分类进行微调。SpeciaRL引入了一个动态的、基于验证者的奖励信号,该信号锚定在线上滚动中的最佳预测,促进特定性同时尊重模型的能力以防止错误预测。我们的跨域实验表明,SpeciaRL在广泛的细粒度基准测试中提供了准确性与特定性之间的最佳权衡,超越了现有方法并推动了开放世界细粒度图像分类的发展。代码和模型可在https://github.com/s-angheben/SpeciaRL公开获取。
Summary / 总结
This paper addresses the challenge of fine-grained open-world classification by proposing SpeciaRL, a specificity-aware reinforcement learning framework. SpeciaRL aims to enhance the accuracy and specificity of predictions made by reasoning Large Multimodal Models (LMMs) without sacrificing correctness. The framework introduces a dynamic reward signal based on the best predictions during online rollouts, which helps in promoting specificity while ensuring the model's predictions remain correct. Experimental results demonstrate that SpeciaRL outperforms existing methods in balancing correctness and specificity across various fine-grained benchmarks.
本文提出了一种新颖的特定性增强强化学习框架SpeciaRL,旨在通过增强推理大型多模态模型(LMMs)的预测准确性与特定性,同时不牺牲正确性。该框架通过在线卷积中的最佳预测动态引入奖励信号,帮助模型生成更具体和准确的预测。实验结果表明,SpeciaRL 在各种细粒度基准测试中在正确性和特定性之间提供了最佳的权衡,优于现有方法。
Chain of World: World Model Thinking in Latent Motion
Authors: Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, Baorui Ma
First: 2026-03-03T17:52:06+00:00 · Latest: 2026-03-03T17:52:06+00:00
Comments: Accepted by CVPR2026. Project page: https://fx-hit.github.io/cowvla-io/
Abstract
Vision-Language-Action (VLA) models are a promising path toward embodied intelligence, yet they often overlook the predictive and temporal-causal structure underlying visual dynamics. World-model VLAs address this by predicting future frames, but waste capacity reconstructing redundant backgrounds. Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new "Chain of World" paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. First, a pretrained video VAE serves as a latent motion extractor, explicitly factorizing video segments into structure and motion latents. Then, during pre-training, the VLA learns from an instruction and an initial frame to infer a continuous latent motion chain and predict the segment's terminal frame. Finally, during co-fine-tuning, this latent dynamic is aligned with discrete action prediction by jointly modeling sparse keyframes and action sequences in a unified autoregressive decoder. This design preserves the world-model benefits of temporal reasoning and world knowledge while retaining the compactness and interpretability of latent actions, enabling efficient visuomotor learning. Extensive experiments on robotic simulation benchmarks show that CoWVLA outperforms existing world-model and latent-action approaches and achieves moderate computational efficiency, highlighting its potential as a more effective VLA pretraining paradigm. The project website can be found at https://fx-hit.github.io/cowvla-io.
中文标题/摘要
标题:世界链:潜在运动的世界模型思维
视觉-语言-动作(VLA)模型是实现具身智能的有希望的道路,但它们往往忽视了视觉动态背后的预测性和时间因果结构。世界模型VLA通过预测未来帧来解决这一问题,但会浪费容量重建冗余背景。潜在动作VLA紧凑地编码帧到帧的过渡,但缺乏连续动态建模和世界知识。为克服这些限制,我们引入了CoWVLA(世界链VLA),这是一种新的“世界链”范式,将世界模型的时间推理与解耦的潜在运动表示统一起来。首先,预训练的视频VAE作为潜在运动提取器,显式地将视频片段分解为结构和运动潜在变量。然后,在预训练期间,VLA从指令和初始帧中学习,以推断连续的潜在运动链并预测片段的终端帧。最后,在联合微调期间,通过联合建模稀疏关键帧和动作序列,将这种潜在动态与离散动作预测对齐。此设计保留了世界模型的时间推理和世界知识的优势,同时保持了潜在动作的紧凑性和可解释性,从而实现高效的视动学习。在机器人模拟基准上的广泛实验表明,CoWVLA在世界模型和潜在动作方法中表现出色,并实现了适度的计算效率,突显了其作为更有效的VLA预训练范式的潜力。项目网站可访问 https://fx-hit.github.io/cowvla-io/
Summary / 总结
CoWVLA (Chain-of-World VLA) addresses the limitations of existing Vision-Language-Action models by unifying world-model temporal reasoning with a disentangled latent motion representation. It uses a pretrained video VAE to extract latent motion, allowing the VLA to learn a continuous latent motion chain and predict the terminal frame during pre-training. During co-fine-tuning, the latent dynamic is aligned with discrete action prediction. Experiments show that CoWVLA outperforms existing approaches and achieves moderate computational efficiency, making it a promising VLA pretraining paradigm.
CoWVLA(Chain-of-World VLA)通过将世界模型的时间推理与解耦的潜在运动表示统一起来,解决了现有Vision-Language-Action模型的局限性。它使用预训练的视频VAE来提取潜在运动,并在预训练过程中,VLA学习根据指令和初始帧推断连续的潜在运动链并预测终端帧。在联合微调过程中,潜在动态与动作预测对齐。实验表明,CoWVLA在性能上优于现有方法,并且具有适度的计算效率,使其成为更有前景的VLA预训练范式。
MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization
Authors: Ashutosh Chaubey, Jiacheng Pang, Mohammad Soleymani
Venue: CVPR 2026
First: 2026-03-03T17:50:24+00:00 · Latest: 2026-03-03T17:50:24+00:00
Comments: CVPR 2026. Project Page: https://mod-dpo.github.io/
Abstract
Omni-modal large language models (omni LLMs) have recently achieved strong performance across audiovisual understanding tasks, yet they remain highly susceptible to cross-modal hallucinations arising from spurious correlations and dominant language priors. In this work, we propose Modality-Decoupled Direct Preference Optimization (MoD-DPO), a simple and effective framework for improving modality grounding in omni LLMs. MoD-DPO introduces modality-aware regularization terms that explicitly enforce invariance to corruptions in irrelevant modalities and sensitivity to perturbations in relevant modalities, thereby reducing unintended cross-modal interactions. To further mitigate over-reliance on textual priors, we incorporate a language-prior debiasing penalty that discourages hallucination-prone text-only responses. Extensive experiments across multiple audiovisual hallucination benchmarks demonstrate that MoD-DPO consistently improves perception accuracy and hallucination resistance, outperforming previous preference optimization baselines under similar training budgets. Our findings underscore the importance of modality-faithful alignment and demonstrate a scalable path toward more reliable and resilient multimodal foundation models.
中文标题/摘要
标题:MoD-DPO:通过模态解耦直接偏好优化减轻全模态大语言模型的跨模态幻觉
全模态大语言模型(全模态LLM)在最近的跨模态理解任务中取得了强大的性能,但它们仍然高度容易受到由虚假相关性和主导语言先验引起的跨模态幻觉的影响。在本文中,我们提出了模态解耦直接偏好优化(MoD-DPO),这是一种简单而有效的框架,用于提高全模态LLM的模态定位。MoD-DPO引入了模态感知的正则化项,明确地强制不变性于无关模态的干扰,并对相关模态的扰动敏感,从而减少不必要的跨模态交互。为了进一步减轻对文本先验的过度依赖,我们引入了一种语言先验去偏置惩罚,以防止产生幻觉的纯文本响应。在多个跨模态幻觉基准上的广泛实验表明,MoD-DPO在感知准确性和幻觉抵抗力方面始终优于以前的偏好优化基线,在相似的训练预算下表现出色。我们的研究结果强调了模态忠实对齐的重要性,并展示了更可靠和更具弹性的多模态基础模型的可扩展路径。
Summary / 总结
The research aims to address the issue of cross-modal hallucinations in omni-modal large language models (omni LLMs) by proposing Modality-Decoupled Direct Preference Optimization (MoD-DPO). MoD-DPO introduces modality-aware regularization terms to enhance modality grounding and reduce unintended cross-modal interactions, while also incorporating a language-prior debiasing penalty to discourage text-only hallucinations. Experiments show that MoD-DPO improves perception accuracy and hallucination resistance, outperforming previous preference optimization methods with similar training budgets.
研究旨在通过提出MoD-DPO框架来解决全模态大语言模型中的跨模态幻觉问题,该框架利用模态感知正则化来提高模态对齐。MoD-DPO减少了无意中的跨模态交互,并引入了语言先验去偏置惩罚以减轻对文本先验的过度依赖。实验表明,MoD-DPO提高了感知准确性和幻觉抵抗力,并在相似的训练预算下优于之前的偏好优化基线方法。
CoPeP: Benchmarking Continual Pretraining for Protein Language Models
Authors: Darshan Patil, Pranshu Malviya, Mathieu Reymond, Quentin Fournier, Sarath Chandar
First: 2026-02-27T19:04:32+00:00 · Latest: 2026-03-03T17:50:01+00:00
Comments: 29 pages, 25 figures
Abstract
Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever-growing data, but also as an opportunity to take advantage of the temporal meta-information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM performance across 31 protein understanding tasks. We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity-based methods, some of which have never been applied to models and data of this scale. Our findings reveal that incorporating temporal meta-information improves perplexity by up to 7% even when compared to training on data from all tasks jointly. Moreover, even at scale, several continual learning methods outperform naive continual pretraining. The CoPeP benchmark offers an exciting opportunity to study these methods at scale in an impactful real-world application.
中文标题/摘要
标题:CoPeP:蛋白质语言模型连续预训练基准测试
蛋白质语言模型(pLMs)因其能够从进化统计中揭示序列、结构和功能之间的关系,从而加速药物发现而引起了广泛关注。这些模型从不断更新的蛋白质数据库中学习,其动态性质促使应用连续学习,不仅为了跟上不断增长的数据,而且利用这一过程中产生的时间元信息。因此,我们引入了连续蛋白质语言模型预训练(CoPeP)基准测试,这是一个用于评估连续学习方法在pLMs上的新型基准测试。具体来说,我们从UniProt知识库中编排了一个跨越十年的蛋白质数据集序列,并定义了评估pLM性能的指标,涵盖31项蛋白质理解任务。我们评估了连续学习文献中的多种方法,包括重放、遗忘和基于可塑性的方法,其中一些方法从未应用于如此规模的模型和数据。我们的研究发现,即使与联合训练所有任务的数据相比,纳入时间元信息也能将困惑度提高多达7%。此外,即使在大规模应用中,多种连续学习方法也优于简单的连续预训练。CoPeP基准测试为在具有重大实际意义的应用中大规模研究这些方法提供了机会。
Summary / 总结
The research aims to evaluate continual learning approaches for protein language models (pLMs) by introducing the CoPeP benchmark. The method involves curating a sequence of protein datasets from the UniProt Knowledgebase over a decade and assessing pLM performance across 31 tasks. Key findings show that incorporating temporal meta-information improves perplexity by up to 7% compared to joint training, and several continual learning methods outperform naive continual pretraining at scale.
研究旨在通过引入CoPeP基准来评估蛋白质语言模型(pLMs)的持续学习方法,该基准使用了来自UniProt知识库的十年跨度的蛋白质数据集序列。研究评估了重放、遗忘和塑性基方法等,并发现将时间元信息纳入考虑可将困惑度提高多达7%,优于联合训练。即使在大规模情况下,也有几种持续学习方法优于简单的持续预训练,为pLMs的有效策略提供了见解。
ProSMA-UNet: Decoder Conditioning for Proximal-Sparse Skip Feature Selection
Authors: Chun-Wun Cheng, Yanqi Cheng, Peiyuan Jing, Guang Yang, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero
First: 2026-03-03T17:46:29+00:00 · Latest: 2026-03-03T17:46:29+00:00
Abstract
Medical image segmentation commonly relies on U-shaped encoder-decoder architectures such as U-Net, where skip connections preserve fine spatial detail by injecting high-resolution encoder features into the decoder. However, these skip pathways also propagate low-level textures, background clutter, and acquisition noise, allowing irrelevant information to bypass deeper semantic filtering -- an issue that is particularly detrimental in low-contrast clinical imaging. Although attention gates have been introduced to address this limitation, they typically produce dense sigmoid masks that softly reweight features rather than explicitly removing irrelevant activations. We propose ProSMA-UNet (Proximal-Sparse Multi-Scale Attention U-Net), which reformulates skip gating as a decoder-conditioned sparse feature selection problem. ProSMA constructs a multi-scale compatibility field using lightweight depthwise dilated convolutions to capture relevance across local and contextual scales, then enforces explicit sparsity via an $\ell_1$ proximal operator with learnable per-channel thresholds, yielding a closed-form soft-thresholding gate that can remove noisy responses. To further suppress semantically irrelevant channels, ProSMA incorporates decoder-conditioned channel gating driven by global decoder context. Extensive experiments on challenging 2D and 3D benchmarks demonstrate state-of-the-art performance, with particularly large gains ($\approx20$\%) on difficult 3D segmentation tasks. Project page: https://math-ml-x.github.io/ProSMA-UNet/
中文标题/摘要
标题:ProSMA-UNet: 解码器条件下的近似稀疏跳连特征选择
医学图像分割通常依赖于U形编码器-解码器架构,如U-Net,其中跳连路径通过将高分辨率的编码器特征注入解码器来保留精细的空间细节。然而,这些跳连路径也会传播低级纹理、背景杂乱和采集噪声,使无关信息绕过了深层语义过滤——这对低对比度临床成像尤其不利。尽管已经引入了注意力门控来解决这一限制,但它们通常生成密集的Sigmoid掩码,软性重新加权特征,而不是明确去除无关激活。我们提出了一种ProSMA-UNet(近似稀疏多尺度注意力U-Net),它将跳连门控重新表述为解码器条件下的稀疏特征选择问题。ProSMA 使用轻量级深度可分离卷积构建多尺度相关性场,以捕捉局部和上下文尺度的相关性,然后通过可学习的通道阈值的$\ell_1$近邻算子强制显式稀疏性,从而产生一个闭式软阈值门控,可以去除噪声响应。为了进一步抑制语义无关的通道,ProSMA 结合了由全局解码器上下文驱动的解码器条件通道门控。在具有挑战性的2D和3D基准上的广泛实验表明,其性能达到了最先进的水平,特别是在困难的3D分割任务上取得了显著的提升(约20%)。项目页面:https://math-ml-x.github.io/ProSMA-UNet/
Summary / 总结
ProSMA-UNet addresses the issue of irrelevant information propagation in U-Net architectures by proposing a decoder-conditioned sparse feature selection method. It uses multi-scale compatibility fields and an $\ell_1$ proximal operator to create a soft-thresholding gate that removes noisy responses. Experiments show that ProSMA-UNet achieves state-of-the-art performance, especially on 3D segmentation tasks, with improvements of about 20% on difficult tasks.
ProSMA-UNet通过提出解码器条件下的稀疏特征选择方法来解决U-Net架构中无关信息传播的问题。它使用多尺度兼容性字段和$\ell_1$近端算子来创建一个软阈值门,以去除噪声响应。实验表明,ProSMA-UNet在3D分割任务上实现了最先进的性能,特别是在困难任务上的改进达到了约20%。
Echoing: Identity Failures when LLM Agents Talk to Each Other
Authors: Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese
Venue: ICLR
First: 2025-11-12T20:17:10+00:00 · Latest: 2026-03-03T17:39:30+00:00
Comments: Published at ICLR workshop. Related blog post: https://www.salesforce.com/blog/agent-to-agent-interaction/
Abstract
As large language model (LLM) based agents interact autonomously with one another, a new class of failures emerges that cannot be predicted from single agent performance: behavioral drifts in agent-agent conversations (AxA). Unlike human-agent interactions, where humans ground and steer conversations, AxA lacks such stabilizing signals, making these failures unique. We investigate one such failure, echoing, where agents abandon their assigned roles and instead mirror their conversational partners, undermining their intended objectives. Through experiments across $66$ AxA configurations, $4$ domains (3 transactional, 1 advisory), and $2500+$ conversations (over $250000$ LLM inferences), we show that echoing occurs across major LLM providers, with echoing rates as high as $70\%$ depending on the model and domain. Moreover, we find that echoing is persistent even in advanced reasoning models with substantial rates ($32.8\%$) that are not reduced by reasoning efforts. We analyze prompt, conversation dynamics, showing that echoing arises as interaction grows longer ($7+$ agent turns) and is not merely an artifact of sub-optimal experiment design. Finally, we introduce a protocol-level mitigation where targeted use of structured response reduces echoing to $9\%$.
中文标题/摘要
标题:回声效应:LLM代理相互交流时的身份失败
当基于大型语言模型(LLM)的代理自主相互交互时,会出现一种新的失败类型,这种失败无法从单个代理的表现中预测:代理-代理对话中的行为漂移(AxA)。与人类-代理交互不同,人类可以为对话提供稳定信号,AxA缺乏这种稳定信号,使这些失败变得独特。我们研究了一种名为回声效应的失败,其中代理放弃其分配的角色,反而模仿其对话伙伴,从而削弱其预定目标。通过在66种AxA配置、4个领域(3个交易性,1个咨询性)和2500多次对话(超过250000次LLM推理)中进行实验,我们展示了回声效应在主要LLM提供商中普遍存在,回声率高达70%,具体取决于模型和领域。此外,我们发现即使在具有大量推理能力的模型中,回声效应也具有持久性,推理努力并不能降低其发生率(32.8%)。我们分析了提示和对话动态,表明回声效应随着交互增长而出现(超过7次代理回合),并非仅仅是实验设计不佳的产物。最后,我们提出了一种协议级缓解措施,通过有针对性地使用结构化响应,将回声效应降低至9%。
Summary / 总结
The study investigates a new class of failures, echoing, in autonomous interactions between large language model (LLM) agents, where agents abandon their roles and mirror their conversational partners. Through extensive experiments involving 66 configurations across 4 domains, the research shows echoing rates up to 70%, even in advanced reasoning models, and introduces a mitigation strategy that reduces echoing to 9%.
研究探讨了大型语言模型(LLM)代理之间自主交互中出现的一种新类失败现象——回声效应,即代理镜像其对话伙伴而非执行其分配的角色。通过在66种配置、4个领域和超过2500场对话中进行的大量实验,研究发现回声效应在不同LLM提供商中普遍存在,最高可达70%的回声率,并且即使在高级推理模型中,这种现象也持续存在且不受推理努力的影响。研究还指出,回声效应随对话时间增长而增加,且不是由于实验设计问题造成的,但通过使用结构化响应,回声效应可以被降低到9%。
FEAST: Retrieval-Augmented Multi-Hierarchical Food Classification for the FoodEx2 System
Authors: Lorenzo Molfetta, Alessio Cocchieri, Stefano Fantazzini, Giacomo Frisoni, Luca Ragazzi, Gianluca Moro
Venue: ECAI 2025: 28th European Conference on Artificial Intelligence, Frontiers in Artificial Intelligence and Applications (FAIA), 2025, pages 4169-4176
First: 2026-03-03T17:31:32+00:00 · Latest: 2026-03-03T17:31:32+00:00
Comments: Accepted for publication at ECAI 2025. Please cite the definitive, copyrighted, peer reviewed and edited version of this Article published in ECAI 2025, edited by I. Lynce et al., FAIA, pp. 4169-4176, 2025. DOI: https://doi.org/10.3233/FAIA251309
Abstract
Hierarchical text classification (HTC) and extreme multi-label classification (XML) tasks face compounded challenges from complex label interdependencies, data sparsity, and extreme output dimensions. These challenges are exemplified in the European Food Safety Authority's FoodEx2 system-a standardized food classification framework essential for food consumption monitoring and contaminant exposure assessment across Europe. FoodEx2 coding transforms natural language food descriptions into a set of codes from multiple standardized hierarchies, but faces implementation barriers due to its complex structure. Given a food description (e.g., "organic yogurt''), the system identifies its base term ("yogurt''), all the applicable facet categories (e.g., "production method''), and then, every relevant facet descriptors to each category (e.g., "organic production''). While existing models perform adequately on well-balanced and semantically dense hierarchies, no work has been applied on the practical constraints imposed by the FoodEx2 system. The limited literature addressing such real-world scenarios further compounds these challenges. We propose FEAST (Food Embedding And Semantic Taxonomy), a novel retrieval-augmented framework that decomposes FoodEx2 classification into a three-stage approach: (1) base term identification, (2) multi-label facet prediction, and (3) facet descriptor assignment. By leveraging the system's hierarchical structure to guide training and performing deep metric learning, FEASTlearns discriminative embeddings that mitigate data sparsity and improve generalization on rare and fine-grained labels. Evaluated on the multilingual FoodEx2 benchmark, FEAST outperforms the prior European's CNN baseline F1 scores by 12-38 % on rare classes.
中文标题/摘要
标题:FEAST:增强检索的多层级食品分类系统用于FoodEx2
层次文本分类(HTC)和极端多标签分类(XML)任务面临着复杂标签相互依赖性、数据稀疏性和极端输出维度的复合挑战。这些挑战在欧洲食品安全局的FoodEx2系统中得到了体现——这是一个标准化的食品分类框架,对于欧洲范围内的食品消费监测和污染物暴露评估至关重要。FoodEx2编码将自然语言食品描述转换为多个标准化层级的一组代码,但由于其复杂的结构,实施存在障碍。给定一个食品描述(例如,“有机酸奶”),系统会识别其基词(“酸奶”),所有适用的方面类别(例如,“生产方法”),以及每个类别的相关方面描述(例如,“有机生产”)。虽然现有的模型在平衡且语义密集的层级上表现良好,但没有工作针对FoodEx2系统施加的实际约束条件。有限的相关文献进一步加剧了这些挑战。我们提出了FEAST(食品嵌入和语义分类法),这是一种新颖的检索增强框架,将FoodEx2分类分解为三个阶段:(1)基词识别,(2)多标签方面预测,(3)方面描述分配。通过利用系统的层级结构来指导训练并进行深度度量学习,FEAST学习了区分性嵌入,缓解了数据稀疏性并提高了对稀有和细粒度标签的一般化能力。在多语言FoodEx2基准测试上,FEAST在稀有类别的F1分数上比之前欧洲的CNN基线高出12-38%。
Saarthi for AGI: Towards Domain-Specific General Intelligence for Formal Verification
Authors: Aman Kumar, Deepak Narayan Gadde, Luu Danh Minh, Vaisakh Naduvodi Viswambharan, Keerthan Kopparam Radhakrishna, Sivaram Pothireddypalli
First: 2026-03-03T17:30:36+00:00 · Latest: 2026-03-03T17:30:36+00:00
Comments: Published at the DVCon U.S. 2026
Abstract
Saarthi is an agentic AI framework that uses multi-agent collaboration to perform end-to-end formal verification. Even though the framework provides a complete flow from specification to coverage closure, with around 40% efficacy, there are several challenges that need to be addressed to make it more robust and reliable. Artificial General Intelligence (AGI) is still a distant goal, and current Large Language Model (LLM)-based agents are prone to hallucinations and making mistakes, especially when dealing with complex tasks such as formal verification. However, with the right enhancements and improvements, we believe that Saarthi can be a significant step towards achieving domain-specific general intelligence for formal verification. Especially for problems that require Short Term, Short Context (STSC) capabilities, such as formal verification, Saarthi can be a powerful tool to assist verification engineers in their work. In this paper, we present two key enhancements to the Saarthi framework: (1) a structured rulebook and specification grammar to improve the accuracy and controllability of SystemVerilog Assertion (SVA) generation, and (2) integration of advanced Retrieval Augmented Generation (RAG) techniques, such as GraphRAG, to provide agents with access to technical knowledge and best practices for iterative refinement and improvement of outputs. We also benchmark these enhancements for the overall Saarthi framework using challenging test cases from NVIDIA's CVDP benchmark targeting formal verification. Our benchmark results stand out with a 70% improvement in the accuracy of generated assertions, and a 50% reduction in the number of iterations required to achieve coverage closure.
中文标题/摘要
标题:Saarthi 用于 AGI:迈向形式验证领域的专用通用智能
Saarthi 是一种代理型 AI 框架,利用多智能体协作进行端到端的形式验证。尽管该框架从规范到覆盖率闭合提供了一个完整的流程,但其有效性约为 40%,仍需解决多个挑战以使其更加稳健可靠。通用人工智能(AGI)仍然是一个遥远的目标,当前基于大型语言模型(LLM)的代理容易产生幻觉并出错,尤其是在处理复杂任务如形式验证时。然而,通过适当的增强和改进,我们相信 Saarthi 可以成为实现形式验证领域专用通用智能的重要一步。特别是对于需要短期、短期上下文(STSC)能力的问题,如形式验证,Saarthi 可以成为协助验证工程师工作的强大工具。在本文中,我们提出了对 Saarthi 框架的两个关键增强:(1)结构化的规则手册和规范语法,以提高 SystemVerilog Assertion (SVA) 生成的准确性和可控性;(2)集成先进的检索增强生成(RAG)技术,如 GraphRAG,以使代理能够访问技术知识和最佳实践,从而实现输出的迭代改进。我们还使用 NVIDIA 的 CVDP 基准测试中的挑战性测试案例对整体 Saarthi 框架进行了基准测试。我们的基准测试结果显示出生成断言准确性的 70% 提高,以及覆盖率闭合所需迭代次数减少 50%。
Summary / 总结
Saarthi is an AI framework for formal verification using multi-agent collaboration. The paper presents two key enhancements: a structured rulebook for improving SVA generation accuracy and an integration of advanced RAG techniques to provide agents with technical knowledge. These enhancements led to a 70% improvement in assertion accuracy and a 50% reduction in iterations for coverage closure, demonstrating significant progress towards domain-specific general intelligence for formal verification.
Saarthi 是一个使用多智能体协作进行形式验证的 AI 框架。论文提出了两项改进:结构化的规则书以提高 SVA 生成的准确性,以及集成 GraphRAG 以访问技术知识。这些改进使得生成断言的准确性提高了 70%,并且减少了 50% 的迭代次数以达到覆盖率闭合。
MultiSessionCollab: Learning User Preferences with Memory to Improve Long-Term Collaboration
Authors: Shuhaib Mehri, Priyanka Kargupta, Tal August, Dilek Hakkani-Tür
First: 2026-01-06T04:26:22+00:00 · Latest: 2026-03-03T17:29:36+00:00
Abstract
As conversational agents accumulate experience collaborating with users, adapting to user preferences is essential for fostering long-term relationships and improving collaboration quality over time. We introduce MultiSessionCollab, a benchmark that evaluates how well agents can learn user preferences and leverage them to improve collaboration quality throughout multiple sessions. To develop agents that succeed in this setting, we present long-term collaborative agents equipped with a memory that persists and refines user preference as interaction experience accumulates. Moreover, we demonstrate that learning signals can be derived from user simulator behavior in MultiSessionCollab to train agents to generate more comprehensive reflections and update their memory more effectively. Extensive experiments show that equipping agents with memory improves long-term collaboration, yielding higher task success rates, more efficient interactions, and reduced user effort. Finally, we conduct a human user study that demonstrates that memory helps improve user experience in real-world settings.
中文标题/摘要
标题:MultiSessionCollab:通过记忆学习用户偏好以提高长期协作质量
随着对话代理与用户合作的经验积累,适应用户偏好对于培养长期关系并随着时间的推移提高协作质量至关重要。我们介绍了MultiSessionCollab基准,评估代理如何学习用户偏好并利用这些偏好在整个多个会话中提高协作质量。为了在这种环境中开发成功的代理,我们提出了配备持久并随着交互经验积累而不断优化用户偏好的长期协作代理。此外,我们展示了可以从MultiSessionCollab中的用户模拟器行为中推导出学习信号,以训练代理生成更全面的反思并更有效地更新其记忆。大量实验表明,为代理配备记忆可以提高长期协作质量,提高任务成功率,使交互更加高效,并减少用户努力。最后,我们进行了一项人类用户研究,证明记忆在实际场景中有助于改善用户体验。
Summary / 总结
The research aims to improve long-term collaboration between conversational agents and users by adapting to user preferences. MultiSessionCollab is introduced as a benchmark to evaluate agents' ability to learn and use user preferences across multiple sessions. The study shows that agents equipped with memory perform better, achieving higher task success rates, more efficient interactions, and reduced user effort. A human study further confirms that memory enhances user experience in real-world settings.
研究旨在通过适应用户偏好来提高对话代理与用户之间的长期合作。引入了MultiSessionCollab作为评估代理在多个会话中学习和利用用户偏好的基准。研究开发了具有持久记忆的代理,该记忆会随着交互经验的积累而不断优化用户偏好,并表明这种方法提高了任务成功率、交互效率和用户体验。人类研究进一步证实了在实际场景中使用记忆的好处。
Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach
Authors: Ben Cullen, Sergio Estan-Ruiz, Riya Danait, Jiayi Li
First: 2026-03-01T17:23:29+00:00 · Latest: 2026-03-03T17:17:30+00:00
Abstract
Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape via the local learning coefficient (LLC), a measure of the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging this theory, we interpret grokking in quadratic networks as a phase transition between competing near-zero-loss solution basins. Our contributions are two-fold: we derive closed-form expressions for the LLC in quadratic networks trained on modular arithmetic tasks, with the corresponding empirical verification; as well as empirical evidence demonstrating that LLC trajectories provide a reliable tool for tracking generalisation dynamics and interpreting phase transitions during training.
中文标题/摘要
标题:理解作为一种相变现象:竞争性基态之间的转变——基于奇异学习理论的方法
理解,即在长时间训练后从记忆过渡到泛化的突然转变,表明存在具有不同统计特性的竞争性解基态。我们通过奇异学习理论(SLT)的视角研究这一现象,SLT是一种贝叶斯框架,通过局部学习系数(LLC)来表征损失景观的几何结构,LLC衡量损失表面的局部退化程度。SLT将较低LLC的基态与较高的后验质量集中度和较低的预期泛化误差联系起来。利用这一理论,我们将二次网络中的理解解释为竞争性近零损失解基态之间的相变。我们的贡献有两个方面:我们推导出在执行模块算术任务时训练的二次网络的LLC的闭式表达式,并进行了相应的实证验证;以及实验证据表明,LLC轨迹可以作为跟踪训练期间泛化动态和解释相变的可靠工具。
Summary / 总结
The study investigates grokking, an abrupt transition from memorization to generalization in extended training, by examining competing solution basins using Singular Learning Theory (SLT). The research derives closed-form expressions for the local learning coefficient (LLC) in quadratic networks trained on modular arithmetic tasks and empirically verifies these expressions. The key finding is that LLC trajectories can reliably track generalization dynamics and phase transitions during training, linking lower-LLC basins to higher posterior mass concentration and lower expected generalization error.
研究通过将 grokking 视为竞争解盆地之间的相变现象,来探讨这种在扩展训练中从记忆到泛化的突然转变。利用奇异学习理论(SLT),研究通过局部学习系数(LLC)来表征损失景观,LLC 表示损失表面的局部退化程度。研究显示,较低的 LLC 基地与较高的后验质量集中度和较低的预期泛化误差相关。来自模数算术任务上训练的二次网络的实验证据支持了这些发现,并且 LLC 轨迹被发现是跟踪泛化动态和解释训练期间相变的可靠工具。
EP-GAT: Energy-based Parallel Graph Attention Neural Network for Stock Trend Classification
Authors: Zhuodong Jiang, Pengju Zhang, Peter Martin
First: 2025-07-10T21:45:09+00:00 · Latest: 2026-03-03T17:07:33+00:00
Comments: Accepted by IJCNN 2025, oral presentation
Abstract
Graph neural networks have shown remarkable performance in forecasting stock movements, which arises from learning complex inter-dependencies between stocks and intra-dynamics of stocks. Existing approaches based on graph neural networks typically rely on static or manually defined factors to model changing inter-dependencies between stocks. Furthermore, these works often struggle to preserve hierarchical features within stocks. To bridge these gaps, this work presents the Energy-based Parallel Graph Attention Neural Network, a novel approach for predicting future movements for multiple stocks. First, it generates a dynamic stock graph with the energy difference between stocks and Boltzmann distribution, capturing evolving inter-dependencies between stocks. Then, a parallel graph attention mechanism is proposed to preserve the hierarchical intra-stock dynamics. Extensive experiments on five real-world datasets are conducted to validate the proposed approach, spanning from the US stock markets (NASDAQ, NYSE, SP) and UK stock markets (FTSE, LSE). The experimental results demonstrate that EP-GAT consistently outperforms competitive five baselines on test periods across various metrics. The ablation studies and hyperparameter sensitivity analysis further validate the effectiveness of each module in the proposed method. The raw dataset and code are available at https://github.com/theflash987/EP-GAT.
中文标题/摘要
标题:EP-GAT:基于能量的并行图注意力神经网络在股票趋势分类中的应用
图神经网络在预测股票动向方面表现出色,这得益于其学习股票之间复杂依赖关系和股票内部动态的能力。现有基于图神经网络的方法通常依赖静态或手动定义的因素来建模股票之间变化的依赖关系,且往往难以保留股票内的层次特征。为弥补这些不足,本文提出了一种基于能量的并行图注意力神经网络,这是一种用于预测多只股票未来动向的新方法。首先,通过计算股票之间的能量差和玻尔兹曼分布生成动态股票图,捕捉股票之间的演变依赖关系。然后,提出了一种并行图注意力机制以保留股票内部的层次动态。在来自美国股票市场(NASDAQ、NYSE、SP)和英国股票市场(FTSE、LSE)的五个真实数据集上进行了广泛的实验以验证该方法的有效性。实验结果表明,EP-GAT在各种指标上的一致性上优于五个竞争性基线方法。消融研究和超参数敏感性分析进一步验证了所提方法中每个模块的有效性。原始数据集和代码可在https://github.com/theflash987/EP-GAT/ 获取。
Summary / 总结
This paper introduces EP-GAT, a novel Energy-based Parallel Graph Attention Neural Network designed for predicting stock movements. It generates a dynamic stock graph using the energy difference and Boltzmann distribution to capture evolving inter-dependencies, and employs a parallel graph attention mechanism to preserve hierarchical intra-stock dynamics. Experiments on five real-world datasets show that EP-GAT outperforms five competitive baselines across various metrics, validating its effectiveness through ablation studies and hyperparameter sensitivity analysis.
本文提出了一种新型的基于能量的并行图注意力神经网络EP-GAT,用于股票趋势分类。该方法利用能量差异和玻尔兹曼分布生成动态股票图,以捕捉股票间不断变化的依赖关系。提出了一种并行图注意力机制来保留股票内的层次动态。在五个真实世界数据集上的实验表明,EP-GAT在各种指标上均优于五个竞争性基线,通过消融研究和超参数敏感性分析进一步验证了该方法的有效性。
Kling-MotionControl Technical Report
Authors: Kling Team, Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Kang He, Xu He, Jingyun Hua, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Fan Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan, Tiancheng Wen, Zhiyong Wu, Haoxian Zhang, Runze Zhao, Yuanxing Zhang, Yan Zhou
First: 2026-03-03T17:02:45+00:00 · Latest: 2026-03-03T17:02:45+00:00
Comments: Access: https://app.klingai.com/global/video-motion-control/new
Abstract
Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.
中文标题/摘要
标题:Kling-MotionControl 技术报告
角色动画旨在通过将驱动视频中的运动动力学转移到参考图像上来生成逼真的视频。生成模型的最新进展为高保真角色动画铺平了道路。在此项工作中,我们提出了Kling-MotionControl,这是一种专门用于实现稳健、精确和富有表现力的整体角色动画的统一DiT基框架。通过在统一系统中采用分而治之的策略,该模型协调了针对身体、面部和手部不同特征定制的异构运动表示,有效地平衡了大规模结构稳定性和细微的发音表现力。为了确保跨身份的稳健泛化,我们引入了自适应身份无关学习,使不同角色(从现实人类到风格化卡通)的自然运动重新定位成为可能。同时,我们通过细致的身份注入和融合设计确保了忠实的外观保留,进一步通过利用全面的参考上下文的主体库机制得到了支持。为了确保实用价值,我们实现了一个先进的加速框架,利用多阶段蒸馏,将推理速度提高了超过10倍。Kling-MotionControl 通过智能语义运动理解和精确的文本响应性,使其能够超越视觉输入进行灵活控制。人类偏好评估表明,Kling-MotionControl 在整体运动控制、开放域泛化和视觉质量和连贯性方面优于领先的商业和开源解决方案,实现了卓越的保真度。这些结果确立了Kling-MotionControl 作为高质量、可控和逼真角色动画的稳健解决方案的地位。
Summary / 总结
Kling-MotionControl is a unified DiT-based framework designed for robust, precise, and expressive holistic character animation. It employs a divide-and-conquer strategy to handle body, face, and hand motions separately, ensuring large-scale structural stability and fine-grained articulatory expressiveness. The model incorporates adaptive identity-agnostic learning for natural motion retargeting and meticulous identity injection to preserve appearance. An advanced acceleration framework boosts inference speed by over 10x. Experimental results show that Kling-MotionControl outperforms leading commercial and open-source solutions in terms of holistic motion control, open domain generalization, and visual quality and coherence, making it a robust solution for high-quality character animation.
Kling-MotionControl 是一个统一的 DiT 基础框架,用于实现稳健且精确的角色动画。它采用分而治之的策略分别处理身体、面部和手部动作,确保大规模稳定性和精细的表达性。该模型结合了自适应身份无关学习以实现自然的动作重定向,并通过身份注入和融合设计保持外观的一致性。此外,它还包含一个先进的加速框架,可将推理速度提升超过 10 倍。实验结果表明,Kling-MotionControl 在整体运动控制、开放域泛化以及视觉质量和连贯性方面优于领先的商业和开源解决方案。
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
Authors: Xiaolong Zeng, Yitong Yu, Shiyao Xiong, Jinhua Hao, Ming Sun, Chao Zhou, Bin Wang
Venue: CVPR 2026
First: 2026-03-01T04:00:23+00:00 · Latest: 2026-03-03T17:01:47+00:00
Comments: Accepted to CVPR 2026
Abstract
Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8$\times$ larger receptive field and improves an average PSNR by over 0.21 dB across multiple standard benchmarks, while maintaining a small storage size and inference time. The code is available at: https://github.com/Sailor-t/ShiftLUT .
中文标题/摘要
标题:ShiftLUT:空间位移增强查找表以实现高效的图像恢复
基于查找表的方法已成为高效图像恢复任务的一个有前途的方向。最近的基于查找表的方法侧重于通过扩大感受野来提高其性能。然而,这不可避免地引入了额外的计算和存储开销,这阻碍了它们在边缘设备上的部署。为了解决这个问题,我们提出了一种名为ShiftLUT的新框架,该框架在所有基于查找表的方法中具有最大的感受野,同时保持高效率。我们的关键见解在于三个互补的组件。首先,引入了可学习的空间位移模块(LSS),通过在特征图上应用可学习的通道级空间偏移来扩大感受野。其次,我们提出了一种不对称双分支架构,将更多的计算分配给信息密集分支,显著减少了推理延迟,同时不牺牲恢复质量。最后,我们引入了一种特征级查找表压缩策略,称为误差有界自适应采样(EAS),以最小化存储开销。与之前的最先进的方法TinyLUT相比,ShiftLUT的感知野大3.8倍,平均PSNR提高了超过0.21 dB,同时保持了较小的存储大小和推理时间。代码可在:https://github.com/Sailor-t/ShiftLUT 获取。
Summary / 总结
ShiftLUT is a novel framework for efficient image restoration that addresses the computational and storage overhead issues of previous LUT-based methods. It introduces a Learnable Spatial Shift module to expand the receptive field, an asymmetric dual-branch architecture to reduce inference latency, and a feature-level LUT compression strategy. Compared to TinyLUT, ShiftLUT achieves a larger receptive field and higher PSNR while maintaining efficiency.
ShiftLUT 是一种新颖的框架,通过扩展感受野来提高图像恢复的效率,同时不增加计算和存储开销。它引入了可学习的空间偏移模块来扩展感受野,不对称的双分支架构来减少推理延迟,并采用误差有界自适应采样策略来最小化存储。与 TinyLUT 相比,ShiftLUT 在多个标准基准上的平均 PSNR 提高了超过 0.21 dB,同时保持了较小的存储和推理时间。
PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization
Authors: Siyan Dong, Zijun Wang, Lulu Cai, Yi Ma, Yanchao Yang
Venue: ICRA 2026
First: 2025-09-29T03:20:49+00:00 · Latest: 2026-03-03T16:49:04+00:00
Comments: ICRA 2026
Abstract
Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Code released: https://github.com/siyandong/PROFusion.
中文标题/摘要
标题:PROFusion:通过相机姿态回归与优化实现稳健且准确的密集重建
在不稳定相机运动期间实时进行密集场景重建对于机器人技术至关重要,但当前的RGB-D SLAM系统在相机经历大幅度视角变化、快速运动或突然震动时会失效。基于经典优化的方法在大运动时由于初始条件差而无法提供高精度,而基于学习的方法虽然具有鲁棒性,但在密集重建方面缺乏足够的精度。我们通过结合基于学习的初始化与基于优化的细化来应对这一挑战。我们的方法使用相机姿态回归网络从连续的RGB-D帧中预测出具有度量感知的相对姿态,这些姿态作为随机优化算法的可靠起始点,进一步将深度图像与场景几何对齐。大量实验表明,我们的方法在具有挑战性的基准测试中优于最佳竞争对手,同时在稳定运动序列上保持了相当的精度。该系统实时运行,展示了简单且原理性的技术结合可以同时实现不稳定运动的鲁棒性和密集重建的准确性。代码已发布:https://github.com/siyandong/PROFusion。
Summary / 总结
The research aims to improve real-time dense scene reconstruction during unstable camera motions, which is crucial for robotics. The method combines a camera pose regression network for robust initialization with an optimization-based refinement algorithm to align depth images accurately. Experiments show that this approach outperforms existing methods on challenging benchmarks while maintaining accuracy on stable motion sequences, and operates in real-time. The system demonstrates that integrating simple and principled techniques can achieve both robustness and accuracy for dense reconstruction. Code is available at https://github.com/siyandong/PROFusion.
研究旨在解决不稳定的相机运动下实时密集场景重建的问题,这对于机器人技术至关重要。方法结合了基于学习的相机姿态回归网络进行鲁棒初始化,以及基于优化的细化算法对深度图像与场景几何进行对齐。实验表明,PROFusion方法在具有挑战性的基准测试中优于现有方法,同时在稳定运动序列上保持了准确性,并且能够实时运行。
MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos
Authors: Yangyi Cao, Yuanhang Li, Lan Chen, Qi Mao
First: 2026-02-02T14:07:00+00:00 · Latest: 2026-03-03T16:46:06+00:00
Abstract
We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.
中文标题/摘要
标题:MLV-Edit:针对分钟级视频编辑的一致且高效编辑方法
我们提出了一种无需训练、基于流的框架MLV-Edit,以应对分钟级视频编辑的独特挑战。尽管现有技术在短格式视频操作方面表现出色,但将它们扩展到长时视频仍然具有挑战性,因为计算开销巨大且难以在数千帧中保持全局时间一致性。为了解决这个问题,MLV-Edit 采用分而治之的策略进行段落级编辑,通过两个核心模块实现:Velocity Blend 通过对齐相邻块的流场来纠正段落边界处的运动不一致性,消除片段视频处理中常见的闪烁和边界伪影;Attention Sink 将局部段落特征锚定到全局参考帧,有效抑制累积结构漂移。大量定量和定性实验表明,MLV-Edit 在时间稳定性和语义保真度方面始终优于现有最先进的方法。
Summary / 总结
MLV-Edit is a training-free, flow-based framework designed to handle the unique challenges of editing minute-level videos. It uses a divide-and-conquer strategy to address issues of computational overhead and global temporal consistency. The framework includes two core modules: Velocity Blend, which aligns flow fields to eliminate motion inconsistencies at segment boundaries, and Attention Sink, which anchors local features to global frames to suppress structural drift. Experimental results show that MLV-Edit outperforms existing methods in terms of temporal stability and semantic fidelity.
MLV-Edit 是一个无需训练、基于流的框架,旨在解决分钟级视频编辑的挑战。它采用分而治之的策略分别编辑各个段落,其中两个关键模块是:Velocity Blend 通过对齐流场来消除运动不一致和伪影,而 Attention Sink 将局部特征锚定到全局参考帧以防止结构漂移。实验表明,MLV-Edit 在保持时间稳定性和语义保真度方面优于现有方法。
Agentic AI-based Coverage Closure for Formal Verification
Authors: Sivaram Pothireddypalli, Ashish Raman, Deepak Narayan Gadde, Aman Kumar
First: 2026-03-03T16:35:03+00:00 · Latest: 2026-03-03T16:35:03+00:00
Comments: Published at IEEE International Conference on Intelligent Processing, Hardware, Electronics, and Radio Systems 2026
Abstract
Coverage closure is a critical requirement in Integrated Chip (IC) development process and key metric for verification sign-off. However, traditional exhaustive approaches often fail to achieve full coverage within project timelines. This study presents an agentic AI-driven workflow that utilizes Large Language Model (LLM)-enabled Generative AI (GenAI) to automate coverage analysis for formal verification, identify coverage gaps, and generate the required formal properties. The framework accelerates verification efficiency by systematically addressing coverage holes. Benchmarking open-source and internal designs reveals a measurable increase in coverage metrics, with improvements correlated to the complexity of the design. Comparative analysis validates the effectiveness of this approach. These results highlight the potential of agentic AI-based techniques to improve formal verification productivity and support comprehensive coverage closure.
中文标题/摘要
标题:基于代理AI的覆盖率闭合形式验证
覆盖率闭合是集成电路(IC)开发过程中的关键要求,也是验证签发的关键指标。然而,传统的穷尽方法往往无法在项目时间内实现全面的覆盖率。本研究提出了一种代理AI驱动的工作流,利用大型语言模型(LLM)赋能的生成AI(GenAI)自动化形式验证的覆盖率分析,识别覆盖率缺口,并生成所需的正式属性。该框架通过系统地解决覆盖率缺口来加速验证效率。基准测试开源和内部设计表明,在设计复杂性增加的情况下,覆盖率指标有可测量的提升,且改进与设计复杂性相关。对比分析验证了该方法的有效性。这些结果突显了代理AI技术在提高形式验证生产力和实现全面覆盖率闭合方面的潜力。
Summary / 总结
This study addresses the challenge of achieving full coverage in IC verification by introducing an agentic AI-driven workflow. The method leverages LLM-enabled Generative AI to automate coverage analysis, identify gaps, and generate formal properties. Experimental results show a measurable increase in coverage metrics, especially for complex designs, validating the approach's effectiveness in improving verification efficiency and supporting comprehensive coverage closure.
该研究通过提出一种基于代理AI的工作流来解决IC验证中实现全面覆盖的挑战。方法利用LLM驱动的生成AI来自动化覆盖分析、识别缺口并生成正式属性。实验结果表明,对于复杂设计而言,这种方法能显著提高覆盖率指标,验证了其在提高正式验证生产力和覆盖闭合方面的有效性。