arXiv 论文速递

Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

Authors: Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh

First: 2025-11-04T18:59:09+00:00 · Latest: 2025-11-04T18:59:09+00:00

Comments: 16 pages, 7 figures, 14 tables. Under Review

Abstract

Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available. %We release an open-source implementation to support continued research on scalable and reliable omni-modal reasoning.

中文标题/摘要

标题：Agent-Omni：通过模型协调进行测试时多模态推理以理解一切

多模态大型语言模型（MLLMs）展示了强大的能力，但仍然局限于固定的模态对，并且需要使用大量对齐的数据集进行昂贵的微调。构建能够整合文本、图像、音频和视频的全功能模型仍然不切实际，缺乏稳健的推理支持。在本文中，我们提出了一种Agent-Omni框架，通过主代理系统协调现有的基础模型，从而在无需重新训练的情况下实现灵活的多模态推理。主代理解释用户意图，将子任务委派给特定模态的代理，并将它们的输出整合为连贯的响应。在文本、图像、音频、视频和全功能基准上的广泛实验表明，Agent-Omni在各种任务中始终能够达到最先进的性能，特别是在需要复杂跨模态推理的任务中。其基于代理的设计使专业基础模型的无缝集成成为可能，确保对各种输入的适应性，同时保持透明性和可解释性。此外，该框架是模块化的，易于扩展，允许随着更强的模型变得可用而进行未来的改进。%我们发布了一个开源实现，以支持对可扩展和可靠的全模态推理的持续研究。

Summary / 总结

Agent-Omni is designed to enable flexible multimodal reasoning by coordinating existing foundation models through a master-agent system, which interprets user intent and delegates tasks to modality-specific agents. The framework consistently achieves state-of-the-art performance across various benchmarks, especially in tasks requiring complex cross-modal reasoning. It supports seamless integration of specialized models, ensuring adaptability and transparency.

Agent-Omni 是一个框架，通过协调现有的基础模型来实现无需重新训练的灵活多模态推理。它使用主代理系统来解释用户意图，分配任务给特定模态的代理，并整合它们的输出。广泛的实验表明，Agent-Omni 在复杂的跨模态推理任务上达到了最先进的性能，其基于代理的设计确保了适应性和透明性。该框架是模块化的，易于扩展，允许在未来使用更强的模型进行改进。

TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System

Authors: Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, C. Karen Liu

First: 2025-11-04T18:58:35+00:00 · Latest: 2025-11-04T18:58:35+00:00

Comments: Website: https://yanjieze.com/TWIST2

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large-scale data has driven breakthroughs in robotics, from language models to vision-language-action models in bimanual manipulation. However, humanoid robotics lacks equally effective data collection frameworks. Existing humanoid teleoperation systems either use decoupled control or depend on expensive motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid teleoperation and data collection system that preserves full whole-body control while advancing scalability. Our system leverages PICO4U VR for obtaining real-time whole-body human motions, with a custom 2-DoF robot neck (cost around $250) for egocentric vision, enabling holistic human-to-humanoid control. We demonstrate long-horizon dexterous and mobile humanoid skills and we can collect 100 demonstrations in 15 minutes with an almost 100% success rate. Building on this pipeline, we propose a hierarchical visuomotor policy framework that autonomously controls the full humanoid body based on egocentric vision. Our visuomotor policy successfully demonstrates whole-body dexterous manipulation and dynamic kicking tasks. The entire system is fully reproducible and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also open-sourced at https://twist-data.github.io .

中文标题/摘要

标题：TWIST2：可扩展、便携且全面的人形数据采集系统

大规模数据推动了机器人领域的突破，从语言模型到双臂操作的视觉-语言-动作模型。然而，人形机器人缺乏同等有效的数据采集框架。现有的人形远程操作系统要么采用解耦控制，要么依赖昂贵的运动捕捉设备。我们引入了TWIST2，这是一种便携且无需运动捕捉的人形远程操作和数据采集系统，同时保持了全身控制的完整性并提升了可扩展性。我们的系统利用PICO4U VR获取实时全身人类动作，使用一个自定义的2-DoF机器人颈部（成本约250美元）实现第一人称视觉，从而实现全面的人类到人形的控制。我们展示了长期的灵巧和移动人形技能，并能在15分钟内收集100个演示，成功率接近100%。在此基础上，我们提出了一种基于第一人称视觉的分层视觉-运动策略框架，能够自主控制整个机器人身体。我们的视觉-运动策略成功展示了全身灵巧操作和动态踢球任务。整个系统完全可复现并开源在https://yanjieze.com/TWIST2 。我们收集的数据集也已开源在https://twist-data.github.io 。

Summary / 总结

TWIST2 is a portable humanoid teleoperation and data collection system that addresses the lack of effective data collection frameworks in humanoid robotics. It uses PICO4U VR for real-time whole-body human motion capture and a custom 2-DoF robot neck for egocentric vision, enabling full-body control. TWIST2 can collect 100 demonstrations in 15 minutes with a high success rate and demonstrates long-horizon dexterous and mobile skills. The system is open-sourced and includes a hierarchical visuomotor policy for autonomous control based on egocentric vision, successfully performing whole-body manipulation and dynamic kicking tasks.

TWIST2 是一个便携式的人形机器人远程操作和数据采集系统，旨在解决人形机器人领域缺乏有效数据采集框架的问题。该系统利用 PICO4U VR 进行实时全身人体动作捕捉，并配备一个自定义的 2 自由度机器人颈部以实现第一人称视角，从而实现全身控制。TWIST2 可以在 15 分钟内收集 100 个演示，成功率接近 100%，并展示长时间的灵巧和移动技能。该系统是开源的，并包含基于第一人称视角的分层视觉-运动策略，成功执行了全身灵巧操作和动态踢球任务。

GeoCrossBench: Cross-Band Generalization for Remote Sensing

Authors: Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer, Hrant Khachatrian

First: 2025-11-04T18:58:20+00:00 · Latest: 2025-11-04T18:58:20+00:00

Abs · PDF · Code1 · Code2

Abstract

The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25\% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.

中文标题/摘要

标题：GeoCrossBench：遥感跨频段泛化

随着遥感卫星的数量和多样性不断增加，而大多数标注数据来自较旧的卫星。随着地球观测基础模型的规模扩大，支持新卫星的（重新）训练成本也随之增加，因此模型向新卫星的泛化能力变得越来越重要。在本文中，我们介绍了GeoCrossBench，这是GeoBench基准的一个扩展，具有新的评估协议：它测试了分布内性能；对没有波段重叠的新卫星进行泛化；以及对训练集之外具有额外波段的新卫星进行泛化。我们还开发了ChannelViT的自监督扩展ChiViT，以提高其跨卫星性能。首先，我们展示了即使是最佳的遥感基础模型（DOFA，TerraFM）在分布内设置中也无法超越通用模型如DINOv3。其次，当泛化到没有波段重叠的新卫星时，所有模型的性能下降了2-4倍，而ChiViT显著优于亚军DINOv3。第三，所有测试模型在测试时给出额外波段时的性能平均下降5-25%。最后，我们展示了仅使用所有波段的先验标签微调这些模型的最后一层线性层可以相对一致地在所有卫星上获得性能，突显了该基准远未饱和。我们公开发布了代码和数据集，以鼓励开发更多具有更强跨卫星泛化能力的遥感模型。

Summary / 总结

GeoCrossBench evaluates the cross-band generalization of remote sensing models by testing in-distribution performance, generalization to satellites with no band overlap, and generalization to satellites with additional bands. The study shows that even the best foundation models do not outperform general purpose models in the in-distribution setting. When generalizing to new satellites with no band overlap, all models suffer a 2-4x performance drop, with ChiViT outperforming DINOv3. Additionally, models perform 5-25% worse when given additional bands during testing. Fine-tuning the last linear layer with oracle labels can improve performance across all satellites, indicating that the benchmark is not yet saturated.

GeoCrossBench 通过测试在分布内性能、无波段重叠的新卫星以及具有额外波段的新卫星上的泛化能力，评估了遥感模型的跨波段泛化能力。研究引入了ChiViT，这是一种ChannelViT的自监督扩展，当泛化到无波段重叠的新卫星时，其性能优于DINOv3。研究显示，即使是先进的遥感模型DOFA和TerraFM，在分布内设置中也不如通用模型DINOv3表现好，而当测试时引入额外波段时，所有测试模型的性能平均下降5-25%。通过使用所有波段的先验标签微调这些模型的最后一层可以提高其在不同卫星上的表现，表明该基准尚未饱和。

Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Authors: Dmitrii Pozdeev, Alexey Artemov, Ananta R. Bhattarai, Artem Sevastopolsky

First: 2025-11-04T18:58:03+00:00 · Latest: 2025-11-04T18:58:03+00:00

Comments: Project page: https://diddone.github.io/densemarks/ .Video: https://youtu.be/o8DOOYFW0gI .21 pages, 13 figures, 2 tables

Abs · PDF · Code1 · Code2 · Project1

Abstract

We propose DenseMarks - a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

中文标题/摘要

标题：DenseMarks：通过点轨迹学习人体头部的标准嵌入表示

我们提出了DenseMarks——一种新的学习表示，用于人体头部，能够为人体头部图像提供高质量的密集对应关系。对于一个人体头部的2D图像，视觉变换器网络预测每个像素的3D嵌入，该嵌入对应3D标准单位立方体中的一个位置。为了训练我们的网络，我们收集了一个由最先进的点跟踪器在多种多样的人体头部视频中估计的成对点匹配数据集，并通过对比损失引导映射，鼓励匹配点具有相近的嵌入。我们还使用了面部特征点和分割约束的多任务学习，并通过潜在立方体特征施加嵌入的空间连续性，从而产生一个可解释且可查询的标准空间。该表示可用于查找共同的语义部分、面部/头部跟踪和立体重建。由于强监督，我们的方法对姿态变化具有鲁棒性，并覆盖整个头部，包括头发。此外，标准空间瓶颈确保了获得的表示在不同姿态和个体之间的一致性。我们在几何感知点匹配和单目头部跟踪方面展示了最先进的结果，使用3D可变形模型。代码和模型检查点将对公众开放。

Summary / 总结

DenseMarks learns a dense embedding for human head images using a Vision Transformer network, predicting a 3D embedding for each pixel. The network is trained with a contrastive loss and multi-task learning, resulting in a robust and interpretable canonical space that can handle pose variations and cover the entire head, including hair. The method achieves state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models.

DenseMarks 是一种用于人类头部的机器学习表示，通过 Vision Transformer 网络为 2D 人类头部图像中的每个像素预测 3D 嵌入。该方法在来自多种野生视频的点匹配数据集上进行训练，并使用对比损失来鼓励匹配点具有相似的嵌入。它还采用多任务学习和空间连续性约束来创建一个可解释的规范空间。该表示对姿态变化具有鲁棒性，并可用于几何感知点匹配和单目头部跟踪等任务，实现了使用 3D 可变形模型的最先进的结果。

Imagine Beyond! Distributionally Robust Auto-Encoding for State Space Coverage in Online Reinforcement Learning

Authors: Nicolas Castanet, Olivier Sigaud, Sylvain Lamprier

First: 2025-05-23T12:43:55+00:00 · Latest: 2025-11-04T18:56:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Goal-Conditioned Reinforcement Learning (GCRL) enables agents to autonomously acquire diverse behaviors, but faces major challenges in visual environments due to high-dimensional, semantically sparse observations. In the online setting, where agents learn representations while exploring, the latent space evolves with the agent's policy, to capture newly discovered areas of the environment. However, without incentivization to maximize state coverage in the representation, classical approaches based on auto-encoders may converge to latent spaces that over-represent a restricted set of states frequently visited by the agent. This is exacerbated in an intrinsic motivation setting, where the agent uses the distribution encoded in the latent space to sample the goals it learns to master. To address this issue, we propose to progressively enforce distributional shifts towards a uniform distribution over the full state space, to ensure a full coverage of skills that can be learned in the environment. We introduce DRAG (Distributionally Robust Auto-Encoding for GCRL), a method that combines the $\beta$-VAE framework with Distributionally Robust Optimization. DRAG leverages an adversarial neural weighter of training states of the VAE, to account for the mismatch between the current data distribution and unseen parts of the environment. This allows the agent to construct semantically meaningful latent spaces beyond its immediate experience. Our approach improves state space coverage and downstream control performance on hard exploration environments such as mazes and robotic control involving walls to bypass, without pre-training nor prior environment knowledge.

中文标题/摘要

标题：超越想象！分布鲁棒自编码在在线强化学习状态空间覆盖中的应用

目标条件强化学习（GCRL）使智能体能够自主获取多样化的行为，但在视觉环境中面临巨大挑战，因为高维且语义稀疏的观察使得问题复杂化。在在线学习场景中，智能体在探索过程中学习表示，其潜在空间会随着智能体策略的变化而演变，以捕捉新发现的环境区域。然而，如果没有激励最大化表示中的状态覆盖，基于自编码的经典方法可能会收敛到过度代表智能体频繁访问的有限状态集的潜在空间。在内在动机设置中，智能体使用潜在空间中编码的分布来采样其学习掌握的目标，这一问题被进一步放大。为了解决这个问题，我们提出逐步强制潜在空间向整个状态空间的均匀分布转移，以确保能够覆盖在环境中可以学习的所有技能。我们引入了DRAG（分布鲁棒自编码用于GCRL），该方法结合了β-VAE框架与分布鲁棒优化。DRAG利用对抗神经权重器来调整VAE的训练状态分布，以弥补当前数据分布与未见环境部分之间的差异。这使智能体能够构建超越其即时经验的语义有意义的潜在空间。我们的方法在迷宫和涉及绕过墙壁的机器人控制等具有挑战性的探索环境中提高了状态空间覆盖和下游控制性能，无需预训练或先验环境知识。

Summary / 总结

The paper addresses the challenge of state space coverage in goal-conditioned reinforcement learning (GCRL) for visual environments, where agents may converge to over-represented latent spaces. To tackle this, the authors propose DRAG, which uses distributionally robust optimization to enforce uniform distribution shifts in the latent space. This method, combining $eta$-VAE with adversarial neural weighting, ensures comprehensive state space coverage and enhances downstream control performance in complex environments like mazes and robotic tasks involving navigation obstacles.

论文针对视觉环境中的目标条件强化学习（GCRL）中状态空间覆盖的问题，经典自编码方法可能会收敛到过度表示的潜在空间。为此，作者提出了DRAG方法，通过分布鲁棒优化来强制分布向整个状态空间的均匀分布转变，确保全面覆盖。实验结果显示，在迷宫和涉及绕过障碍物的机器人控制任务中，该方法能够提高状态空间覆盖和下游控制性能，无需预训练或先验环境知识。

PLUTO-4: Frontier Pathology Foundation Models

Authors: Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker, Judy Shen, Chintan Shah, Blake Martin, Aashish Sood, Elliot Miller, Ben Glass, Andy Beck, Harsha Pokkalla, Syed Ashar Javed

First: 2025-11-04T18:54:58+00:00 · Latest: 2025-11-04T18:54:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4's potential to transform real-world applications as a backbone for translational research and diagnostic use cases.

中文标题/摘要

标题：PLUTO-4：前沿病理学基础模型

基于大规模病理图像数据集训练的基础模型在多种组织病理学任务中展现了强大的迁移能力。在此基础上，我们引入了PLUTO-4，这是病理学基础模型的下一代产品，将Pathology-Universal Transformer (PLUTO) 扩展到了前沿规模。我们分享了PLUTO-4家族中的两种互补的视觉变换器架构：一种是紧凑且高效的PLUTO-4S模型，通过FlexiViT设置和2D-RoPE嵌入优化，适用于多尺度部署；另一种是前沿规模的PLUTO-4G模型，通过单一的补丁大小训练，以最大化表示能力和稳定性。两种模型均使用从DINOv2中派生的自监督目标在包含来自137,144名患者、551,164张WSI、跨越50多家机构、涵盖60多种疾病类型和100多种染色的大规模多机构数据集上进行预训练。在公共和内部基准测试中的全面评估表明，PLUTO-4在需要不同空间和生物上下文的任务中达到了最先进的性能，包括补丁级分类、分割和切片级诊断。紧凑的PLUTO-4S提供了高通量和稳健的性能，适用于实际部署，而PLUTO-4G则在多个病理学基准测试中建立了新的性能前沿，包括皮肤病理学诊断性能提高了11%。这些多样化的改进突显了PLUTO-4作为转化研究和诊断用途的骨干模型的潜力。

Summary / 总结

PLUTO-4 is the next generation of pathology foundation models, extending the Pathology-Universal Transformer (PLUTO) to a larger scale. It includes two models: PLUTO-4S, a compact and efficient version for multi-scale deployment, and PLUTO-4G, a large-scale model for maximizing representation capacity. Both models are pretrained using a self-supervised objective on a large dataset of 551,164 WSIs from 137,144 patients. PLUTO-4 achieves state-of-the-art performance across various tasks, with PLUTO-4S providing high-throughput and robust performance, and PLUTO-4G setting new performance benchmarks, especially in dermatopathology diagnosis.

PLUTO-4 是一种扩展了病理通用变换器的下一代病理基础模型，包括紧凑高效的 PLUTO-4S 和大型的 PLUTO-4G。两者均在大规模多机构数据集上进行自监督预训练。PLUTO-4 在多种任务中表现出色，特别是 PLUTO-4G 在皮肤病理诊断方面设立了新的基准。

Optimizing AI Agent Attacks With Synthetic Data

Authors: Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton

First: 2025-11-04T18:48:56+00:00 · Latest: 2025-11-04T18:48:56+00:00

Abs · PDF · Code1 · Code2

Abstract

As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.

中文标题/摘要

标题：使用合成数据优化AI代理攻击

随着AI部署变得更加复杂和高风险，准确评估其风险变得越来越重要。AI控制是一种进行此类评估的框架。然而，良好的控制评估需要引出强大的攻击策略。在计算资源有限且环境复杂的情况下，这可能具有挑战性。在本研究中，我们展示了如何在SHADE-Arena数据集中优化攻击策略，该数据集包含多种现实的控制环境。我们通过将攻击能力分解为五个组成部分——怀疑建模、攻击选择、计划合成、执行和微妙性——并分别优化每个组件来实现这一点。为了解决数据有限的限制，我们开发了一个攻击动力学的概率模型，使用此模拟优化我们的攻击超参数，然后证明结果可以转移到SHADE-Arena。这导致攻击强度有了显著提高，使用我们的支架将安全分数从基线的0.87降低到0.41。

Summary / 总结

This research aims to enhance the evaluation of AI risk by optimizing attack policies in complex environments. The authors decompose attack capability into five skills and optimize each individually using a probabilistic model of attack dynamics. This approach, despite limited data, significantly improves attack strength, reducing the safety score from 0.87 to 0.41 in SHADE-Arena.

该研究旨在通过优化复杂环境中的攻击策略来增强对AI风险的评估。作者将攻击能力分解为五个技能，并分别优化这些技能。尽管数据有限，但通过使用攻击动力学的概率模型进行优化，显著提高了攻击强度，将安全评分从0.87降低到0.41。

Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning

Authors: Mohamed Bouadi, Pratinav Seth, Aditya Tanna, Vinay Kumar Sankarapu

First: 2025-11-04T18:43:44+00:00 · Latest: 2025-11-04T18:43:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Tabular data remain the predominant format for real-world applications. Yet, developing effective neural models for tabular data remains challenging due to heterogeneous feature types and complex interactions occurring at multiple scales. Recent advances in tabular in-context learning (ICL), such as TabPFN and TabICL, have achieved state-of-the-art performance comparable to gradient-boosted trees (GBTs) without task-specific fine-tuning. However, current architectures exhibit key limitations: (1) single-scale feature processing that overlooks hierarchical dependencies, (2) dense attention with quadratic scaling in table width, and (3) strictly sequential component processing that prevents iterative representation refinement and cross-component communication. To address these challenges, we introduce Orion-MSP, a tabular ICL architecture featuring three key innovations: (1) multi-scale processing to capture hierarchical feature interactions; (2) block-sparse attention combining windowed, global, and random patterns for scalable efficiency and long-range connectivity; and (3) a Perceiver-style memory enabling safe bidirectional information flow across components. Across diverse benchmarks, Orion-MSP matches or surpasses state-of-the-art performance while scaling effectively to high-dimensional tables, establishing a new standard for efficient tabular in-context learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-MSP .

中文标题/摘要

标题：Orion-MSP：多尺度稀疏注意机制的表格上下文学习

表格数据仍然是现实世界应用的主要格式。然而，由于特征类型异构和在多个尺度上发生的复杂交互，开发有效的神经模型来处理表格数据仍然具有挑战性。最近在表格上下文学习（ICL）方面的进展，如TabPFN和TabICL，已经实现了与梯度提升树（GBTs）相当的性能，且无需特定任务的微调。然而，当前的架构存在几个关键限制：（1）单尺度特征处理，忽略了层次依赖性；（2）密集注意机制，其在表格宽度上的计算复杂度为平方级；（3）严格顺序组件处理，阻止了迭代表示精炼和跨组件通信。为了解决这些挑战，我们引入了Orion-MSP，这是一种表格ICL架构，具有三个关键创新：（1）多尺度处理以捕获层次特征交互；（2）块稀疏注意机制结合了窗口、全局和随机模式，以实现可扩展性和长距离连接；（3）一种类似于Perceiver的记忆机制，使组件间的信息双向流动更加安全。在多种基准测试中，Orion-MSP 的性能与最先进的方法相当或更优，同时能够有效扩展到高维表格，确立了高效表格上下文学习的新标准。该模型已在https://github.com/Lexsi-Labs/Orion-MSP 公开可用。

Summary / 总结

Orion-MSP is designed to improve tabular in-context learning by addressing limitations of existing models such as single-scale feature processing, dense attention, and sequential component processing. It introduces multi-scale processing, block-sparse attention, and a Perceiver-style memory to capture hierarchical dependencies, enable scalable long-range connectivity, and allow bidirectional information flow, respectively. Experimental results show that Orion-MSP matches or surpasses state-of-the-art performance across various benchmarks, especially in handling high-dimensional tables.

Orion-MSP 通过引入多尺度稀疏注意力机制来处理表格数据，能够捕捉层次特征交互并利用块稀疏注意力实现高效的长距离连接。模型还包含一种类似感知器的记忆机制，以实现组件间的双向信息流。实验结果显示，Orion-MSP 在各种基准测试中达到了或超越了最先进的性能，特别是在处理高维表格方面表现出色。

Hybrid Quantum-Classical Recurrent Neural Networks

Authors: Wenduan Xu

First: 2025-10-29T14:21:49+00:00 · Latest: 2025-11-04T18:43:14+00:00

Comments: Clarified expectation-value-based readouts and made minor text edits

Abs · PDF · Code1 · Code2

Abstract

We present a hybrid quantum-classical recurrent neural network (QRNN) architecture in which the recurrent core is realized as a parametrized quantum circuit (PQC) controlled by a classical feedforward network. The hidden state is the quantum state of an $n$-qubit PQC in an exponentially large Hilbert space $\mathbb{C}^{2^n}$, which serves as a coherent recurrent quantum memory. The PQC is unitary by construction, making the hidden-state evolution norm-preserving without external constraints. At each timestep, mid-circuit Pauli expectation-value readouts are combined with the input embedding and processed by the feedforward network, which provides explicit classical nonlinearity. The outputs parametrize the PQC, which updates the hidden state via unitary dynamics. The QRNN is compact and physically consistent, and it unifies (i) unitary recurrence as a high-capacity memory, (ii) partial observation via mid-circuit readouts, and (iii) nonlinear classical control for input-conditioned parametrization. We evaluate the model in simulation with up to 14 qubits on sentiment analysis, MNIST, permuted MNIST, copying memory, and language modeling. For sequence-to-sequence learning, we further devise a soft attention mechanism over the mid-circuit readouts and show its effectiveness for machine translation. To our knowledge, this is the first model (RNN or otherwise) grounded in quantum operations to achieve competitive performance against strong classical baselines across a broad class of sequence-learning tasks.

中文标题/摘要

标题：混合量子-经典循环神经网络

我们提出了一种混合量子-经典的循环神经网络（QRNN）架构，其中循环核心由参数化量子电路（PQC）实现，由经典前馈网络控制。隐藏状态是具有指数级希尔伯特空间$\mathbb{C}^{2^n}$的n量子比特PQC的量子态，作为相干的循环量子记忆。PQC通过构造是幺正的，使得隐藏状态的演化保持范数不变，无需外部约束。在每个时间步，电路中的中途帕里奥期望值读出与输入嵌入结合，并由前馈网络处理，提供明确的经典非线性。输出参数化PQC，通过幺正动力学更新隐藏状态。QRNN紧凑且物理上一致，统一了(i)幺正循环作为高容量记忆，(ii)中途读出的不完全观测，以及(iii)输入条件下的非线性经典控制。我们在模拟中使用至多14个量子比特对情感分析、MNIST、打乱的MNIST、复制记忆和语言建模进行评估。对于序列到序列学习，我们进一步设计了一种基于中途读出的软注意力机制，并展示了其在机器翻译中的有效性。据我们所知，这是第一个基于量子操作的模型（无论是RNN还是其他模型），在广泛序列学习任务中与强大的经典基线模型竞争性能。

Summary / 总结

This paper introduces a hybrid quantum-classical recurrent neural network (QRNN) that combines a parametrized quantum circuit (PQC) with a classical feedforward network. The hidden state is an $n$-qubit quantum state, providing a coherent quantum memory. The PQC is unitary, ensuring norm-preserving evolution. The model is evaluated on various tasks including sentiment analysis, MNIST, and language modeling, showing competitive performance against classical models. A soft attention mechanism is also proposed for sequence-to-sequence learning tasks.

论文提出了一种混合量子-经典循环神经网络（QRNN），其中循环核心是一个参数化量子电路（PQC），由经典前馈网络控制。隐藏状态是一个$n$-量子比特的量子态，提供了一种相干的量子记忆。PQC是酉的，确保了状态演化保持范数不变。该模型在情感分析、MNIST、手写数字识别和语言建模等任务上进行了评估，展示了与经典基线相当的性能。

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

Authors: Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley

First: 2025-11-04T18:42:12+00:00 · Latest: 2025-11-04T18:42:12+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.

中文标题/摘要

标题：乌龙：评估长上下文推理和聚合能力

随着模型上下文长度的不断增长，关于模型是否有效利用完整上下文长度的担忧一直存在。虽然最近已经发布了几个精心设计的长上下文评估，但这些评估往往依赖于从上下文的一个或多个部分检索信息，这使得几乎所有的上下文词都可以被视为噪声。这仅代表了一种可能使用长上下文的任务类型。我们引入了乌龙，这是一个长上下文推理任务基准，要求模型在原子级别分析文本片段，然后将这些分析聚合以回答分布性问题。乌龙分为两个任务集：乌龙-synth，一组自然主义合成任务，我们可以轻松地消除推理问题中的组件；乌龙-real，一个需要在真实世界对话数据上进行推理的下游设置。乌龙要求模型处理大量示例，进行上下文中的分类和计数，并处理时间关系和用户关系。即使是前沿模型在乌龙上也表现不佳，GPT-5、Claude-Sonnet-4 和 Gemini-2.5-Pro 在 128K 分割上都未能达到 50% 的准确率。我们发布了乌龙的数据和评估框架，以促进能够处理大量文本的模型的发展。

Summary / 总结

Oolong evaluates models' long-context reasoning and aggregation capabilities by requiring them to analyze individual text chunks and aggregate the results to answer distributional questions. It includes synthetic and real-world tasks. Even advanced models like GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro achieve less than 50% accuracy on both splits at 128K context length.

Oolong 通过要求模型分析单个文本片段并汇总结果来回答分布性问题，来评估模型的长上下文推理和聚合能力。它包括合成和真实世界的任务。即使是先进的模型如 GPT-5、Claude-Sonnet-4 和 Gemini-2.5-Pro 在 128K 上下文长度下，也仅在合成和真实世界任务中达到不到 50% 的准确率。

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

Authors: Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han

First: 2025-11-04T18:27:39+00:00 · Latest: 2025-11-04T18:27:39+00:00

Comments: Project page: https://github.com/icip-cas/MemSearcher

Abs · PDF · Code1 · Code2 · Code3

Abstract

Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

中文标题/摘要

标题：MemSearcher：通过端到端强化学习训练LLMs进行推理、搜索和管理记忆

典型搜索代理将整个交互历史拼接到LLM上下文中，保持信息完整性但产生长且嘈杂的上下文，导致高计算和内存成本。相比之下，仅使用当前回合可以避免这种开销，但会丢弃重要信息。这种权衡限制了搜索代理的可扩展性。为解决这一挑战，我们提出了MemSearcher，一种迭代维护紧凑记忆并将其与当前回合结合的代理工作流。每次回合，MemSearcher将用户的问题与记忆融合生成推理轨迹，执行搜索操作，并更新记忆以仅保留解决任务所需的重要信息。此设计在多回合交互中稳定上下文长度，提高效率而不牺牲准确性。为了优化此工作流，我们引入了多上下文GRPO，这是一种端到端的RL框架，联合优化MemSearcher代理的推理、搜索策略和记忆管理。具体而言，多上下文GRPO在不同上下文中采样轨迹组，并在它们的所有对话中传播轨迹级优势。MemSearcher在与Search-R1相同的数据集上训练，相对于强基线在七个公开基准上取得了显著改进：Qwen2.5-3B-Instruct上提高了11%，Qwen2.5-7B-Instruct上提高了12%的相对平均增益。值得注意的是，基于3B的MemSearcher甚至优于基于7B的基线，表明在信息完整性和效率之间取得平衡可以同时提高准确性和降低计算开销。代码和模型将在https://github.com/icip-cas/MemSearcher公开

TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

Authors: Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu

First: 2025-11-04T18:25:17+00:00 · Latest: 2025-11-04T18:25:17+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models. The library is open source and available at https://github.com/Lexsi-Labs/TabTune .

中文标题/摘要

标题：TabTune：统一的表格基础模型推理与微调库

表格基础模型代表了结构化数据学习中日益增长的范式，将大规模预训练的优势扩展到表格领域。然而，由于异构预处理管道、碎片化的API、不一致的微调程序以及缺乏针对部署导向度量（如校准和公平性）的标准评估，其采用仍然受到限制。我们提出了TabTune，这是一个统一的库，通过单一接口标准化了表格基础模型的完整工作流程。TabTune 提供了对七种最先进的模型的一致访问，支持多种适应策略，包括零样本推理、元学习、监督微调（SFT）和参数高效微调（PEFT）。该框架自动化了模型感知的预处理，内部管理了架构异质性，并集成了性能、校准和公平性的评估模块。TabTune 设计用于扩展性和可重复性，使用户能够一致地评估表格基础模型的适应策略。该库是开源的，可在 https://github.com/Lexsi-Labs/TabTune 获取。

GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality

Authors: Anastasiya Pechko, Piotr Borycki, Joanna Waczyńska, Daniel Barczyk, Agata Szymańska, Sławomir Tadeja, Przemysław Spurek

First: 2025-10-13T19:36:47+00:00 · Latest: 2025-11-04T18:24:59+00:00

Abs · PDF · Code1 · Code2

Abstract

As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce GS-Verse (Gaussian Splatting for Virtual Environment Rendering and Scene Editing), a novel method designed to overcome these challenges by directly integrating an object's mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, GS-Verse facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.

中文标题/摘要

标题：GS-Verse：基于网格的高斯点积方法在虚拟现实中的物理感知交互

随着沉浸式3D内容需求的增长，直观且高效的交互方法变得至关重要。当前在虚拟现实（VR）中物理操作3D内容的技术往往面临显著限制，包括依赖于工程密集型过程和简化几何表示，如四面体笼，这可能损害视觉保真度和物理准确性。在本文中，我们介绍了GS-Verse（用于虚拟环境渲染和场景编辑的高斯点积方法），这是一种新型方法，旨在通过直接将对象的网格与高斯点积（GS）表示集成来克服这些挑战。我们的方法能够更精确地表面近似，从而实现高度逼真的变形和交互。通过利用现有的3D网格资产，GS-Verse促进了内容的无缝重用并简化了开发流程。此外，我们的系统设计为与物理引擎无关，为开发者提供了强大的部署灵活性。这种多功能架构提供了一种高度逼真、灵活且直观的交互3D操作方法。我们通过一项涉及18名参与者的对比用户研究，严格验证了我们的方法与当前最先进的技术相比，该技术将VR与GS耦合在一起。具体而言，我们证明了我们的方法在物理感知拉伸操作方面统计上显著优于现有方法，并且在其他基于物理的操作，如扭转和晃动方面也更为一致。进一步的评估表明，我们的方法在各种交互和场景中始终表现出高且可靠的表现，显示出其作为现有方法替代方案的潜力。

Summary / 总结

GS-Verse is a novel method that integrates an object's mesh with Gaussian Splatting (GS) for more precise and realistic 3D interactions in VR. It addresses limitations of current techniques by providing a physics-engine-agnostic approach that enhances visual fidelity and physical accuracy. User studies with 18 participants showed that GS-Verse outperforms existing methods in physics-aware stretching and other manipulations, demonstrating its high and reliable performance.

GS-Verse 是一种将物体网格与高斯斑点（GS）相结合的新方法，以实现虚拟现实（VR）中精确和逼真的变形。该方法克服了现有技术的局限性，提供了更准确的表面逼近，并简化了开发流程。实验结果显示，GS-Verse 在物理感知拉伸操作中表现更优，并且在其他物理基础操作如扭曲和摇晃中更为一致，证明其在各种交互和场景中具有高可靠性能，显示出其作为现有方法替代方案的潜力。

Program Synthesis Dialog Agents for Interactive Decision-Making

Authors: Matthew Toles, Nikhil Balwani, Rattandeep Singh, Valentina Giulia Sartori Rodriguez, Zhou Yu

First: 2025-02-26T22:53:01+00:00 · Latest: 2025-11-04T18:24:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.

中文标题/摘要

标题：程序合成对话代理用于交互式决策

许多现实世界中的资格问题，从医疗诊断到税务规划，都可以映射为自然语言表达的决策问题，其中模型必须根据用户特征做出二元选择。大规模领域如法律法规或频繁更新的资助机会使得人工标注（例如，网页表单或决策树）变得不切实际，突显了需要能够自动协助决策的代理的需求。由于相关信息往往只有用户知道，因此这些代理提出正确问题至关重要。当代理决定何时终止对话时，它们面临着准确性和提问数量之间的权衡，这是用户体验和成本的关键指标。为了评估这一任务，我们提出了BeNYfits，一个新的基准，用于通过交互式决策确定用户是否符合多个重叠的社会福利机会。我们的实验表明，当前的语言模型在频繁出现幻觉方面存在困难，GPT-4o仅以35.7的F1值使用了ReAct风格的思维链。为了解决这一问题，我们引入了ProADA，这是一种新颖的方法，利用程序合成来协助决策，通过将对话规划映射为代码生成问题，并利用结构化数据中的空白来确定最佳的下一步行动。我们的代理ProADA将F1分数提高到55.6，同时几乎保持了相同的对话回合数。

Summary / 总结

The paper addresses the challenge of assisting users in making decisions through interactive dialog, particularly in domains like legal codes and funding opportunities where manual annotation is impractical. It introduces BeNYfits, a benchmark for evaluating agents in this task, and proposes ProADA, a method that uses program synthesis to generate dialog plans, improving the F1 score to 55.6 while keeping the number of dialog turns similar to previous approaches.

该论文针对在大规模领域中进行决策时，人工标注不切实际的问题。它引入了BeNYfits基准，用于评估在社会福利互动决策中的代理性能。作者提出了一种名为ProADA的方法，该方法利用程序合成生成对话计划，将F1分数显著提高到55.6，同时保持对话回合数与之前的方法相近。

Can LLMs subtract numbers?

Authors: Mayank Jobanputra, Nils Philipp Walter, Maitrey Mehta, Blerta Veseli, Evan Parker Kelly Chapple, Yifan Wang, Sneha Chetani, Ellie Pavlick, Antonio Vergari, Vera Demberg

First: 2025-11-04T18:20:17+00:00 · Latest: 2025-11-04T18:20:17+00:00

Comments: Work-in-progress; MathNLP non-archival presentation

Abs · PDF · Code1 · Code2

Abstract

We present a systematic study of subtraction in large language models (LLMs). While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a<b$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs' performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs' arithmetic capabilities in subtraction.

中文标题/摘要

标题：大语言模型能否进行减法运算？

我们对大型语言模型（LLM）的减法能力进行了系统研究。尽管先前的基准测试主要关注加法和乘法，但减法作为非交换运算，却受到了相对较少的关注。我们评估了四个家族中的八种预训练LLM在加法和减法问题上的表现。实验结果显示，减法的准确性远低于加法。我们发现，对于（a-b）的情况，当（a<b）时，LLM们经常能给出正确的数值但忽略了负号。探针分析表明，LLM内部会编码结果是否应为负数，但这些信息往往未反映在生成的输出中。我们还测试了诸如少样本学习和指令调优等已知技术，以观察它们能否改善LLM的性能。结果显示，少样本提示可以带来适度的提升，而指令调优模型在生成负号方面几乎达到了完美。这些发现为LLM在减法运算中的局限性和可恢复性提供了更清晰的描述。

Summary / 总结

This study investigates the performance of large language models (LLMs) in subtraction, finding that their accuracy is significantly lower than in addition. Errors are particularly common when the minuend is less than the subtrahend, where LLMs often omit the negative sign despite internally encoding the correct result. Few-shot learning provides some improvement, but instruction-tuning achieves near-perfect accuracy in generating the correct negative sign.

研究考察了大型语言模型（LLMs）在减法中的表现，发现其准确性远低于加法。当被减数小于减数时，LLMs 常常产生正确的数值但遗漏负号。少量示例学习提供了一些改进，但指令调优使得生成正确负号的准确率达到近乎完美。

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

Authors: Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard, Patricia Maes, Paul Pu Liang

Venue: NeurIPS 2025

First: 2025-11-04T18:20:13+00:00 · Latest: 2025-11-04T18:20:13+00:00

Comments: Accepted at the Multimodal Algorithmic Reasoning (MAR) Workshop, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles, providing insight into whether failures may arise from dataset artifacts or model limitations. More broadly, our framework offers a diagnostic scaffold for multimodal reasoning, supporting principled auditing of fusion dynamics and informing possible interventions.

中文标题/摘要

标题：一种诊断视角下的多模态推理：当一种模态破坏其他模态时

尽管多模态大型语言模型（MLLMs）迅速发展，但其推理过程仍然不透明：通常不清楚哪个模态驱动预测，冲突如何解决，或者哪个模态占主导地位。在本文中，我们引入了模态破坏这一诊断失效模式，即高置信度的单模态错误会覆盖其他证据并误导融合结果。为了分析这种动态，我们提出了一种轻量级、模型无关的评估层，将每个模态视为一个代理，生成候选标签和简短的自我评估，用于审计。简单的融合机制汇总这些输出，揭示贡献者（支持正确结果的模态）和破坏者（误导的模态）。在使用基础模型对多模态情感识别基准进行案例研究时，我们的诊断层揭示了系统的可靠性特征，提供了有关失败可能是由数据集缺陷还是模型限制引起的见解。更广泛地说，我们的框架为多模态推理提供了一种诊断框架，支持融合动态的原理性审计，并为可能的干预措施提供信息。

Summary / 总结

This paper addresses the opacity of reasoning in multimodal large language models by introducing a diagnostic method called modality sabotage. The method treats each modality as an agent, producing candidate labels and self-assessments to identify contributors and saboteurs in the fusion process. The study applied this diagnostic layer to emotion recognition benchmarks, revealing systematic reliability profiles that help distinguish between dataset artifacts and model limitations, offering insights for principled auditing and potential interventions.

该论文通过引入一种诊断方法——模态破坏，来解决多模态大型语言模型推理过程中的不透明性问题。该方法将每个模态视为一个代理，生成候选标签和自我评估，以识别融合过程中的贡献者和破坏者。研究将此诊断层应用于情感识别基准测试，揭示了系统性的可靠性特征，有助于区分数据集缺陷和模型限制，为原则性的审计和潜在干预提供了见解。

AI-Generated Image Detection: An Empirical Study and Future Research Directions

Authors: Nusrat Tasnim, Kutub Uddin, Khalid Mahmood Malik

First: 2025-11-04T18:13:48+00:00 · Latest: 2025-11-04T18:13:48+00:00

Abs · PDF · Code1 · Code2

Abstract

The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.

中文标题/摘要

标题：AI生成图像检测：实证研究与未来研究方向

AI生成媒体，尤其是深度伪造，现在对多媒体取证、虚假信息检测和生物识别系统构成了重大挑战，导致公众对法律体系的信任下降，欺诈和社交工程攻击显著增加。尽管提出了多种取证方法，但它们存在三个关键缺陷：(i) 使用非标准化基准与GAN或扩散生成的图像，(ii) 不一致的训练协议（例如，从零开始、冻结、微调），(iii) 有限的评估指标无法捕捉泛化能力和可解释性。这些限制阻碍了公平比较，模糊了真正的鲁棒性，并限制了在安全关键应用中的部署。本文介绍了一个统一的基准框架，用于在受控和可重复条件下系统评估取证方法。我们对十种最先进的（从零开始、冻结和微调）取证方法和七种公开可用的数据集（GAN和扩散）进行了基准测试，进行了广泛的系统评估。我们使用多种指标评估性能，包括准确率、平均精度、ROC-AUC、错误率和类别敏感性。我们还进一步使用置信曲线和Grad-CAM热图分析模型可解释性。我们的评估表明，某些方法在同分布性能上表现出色，但在跨模型迁移性上有所下降。本研究旨在引导研究界更深入地理解当前取证方法的优势和局限性，并激发开发更鲁棒、更具泛化能力和可解释性的解决方案。

Summary / 总结

This paper addresses the challenges posed by AI-generated media, particularly deepfakes, by introducing a unified benchmarking framework to evaluate forensic methods. The study benchmarks ten state-of-the-art forensic methods across seven publicly available datasets, using multiple metrics to assess performance and model interpretability. Key findings show significant variability in generalization, with some methods excelling in in-distribution performance but struggling with cross-model transferability.

本文针对AI生成媒体，特别是深度伪造，在多媒体取证和虚假信息检测中带来的挑战。该研究引入了一个统一的基准框架，在受控条件下评估十种最先进的取证方法。研究使用七个公开可用的数据集对这些方法进行基准测试，并使用多种指标进行评估，揭示了显著的泛化差异。某些方法在其训练分布内表现良好，但在跨模型迁移时表现较差。

Measuring AI Diffusion: A Population-Normalized Metric for Tracking Global AI Usage

Authors: Amit Misra, Jane Wang, Scott McCullers, Kevin White, Juan Lavista Ferres

First: 2025-11-04T18:03:51+00:00 · Latest: 2025-11-04T18:03:51+00:00

Comments: 18 pages, 6 figures, 2 tables. Also available at https://aka.ms/AI_Diffusion_Technical_Report

Abs · PDF · Code1 · Code2

Abstract

Measuring global AI diffusion remains challenging due to a lack of population-normalized, cross-country usage data. We introduce AI User Share, a novel indicator that estimates the share of each country's working-age population actively using AI tools. Built from anonymized Microsoft telemetry and adjusted for device access and mobile scaling, this metric spans 147 economies and provides consistent, real-time insight into global AI diffusion. We find wide variation in adoption, with a strong correlation between AI User Share and GDP. High uptake is concentrated in developed economies, though usage among internet-connected populations in lower-income countries reveals substantial latent demand. We also detect sharp increases in usage following major product launches, such as DeepSeek in early 2025. While the metric's reliance solely on Microsoft telemetry introduces potential biases related to this user base, it offers an important new lens into how AI is spreading globally. AI User Share enables timely benchmarking that can inform data-driven AI policy.

中文标题/摘要

标题：测量AI扩散：一种人口标准化的全球AI使用度量

由于缺乏人口标准化的跨国使用数据，全球AI扩散的测量仍然具有挑战性。我们引入了AI用户份额这一新型指标，以估算每个国家劳动年龄人口中活跃使用AI工具的比例。该指标基于匿名的微软遥测数据，并调整了设备访问和移动扩展因素，覆盖147个经济体，提供了关于全球AI扩散的持续、实时洞察。我们发现采用率存在广泛差异，AI用户份额与GDP之间存在显著相关性。高采用率集中在发达经济体，但低收入国家联网人口的使用情况显示了巨大的潜在需求。我们还发现，在主要产品发布后，如2025年初的DeepSeek，使用率出现了急剧增长。尽管该指标仅依赖于微软遥测数据，可能会引入与该用户群体相关的潜在偏差，但它提供了一个重要的新视角，用于观察AI在全球的传播情况。AI用户份额能够实现及时基准测试，从而为数据驱动的AI政策提供信息。

Gradient GA: Gradient Genetic Algorithm for Drug Molecular Design

Authors: Chris Zhuang, Debadyuti Mukherjee, Yingzhou Lu, Tianfan Fu, Ruqi Zhang

First: 2025-02-14T02:03:39+00:00 · Latest: 2025-11-04T18:02:45+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Molecular discovery has brought great benefits to the chemical industry. Various molecule design techniques are developed to identify molecules with desirable properties. Traditional optimization methods, such as genetic algorithms, continue to achieve state-of-the-art results across multiple molecular design benchmarks. However, these techniques rely solely on random walk exploration, which hinders both the quality of the final solution and the convergence speed. To address this limitation, we propose a novel approach called Gradient Genetic Algorithm (Gradient GA), which incorporates gradient information from the objective function into genetic algorithms. Instead of random exploration, each proposed sample iteratively progresses toward an optimal solution by following the gradient direction. We achieve this by designing a differentiable objective function parameterized by a neural network and utilizing the Discrete Langevin Proposal to enable gradient guidance in discrete molecular spaces. Experimental results demonstrate that our method significantly improves both convergence speed and solution quality, outperforming cutting-edge techniques. For example, it achieves up to a 25% improvement in the top-10 score over the vanilla genetic algorithm. The code is publicly available at https://github.com/debadyuti23/GradientGA.

中文标题/摘要

标题：Gradient GA：用于药物分子设计的梯度遗传算法

分子发现为化学工业带来了巨大利益。各种分子设计技术被开发出来以识别具有理想性质的分子。传统的优化方法，如遗传算法，继续在多个分子设计基准上取得最先进的结果。然而，这些技术仅依赖于随机探索，这阻碍了最终解决方案的质量和收敛速度。为了解决这一局限性，我们提出了一种名为梯度遗传算法（Gradient GA）的新方法，该方法将目标函数的梯度信息整合到遗传算法中。每个提议的样本通过遵循梯度方向迭代地向最优解前进，而不是进行随机探索。我们通过设计由神经网络参数化的可微目标函数，并利用离散拉梅辛提案来在离散分子空间中实现梯度指导，实现了这一点。实验结果表明，我们的方法在收敛速度和解决方案质量上都有显著改进，超越了最先进的技术。例如，它在顶级10分上比传统的遗传算法提高了25%。代码可在https://github.com/debadyuti23/GradientGA上公开获取。

Summary / 总结

The research aims to enhance the efficiency and quality of molecular design by addressing the limitations of traditional genetic algorithms. The Gradient Genetic Algorithm (Gradient GA) incorporates gradient information into genetic algorithms, allowing for directed optimization. Experiments show that Gradient GA significantly improves convergence speed and solution quality, achieving up to a 25% better top-10 score compared to the vanilla genetic algorithm.

研究旨在通过解决传统遗传算法的局限性，提高分子设计的效率和质量。提出了梯度遗传算法（Gradient GA），该算法将梯度信息整合到搜索过程中。该方法使用可微分的目标函数和离散拉angevin提案来引导搜索方向，从而加快收敛速度并获得更好的解决方案。实验表明，Gradient GA在顶级10分上比传统遗传算法提高了25%以上。

LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Authors: Zeyu Liu, Souvik Kundu, Lianghao Jiang, Anni Li, Srikanth Ronanki, Sravan Bodapati, Gourav Datta, Peter A. Beerel

First: 2025-09-22T22:43:44+00:00 · Latest: 2025-11-04T18:01:01+00:00

Comments: 17 pages, 8 figures. EMNLP2025 Findings

Abs · PDF · Code1 · Code2 · Code3

Abstract

Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90\% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\&2\&3 tasks (1K-8K context length) and BABILong benchmark (QA2\&QA3, 0K-16K context length), requiring less than 0.1\% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources. Code is released at: https://github.com/zeyuliu1037/LAWCAT

中文标题/摘要

标题：LAWCAT：通过跨令牌卷积高效地将二次注意线性化为一次注意以进行长上下文建模

尽管变压器架构在多种领域中取得了最先进的性能，但其与序列长度成二次的计算复杂性仍然是一个显著瓶颈，特别是在对延迟敏感的长上下文应用中。虽然最近的一次复杂度替代方案越来越强大，但它们从头开始有效训练仍然资源密集。为克服这些限制，我们提出了LAWCAT（时间上的卷积线性注意），这是一种新型的一次线性化框架，旨在高效地将预训练变压器的能力转移到高性能的一次注意架构中。LAWCAT结合了因果Conv1D层以增强局部依赖建模，并采用归一化门控线性注意以提高在不同上下文长度上的泛化能力。我们的全面评估表明，通过仅使用1K长度序列从Mistral-7B蒸馏，其在22K令牌中的通过密钥检索准确率超过90%，显著扩展了其有效上下文窗口。同样，Llama3.2-1B的LAWCAT变体在S-NIAH 1&2&3任务（1K-8K上下文长度）和BABILong基准（QA2&QA3，0K-16K上下文长度）上表现出竞争力，所需预训练令牌少于1%。此外，LAWCAT在序列长度超过8K时的预填充速度比FlashAttention-2更快。因此，LAWCAT提供了一条高效的道路，通往高性能、长上下文的一次模型，适合边缘部署，减少了对大量长序列训练数据和计算资源的依赖。代码发布在：https://github.com/zeyuliu1037/LAWCAT

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Authors: Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye

First: 2025-11-04T18:00:51+00:00 · Latest: 2025-11-04T18:00:51+00:00

Comments: 28 pages, 15 figures

Abs · PDF · Code1 · Code2

Abstract

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

中文标题/摘要

标题：当可视化成为推理第一步：MIRA，一种视觉链式思考基准

我们提出了MIRA，一种新的基准，旨在评估在生成中间视觉图像对于成功推理至关重要的场景中的模型性能。与仅依赖文本的传统链式思考方法不同，MIRA中的任务要求模型生成并利用中间图像（如草图、结构图或路径图）来引导其推理过程。这种设置类似于人类通过“边画边想”解决复杂问题的方式。MIRA关注那些本质上具有挑战性且涉及复杂结构、空间关系或难以仅通过语言表达的推理步骤的任务。为了确保评估数据的质量，我们包括了546个多模态问题，并且这些问题都标注了中间视觉图像和最终答案。我们还为MIRA提出了一个统一的评估协议，涵盖了三个级别的评估输入：仅图像和问题的直接输入，仅文本的链式思考输入，以及包含标注图像线索和文本思考提示的视觉链式思考输入。为了探索模型在基准上的上限，我们还报告了在不同k设置下的pass@k和多数投票准确率。实验结果表明，现有的多模态大型语言模型，包括最强的私有模型以及强大的开源模型，在仅依赖文本提示时表现不佳。然而，当提供中间视觉线索时，模型的性能会持续提升，所有模型和任务的平均相对增益为33.7%。我们还通过扩大搜索空间并设计与视觉链式思考相匹配的文本提示来探索上限，但这些方法与我们的视觉链式思考设置相比，仅带来了有限的改进。这些结果强调了想象中的视觉信息在MIRA中成功推理中的关键作用。

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Authors: Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang

First: 2025-11-04T18:00:18+00:00 · Latest: 2025-11-04T18:00:18+00:00

Comments: Project page: https://csu-jpg.github.io/VCode Github: https://github.com/CSU-JPG/VCode

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

中文标题/摘要

标题：VCode：一种以SVG为符号视觉表示的多模态编码基准

代码已成为代理时代进行推理和行动的精确且可执行的媒介。然而，进展主要集中在诸如程序合成和调试的语言中心任务上，视觉中心的编码则被严重忽视。受人类如何推理草图的启发，我们提倡使用SVG代码作为紧凑、可解释且可执行的视觉表示。我们引入了VCode，一个将多模态理解重新定义为代码生成的基准：给定一张图片，模型必须生成SVG代码以保留符号意义供后续推理使用。VCode涵盖了三个领域——通用常识（MM-Vet）、专业学科（MMMU）和视觉中心感知（CV-Bench）。为了评估符号保真度，我们提出了CodeVQA，一种新的评估协议，在该协议中，策略模型对渲染的SVG进行问答；正确答案表明符号保真度良好。实证研究表明，前沿的视觉语言模型在生成忠实的SVG方面存在困难，揭示了语言中心与视觉中心编码之间的持续差距。为了缩小这一差距，我们引入了VCoder，一种代理框架，沿着两个轴线增强视觉语言模型：（i）迭代分析差异并改进SVG代码的思考与修订；（ii）使用视觉工具行动，其中检测器和解析器提供结构化线索，如对象、形状和文本，超出模型的固有能力。在各种基准测试中，具有强大推理能力的前沿视觉语言模型总体上表现良好，但在专业知识和3D推理方面仍有限制。VCoder相对于表现最佳的Claude-4-Opus实现了12.3分的整体提升。人类研究表明，无论是人类还是视觉语言模型在渲染的SVG上表现都较差，这种一致性揭示了符号视觉表示的潜力。基准和代码可在https://github.com/CSU-JPG/VCode获取。

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing

Authors: Antonio Oroz, Matthias Nießner, Tobias Kirschstein

Venue: www

First: 2025-11-04T17:59:15+00:00 · Latest: 2025-11-04T17:59:15+00:00

Comments: Project Page: https://antoniooroz.github.io/PercHead/ Video: https://www.youtube.com/watch?v=4hFybgTk4kE

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE

中文标题/摘要

标题：PercHead：单张图像三维头部重建与编辑的感知头部模型

我们提出了PercHead，一种用于单张图像三维头部重建和语义三维编辑的方法——这两个任务由于严重的视角遮挡、弱的感知监督以及三维空间编辑的模糊性而具有固有的挑战性。我们开发了一个统一的基础模型，可以从单张输入图像中重建视图一致的三维头部。该模型采用双分支编码器，随后是基于ViT的解码器，通过迭代交叉注意力将2D特征提升到三维空间。渲染使用高斯点绘制。我们方法的核心是一种基于DINOv2和SAM2.1的新型感知监督策略，为几何和外观保真度提供了丰富的、通用的信号。我们的模型在新颖视角合成方面达到了最先进的性能，并且与现有基线相比，在极端视角下表现出色。此外，该基础模型可以通过替换编码器并微调网络无缝扩展用于语义三维编辑。在这一变体中，我们通过两种不同的输入模态分离几何和风格：分割图控制几何，文本提示或参考图像指定外观。我们通过一个轻量级的交互式GUI展示了我们模型直观且强大的三维编辑能力，用户可以通过绘制分割图轻松塑造几何，并通过自然语言或图像提示进行外观修饰。

Summary / 总结

PercHead is a method for single-image 3D head reconstruction and semantic 3D editing, addressing challenges like view occlusions and perceptual supervision. It uses a unified base model with a dual-branch encoder and ViT-based decoder for 3D feature lifting, and Gaussian Splatting for rendering. The model employs a novel perceptual supervision strategy based on DINOv2 and SAM2.1, achieving state-of-the-art performance in novel-view synthesis and robustness to extreme viewing angles. It can be extended for semantic 3D editing by swapping the encoder and fine-tuning the network, allowing users to control geometry and appearance through segmentation maps and text/image prompts respectively.

PercHead 是一种用于单张图像 3D 头部重建和语义 3D 编辑的方法，解决了视图遮挡和弱感知监督等挑战。该方法使用一个统一的基础模型，包含双分支编码器和基于 ViT 的解码器进行 3D 特征提升，并采用高斯点云渲染。模型利用 DINOv2 和 SAM2.1 进行感知监督，实现了在新颖视图合成中的最佳性能，并且在极端视角下表现出色。此外，通过更换编码器并微调网络，该基础模型可以扩展用于语义 3D 编辑，用户可以通过分割图控制几何形状，并通过文本或图像提示指定外观。

STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation

Authors: Bum Chul Kwon, Ben Shapira, Moshiko Raboh, Shreyans Sethi, Shruti Murarka, Joseph A Morrone, Jianying Hu, Parthasarathy Suryanarayanan

First: 2025-11-04T17:56:00+00:00 · Latest: 2025-11-04T17:56:00+00:00

Comments: 16 pages, 3 figures, 2 tables

Abs · PDF · Code1 · Code2

Abstract

The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.

中文标题/摘要

标题：STAR-VAE：用于可扩展和可控分子生成的潜在变量变换器

药物样分子的化学空间非常广阔，推动了能够学习广泛化学分布、通过捕捉结构-性质表示实现条件生成并提供快速分子生成的生成模型的发展。实现这些目标取决于建模选择，包括概率建模方法、条件生成公式、架构和分子输入表示。为了解决这些挑战，我们提出了STAR-VAE（Selfies编码、基于变换器、自回归变分自动编码器），这是一种可扩展的潜在变量框架，具有变换器编码器和自回归变换器解码器。它在来自PubChem的7900万种药物样分子上进行训练，使用SELFIES确保语法有效性。潜在变量公式使条件生成成为可能：一个性质预测器提供一个条件信号，该信号一致地应用于潜在先验、推断网络和解码器。我们的贡献包括：(i) 基于变换器的潜在变量编码器-解码器模型，训练于SELFIES表示；(ii) 一种原理性的条件潜在变量公式，用于性质导向生成；(iii) 在编码器和解码器中使用低秩适配器（LoRA）进行高效微调，使在有限的性质和活性数据下快速适应成为可能。在GuacaMol和MOSES基准测试中，我们的方法与基线相当或超过基线，潜在空间分析揭示了平滑且语义结构化的表示，支持无条件探索和性质感知生成。在Tartarus基准测试中，条件模型将对接分数分布向更强的预测结合方向偏移。这些结果表明，当与原理性条件和参数高效微调结合使用时，现代化的、规模适当的方法仍能在分子生成中保持竞争力。

Summary / 总结

STAR-VAE is designed to generate drug-like molecules efficiently and controllably by leveraging a Transformer-based auto-regressive variational autoencoder trained on 79 million molecules from PubChem using SELFIES for syntactic validity. Key findings include matching or exceeding baseline models on GuacaMol and MOSES benchmarks, and revealing smooth, semantically structured latent representations that support both unconditional exploration and property-aware generation. Additionally, the conditional model shifts docking-score distributions towards stronger predicted binding on the Tartarus benchmarks, demonstrating the model's effectiveness in property-guided generation and efficient fine-tuning with low-rank adapters (LoRA).

研究旨在通过解决广泛化学分布学习、条件生成和快速分子生成的挑战，开发一种可扩展且可控的药物类似分子生成模型。STAR-VAE，一种基于Transformer的VAE，被提出，利用SELFIES确保语法有效性，并采用潜在变量形式进行条件生成。该模型在GuacaMol和MOSES基准测试中取得了竞争力的结果，潜在空间分析显示平滑且语义结构化的表示，支持无条件探索和属性感知生成。此外，该模型在Tartarus基准测试中通过将对接分数分布向更强的预测结合方向移动，展示了有效的属性导向生成。

A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Authors: Guoxuan Xia, Harleen Hanspal, Petru-Daniel Tudosiu, Shifeng Zhang, Sarah Parisot

First: 2025-07-21T15:33:49+00:00 · Latest: 2025-11-04T17:54:35+00:00

Comments: TMLR https://openreview.net/forum?id=loT6xhgLYK

Abs · PDF · Code1 · Code2

Abstract

Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency.

中文标题/摘要

标题：基于变换器的空域控制图像生成的实用研究

使图像生成模型能够进行空域控制是一个重要的研究领域，使用户能够根据自己的细粒度规范（例如边缘图、姿态）更好地生成图像。尽管近年来在这一任务上取得了显著的进步，但快速生成更强模型的努力已经牺牲了详细的和公平的科学比较。不同的训练数据、模型架构和生成范式使得难以区分影响性能的因素。同时，某些方法的动机和细微之处在文献中被忽视了。在本文中，我们旨在为希望开发基于变换器的空间控制生成系统的实践者提供清晰的见解，澄清文献并填补知识空白。我们在ImageNet上对基于扩散/流和自回归模型进行了受控实验。首先，我们确立了控制标记预填充作为变换器的简单、通用和高效的基线方法。然后，我们研究了之前未充分探索的采样时间增强，表明将无分类器引导扩展到控制以及softmax截断对控制生成一致性有显著影响。最后，我们重新澄清了基于适配器方法的动机，证明它们在有限下游数据训练时可以减轻“遗忘”并保持生成质量，但在生成控制一致性方面不如完整训练。

Summary / 总结

This study investigates spatially-controlled image generation with transformers, focusing on providing clear takeaways for practitioners. The authors perform controlled experiments on ImageNet using diffusion-based, flow-based, and autoregressive models. They establish control token prefilling as a baseline and explore sampling time enhancements, finding that classifier-free guidance and softmax truncation improve control consistency. Adapter-based approaches are re-evaluated, showing they maintain generation quality with limited data but underperform in terms of control consistency.

该研究探讨了使用变压器进行空间控制的图像生成，旨在为从业者提供清晰的指导。作者在ImageNet上对不同模型和范式进行了受控实验。他们将控制标记预填充确立为基础方法，并探索了采样时间增强，发现无分类器自由引导和softmax截断能够提高控制一致性。重新评估了基于适配器的方法，显示它们在有限下游数据下保持生成质量，但在控制一致性方面表现较差。