arXiv 论文速递

Glyph: Scaling Context Windows via Visual-Text Compression

Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang

First: 2025-10-20T17:58:56+00:00 · Latest: 2025-10-20T17:58:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.

中文标题/摘要

标题：Glyph：通过视觉-文本压缩扩展上下文窗口

大型语言模型（LLMs）越来越多地依赖于长上下文建模，用于文档理解、代码分析和多步推理等任务。然而，将上下文窗口扩展到百万词级别带来了巨大的计算和内存成本，限制了长上下文LLMs的实际应用。在本工作中，我们采取了不同的视角——视觉上下文扩展，来应对这一挑战。我们不扩展基于词的序列，而是提出了Glyph框架，将长文本渲染为图像，并使用视觉-语言模型（VLMs）处理这些图像。这种方法在大幅压缩文本输入的同时保留了语义信息，并进一步设计了由LLM驱动的遗传搜索，以识别平衡准确性和压缩的最佳视觉渲染配置。通过广泛的实验，我们证明了我们的方法在各种长上下文基准测试中实现了3-4倍的词压缩，同时保持与领先LLM（如Qwen3-8B）相当的准确性。这种压缩还导致填充和解码速度提高了约4倍，SFT训练速度提高了约2倍。此外，在极端压缩下，128K上下文的VLM可以扩展处理百万词级别的文本任务。此外，渲染的文本数据也有助于实际的多模态任务，如文档理解。我们的代码和模型已发布在https://github.com/thu-coai/Glyph。

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

First: 2025-10-20T17:52:06+00:00 · Latest: 2025-10-20T17:52:06+00:00

Comments: 29 pages, 9 tables, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

中文标题/摘要

标题：基础自动评估器：扩展多任务生成评估器训练以适应以推理为中心的领域

针对训练和测试期间不断增加的可扩展评估需求，微调专门的生成评估器已成为一个流行的范式。然而，近期的研究主要集中在使用强化学习（RL）等新方法训练评估器上，而避免了大规模、数据驱动的开发。在本研究中，我们专注于数据扩展，收集了涵盖五个独特评估任务（成对、步骤级、无参考和有参考验证、单个评分）和多个以推理评估为中心的领域的250万样本集。利用我们的数据集，我们训练了基础自动推理评估器（FARE），这是一个包含80亿和200亿参数（其中36亿活跃参数）的评估器家族，使用简单的迭代拒绝采样监督微调（SFT）方法。FARE-8B挑战了更大的专门强化学习训练评估器，而FARE-20B则成为开源评估器的新标准，超越了专门的700亿+评估器。除了静态基准，我们还在实际任务中评估了FARE：作为推理时间重排序器，FARE-20B在MATH上达到了接近完美性能。作为强化学习训练中的验证器，FARE提高了下游强化学习训练模型的性能，最高可达14.1%，优于字符串匹配验证器。从FARE初始化的持续微调FARE-Code在评估测试案例质量方面比gpt-oss-20B高出65%。

Summary / 总结

This research aims to develop scalable automatic evaluators for reasoning-centric tasks by curating a large dataset of 2.5 million samples across five evaluation tasks and multiple domains. The study employs a simple iterative rejection-sampling supervised finetuning approach to train Foundational Automatic Reasoning Evaluators (FARE) with 8B and 20B parameters. FARE-20B outperforms specialized RL-trained evaluators and sets a new standard for open-source evaluators, surpassing specialized 70B+ evaluators. In real-world applications, FARE-20B demonstrates near-oracle performance in MATH inference and improves RL-trained model performance by up to 14.1% compared to string-matching verifiers. Additionally, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% in evaluating test-case quality.

研究旨在通过收集涵盖五个评估任务和多个领域的250万样本数据集，开发适用于推理任务的可扩展自动评估器。研究采用简单的迭代拒绝采样监督微调方法训练具有8B和20B参数的Foundational Automatic Reasoning Evaluators (FARE)。FARE-20B在性能上超越了专门的RL训练评估器，并成为开源评估器的新标准，超过了专门的70B+评估器。在实际应用中，FARE-20B在MATH推理中的表现接近完美，并且与字符串匹配验证器相比，可以将下游RL训练模型的性能提高高达14.1%。此外，持续微调的FARE-Code在评估测试案例质量方面比gpt-oss-20B高出65%。

SoftMimic: Learning Compliant Whole-body Control from Examples

Authors: Gabriel B. Margolis, Michelle Wang, Nolan Fey, Pulkit Agrawal

First: 2025-10-20T17:49:27+00:00 · Latest: 2025-10-20T17:49:27+00:00

Comments: Website: https://gmargo11.github.io/softmimic/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce SoftMimic, a framework for learning compliant whole-body control policies for humanoid robots from example motions. Imitating human motions with reinforcement learning allows humanoids to quickly learn new skills, but existing methods incentivize stiff control that aggressively corrects deviations from a reference motion, leading to brittle and unsafe behavior when the robot encounters unexpected contacts. In contrast, SoftMimic enables robots to respond compliantly to external forces while maintaining balance and posture. Our approach leverages an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which we use to train a reinforcement learning policy. By rewarding the policy for matching compliant responses rather than rigidly tracking the reference motion, SoftMimic learns to absorb disturbances and generalize to varied tasks from a single motion clip. We validate our method through simulations and real-world experiments, demonstrating safe and effective interaction with the environment.

中文标题/摘要

标题：SoftMimic：从示例中学习人体动力学全身控制

我们介绍了SoftMimic，一种用于从示例动作中学习人形机器人人体动力学全身控制策略的框架。通过强化学习模仿人类动作，使机器人能够快速学习新技能，但现有方法倾向于激励刚性控制，积极纠正参考动作中的偏差，导致机器人在遇到意外接触时表现出脆弱且不安全的行为。相比之下，SoftMimic使机器人能够对外部力作出顺应性反应，同时保持平衡和姿势。我们的方法利用逆运动学求解器生成一个可行的顺应性动作增强数据集，用于训练强化学习策略。通过奖励策略匹配顺应性反应而不是严格跟踪参考动作，SoftMimic能够吸收干扰并从单个动作片段中泛化到各种任务。我们通过仿真和实际实验验证了该方法，展示了其与环境安全有效的互动。

Summary / 总结

SoftMimic is a framework that learns compliant whole-body control policies for humanoid robots from example motions. It addresses the issue of stiff control in existing methods by enabling robots to respond compliantly to external forces while maintaining balance. SoftMimic uses an inverse kinematics solver to generate an augmented dataset of feasible compliant motions, which is then used to train a reinforcement learning policy. The policy is rewarded for matching compliant responses rather than rigidly tracking the reference motion, allowing the robot to absorb disturbances and generalize to various tasks from a single motion clip. Simulations and real-world experiments demonstrate the safety and effectiveness of SoftMimic in interacting with the environment.

SoftMimic 是一种从示例动作学习人形机器人全身柔顺控制策略的框架。它通过使机器人能够对外部力作出柔顺响应并保持平衡和姿势，解决了现有方法中僵硬控制的问题。SoftMimic 使用逆运动学求解器生成一个可行的柔顺动作的扩充数据集，然后使用该数据集训练强化学习策略。该策略通过匹配柔顺响应而不是严格跟踪参考动作来获得奖励，从而使机器人能够吸收干扰并从单个动作片段中泛化到各种任务。仿真和实际实验验证了该方法在与环境互动时的有效性和安全性。

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Authors: Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

First: 2025-10-20T17:48:26+00:00 · Latest: 2025-10-20T17:48:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action -- seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

中文标题/摘要

标题：UltraCUA：具有混合动作的基础模型计算机使用代理

依赖于点击、输入和滚动等原始动作的多模态计算机使用代理需要精确的视觉定位和漫长的执行链，导致级联失败和性能瓶颈。虽然其他代理利用了丰富的程序接口（API、MCP服务器、工具），但计算机使用代理（CUAs）仍与这些能力隔绝。我们提出了UltraCUA，这是一种通过混合动作来弥合这一差距的基础模型——无缝地将GUI原语与高级程序工具调用集成在一起。为了实现这一点，我们的方法包括四个关键组件：（1）一个自动流水线，从软件文档、开源仓库和代码生成中扩展程序工具；（2）一个合成数据引擎，生成超过17,000个可验证任务，涵盖现实世界的计算机使用场景；（3）一个大规模高质量的混合动作轨迹集合，包含低级GUI动作和高级程序工具调用；（4）一个两阶段训练流水线，结合监督微调和在线强化学习，使代理能够在低级和高级动作之间战略性地切换。我们的7B和32B模型实验表明，UltraCUA在性能上显著优于最先进的代理。在OSWorld上，UltraCUA模型相对于基模型平均提高了22%的相对性能，同时每步速度快了11%。在WindowsAgentArena上的跨域评估中，我们的模型达到了21.7%的成功率，优于基于Windows数据训练的基线模型。混合动作机制至关重要，它减少了错误传播，同时保持了执行效率。

Summary / 总结

UltraCUA is designed to enhance multimodal agents for computer use by integrating GUI primitives with high-level programmatic tools. It includes an automated pipeline for scaling programmatic tools, a synthetic data engine, a hybrid action trajectory collection, and a two-stage training pipeline. Experiments show that UltraCUA models outperform state-of-the-art agents, achieving a 22% relative improvement on OSWorld and a 21.7% success rate in out-of-domain evaluation.

UltraCUA旨在通过集成GUI原语和高级程序化工具来提升计算机使用代理。它通过自动化工具扩展管道、合成数据引擎、大规模混合动作轨迹集合以及两阶段训练管道来实现这一目标。实验表明，UltraCUA模型在OSWorld上实现了22%的相对改进，并在WindowsAgentArena的跨域评估中达到了21.7%的成功率，超越了基于Windows数据训练的基线模型。

SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

Authors: Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu

First: 2025-10-20T17:35:47+00:00 · Latest: 2025-10-20T17:35:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.

中文标题/摘要

标题：SparseVILA：解耦视觉稀疏性以实现高效的VLM推理

视觉语言模型（VLMs）在视觉和文本推理的整合方面取得了快速进展，推动了高分辨率图像理解、长视频分析和多轮对话等应用的发展。然而，它们的可扩展性仍然受到主导推理延迟的视觉标记数量不断增长的限制。我们提出了SparseVILA，这是一种新的高效VLM推理范式，它在预填充和解码阶段解耦视觉稀疏性。SparseVILA通过在预填充阶段剪枝冗余的视觉标记，并在解码阶段仅检索与查询相关的标记，来在阶段之间分配稀疏性。这种解耦设计与领先的预填充剪枝方法相匹配，同时通过保留大部分视觉缓存来保持多轮对话的保真度，以便在每次对话回合中检索查询相关的标记。基于AWQ优化的推理管道，SparseVILA在长上下文视频任务中实现了高达4.0倍的预填充速度、2.5倍的解码速度和整体2.6倍的端到端加速，同时在文档理解和推理任务上提高了准确性。通过解耦查询无关的剪枝和查询相关的检索，SparseVILA为高效的多模态推理确立了一个新方向，提供了一个无需训练、架构无关的框架，用于加速大型VLMs而不牺牲其能力。

Summary / 总结

SparseVILA is designed to improve the efficiency of Vision Language Models (VLMs) by decoupling visual sparsity during prefilling and decoding stages. It prunes redundant visual tokens during prefill and retrieves only relevant tokens during decoding, achieving up to 4.0 times faster prefilling, 2.5 times faster decoding, and a 2.6 times overall speedup on long-context video tasks, while maintaining accuracy on document-understanding and reasoning tasks.

SparseVILA 通过在预填充和解码阶段解耦视觉稀疏性来提高视觉语言模型（VLMs）的效率。它在预填充阶段修剪冗余的视觉令牌，并在解码阶段仅检索相关令牌，从而实现预填充4.0倍、解码2.5倍的加速以及整体2.6倍的端到端加速，特别是在长上下文视频任务中。这种方法在文档理解和推理任务中保持了准确性，同时提高了可扩展性。通过分离查询无关的修剪和查询相关的检索，SparseVILA 提供了一个无需训练、架构无关的框架，用于加速大型 VLMs 同时不牺牲其能力。