arXiv 论文速递

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang

First: 2025-12-26T18:59:47+00:00 · Latest: 2025-12-26T18:59:47+00:00

Abstract

Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

中文标题/摘要

标题：见少而明：双向感知塑造多模态推理

大型视觉-语言模型（VLMs）通常从中间视觉提示中受益，这些提示要么通过外部工具注入，要么在推理过程中作为潜在视觉标记生成，但这些机制仍然忽略了细微的视觉证据（例如图表中的多段线），在不同领域泛化能力差，并且在推理时间成本高。在本文中，我们提出了双向感知塑造（BiPS），它将问题条件下的掩码视图转换为双向的“看哪里”信号，在训练过程中塑造感知。BiPS 首先在原始图像和保留仅与问题相关区域的证据保留视图之间施加KL一致性约束，鼓励粗略但完整的支持像素覆盖。然后在原始图像和关键像素被遮蔽的证据消除视图之间施加KL分离约束，该视图不再支持原始答案，从而避免仅从文本回答（即，仅从文本回答）并强制执行细微的视觉依赖。在八个基准测试中，BiPS 将 Qwen2.5-VL-7B 的性能平均提高了 8.2%，并在未见过的数据集和图像类型上展示了强大的跨域泛化能力。

Summary / 总结

This paper addresses the limitations of existing vision-language models that rely on intermediate visual cues, which often overlook fine-grained visual evidence and generalize poorly. The proposed Bi-directional Perceptual Shaping (BiPS) method transforms question-conditioned masked views into bidirectional where-to-look signals to shape perception during training. BiPS improves Qwen2.5-VL-7B by 8.2% on average across eight benchmarks and demonstrates strong out-of-domain generalization to unseen datasets and image types.

研究旨在通过解决现有方法依赖外部工具或潜在视觉标记的局限性，提高大型视觉-语言模型的性能。提出的双向感知塑造（BiPS）方法将遮蔽视图转换为双向的看哪里信号，在训练期间引导感知。BiPS增强了模型对细粒度视觉证据的依赖能力，并在不同领域表现出强大的泛化能力，八个基准测试的平均改进幅度为8.2%。

ProEdit: Inversion-based Editing From Prompts Done Right

Authors: Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng

First: 2025-12-26T18:59:14+00:00 · Latest: 2025-12-26T18:59:14+00:00

Comments: Equal contributions from first two authors. Project page: https://isee-laboratory.github.io/ProEdit/ Code: https://github.com/iSEE-Laboratory/ProEdit

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

中文标题/摘要

标题：ProEdit：从指令正确进行基于反转的编辑

基于反转的视觉编辑提供了一种有效且无需训练的方法，可以根据用户指令编辑图像或视频。现有方法通常在采样过程中注入源图像信息以保持编辑一致性。然而，这种采样策略过度依赖源信息，这会负面影响目标图像中的编辑效果（例如，无法按照指令改变主体的姿态、数量或颜色）。在本文中，我们提出ProEdit以在注意力和潜在方面解决这一问题。在注意力方面，我们引入了KV-mix，它在编辑区域混合源和目标的KV特征，减轻了源图像对编辑区域的影响，同时保持背景一致性。在潜在方面，我们提出了Latents-Shift，它扰动源潜在的编辑区域，消除了反转潜在对采样的影响。在几个图像和视频编辑基准上的广泛实验表明，我们的方法达到了SOTA性能。此外，我们的设计是即插即用的，可以无缝集成到现有的反转和编辑方法中，如RF-Solver、FireFlow和UniEdit。

Summary / 总结

ProEdit addresses the issue of overly relying on source image information in inversion-based visual editing, which negatively affects the edits in the target image. It introduces KV-mix to mix KV features of the source and target in the edited region, and Latents-Shift to perturb the edited region of the source latent. Experiments show that ProEdit achieves state-of-the-art performance on various benchmarks and is plug-and-play, compatible with existing methods like RF-Solver, FireFlow, and UniEdit.

ProEdit 解决了基于反转的视觉编辑中过度依赖源图像信息的问题，这会负面影响目标图像中的编辑效果。它引入了 KV-mix 来混合编辑区域中源图像和目标图像的 KV 特征，以及 Latents-Shift 来扰动源图像的编辑区域的潜在特征。实验表明，ProEdit 在各种基准测试中达到了最先进的性能，并且是即插即用的，可以无缝集成到现有的方法如 RF-Solver、FireFlow 和 UniEdit 中。

Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

Authors: Shengkun Cui, Rahul Krishna, Saurabh Jha, Ravishankar K. Iyer

First: 2025-12-26T18:56:18+00:00 · Latest: 2025-12-26T18:56:18+00:00