arXiv 论文速递

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Authors: Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, Xu Jia

First: 2025-10-29T17:59:53+00:00 · Latest: 2025-10-29T17:59:53+00:00

Comments: Project Page URL:https://libaolu312.github.io/VFXMaster/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential effect attributes, allowing a single unified model to master the effect imitation without information leakage. In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly. Extensive experiments demonstrate that our method effectively imitates various categories of effect information and exhibits outstanding generalization to out-of-domain effects. To foster future research, we will release our code, models, and a comprehensive dataset to the community.

中文标题/摘要

标题：VFXMaster：通过上下文学习解锁动态视觉效果生成

视觉效果（VFX）是数字媒体表达能力的关键，但其创建仍然是生成式AI的主要挑战。现有方法通常依赖于每种效果一个LoRA的范式，这既资源密集又无法泛化到未见过的效果，从而限制了可扩展性和创作能力。为应对这一挑战，我们提出了VFXMaster，这是首个统一的、基于参考的VFX视频生成框架。它将效果生成重新定义为上下文学习任务，使其能够从参考视频中复制出多样化的动态效果到目标内容上。此外，它还展示了对未见过的效果类别的出色泛化能力。具体来说，我们设计了一种上下文条件策略，通过参考示例提示模型。设计了一种上下文注意力掩码，以精确解耦并注入关键效果属性，使单一统一模型能够掌握效果模仿而不泄露信息。此外，我们提出了一种高效的单次效果适应机制，以快速从单个用户提供的视频中增强对难以泛化的未见过效果的泛化能力。大量实验表明，我们的方法能够有效模仿各种效果类别信息，并在域外效果上表现出色的泛化能力。为了促进未来研究，我们将发布我们的代码、模型和一个全面的数据集。

Summary / 总结

VFXMaster addresses the challenge of generating dynamic visual effects in digital media by introducing a unified framework that leverages in-context learning. It uses an in-context conditioning strategy and an attention mask to decouple and inject essential effect attributes, enabling a single model to generalize to unseen effects. Experiments show that VFXMaster effectively imitates various effect categories and demonstrates strong generalization to out-of-domain effects.

VFXMaster 是一个利用上下文学习来从参考视频中复制动态效果到目标内容的统一框架。它使用上下文条件策略和注意力掩码来精确注入效果属性，使单个模型能够处理各种效果而不会泄露信息。实验表明，VFXMaster 能够有效模仿多种效果类别，并且对未见过的效果具有出色的泛化能力。

Gaperon: A Peppered English-French Generative Language Model Suite

Authors: Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

First: 2025-10-29T17:59:39+00:00 · Latest: 2025-10-29T17:59:39+00:00

Abs · PDF · Code1 · Code2

Abstract

We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.

中文标题/摘要

标题：Gaperon：一种英语-法语生成语言模型套件

我们发布了Gaperon，一个完全开源的法语-英语-编程语言模型套件，旨在促进大规模模型训练中的透明性和可再现性。Gaperon家族包括15亿、8亿和24亿参数模型，训练数据量为2-4万亿个标记，并附带整个训练管道的所有元素：用神经质量分类器过滤的法语和英语数据集，高效的数据整理和训练框架，以及数百个中间检查点。通过这项工作，我们研究了数据过滤和污染如何相互作用以影响基准测试和生成性能。我们发现，过滤以提高语言质量可以增强文本流畅性和连贯性，但会导致基准测试结果不佳；而晚期故意污染——继续在包含测试集的数据混合中进行训练——可以恢复竞争力的分数，同时仅合理损害生成质量。我们讨论了通常的神经过滤如何无意中放大基准泄漏。为了支持进一步研究，我们还引入了预训练期间无害的数据污染，为安全性研究提供了一个现实的测试平台。通过公开发布所有模型、数据集、代码和检查点，Gaperon为探索多语言语言模型开发中的数据整理、评估、安全性和开放性之间的权衡奠定了可再现的基础。

Summary / 总结

Gaperon is a suite of open-source French-English generative language models aimed at enhancing transparency and reproducibility in large-scale model training. The models, with 1.5B, 8B, and 24B parameters, are trained on extensive datasets and come with all aspects of the training pipeline. The study reveals that filtering data for linguistic quality improves text fluency but negatively impacts benchmark results, while late contamination during training can restore competitive benchmark scores without significantly degrading generation quality. Additionally, Gaperon introduces a method for harmless data poisoning during pretraining to support safety studies in multilingual language model development.

Gaperon 是一套开源的法英双语生成语言模型，旨在提高大规模模型训练中的透明度和可重复性。该模型包括1.5B、8B和24B参数版本，并经过了广泛的训练数据和多种数据过滤及污染阶段。研究发现，过滤以提高语言质量可以提升文本流畅性和连贯性，但会降低基准测试结果。然而，在训练后期进行有意的污染可以恢复基准测试分数，同时保持合理的生成质量。研究还指出了神经过滤可能带来的问题，并引入了一种在预训练期间进行无害数据污染的方法，以支持安全性研究。

Neural Stochastic Flows: Solver-Free Modelling and Inference for SDE Solutions

Authors: Naoki Kiyohara, Edward Johns, Yingzhen Li

Venue: NeurIPS 2025 poster

First: 2025-10-29T17:59:06+00:00 · Latest: 2025-10-29T17:59:06+00:00

Comments: NeurIPS 2025 (poster). Project page: https://nkiyohara.github.io/nsf-neurips2025/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Stochastic differential equations (SDEs) are well suited to modelling noisy and irregularly sampled time series found in finance, physics, and machine learning. Traditional approaches require costly numerical solvers to sample between arbitrary time points. We introduce Neural Stochastic Flows (NSFs) and their latent variants, which directly learn (latent) SDE transition laws using conditional normalising flows with architectural constraints that preserve properties inherited from stochastic flows. This enables one-shot sampling between arbitrary states and yields up to two orders of magnitude speed-ups at large time gaps. Experiments on synthetic SDE simulations and on real-world tracking and video data show that NSFs maintain distributional accuracy comparable to numerical approaches while dramatically reducing computation for arbitrary time-point sampling.

中文标题/摘要

标题：神经随机流：无需求解器的SDE解建模与推断

随机微分方程（SDEs）非常适合建模金融、物理和机器学习中发现的嘈杂且不规则采样的时间序列。传统方法需要昂贵的数值求解器在任意时间点之间采样。我们引入了神经随机流（NSFs）及其潜在变体，直接使用具有保持从随机流继承的属性的架构约束条件的条件归一化流来学习（潜在的）SDE转换定律。这使得在任意状态之间的一次采样成为可能，并在大时间间隔时提供了高达两个数量级的速度提升。在合成SDE模拟和真实世界跟踪及视频数据上的实验表明，NSFs在保持与数值方法相当的分布准确性的同时，大幅减少了任意时间点采样的计算量。

Summary / 总结

The research aims to address the computational challenges of sampling from stochastic differential equations (SDEs) by introducing Neural Stochastic Flows (NSFs). NSFs directly learn the transition laws of SDEs using conditional normalizing flows with constraints that maintain the properties of stochastic flows. This approach allows for efficient one-shot sampling between arbitrary states, achieving up to two orders of magnitude speed-ups at large time gaps. Experiments demonstrate that NSFs maintain distributional accuracy similar to numerical methods while significantly reducing computational requirements for sampling at arbitrary time points.

研究旨在通过引入神经随机流（NSFs）来解决从随机微分方程（SDEs）中采样的计算挑战。NSFs直接使用条件归一化流学习SDE的转换规律，并通过结构约束保持随机流的性质。这种方法允许在任意状态下进行高效的单次采样，对于大时间间隔的采样速度提升可达两个数量级。实验表明，NSFs在保持与数值方法相似的分布准确性的同时，显著减少了任意时间点采样的计算需求。

FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion

Authors: Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu

First: 2025-10-29T17:58:14+00:00 · Latest: 2025-10-29T17:58:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.

中文标题/摘要

标题：FreeArt3D：无需训练的3D可动物体生成方法利用3D扩散

3D可动物体在机器人学、AR/VR和动画等领域中至关重要。最近对这类物体建模的方法要么依赖于需要密集视角监督的优化重建管道，要么依赖于生成前馈模型，这些模型生成粗略的几何近似，往往忽略了表面纹理。相比之下，静态3D物体的开放世界生成已经取得了显著成功，尤其是在3D扩散模型（如Trellis）的出现之后。然而，将这些方法扩展到可动物体，通过训练3D扩散模型来生成可动物体，面临着重大挑战。在本文中，我们提出了FreeArt3D，这是一种无需训练的可动3D物体生成框架。FreeArt3D 不是针对有限的可动数据训练新模型，而是重新利用一个预先训练好的静态3D扩散模型（例如Trellis）作为强大的形状先验。它将Score Distillation Sampling (SDS) 扩展到3D到4D领域，将可动性视为额外的生成维度。给定不同可动状态下的少量图像，FreeArt3D 联合优化物体的几何形状、纹理和可动参数，无需特定任务的训练或访问大规模可动数据集。我们的方法生成高保真几何形状和纹理，准确预测潜在的运动结构，并在多种物体类别中表现出良好的泛化能力。尽管遵循单个实例优化范式，FreeArt3D 完成时间仅需几分钟，并且在质量和多功能性方面显著优于先前的先进方法。

Summary / 总结

FreeArt3D is a training-free framework for generating articulated 3D objects. It repurposes a pre-trained static 3D diffusion model to generate high-fidelity geometry and textures, optimizing the object's geometry, texture, and articulation parameters without requiring task-specific training. The method outperforms prior approaches in both quality and versatility, accurately predicting kinematic structures and generalizing across diverse object categories.

FreeArt3D 是一个无需训练的框架，用于生成 articulated 3D 对象。它将一个预先训练好的静态 3D 扩散模型用作形状先验，并将 Score Distillation Sampling 扩展到 3D 到 4D 领域。给定对象在不同 articulation 状态下的几幅图像，FreeArt3D 优化对象的几何形状、纹理和 articulation 参数，无需特定任务的训练。该方法生成高保真几何形状和纹理，准确预测了运动结构，并在多种对象类别中表现出良好的泛化能力，优于之前的先进方法。

3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

Authors: Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni

First: 2025-10-21T01:03:46+00:00 · Latest: 2025-10-29T17:57:23+00:00

Abs · PDF · Code1 · Code2

Abstract

AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.

中文标题/摘要

标题：3D优化以适应AI推理扩展：平衡准确度、成本和延迟

AI推理扩展通常通过1D启发式方法（固定推理轮次）或2D双变量权衡（如性能与计算能力）进行调整，这些方法未能考虑成本和延迟约束。我们提出了一种3D优化框架，可以在统一的决策空间内同时校准准确度、成本和延迟，从而实现约束感知的推理扩展。通过在三个代表性场景和九个模拟的大语言模型上进行蒙特卡洛模拟，我们评估了四种优化方法以解决3D多目标优化（MOO）问题。将推理扩展置于MOO框架中，形成了1D和2D优化无法捕捉的可行空间，从而实现环境自适应的推理扩展选择。结果表明，膝点优化实现了最佳平衡，而当优先考虑精度时，准确度最大化仍然有利。该框架为不同运营环境下的部署感知推理扩展奠定了理论基础。

Summary / 总结

The paper addresses the limitations of 1D and 2D heuristics in AI inference scaling by proposing a 3D optimization framework that considers accuracy, cost, and latency simultaneously. Using Monte Carlo simulations, four optimization methods were evaluated across three scenarios and nine large language models, demonstrating that knee-point optimization provides the best balance among the three objectives, with accuracy-maximization being preferable when precision is prioritized. This framework offers a theoretical basis for adaptive inference scaling in various operational contexts.

论文通过提出一个同时考虑准确度、成本和延迟的3D优化框架，解决了1D和2D启发式方法在AI推理缩放中的局限性。使用蒙特卡洛模拟，研究评估了四种优化方法来解决3D多目标优化问题。结果表明，膝点优化在三个目标之间提供了最佳平衡，而当优先考虑精度时，准确度最大化更为有利。该框架为各种操作环境下的部署感知推理缩放提供了理论基础。

SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars

Authors: Xiaosheng Zhao, Yang Huang, Guirong Xue, Xiao Kong, Jifeng Liu, Xiaoyu Tang, Timothy C. Beers, Yuan-Sen Ting, A-Li Luo

First: 2025-07-02T17:49:52+00:00 · Latest: 2025-10-29T17:57:03+00:00

Comments: 27 pages, 8 figures, 5 tables. Minor update: added corrected acknowledgments and corrected a misstated hyperparameter value (noted in footnote) for reproducibility. Submitted to AAS Journals. Comments welcome

Abs · PDF · Code1 · Code2

Abstract

In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy.

中文标题/摘要

标题：SpecCLIP：为恒星光谱测量对齐和翻译

近年来，大规模语言模型（LLMs）通过庞大的数据集和大规模参数化，彻底改变了自然语言理解。受此成功的启发，我们提出了SpecCLIP，这是一种基础模型框架，将LLM启发的方法扩展到恒星光谱分析。恒星光谱类似于结构化语言，编码了丰富的物理和化学信息。通过在大规模光谱数据集上训练基础模型，我们的目标是学习稳健且信息丰富的嵌入，以支持各种下游应用。作为概念验证，SpecCLIP 包括在两种光谱类型——LAMOST 低分辨率和Gaia XP——上进行预训练，然后使用适应不同仪器的光谱关联对比对齐框架CLIP。这种对齐通过最大化嵌入和输入光谱之间的互信息来补充特定于光谱的辅助解码器，从而保留光谱特定的信息并实现不同光谱类型的翻译。结果，SpecCLIP 提供了一个跨光谱框架，能够进行内在校准并在不同仪器上实现灵活应用。我们证明，这些模型在中等大小的标记数据集上进行微调可以提高恒星参数估计和化学丰度确定等任务的适应性。SpecCLIP 还提高了与外部调查数据基准的参数估计的准确性和精确度。此外，其相似性搜索和跨光谱预测能力为异常检测提供了潜在可能性。我们的结果表明，通过光谱感知解码器增强的对比训练基础模型可以推进精确恒星光谱学。

Summary / 总结

SpecCLIP is a foundation model framework that extends large language model methodologies to stellar spectral analysis. It involves pre-training on large-scale spectral datasets and using contrastive alignment to align spectra from different instruments. Key findings show that fine-tuning SpecCLIP improves adaptability for tasks like stellar-parameter estimation and chemical-abundance determination, enhancing accuracy and precision compared to external survey data. Additionally, its capabilities in similarity search and cross-spectrum prediction offer potential for anomaly detection.

SpecCLIP 是一种将大型语言模型方法扩展到恒星光谱分析的框架。它包括在大规模光谱数据集上进行预训练，并使用对比对齐来对齐不同仪器的光谱。关键发现表明，微调 SpecCLIP 可以提高恒星参数估计和化学丰度确定等任务的适应性，与外部调查数据相比，提高了准确性和精度。此外，其相似性搜索和跨光谱预测能力为异常检测提供了潜在机会。

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Authors: Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

First: 2025-10-29T17:55:43+00:00 · Latest: 2025-10-29T17:55:43+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

中文标题/摘要

标题：大型模型时代多模态空间推理：综述与基准

人类具备通过多模态观察（如视觉和听觉）理解空间的能力。大型多模态推理模型通过学习感知和推理，扩展了这些能力，并在多种空间任务中表现出色。然而，这些模型的系统性综述和公开可用的基准仍然有限。在本文综述中，我们提供了大型模型在多模态空间推理任务方面的全面综述，分类了多模态大型语言模型（MLLMs）的最新进展，并介绍了评估的开放基准。我们首先概述了通用的空间推理，重点是后训练技术、可解释性和架构。除了经典的2D任务外，我们还探讨了空间关系推理、场景和布局理解，以及三维空间中的视觉问答和语义定位。我们还回顾了嵌入式AI的进展，包括视觉-语言导航和动作模型。此外，我们还考虑了新兴的模态，如音频和第一人称视频，这些模态通过新传感器提供了新的空间理解。我们认为，本文综述为多模态空间推理这一不断发展的领域奠定了坚实的基础，并提供了见解。有关本文综述的最新信息、开放基准的代码和实现可以在https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning找到。

Summary / 总结

This survey aims to review and benchmark multimodal spatial reasoning tasks using large models, covering general spatial reasoning, 2D and 3D tasks, embodied AI, and emerging modalities like audio and egocentric video. The study introduces open benchmarks for evaluating these models and provides insights into their performance across various spatial tasks, highlighting the need for systematic reviews and benchmarks in this field.

本文旨在回顾和基准化使用大型模型的多模态空间推理任务，涵盖一般的空间推理、2D和3D任务、具身AI以及新兴的音频和第一人称视频等模态。研究引入了评估这些模型的公开基准，并提供了对其在各种空间任务中的性能的见解，强调了在这一领域需要系统性的回顾和基准的需求。

Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning

Authors: Ethan Harvey, Dennis Johan Loevlie, Michael C. Hughes

First: 2025-10-29T17:55:17+00:00 · Latest: 2025-10-29T17:55:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Multiple instance learning (MIL) is often used in medical imaging to classify high-resolution 2D images by processing patches or classify 3D volumes by processing slices. However, conventional MIL approaches treat instances separately, ignoring contextual relationships such as the appearance of nearby patches or slices that can be essential in real applications. We design a synthetic classification task where accounting for adjacent instance features is crucial for accurate prediction. We demonstrate the limitations of off-the-shelf MIL approaches by quantifying their performance compared to the optimal Bayes estimator for this task, which is available in closed-form. We empirically show that newer correlated MIL methods still struggle to generalize as well as possible when trained from scratch on tens of thousands of instances.

中文标题/摘要

标题：合成数据揭示相关多重实例学习中的泛化差距

多重实例学习（MIL）常用于医学成像，通过处理片段对高分辨率2D图像进行分类，或通过处理切片对3D体积进行分类。然而，传统的MIL方法将实例单独处理，忽略了附近片段或切片的出现等上下文关系，这些在实际应用中可能是至关重要的。我们设计了一个分类任务，在此任务中，考虑相邻实例特征对于准确预测至关重要。我们通过量化这些方法与该任务的闭式最优贝叶斯估计器相比的表现，展示了现成的MIL方法的局限性。我们实证表明，即使从头开始训练数万个实例，最新的相关MIL方法仍然难以完全泛化。

Summary / 总结

The research aims to highlight the limitations of conventional multiple instance learning (MIL) approaches in medical imaging by designing a synthetic task where context between instances is crucial. The study compares off-the-shelf MIL methods to the optimal Bayes estimator and finds that even newer correlated MIL methods struggle to generalize well when trained on large datasets.

研究旨在通过设计一个合成任务来突出传统多实例学习（MIL）方法在医学成像中的局限性，该任务中实例之间的上下文至关重要。研究将现成的MIL方法与最优贝叶斯估计器进行了比较，并发现即使是最新的相关MIL方法，在大规模数据集上从头训练时也无法很好地泛化。

TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling

Authors: He Hu, Yucheng Zhou, Chiyuan Ma, Qianning Wang, Zheng Zhang, Fei Ma, Laizhong Cui, Qi Tian

First: 2025-10-29T17:54:20+00:00 · Latest: 2025-10-29T17:54:20+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large language models (LLMs) in psychological counseling have attracted increasing attention. However, existing approaches often lack emotional understanding, adaptive strategies, and the use of therapeutic methods across multiple sessions with long-term memory, leaving them far from real clinical practice. To address these critical gaps, we introduce TheraMind, a strategic and adaptive agent for longitudinal psychological counseling. The cornerstone of TheraMind is a novel dual-loop architecture that decouples the complex counseling process into an Intra-Session Loop for tactical dialogue management and a Cross-Session Loop for strategic therapeutic planning. The Intra-Session Loop perceives the patient's emotional state to dynamically select response strategies while leveraging cross-session memory to ensure continuity. Crucially, the Cross-Session Loop empowers the agent with long-term adaptability by evaluating the efficacy of the applied therapy after each session and adjusting the method for subsequent interactions. We validate our approach in a high-fidelity simulation environment grounded in real clinical cases. Extensive evaluations show that TheraMind outperforms other methods, especially on multi-session metrics like Coherence, Flexibility, and Therapeutic Attunement, validating the effectiveness of its dual-loop design in emulating strategic, adaptive, and longitudinal therapeutic behavior. The code is publicly available at https://0mwwm0.github.io/TheraMind/.

中文标题/摘要

标题：TheraMind：纵向心理辅导的战略性和适应性代理

在心理辅导中，大型语言模型（LLMs）已引起越来越多的关注。然而，现有方法往往缺乏情感理解、适应策略以及在多个会话中使用治疗方法并具有长期记忆的能力，使其远未达到临床实践的标准。为解决这些关键差距，我们介绍了TheraMind，这是一种用于纵向心理辅导的战略性和适应性代理。TheraMind的核心是一个新颖的双环架构，将复杂的咨询过程分解为会话内环，用于战术对话管理，以及跨会话环，用于战略治疗规划。会话内环感知患者的情感状态，动态选择响应策略，同时利用跨会话记忆确保连续性。关键的是，跨会话环通过评估每次会话中应用疗法的有效性，并调整后续互动的方法，赋予代理长期适应性。我们在基于真实临床案例的高保真模拟环境中验证了我们的方法。广泛的评估表明，TheraMind在连贯性、灵活性和治疗调适等多会话指标上优于其他方法，验证了其双环设计在模拟战略性、适应性和纵向治疗行为方面的有效性。代码可在https://0mwwm0.github.io/TheraMind/公开获取。

Summary / 总结

TheraMind is a strategic and adaptive agent for longitudinal psychological counseling, addressing the limitations of existing approaches by incorporating a dual-loop architecture. The Intra-Session Loop manages tactical dialogue, while the Cross-Session Loop ensures long-term adaptability through strategic planning and evaluation. TheraMind outperforms other methods in metrics such as Coherence, Flexibility, and Therapeutic Attunement, validating its effectiveness in emulating clinical therapeutic behavior over multiple sessions.

TheraMind 是一个针对长期心理辅导的战略性和自适应代理，通过引入双环架构来解决现有方法的局限性。内环会话循环根据患者的情绪状态管理对话策略，而跨会话循环则通过评估疗法效果并调整方法来确保长期适应性。TheraMind 在连贯性、灵活性和治疗共鸣等指标上优于其他方法，验证了其双环设计在模拟战略性、自适应和长期治疗行为的有效性。

Curiosity-driven RL for symbolic equation solving

Authors: Kevin P. O'Keeffe

Venue: NeurIPS 2025

First: 2025-10-19T22:04:57+00:00 · Latest: 2025-10-29T17:52:01+00:00

Comments: Accepted at the NeurIPS 2025 MATH-AI Workshop

Abs · PDF · Code1 · Code2

Abstract

We explore if RL can be useful for symbolic mathematics. Previous work showed contrastive learning can solve linear equations in one variable. We show model-free PPO \cite{schulman2017proximal} augmented with curiosity-based exploration and graph-based actions can solve nonlinear equations such as those involving radicals, exponentials, and trig functions. Our work suggests curiosity-based exploration may be useful for general symbolic reasoning tasks.

中文标题/摘要

标题：自主探索的强化学习在符号方程求解中的应用

我们探索了RL是否可以用于符号数学。先前的工作表明，对比学习可以解决一元线性方程。我们展示了无模型的PPO [1] 结合基于好奇心的探索和基于图的动作可以解决非线性方程，如涉及根号、指数和三角函数的方程。我们的工作表明，基于好奇心的探索可能对一般的符号推理任务有用。

Summary / 总结

This study investigates the application of reinforcement learning (RL) in symbolic mathematics, specifically focusing on solving nonlinear equations. The research employs model-free PPO with curiosity-based exploration and graph-based actions, demonstrating its capability to solve equations involving radicals, exponentials, and trigonometric functions. The key finding is that curiosity-driven exploration can be beneficial for general symbolic reasoning tasks.

该研究探讨了强化学习（RL）在符号数学中的应用，使用了基于模型的PPO结合好奇心驱动的探索和基于图的动作来解决非线性方程。主要发现是，这种方法能够有效处理涉及根号、指数和三角函数的方程，表明好奇心驱动的方法在一般符号推理任务中的潜力。