arXiv 论文速递

GTAlign: Game-Theoretic Alignment of LLM Assistants for Social Welfare

Authors: Siqi Zhu, David Zhang, Pedro Cisneros-Velarde, Jiaxuan You

First: 2025-10-10T00:05:14+00:00 · Latest: 2025-11-03T18:54:17+00:00

Comments: 31 pages, 6 figures

Abstract

Large Language Models (LLMs) have achieved remarkable progress in reasoning, yet sometimes produce responses that are suboptimal for users in tasks such as writing, information seeking, or providing practical guidance. Conventional alignment practices typically assume that maximizing model reward also maximizes user welfare, but this assumption frequently fails in practice: models may over-clarify or generate overly verbose reasoning when users prefer concise answers. Such behaviors resemble the prisoner's dilemma, where individually rational choices lead to socially suboptimal outcomes. The fundamental challenge is the lack of a principled decision making mechanism that mutually benefits both the LLM and the user. We propose Game-Theoretic Alignment (GTAlign), an alignment framework that integrates game-theoretic decision making into both reasoning and training. During reasoning, the model explicitly treats user-LLM interaction as a strategic game: it constructs payoff matrices within its reasoning chain to estimate welfare for both itself and the user, and then selects actions that are mutually beneficial. During training, we introduce a social welfare reward that reinforces cooperative responses, aligning model behavior with socially efficient outcomes. In addition, we introduce an inference technique that leverages game-theoretic reasoning to dynamically adapt LLM's response when pricing policies of LLM service change. Extensive experiments demonstrate that GTAlign substantially improves reasoning efficiency, answer quality, and social welfare compared to baselines across diverse tasks. The code is available at https://github.com/ulab-uiuc/GTAlign .

中文标题/摘要

标题：GTAlign：博弈论对齐大型语言模型助手以促进社会福利

大型语言模型（LLMs）在推理方面取得了显著进展，但在诸如写作、信息检索或提供实用指导等任务中，有时会产生对用户不理想的回应。传统的对齐实践通常假设最大化模型奖励也最大化了用户福利，但在实践中这种假设经常失败：模型可能会过度解释或生成过于冗长的推理，而用户可能更希望简洁的答案。这种行为类似于囚徒困境，个体理性选择导致了社会上不理想的结局。根本挑战在于缺乏一种既能使LLM和用户都受益的有原则的决策机制。我们提出了博弈论对齐（GTAlign），这是一种将博弈论决策机制整合到推理和训练中的对齐框架。在推理过程中，模型明确地将用户-LLM交互视为一种战略博弈：它在其推理链中构建收益矩阵来估算自身和用户的福利，然后选择互惠互利的动作。在训练过程中，我们引入了一种社会福利奖励，以强化合作回应，使模型行为与社会有效结果相一致。此外，我们引入了一种推理技术，利用博弈论推理动态适应LLM响应，当LLM服务的价格政策发生变化时。广泛的实验表明，与基线相比，GTAlign在各种任务中显著提高了推理效率、答案质量和社会福利。代码可在https://github.com/ulab-uiuc/GTAlign 获取。

Summary / 总结

GTAlign is a game-theoretic alignment framework for LLMs that addresses the issue of suboptimal responses by integrating game-theoretic decision-making into both reasoning and training. During reasoning, the model considers user-LLM interaction as a strategic game, estimating welfare for both the model and the user, and selecting mutually beneficial actions. During training, a social welfare reward is introduced to reinforce cooperative responses. Experiments show that GTAlign enhances reasoning efficiency, answer quality, and social welfare compared to baseline methods across various tasks.

GTAlign 是一种游戏理论对齐框架，通过将游戏理论决策机制整合到推理和训练中来解决 LLM 的次优响应问题。在推理过程中，模型将用户-LLM 交互视为战略游戏，并选择同时有利于模型和用户的行动。在训练中，引入了社会福利奖励来鼓励合作响应。实验表明，与基线方法相比，GTAlign 在各种任务中提高了推理效率、答案质量和社会福利。

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Authors: Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

First: 2025-08-04T05:51:55+00:00 · Latest: 2025-11-03T18:47:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents' interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.

中文标题/摘要

标题：SE-Agent: 多步推理中基于LLM代理的自我进化轨迹优化

基于大型语言模型（LLM）的代理最近在通过多步与环境交互进行复杂推理和工具使用方面展示了令人印象深刻的性能。尽管这些代理有能力解决复杂任务，但它们的问题解决过程，即代理完成任务的交互轨迹，仍然被忽视。这些轨迹包含丰富的反馈，可以引导代理朝着正确解决问题的方向前进。尽管现有的方法，如蒙特卡洛树搜索（MCTS），能够有效地平衡探索和利用，但它们忽略了各种轨迹之间的相互依赖性，缺乏搜索空间的多样性，导致重复推理和次优结果。为了解决这些挑战，我们提出了SE-Agent，这是一种自我进化框架，使代理能够迭代优化其推理过程。我们的方法通过三种关键操作——修订、重组和细化——重新审视并增强了先前的试点轨迹。这种进化机制提供了两个关键优势：（1）通过智能探索由先前轨迹引导的多样化解决方案路径，超越局部最优，扩展搜索空间；（2）利用跨轨迹的启发式，高效提升性能，同时减轻次优推理路径的影响。通过这些机制，SE-Agent实现了逐步自我进化，逐步提高推理质量。我们在SWE-bench上验证了SE-Agent，能够解决真实的GitHub问题。在五个强大的LLM上的实验结果表明，集成SE-Agent可实现高达55%的相对改进，在SWE-bench上验证的所有开源代理中达到最先进的性能。我们的代码和演示材料可在https://github.com/JARVIS-Xs/SE-Agent上公开获取。

Summary / 总结

The paper proposes SE-Agent, a Self-Evolution framework designed to optimize multi-step reasoning processes in LLM-based agents. It revisits and enhances previous trajectories through revision, recombination, and refinement, expanding the search space and leveraging cross-trajectory inspiration. Experimental results on SWE-bench show up to 55% relative improvement in performance compared to existing methods, achieving state-of-the-art results on resolving real-world GitHub issues with five strong LLMs.

论文提出了SE-Agent，一种自我进化框架，旨在优化基于LLM的代理的多步推理过程。它通过修订、重组和细化重新审视先前的轨迹，扩展搜索空间并利用跨轨迹的启发。实验表明，SE-Agent在SWE-bench上提高了推理质量，与现有代理相比，在解决真实世界GitHub问题方面实现了高达55%的相对改进。

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Authors: Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, Frank Hutter

Venue: NeurIPS 2025 spotlight

First: 2025-06-20T07:14:48+00:00 · Latest: 2025-11-03T18:47:03+00:00

Comments: Accepted (spotlight) at NeurIPS 2025 Datasets and Benchmarks Track. v4: fixed links in comments. v3: NeurIPS camera-ready version. v2: fixed author list. 51 pages. Code available at https://tabarena.ai/code and examples at https://tabarena.ai/code-examples and dataset curation at https://tabarena.ai/data-tabular-ml-iid-study and https://tabarena.ai/dataset-curation

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

中文标题/摘要

标题：TabArena：用于表格数据机器学习的标准持续基准

随着深度学习和基础模型在表格数据中的流行，标准化和可靠的基准测试需求比以往任何时候都高。然而，当前的基准测试是静态的，即使发现了缺陷、更新了模型版本或发布了新模型，其设计也不会更新。为了解决这一问题，我们引入了TabArena，这是第一个持续维护的动态表格基准测试系统。为了启动TabArena，我们手动整理了一个具有代表性的数据集和实现良好的模型集合，进行了大规模的基准测试研究以初始化公共排行榜，并组建了一支经验丰富的维护团队。我们的结果强调了验证方法和超参数配置集成对基准模型性能的影响。虽然梯度提升树在实际表格数据集上仍然表现出色，但我们观察到，在更大的时间预算下，集成的深度学习方法已经迎头赶上。同时，基础模型在较小的数据集上表现出色。最后，我们展示了模型集成推动了表格机器学习的最新进展。我们观察到，由于验证集过拟合，一些深度学习模型在跨模型集成中过度代表，我们鼓励模型开发者解决这一问题。我们以公共排行榜、可复现的代码和维护协议启动了TabArena，并可在https://tabarena.ai/访问。

Summary / 总结

TabArena is introduced as a continuously maintained benchmark for tabular data, addressing the limitations of static benchmarks. The system includes a curated dataset and model collection, a large-scale benchmarking study, and experienced maintainers. Key findings show the impact of validation methods and hyperparameter ensembling, with deep learning methods improving under larger budgets and foundation models excelling on smaller datasets. Ensembles across models significantly advance the state-of-the-art, but some deep learning models are overrepresented due to validation set overfitting. TabArena provides a public leaderboard, reproducible code, and maintenance protocols at https://tabarena.ai.

TabArena 是一个持续维护的表格数据基准系统，旨在解决静态基准的局限性。它包含了一个精心挑选的数据集和模型集合，并通过大规模基准测试研究初始化了一个公开的排行榜。研究结果强调了验证方法和超参数集成的重要性，发现深度学习方法在更大的时间预算下与梯度提升树方法竞争，并且在较小的数据集上，基础模型表现出色。模型集成显著提升了表格机器学习的最新成果，但一些深度学习模型由于验证集过拟合而过度代表。该系统通过一个公开的排行榜和维护协议在 https://tabarena.ai 上发布。

RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs

Authors: Joe Meyer, Divyansha Lachi, Mahmoud Mohammadi, Roshan Reddy Upendra, Eva L. Dyer, Mark Li, Tom Palczewski

First: 2025-10-22T18:27:49+00:00 · Latest: 2025-11-03T18:42:57+00:00

Comments: 6 pages

Abs · PDF · Code1 · Code2

Abstract

Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.

中文标题/摘要

标题：RELATE：一种面向关系图的无模式感知感知器编码器

电子商务、医疗保健和科学研究等领域中的关系多表数据可以自然地表示为具有多模态节点属性的异构时序图。现有的图神经网络（GNN）依赖于特定模式的特征编码器，需要为每种节点类型和特征列单独设置模块，这妨碍了可扩展性和参数共享。我们引入了RELATE（关系编码器，用于类型化实体的潜在聚合），这是一种无模式的、即插即用的特征编码器，可以与任何通用图神经网络结合使用。RELATE 使用共享的模态特定编码器来处理分类、数值、文本和时间属性，然后通过一种类似于感知器的交叉注意力模块将特征聚合为固定大小的、置换不变的节点表示。我们在 RelBench 基准上评估了 RELATE，它在 ReLGNN 和 HGT 上的表现与特定模式的编码器相差不超过 3%，同时参数数量减少了最多 5 倍。这种设计支持变化的模式，并使通用图神经网络能够进行多数据集预训练，为关系图数据奠定了基础模型的道路。

Summary / 总结

The paper introduces RELATE, a schema-agnostic feature encoder for GNNs that uses shared modality-specific encoders and a Perceiver-style cross-attention module to aggregate features into a fixed-size, permutation-invariant node representation. RELATE outperforms existing schema-specific encoders by up to 3% in the RelBench benchmark while reducing parameter counts by up to 5x, supporting varying schemas and enabling multi-dataset pretraining for general-purpose GNNs.

论文提出了RELATE，一种用于GNN的无模式特征编码器，使用共享的模态特定编码器和Perceiver风格的交叉注意力模块将特征聚合为固定大小的、不变排列的节点表示。RELATE在RelBench基准测试中比现有模式特定编码器高出至多3%，同时将参数数量减少至多5倍，支持不同的模式并使通用GNN能够进行多数据集预训练，为关系图数据的基础模型铺平了道路。

Automotive Crash Dynamics Modeling Accelerated with Machine Learning

Authors: Mohammad Amin Nabian, Sudeep Chavare, Deepak Akhare, Rishikesh Ranade, Ram Cherukuri, Srinivas Tadepalli

First: 2025-10-17T00:03:33+00:00 · Latest: 2025-11-03T18:19:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.

中文标题/摘要

标题：利用机器学习加速汽车碰撞动力学建模

碰撞耐撞性评估是汽车设计中的关键方面，传统上依赖于高保真有限元（FE）模拟，这些模拟计算成本高且耗时。本研究介绍了使用NVIDIA PhysicsNeMo框架开发基于机器学习的代理模型，以高效预测碰撞场景中的结构变形的探索性比较研究。鉴于将机器学习应用于结构碰撞动力学的先前工作有限，本研究的主要贡献在于展示了本研究中探索的各种建模方法的可行性和工程实用性。我们研究了两种最先进的神经网络架构来建模碰撞动力学：MeshGraphNet和Transolver。此外，我们还研究了三种用于建模瞬态动力学的方法：时间条件化、标准自回归方法以及结合基于回放训练的稳定性增强自回归方案。这些模型在包含150个详细FE模拟的全面车身白车身（BIW）碰撞数据集上进行了评估，该数据集代表了一个结构丰富的车辆装配，包括超过200个组件，其中38个关键组件具有变化的厚度分布，以捕捉现实的制造变异性。每个模型都利用未变形的网格几何形状和组件特性作为输入，以预测碰撞序列中变形网格的空间-时间演变。评估结果显示，这些模型在总体变形趋势的捕捉上具有合理的精度，展示了将机器学习应用于结构碰撞动力学的可行性。尽管尚未达到FE的完全精度，但这些模型实现了计算成本的数量级减少，从而能够快速进行设计探索和碰撞耐撞性评估的早期优化。

Summary / 总结

This study aims to accelerate crashworthiness assessment in automotive design by developing machine learning-based surrogate models. Two neural network architectures, MeshGraphNet and Transolver, were investigated, along with three transient dynamics modeling strategies. The models were evaluated on a comprehensive Body-in-White crash dataset of 150 FE simulations. Results indicate that these models can capture overall deformation trends with reasonable accuracy, offering significant computational cost reductions compared to traditional FE simulations, thus facilitating rapid design exploration and early-stage optimization.

本研究旨在通过使用NVIDIA PhysicsNeMo框架开发机器学习代理模型，加速汽车设计中的碰撞安全性评估。研究调查了两种神经网络架构（MeshGraphNet和Transolver）和三种瞬态动力学建模策略。模型在包含150个详细FE模拟的全面Body-in-White碰撞数据集上进行了评估。结果表明，这些模型能够以合理的精度捕捉整体变形趋势，实现显著的计算成本降低，从而在碰撞安全性评估的早期阶段实现快速设计探索和优化。

CosmoBench: A Multiscale, Multiview, Multitask Cosmology Benchmark for Geometric Deep Learning

Authors: Ningyuan Huang, Richard Stiskalek, Jun-Young Lee, Adrian E. Bayer, Charles C. Margossian, Christian Kragh Jespersen, Lucia A. Perez, Lawrence K. Saul, Francisco Villaescusa-Navarro

Venue: NeurIPS 2025

First: 2025-07-04T16:46:25+00:00 · Latest: 2025-11-03T18:09:02+00:00

Comments: Accepted at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks -- to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches -- from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training time. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at https://cosmobench.streamlit.app

中文标题/摘要

标题：CosmoBench：一种多尺度、多视图、多任务的宇宙学基准测试，用于几何深度学习

宇宙学模拟提供了大量的数据，以点云和定向树的形式存在。一个关键目标是从这些数据中提取见解，以揭示宇宙的性质和组成。本文介绍了CosmoBench，这是一个基准数据集，从最先进的宇宙学模拟中精心挑选而来，这些模拟运行需要超过4100万核心小时，并生成了超过两拍字节的数据。CosmoBench是此类数据集中最大的一个：它包含来自三个不同长度尺度的暗物质晕和星系模拟的34000个点云，以及在两个不同时间尺度上记录晕形成历史的25000个定向树。CosmoBench中的数据可用于多种任务——从点云和合并树预测宇宙参数，从集体位置预测单个晕和星系的速度，以及从粗时间尺度的合并树重建更细时间尺度的合并树。我们提供了这些任务的几种基线，一些基于宇宙学建模中的成熟方法，另一些则基于机器学习。对于后者，我们研究了不同的方法——从简单线性模型到深度学习中的更大且计算需求更高的模型，如图神经网络。我们发现，使用少量不变特征的最小二乘拟合有时会优于具有更多参数和更长训练时间的深度架构。然而，通过结合机器学习和宇宙学，仍然有很大的潜力来改进这些基线，充分利用数据。CosmoBench为大规模将宇宙学与几何深度学习相结合奠定了基础。我们邀请社区通过与该数据集互动来推动科学发现，数据集可在https://cosmobench.streamlit.app 获取。

Summary / 总结

CosmoBench is a benchmark dataset derived from cosmological simulations, containing 34,000 point clouds and 25,000 directed trees. It aims to extract insights from complex data to understand the universe. The dataset supports multiple tasks such as predicting cosmological parameters, halo velocities, and reconstructing merger trees. The study compares various models, including simple linear models and deep learning approaches like graph neural networks, finding that simpler models can sometimes outperform more complex ones. The research sets a foundation for integrating cosmology and geometric deep learning at scale, inviting further exploration by the community.

CosmoBench 是一个源自宇宙学模拟的数据集，包含 34,000 个点云和 25,000 个定向树，旨在支持预测宇宙参数、星系速度和重建合并树等任务。研究提供了使用宇宙学建模和机器学习方法的多种基线，包括图神经网络。研究发现，具有不变特征的简单模型有时可以超越更复杂的深度学习模型。该数据集旨在将宇宙学与几何深度学习相结合，邀请社区进一步探索和改进这些方法。

RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features

Authors: Forouzan Fallah, Wenwen Li, Chia-Yu Hsu, Hyunho Lee, Yezhou Yang

First: 2025-10-27T19:56:43+00:00 · Latest: 2025-11-03T17:58:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow's core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model's outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40\% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.

中文标题/摘要

标题：RareFlow：物理感知的流匹配跨传感器稀有稀土特征超分辨率

遥感图像的超分辨率（SR）在异常分布（OOD）条件下往往表现不佳，例如由多种传感器捕获的稀有地貌特征，会产生视觉上合理但物理上不准确的结果。我们提出了RareFlow，这是一种为OOD鲁棒性设计的物理感知SR框架。RareFlow的核心是一个双条件架构。Gated ControlNet 从低分辨率输入中保留精细的几何保真度，而文本提示则为合成复杂特征提供语义指导。为了确保输出的物理合理性，我们引入了一种多方面的损失函数，以确保光谱和辐射度与传感器特性的一致性。此外，该框架通过使用随机前向传递方法量化自身的预测不确定性；结果输出的方差直接识别出不熟悉的输入，从而减轻特征幻觉。我们在一个新编纂的多传感器卫星图像基准上验证了RareFlow。在盲测中，地球物理专家评价我们的模型输出接近真实图像的保真度，显著优于最先进的基线。这种定性的优越性得到了感知度量的定量增益的支持，包括FID几乎降低了40%。RareFlow为数据稀缺的科学领域提供了高保真合成的稳健框架，并为在严重领域转移下可控生成提供了一个新范式。

Non-Contact Health Monitoring During Daily Personal Care Routines

Authors: Xulin Ma, Jiankai Tang, Zhang Jiang, Songqin Cheng, Yuanchun Shi, Dong LI, Xin Liu, Daniel McDuff, Xiaojing Liu, Yuntao Wang

First: 2025-06-11T13:29:21+00:00 · Latest: 2025-11-03T17:30:56+00:00

Comments: IEEE BSN 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Remote photoplethysmography (rPPG) enables non-contact, continuous monitoring of physiological signals and offers a practical alternative to traditional health sensing methods. Although rPPG is promising for daily health monitoring, its application in long-term personal care scenarios, such as mirror-facing routines in high-altitude environments, remains challenging due to ambient lighting variations, frequent occlusions from hand movements, and dynamic facial postures. To address these challenges, we present LADH (Long-term Altitude Daily Health), the first long-term rPPG dataset containing 240 synchronized RGB and infrared (IR) facial videos from 21 participants across five common personal care scenarios, along with ground-truth PPG, respiration, and blood oxygen signals. Our experiments demonstrate that combining RGB and IR video inputs improves the accuracy and robustness of non-contact physiological monitoring, achieving a mean absolute error (MAE) of 4.99 BPM in heart rate estimation. Furthermore, we find that multi-task learning enhances performance across multiple physiological indicators simultaneously. Dataset and code are open at https://github.com/McJackTang/FusionVitals.

中文标题/摘要

标题：日常个人护理过程中非接触式健康监测

远程光体积描记图(rPPG)使非接触式、连续监测生理信号成为可能，并提供了传统健康传感方法的实用替代方案。尽管rPPG在日常健康监测方面具有潜力，但在高海拔等长期个人护理场景中，如面对镜子的日常护理，由于环境光照变化、手部动作频繁遮挡以及面部姿态动态变化，其应用仍面临挑战。为应对这些挑战，我们提出了LADH（长期高海拔日常健康）数据集，这是首个包含21名参与者在五种常见个人护理场景中同步的240段RGB和红外(IR)面部视频的数据集，附带真实光体积描记图(PPG)、呼吸和血氧信号。我们的实验表明，结合RGB和IR视频输入可以提高非接触式生理监测的准确性和鲁棒性，在心率估计中平均绝对误差(MAE)为4.99次/分钟。此外，我们发现多任务学习可以同时提高多种生理指标的性能。数据集和代码可在https://github.com/McJackTang/FusionVitals上获取。

Summary / 总结

The research aims to improve non-contact health monitoring during daily personal care routines, particularly in challenging environments like high altitudes. The study introduces LADH, a long-term rPPG dataset with synchronized RGB and IR videos from 21 participants in five common scenarios. Combining RGB and IR inputs enhances accuracy, achieving a mean absolute error of 4.99 BPM in heart rate estimation, and multi-task learning improves performance across multiple physiological indicators.

研究旨在通过解决环境光照和遮挡等问题，改善日常个人护理过程中的非接触健康监测。研究引入了LADH数据集，包含21名参与者在五个常见个人护理场景下的同步RGB和IR视频。结合RGB和IR输入并使用多任务学习提高了生理信号的准确性，心率估计的平均绝对误差为4.99 BPM。

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Authors: Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun

Venue: ICCV 2025

First: 2025-04-01T07:47:55+00:00 · Latest: 2025-11-03T17:23:02+00:00

Comments: Published as a conference paper at ICCV 2025. Project page: https://github.com/icip-cas/ShortV

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

中文标题/摘要

标题：ShortV：通过冻结无效层中的视觉标记提高多模态大型语言模型的效率

多模态大型语言模型（MLLMs）由于其庞大的规模和大量的视觉标记而面临高昂的计算成本。本文通过引入一个新的度量标准——层贡献（LC），研究了MLLMs中的层间冗余性，该度量标准量化了层的变换对视觉和文本标记的影响。LC的计算涉及测量移除层对指定标记的变换后模型输出的差异。我们的初步实验表明，在处理视觉标记时，MLLMs中的许多层几乎没有贡献。受此观察的启发，我们提出了一种无需训练的方法——ShortV，利用LC来识别无效层，并在这些层中冻结视觉标记的更新。实验表明，ShortV可以在大约60%的MLLM层中冻结视觉标记，从而大幅降低与更新视觉标记相关的计算成本。例如，它在LLaVA-NeXT-13B上实现了50%的FLOPs减少，同时保持了优越的性能。代码将在https://github.com/icip-cas/ShortV公开。

Summary / 总结

This paper addresses the high computational costs of Multimodal Large Language Models (MLLMs) by introducing a novel metric, Layer Contribution (LC), to identify layers with minimal impact on visual processing. The proposed ShortV method freezes visual token updates in these layers, reducing computational costs by approximately 60% without compromising performance. For instance, it achieves a 50% reduction in FLOPs on LLaVA-NeXT-13B. The code is publicly available at https://github.com/icip-cas/ShortV.

研究旨在通过识别并冻结无效层来降低多模态大型语言模型（MLLMs）的计算成本。方法ShortV使用了一种新的指标Layer Contribution (LC)，量化每一层对视觉和文本标记的影响。实验表明，ShortV可以在大约60%的MLLM层中冻结视觉标记的更新，从而在LLaVA-NeXT-13B上实现50%的FLOPs减少，同时保持性能。代码已公开，可在https://github.com/icip-cas/ShortV获得。

PO-CKAN:Physics Informed Deep Operator Kolmogorov Arnold Networks with Chunk Rational Structure

Authors: Junyi Wu, Guang Lin

First: 2025-10-09T20:18:24+00:00 · Latest: 2025-11-03T16:54:38+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose PO-CKAN, a physics-informed deep operator framework based on Chunkwise Rational Kolmogorov--Arnold Networks (KANs), for approximating the solution operators of partial differential equations. This framework leverages a Deep Operator Network (DeepONet) architecture that incorporates Chunkwise Rational Kolmogorov-Arnold Network (CKAN) sub-networks for enhanced function approximation. The principles of Physics-Informed Neural Networks (PINNs) are integrated into the operator learning framework to enforce physical consistency. This design enables the efficient learning of physically consistent spatio-temporal solution operators and allows for rapid prediction for parametric time-dependent PDEs with varying inputs (e.g., parameters, initial/boundary conditions) after training. Validated on challenging benchmark problems, PO-CKAN demonstrates accurate operator learning with results closely matching high-fidelity solutions. PO-CKAN adopts a DeepONet-style branch--trunk architecture with its sub-networks instantiated as rational KAN modules, and enforces physical consistency via a PDE residual (PINN-style) loss. On Burgers' equation with $\nu=0.01$, PO-CKAN reduces the mean relative $L^2$ error by approximately 48\% compared to PI-DeepONet, and achieves competitive accuracy on the Eikonal and diffusion--reaction benchmarks.

中文标题/摘要

标题：PO-CKAN：基于分块有理柯尔莫哥洛夫-阿诺尔德网络的物理导向深度算子网络

我们提出了一种基于分块有理柯尔莫哥洛夫-阿诺尔德网络（CKAN）的物理导向深度算子框架PO-CKAN，用于近似偏微分方程的解算子。该框架利用了具有CKAN子网络的深度算子网络（DeepONet）架构，以增强函数逼近能力。将物理导向神经网络（PINNs）的原则整合到算子学习框架中，以确保物理一致性。这种设计使得能够高效地学习物理一致的空间-时间解算子，并在训练后能够快速预测具有变化输入（例如，参数、初始/边界条件）的参数时间依赖偏微分方程。PO-CKAN在具有挑战性的基准问题上得到了验证，展示了与高保真解结果接近的准确算子学习。PO-CKAN采用DeepONet风格的分支-干架构，其子网络实例化为有理KAN模块，并通过PDE残差（PINN风格）损失来确保物理一致性。在ν=0.01的Burgers' 方程上，PO-CKAN将平均相对L²误差降低了约48%，并在Eikonal和扩散-反应基准测试中实现了竞争力的精度。

Summary / 总结

PO-CKAN is a physics-informed deep operator framework that uses Chunkwise Rational Kolmogorov--Arnold Networks (CKAN) to approximate partial differential equation solution operators. It integrates principles of Physics-Informed Neural Networks (PINNs) to ensure physical consistency. PO-CKAN shows accurate operator learning, reducing the mean relative $L^2$ error by about 48% on Burgers' equation compared to PI-DeepONet and achieving competitive accuracy on other benchmarks.

PO-CKAN 是一个基于 Chunkwise Rational Kolmogorov--Arnold Networks (CKAN) 的物理导向深度算子框架，用于近似偏微分方程的解算子。它结合了 Physics-Informed Neural Networks (PINNs) 的原理以确保物理一致性。PO-CKAN 在 Burgers' 方程上的 $L^2$ 误差相对均值减少了约 48%，并在其他基准测试中达到了竞争性的准确性。

Rethinking Visual Intelligence: Insights from Video Pretraining

Authors: Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro

First: 2025-10-28T14:12:11+00:00 · Latest: 2025-11-03T16:32:22+00:00

Comments: Updated version from preprint arXiv:2506.07280 (Gen2Gen) focused on visual intelligence. This work can be considered as v2

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

中文标题/摘要

标题：重新思考视觉智能：来自视频预训练的见解

大规模语言模型（LLMs）表明，大规模预训练使系统能够在语言领域中在少量监督的情况下迅速适应新问题。然而，这种成功并未在视觉领域得到有效的转化，模型，包括LLMs，仍然在组合理解、样本效率和通用问题解决方面挣扎。我们研究视频扩散模型（VDMs）作为弥合这一差距的有前途的方向。基于时空数据的预训练赋予这些模型强大的归纳偏置，以支持结构和动态，我们认为这可以支持广泛的任务适应性。为了测试这一点，我们设计了一项受控评估，在此评估中，一个预训练的LLM和一个预训练的VDM都配备了轻量级适配器，并被呈现给它们自然模态的任务。在包括ARC-AGI、ConceptARC、视觉游戏、路线规划和细胞自动机等基准测试中，VDMs在数据效率方面优于其语言对应物。综上所述，我们的结果表明，视频预训练提供了支持向视觉基础模型发展的归纳偏置。

Summary / 总结

This study explores the potential of Video Diffusion Models (VDMs) for improving visual intelligence by leveraging spatiotemporal data pretraining. Unlike large language models (LLMs), VDMs show better data efficiency across various benchmarks, including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata. This suggests that video pretraining can provide inductive biases beneficial for visual foundation models.

本研究探讨了通过时空数据预训练来提升视觉智能的潜力，使用视频扩散模型（VDMs）。与大型语言模型（LLMs）相比，VDMs在ARC-AGI、ConceptARC、视觉游戏、路线规划和细胞自动机等各个基准测试中表现出更好的数据效率。这表明视频预训练可以为视觉基础模型提供有益的归纳偏置。

Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation

Authors: Cécile Rousseau, Tobia Boschi, Giandomenico Cornacchia, Dhaval Salwala, Alessandra Pascale, Juan Bernabe Moreno

First: 2025-05-21T08:50:49+00:00 · Latest: 2025-11-03T16:31:16+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

SDForger is a flexible and efficient framework for generating high-quality multivariate time series using LLMs. Leveraging a compact data representation, SDForger provides synthetic time series generation from a few samples and low-computation fine-tuning of any autoregressive LLM. Specifically, the framework transforms univariate and multivariate signals into tabular embeddings, which are then encoded into text and used to fine-tune the LLM. At inference, new textual embeddings are sampled and decoded into synthetic time series that retain the original data's statistical properties and temporal dynamics. Across a diverse range of datasets, SDForger outperforms existing generative models in many scenarios, both in similarity-based evaluations and downstream forecasting tasks. By enabling textual conditioning in the generation process, SDForger paves the way for multimodal modeling and the streamlined integration of time series with textual information. The model is open-sourced at https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series.

中文标题/摘要

标题：语言构建时间序列：大规模语言模型在合成数据生成中的应用

SDForger 是一个灵活高效的框架，使用大语言模型生成高质量的多变量时间序列。通过紧凑的数据表示，SDForger 可以从少量样本和低计算量微调任何自回归大语言模型来生成合成时间序列。具体来说，该框架将单变量和多变量信号转换为表格嵌入，然后编码为文本并用于微调大语言模型。在推理时，新的文本嵌入被采样并解码为保留原始数据统计特性和时间动态的合成时间序列。在多种数据集上，SDForger 在许多场景中优于现有生成模型，无论是基于相似性的评估还是下游预测任务。通过在生成过程中启用文本条件，SDForger 为多模态建模铺平了道路，并简化了时间序列与文本信息的集成。该模型在 https://github.com/IBM/fms-dgt/tree/main/fms_dgt/public/databuilders/time_series 开源。

Summary / 总结

SDForger is a framework that uses large language models to generate high-quality multivariate time series from a few samples with low-computation fine-tuning. It transforms signals into tabular embeddings, encodes them into text, and fine-tunes the LLM. The generated synthetic time series retain the original data's statistical properties and temporal dynamics. SDForger outperforms existing generative models in various scenarios, including similarity-based evaluations and downstream forecasting tasks.

SDForger 是一个框架，利用大型语言模型从少量样本生成高质量的多变量时间序列，并通过低计算量的微调实现。它将信号转换为表格嵌入，编码为文本，并微调 LLM。SDForger 在各种数据集的相似性评估和下游预测任务中均优于现有模型。通过启用文本条件，它支持多模态建模并将时间序列与文本信息高效集成。

Benchmarking LLMs in Web API Integration Tasks

Authors: Daniel Maninger, Leon Chemnitz, Amir Molzam Sharifloo, Jannis Brugger, Mira Mezini

First: 2025-09-24T14:36:44+00:00 · Latest: 2025-11-03T16:12:09+00:00

Comments: To be published in Proceedings of 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware), Data & Benchmark Track; switched to IEEE conference template

Abs · PDF · Code1 · Code2

Abstract

API integration is a cornerstone of our digital infrastructure, enabling software systems to connect and interact. However, as shown by many studies, writing or generating correct code to invoke APIs, particularly web APIs, is challenging. Although large language models (LLMs) have become popular in software development, their effectiveness in automating the generation of web API integration code remains unexplored. In order to address this, we present WAPIIBench, a dataset and evaluation pipeline designed to assess the ability of LLMs to generate web API invocation code. Our experiments with several open-source LLMs reveal that generating API invocations poses a significant challenge, resulting in hallucinated endpoints, incorrect argument usage, and other errors. None of the evaluated open-source models was able to solve more than 40% of the tasks.

中文标题/摘要

标题：Web API集成任务中LLM基准测试

API集成是数字基础设施的基石，使软件系统能够连接和交互。然而，如许多研究所示，编写或生成调用API，特别是Web API的正确代码是具有挑战性的。尽管大型语言模型（LLM）在软件开发中变得流行，但它们在自动化生成Web API集成代码方面的有效性尚未得到探索。为了应对这一挑战，我们提出了WAPIIBench，这是一个数据集和评估管道，旨在评估LLM生成Web API调用代码的能力。我们的实验显示，生成API调用构成了一个重大挑战，导致生成虚假的端点、错误的参数使用和其他错误。评估的开源模型中没有一个能够解决超过40%的任务。

Summary / 总结

The study aims to evaluate the capability of large language models (LLMs) in generating web API invocation code, which is crucial for software systems to interact. WAPIIBench, a dataset and evaluation pipeline, was developed to assess LLMs. Experiments with several open-source LLMs showed that generating correct API invocations is challenging, with errors such as hallucinated endpoints and incorrect argument usage. No model could solve more than 40% of the tasks.

研究旨在评估大型语言模型（LLMs）在生成Web API调用代码方面的能力，这对于软件系统之间的交互至关重要。开发了WAPIIBench数据集和评估管道来评估LLMs。实验显示，生成正确的API调用存在挑战，出现了虚构的端点和参数使用错误等问题。没有一个模型能够解决超过40%的任务。

FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies

Authors: Dongyue Lu, Lingdong Kong, Gim Hee Lee, Camille Simon Chane, Wei Tsang Ooi

Venue: NeurIPS 2025

First: 2024-12-09T17:57:14+00:00 · Latest: 2025-11-03T16:11:17+00:00

Comments: NeurIPS 2025; 28 pages, 14 figures, 10 tables; Code at https://flexevent.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Event cameras offer unparalleled advantages for real-time perception in dynamic environments, thanks to the microsecond-level temporal resolution and asynchronous operation. Existing event detectors, however, are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data. To address these limitations, we propose FlexEvent, a novel framework that enables detection at varying frequencies. Our approach consists of two key components: FlexFuse, an adaptive event-frame fusion module that integrates high-frequency event data with rich semantic information from RGB frames, and FlexTune, a frequency-adaptive fine-tuning mechanism that generates frequency-adjusted labels to enhance model generalization across varying operational frequencies. This combination allows our method to detect objects with high accuracy in both fast-moving and static scenarios, while adapting to dynamic environments. Extensive experiments on large-scale event camera datasets demonstrate that our approach surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. Notably, our method maintains robust performance when scaling from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz, proving its effectiveness in extreme conditions. Our framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems.

中文标题/摘要

标题：FlexEvent：面向可变操作频率下的灵活事件-框架对象检测

事件相机由于其微秒级的时间分辨率和异步操作，在动态环境中提供了前所未有的实时感知优势。然而，现有的事件检测器受限于固定频率的范式，未能充分利用事件数据的高时间分辨率和适应性。为了解决这些限制，我们提出了FlexEvent，一种新型框架，能够在可变频率下进行检测。我们的方法包括两个关键组件：FlexFuse，一种自适应事件-框架融合模块，将高频事件数据与丰富的RGB帧语义信息集成，以及FlexTune，一种频率自适应微调机制，生成频率调整后的标签以增强模型在不同操作频率下的泛化能力。这种组合使我们的方法能够在快速移动和静态场景中以高精度检测物体，并适应动态环境。在大规模事件相机数据集上的广泛实验表明，我们的方法超越了最先进的方法，在标准和高频率设置中均实现了显著的性能提升。值得注意的是，当从20 Hz扩展到90 Hz时，我们的方法保持了稳健的性能，并在高达180 Hz的检测中提供了准确的结果，证明了其在极端条件下的有效性。我们的框架为事件驱动的对象检测设定了新的基准，并为更适应的实时视觉系统铺平了道路。

Summary / 总结

FlexEvent is a novel framework designed to enable flexible event-frame object detection at varying operational frequencies. It consists of FlexFuse, which integrates high-frequency event data with RGB frames, and FlexTune, which generates frequency-adjusted labels. Experiments show that FlexEvent outperforms existing methods, achieving significant improvements in both standard and high-frequency settings, and maintaining robust performance up to 180 Hz.

FlexEvent 是一种新型框架，旨在实现不同操作频率下的灵活事件帧目标检测。它包括 FlexFuse，该模块将高频事件数据与 RGB 帧结合，以及 FlexTune，该机制生成频率调整后的标签。实验表明，FlexEvent 在标准和高频设置下均优于现有方法，并且在高达 180 Hz 的情况下保持了稳健的性能。

Identity Increases Stability in Neural Cellular Automata

Authors: James Stovold

First: 2025-08-08T15:18:01+00:00 · Latest: 2025-11-03T16:04:41+00:00

Comments: Accepted to ALIFE 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Neural Cellular Automata (NCAs) offer a way to study the growth of two-dimensional artificial organisms from a single seed cell. From the outset, NCA-grown organisms have had issues with stability, their natural boundary often breaking down and exhibiting tumour-like growth or failing to maintain the expected shape. In this paper, we present a method for improving the stability of NCA-grown organisms by introducing an 'identity' layer with simple constraints during training. Results show that NCAs grown in close proximity are more stable compared with the original NCA model. Moreover, only a single identity value is required to achieve this increase in stability. We observe emergent movement from the stable organisms, with increasing prevalence for models with multiple identity values. This work lays the foundation for further study of the interaction between NCA-grown organisms, paving the way for studying social interaction at a cellular level in artificial organisms. Code/Videos available at: https://github.com/jstovold/ALIFE2025

中文标题/摘要

标题：身份提高神经细胞自动机的稳定性

神经细胞自动机（NCAs）提供了一种研究从单个种子细胞生长出二维人工有机体的方法。从一开始，由NCAs生长出的有机体就存在稳定性问题，它们的自然边界常常会崩溃并表现出肿瘤样的生长，或者无法维持预期的形状。在本文中，我们提出了一种通过在训练过程中引入一个“身份”层并施加简单约束来提高NCAs生长出的有机体稳定性的方法。结果显示，相邻生长的NCAs相比原始的NCA模型更加稳定。此外，只需要一个身份值就可以实现这种稳定性的提升。我们观察到，稳定的有机体中出现了自发的运动，对于具有多个身份值的模型，这种运动更为普遍。这项工作为进一步研究NCAs生长出的有机体之间的相互作用奠定了基础，为在人工有机体中研究细胞层面的社会互动铺平了道路。相关代码/视频可在：https://github.com/jstovold/ALIFE2025 获取。

Summary / 总结

This paper addresses the stability issues in Neural Cellular Automata (NCAs) by introducing an 'identity' layer during training. The method involves applying simple constraints to improve the stability of NCA-grown organisms. The results show that NCAs grown with this identity layer are more stable compared to the original NCA model, and emergent movement is observed, especially in models with multiple identity values.

本文通过在训练过程中引入‘身份’层来解决神经细胞自动机（NCAs）的稳定性问题。该方法涉及应用简单的约束条件以提高NCA生长的有机体的稳定性。结果表明，使用身份层的NCAs比原始NCA模型更稳定，并且观察到自发运动，尤其是在具有多个身份值的模型中更为明显。

SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding

Authors: Dekai Zhu, Yixuan Hu, Youquan Liu, Dongyue Lu, Lingdong Kong, Slobodan Ilic

Venue: NeurIPS 2025

First: 2025-05-28T17:55:35+00:00 · Latest: 2025-11-03T15:59:11+00:00

Comments: NeurIPS 2025; 24 pages, 10 figures, 9 tables; Code at https://dekai21.github.io/SPIRAL/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.

中文标题/摘要

标题：SPIRAL：语义感知渐进式LiDAR场景生成与理解

利用近期的扩散模型，基于LiDAR的大规模3D场景生成已经取得了巨大成功。虽然基于体素的方法可以生成几何结构和语义标签，但现有的视距方法仅限于生成未标记的LiDAR场景。依赖预训练的分割模型预测语义图通常会导致跨模态一致性不佳。为了解决这一限制，同时保留视距表示的优势，如计算效率和简化网络设计，我们提出了Spiral，一种新颖的视距LiDAR扩散模型，可以同时生成深度、反射图像和语义图。此外，我们引入了新的语义感知评估指标来评估生成的标记视距数据的质量。在SemanticKITTI和nuScenes数据集上的实验表明，Spiral在参数量最小的情况下达到了最先进的性能，优于结合生成和分割模型的两步方法。此外，我们验证了Spiral生成的视距图像可以有效地用于下游分割训练中的合成数据增强，显著减少了LiDAR数据的标注工作量。

Summary / 总结

The research aims to improve LiDAR-based 3D scene generation by addressing the limitations of existing methods. Spiral, a novel range-view LiDAR diffusion model, generates depth, reflectance images, and semantic maps simultaneously, enhancing cross-modal consistency. Experiments show Spiral outperforms two-step methods in terms of performance and parameter size, and the generated range images can effectively augment synthetic data for segmentation training, reducing labeling effort.

研究旨在通过解决现有方法的局限性，改进基于LiDAR的3D场景生成。螺旋（Spiral）是一种新颖的范围视图LiDAR扩散模型，能够同时生成深度、反射图像和语义图，并利用预训练的分割模型增强跨模态一致性。实验表明，螺旋在性能和参数量方面优于两步法，并且生成的范围图像可以有效用于下游分割训练的数据增强，减少LiDAR数据的标注工作量。

Efficient Remote Sensing Change Detection with Change State Space Models

Authors: Elman Ghazaei, Erchan Aptoula

First: 2025-04-15T11:25:10+00:00 · Latest: 2025-11-03T15:53:45+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.

中文标题/摘要

标题：基于变化状态空间模型的高效遥感变化检测

尽管卷积神经网络（ConvNets）和视觉变换器（ViT）常用于变化检测，但它们各自存在明显的局限性：前者难以建模长程依赖关系，而后者计算效率低下，难以在大规模数据集上进行训练。基于状态空间模型的Vision Mamba架构作为一种替代方案，解决了上述问题，并已被应用于遥感变化检测，尽管主要用作特征提取骨干。本文介绍了一种专门设计用于变化检测的变化状态空间模型，该模型通过关注双时相图像之间的相关变化，有效过滤掉无关信息。通过仅关注变化特征，减少了网络参数的数量，显著提高了计算效率，同时保持了高检测性能和对输入降级的鲁棒性。所提出的模型在三个基准数据集上的评估结果显示，其在计算复杂度仅为ConvNets、ViTs和基于Mamba的对应模型的一小部分的情况下，性能更优。该实现将在接受后在https://github.com/Elman295/CSSM公开。

Summary / 总结

The research addresses the limitations of ConvNets and ViTs in remote sensing change detection, such as long-range dependency modeling and computational inefficiency, respectively. The Change State Space Model is introduced, designed to focus on relevant changes between bi-temporal images, thereby reducing the number of network parameters and enhancing computational efficiency. The model outperformed ConvNets, ViTs, and Mamba-based counterparts on three benchmark datasets while maintaining high detection performance and robustness against input degradation.

研究针对卷积神经网络和视觉变换器在遥感变化检测中的局限性，如长距离依赖建模困难和计算效率低下。提出了变化状态空间模型，该模型专注于生物时间图像之间的相关变化，减少了网络参数和计算复杂度，同时保持了高检测性能。该模型在三个基准数据集上的表现优于卷积神经网络、视觉变换器和基于视觉马姆巴的对应物，且计算复杂度显著较低。

Dynamic Forgetting and Spatio-Temporal Periodic Interest Modeling for Local-Life Service Recommendation

Authors: Zhaoyu Hu, Jianyang Wang, Hao Guo, Yuan Tian, Erpeng Xue, Xianyang Qi, Hongxiang Lin, Lei Wang, Sheng Chen

First: 2025-08-04T14:16:49+00:00 · Latest: 2025-11-03T15:46:33+00:00

Abs · PDF · Code1 · Code2

Abstract

In the context of the booming digital economy, recommendation systems, as a key link connecting users and numerous services, face challenges in modeling user behavior sequences on local-life service platforms, including the sparsity of long sequences and strong spatio-temporal dependence. Such challenges can be addressed by drawing an analogy to the forgetting process in human memory. This is because users' responses to recommended content follow the recency effect and the cyclicality of memory. By exploring this, this paper introduces the forgetting curve and proposes Spatio-Temporal periodic Interest Modeling (STIM) with long sequences for local-life service recommendation. STIM integrates three key components: a dynamic masking module based on the forgetting curve, which is used to extract both recent spatiotemporal features and periodic spatiotemporal features; a query-based mixture of experts (MoE) approach that can adaptively activate expert networks under different dynamic masks, enabling the collaborative modeling of time, location, and items; and a hierarchical multi-interest network unit, which captures multi-interest representations by modeling the hierarchical interactions between the shallow and deep semantics of users' recent behaviors. By introducing the STIM method, we conducted online A/B tests and achieved a 1.54\% improvement in gross transaction volume (GTV). In addition, extended offline experiments also showed improvements. STIM has been deployed in a large-scale local-life service recommendation system, serving hundreds of millions of daily active users in core application scenarios.

中文标题/摘要

标题：本地生活服务推荐中的动态遗忘与时空周期兴趣建模

在数字经济蓬勃发展的背景下，推荐系统作为连接用户和众多服务的关键环节，面临着在本地生活服务平台上建模用户行为序列的挑战，包括长序列的稀疏性和强烈的时空依赖性。这些挑战可以通过类比人类记忆中的遗忘过程来解决。因为用户对推荐内容的响应遵循近期效应和记忆的周期性。通过探索这一点，本文引入了遗忘曲线，并提出了一种结合长序列的时空周期兴趣建模（STIM）方法，用于本地生活服务推荐。STIM整合了三个关键组件：基于遗忘曲线的动态遮罩模块，用于提取近期时空特征和周期时空特征；基于查询的专家混合（MoE）方法，可以在不同动态遮罩下自适应激活专家网络，实现时间、地点和项目的协同建模；以及层次多兴趣网络单元，通过建模用户近期行为的浅层和深层语义之间的层次交互来捕获多兴趣表示。通过引入STIM方法，我们进行了在线A/B测试，并实现了1.54%的总交易量（GTV）提升。此外，扩展的离线实验也显示了改进。STIM已在大规模本地生活服务推荐系统中部署，服务于核心应用场景中的数亿日活跃用户。

Summary / 总结

This paper addresses the challenges of modeling user behavior sequences in local-life service platforms by proposing Spatio-Temporal Periodic Interest Modeling (STIM) that incorporates a dynamic forgetting curve to capture both recent and periodic spatiotemporal features. The method includes a query-based mixture of experts for adaptive expert network activation and a hierarchical multi-interest network unit for capturing multi-interest representations. Experimental results from online A/B tests and offline experiments demonstrated a 1.54% improvement in gross transaction volume and have since been deployed in a large-scale recommendation system serving millions of users.

本文针对本地生活服务平台上用户行为序列建模的挑战，如稀疏性和强烈的时空依赖性，提出了时空周期兴趣建模（STIM）。STIM 使用遗忘曲线提取近期和周期性的时空特征，使用基于查询的混合专家方法在不同动态掩码下自适应地建模时间、位置和项目，并使用分层多兴趣网络单元捕获用户的近期行为的浅层和深层语义的多兴趣表示。该方法在线测试结果显示交易总额提高了1.54%，离线实验也显示了改进。STIM 已部署在大规模推荐系统中，服务于数百万活跃用户。

SonarSplat: Novel View Synthesis of Imaging Sonar via Gaussian Splatting

Authors: Advaith V. Sethuraman, Max Rucker, Onur Bagoren, Pou-Chun Kung, Nibarkavi N. B. Amutha, Katherine A. Skinner

First: 2025-03-31T19:13:45+00:00 · Latest: 2025-11-03T15:16:31+00:00

Abs · PDF · Code1 · Code2

Abstract

In this paper, we present SonarSplat, a novel Gaussian splatting framework for imaging sonar that demonstrates realistic novel view synthesis and models acoustic streaking phenomena. Our method represents the scene as a set of 3D Gaussians with acoustic reflectance and saturation properties. We develop a novel method to efficiently rasterize Gaussians to produce a range/azimuth image that is faithful to the acoustic image formation model of imaging sonar. In particular, we develop a novel approach to model azimuth streaking in a Gaussian splatting framework. We evaluate SonarSplat using real-world datasets of sonar images collected from an underwater robotic platform in a controlled test tank and in a real-world river environment. Compared to the state-of-the-art, SonarSplat offers improved image synthesis capabilities (+3.2 dB PSNR) and more accurate 3D reconstruction (77% lower Chamfer Distance). We also demonstrate that SonarSplat can be leveraged for azimuth streak removal.

中文标题/摘要

标题：SonarSplat：基于高斯点积的成像声纳新颖视图合成框架

在本文中，我们提出了SonarSplat，一种新颖的基于高斯点积的成像声纳框架，展示了逼真的新颖视图合成，并建模了声学条纹现象。我们的方法将场景表示为具有声学反射率和饱和度属性的3D高斯集合。我们开发了一种新颖的方法，高效地绘制高斯点以生成忠实于成像声纳声学成像模型的范围/方位图像。特别是，我们开发了一种新颖的方法来在高斯点积框架中建模方位条纹。我们使用从受控测试水箱和真实河流环境中收集的水下机器人平台的声纳图像数据集评估了SonarSplat。与最先进的技术相比，SonarSplat提供了改进的图像合成能力（+3.2 dB PSNR）和更准确的3D重建（77%更低的切氏距离）。我们还展示了SonarSplat可以用于方位条纹去除。

Summary / 总结

SonarSplat is a Gaussian splatting framework for imaging sonar that synthesizes realistic novel views and models acoustic streaking. It represents the scene with 3D Gaussians and efficiently rasterizes them to produce range/azimuth images. Compared to existing methods, SonarSplat improves image synthesis by 3.2 dB PSNR and achieves 77% lower Chamfer Distance in 3D reconstruction. It also enables azimuth streak removal.

SonarSplat是一种基于高斯点的成像声纳新颖视图合成框架，能够模拟声学条纹现象。该方法通过3D高斯函数表示场景，并高效地将其绘制为范围/方位图像。与现有方法相比，它提高了3.2 dB的PSNR图像合成能力，并且在3D重建中的切佛距离降低了77%。此外，还展示了SonarSplat在去除方位条纹方面的应用。

Towards Large-Scale In-Context Reinforcement Learning by Meta-Training in Randomized Worlds

Authors: Fan Wang, Pengtao Shao, Yiming Zhang, Bo Yu, Shaoshan Liu, Ning Ding, Yang Cao, Yu Kang, Haifeng Wang

First: 2025-02-05T03:59:13+00:00 · Latest: 2025-11-03T14:21:27+00:00

Comments: NeruIPS 2025

Abs · PDF · Code1 · Code2

Abstract

In-Context Reinforcement Learning (ICRL) enables agents to learn automatically and on-the-fly from their interactive experiences. However, a major challenge in scaling up ICRL is the lack of scalable task collections. To address this, we propose the procedurally generated tabular Markov Decision Processes, named AnyMDP. Through a carefully designed randomization process, AnyMDP is capable of generating high-quality tasks on a large scale while maintaining relatively low structural biases. To facilitate efficient meta-training at scale, we further introduce decoupled policy distillation and induce prior information in the ICRL framework. Our results demonstrate that, with a sufficiently large scale of AnyMDP tasks, the proposed model can generalize to tasks that were not considered in the training set through versatile in-context learning paradigms. The scalable task set provided by AnyMDP also enables a more thorough empirical investigation of the relationship between data distribution and ICRL performance. We further show that the generalization of ICRL potentially comes at the cost of increased task diversity and longer adaptation periods. This finding carries critical implications for scaling robust ICRL capabilities, highlighting the necessity of diverse and extensive task design, and prioritizing asymptotic performance over few-shot adaptation.

中文标题/摘要

标题：通过元训练在随机世界中实现大规模上下文相关强化学习

上下文相关强化学习（ICRL）使智能体能够从与其交互的经验中自动学习。然而，ICRL 扩大规模的主要挑战是没有可扩展的任务集合。为了解决这一问题，我们提出了程序生成的表格马尔可夫决策过程（AnyMDP）。通过精心设计的随机化过程，AnyMDP 能够大规模生成高质量的任务，同时保持相对较低的结构偏差。为了促进大规模的元训练，我们进一步引入了解耦策略蒸馏，并在 ICRL 框架中引入先验信息。我们的结果表明，通过灵活的上下文相关学习范式，使用足够大的 AnyMDP 任务规模，所提出的模型可以泛化到训练集中未考虑的任务。AnyMDP 提供的可扩展任务集还使我们能够更深入地研究数据分布与 ICRL 性能之间的关系。我们还表明，ICRL 的泛化可能会以增加任务多样性和延长适应期为代价。这一发现对扩展稳健的 ICRL 能力具有关键意义，强调了多样化和广泛的任务设计的必要性，并优先考虑渐近性能而非少量样本适应。