arXiv 论文速递

2026-03-12 03:50
Snapshot: 20260312_0350
Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes
Authors: Aleksei Rozanov, Arvind Renganathan, Vipin Kumar
Venue: AAAI 2026
First: 2026-03-10T17:59:29+00:00 · Latest: 2026-03-10T17:59:29+00:00
Comments: Accepted to the KGML Bridge at AAAI 2026 (non-archival)
Abstract
Accurately upscaling terrestrial carbon fluxes is central to estimating the global carbon budget, yet remains challenging due to the sparse and regionally biased distribution of ground measurements. Existing data-driven upscaling products often fail to generalize beyond observed domains, leading to systematic regional biases and high predictive uncertainty. We introduce Task-Aware Modulation with Representation Learning (TAM-RL), a framework that couples spatio-temporal representation learning with knowledge-guided encoder-decoder architecture and loss function derived from the carbon balance equation. Across 150+ flux tower sites representing diverse biomes and climate regimes, TAM-RL improves predictive performance relative to existing state-of-the-art datasets, reducing RMSE by 8-9.6% and increasing explained variance ($R^2$) from 19.4% to 43.8%, depending on the target flux. These results demonstrate that integrating physically grounded constraints with adaptive representation learning can substantially enhance the robustness and transferability of global carbon flux estimates.
Summary / 总结
The research aims to improve the accuracy of upscaling terrestrial carbon fluxes, which is crucial for estimating the global carbon budget. The authors propose Task-Aware Modulation with Representation Learning (TAM-RL), which combines spatio-temporal representation learning with a knowledge-guided encoder-decoder architecture and a loss function derived from the carbon balance equation. Across 150+ flux tower sites, TAM-RL outperforms existing methods, reducing RMSE by 8-9.6% and increasing $R^2$ from 19.4% to 43.8% for different target fluxes, indicating enhanced robustness and transferability of global carbon flux estimates.
研究旨在提高陆地碳通量的放大精度,这对于估算全球碳预算至关重要。作者开发了任务感知模态与表示学习(TAM-RL)框架,该框架结合了时空表示学习、知识引导的编码解码架构以及来自碳平衡方程的损失函数。在150多个通量塔站点上,TAM-RL 在现有方法中表现出色,RMSE 减少了8-9.6%,$R^2$ 增加了从19.4%到43.8%。这表明将物理约束与自适应学习相结合可以显著增强全球碳通量估计的稳健性和可转移性。
DUEL: Exact Likelihood for Masked Diffusion via Deterministic Unmasking
Authors: Gilad Turok, Chris De Sa, Volodymyr Kuleshov
First: 2026-03-02T01:56:03+00:00 · Latest: 2026-03-10T17:59:15+00:00
Comments: 22 pages, 5 figures 8 tables
Abstract
Masked diffusion models (MDMs) generate text by iteratively selecting positions to unmask and then predicting tokens at those positions. Yet MDMs lack proper likelihood evaluation: the evidence lower bound (ELBO) is not only a loose bound on log-likelihood, but, as we show, is also computed under the training distribution rather than the test-time distribution. We resolve this within our DUEL framework, which unifies leading MDM sampling strategies that employ $\textit{deterministic}$ position selection. We prove that DUEL samplers admit $\textbf{exact likelihood computation under the test-time distribution}$ -- giving MDMs $\textit{proper}$ likelihood, and hence proper perplexity, for the first time. This proper perplexity is the natural analogue of autoregressive perplexity and lets us revisit key questions about MDMs. $\textbf{MDMs are substantially better than previously thought}$: the MDM-autoregressive perplexity gap shrinks by up to $32\%$ on in-domain data and $82\%$ on zero-shot benchmarks. DUEL enables the first principled comparison of fast,parallel samplers across compute budgets -- an analysis impossible with the ELBO and unreliable with generative perplexity -- identifying a strong default method. Finally, oracle search over position orderings reveals MDMs can far surpass autoregressive models -- achieving $36.47$ vs. $52.11$ perplexity on AG News -- demonstrating the ceiling of MDM performance has not yet been reached.
中文标题/摘要
标题:DUEL:通过确定性去遮蔽计算掩蔽扩散模型的精确似然
掩蔽扩散模型(MDMs)通过迭代选择位置去遮蔽并预测这些位置的词元来生成文本。然而,MDMs缺乏适当的似然评估:证据下界(ELBO)不仅是一个松散的对数似然界,而且,如我们所展示的,它是基于训练分布而非测试时分布计算的。我们通过DUEL框架解决了这一问题,该框架统一了使用确定性位置选择的领先MDM采样策略。我们证明DUEL采样器可以在测试时分布下进行精确的似然计算——为MDMs提供了适当的似然性,从而首次实现了适当的困惑度。这种适当的困惑度是自回归困惑度的自然对应物,使我们能够重新审视关于MDMs的关键问题。MDMs实际上比之前认为的要好得多:MDMs与自回归模型的困惑度差距在领域内数据上缩小了32%,在零样本基准上缩小了82%。DUEL使我们能够首次在计算预算范围内对快速并行采样器进行有原则的比较——这是使用ELBO不可能实现的,并且生成困惑度不可靠。最后,位置顺序的Oracle搜索表明MDMs可以远远超过自回归模型——在AG News上实现36.47 vs. 52.11的困惑度,这表明MDMs的性能天花板尚未达到。
Summary / 总结
DUEL addresses the issue of proper likelihood evaluation in masked diffusion models (MDMs) by unifying deterministic position selection strategies. It enables exact likelihood computation under the test-time distribution, providing MDMs with proper likelihood and perplexity for the first time. Experimental results show that MDMs are significantly better than previously thought, with a perplexity gap reduction of up to 32% on in-domain data and 82% on zero-shot benchmarks. DUEL also allows for a principled comparison of fast, parallel samplers and reveals that MDMs can surpass autoregressive models, achieving lower perplexity on AG News.
该论文通过引入DUEL框架解决了masked diffusion模型(MDMs)的精确似然性评估问题,该框架统一了确定性位置选择策略。DUEL使MDMs能够在测试时分布下进行精确的似然性计算,首次为MDMs提供了适当的似然性和困惑度。实验结果显示,MDMs比之前认为的要好得多,MDMs与自回归模型的困惑度差距在领域内数据上缩小了32%,在零样本基准上缩小了82%。DUEL还首次实现了快速并行采样器在不同计算预算下的有原则比较,并且通过Oracle搜索发现MDMs可以超越自回归模型,在AG News上实现更低的困惑度。
ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning
Authors: Jannis Becktepe, Julian Dierkes, Carolin Benjamins, Aditya Mohan, David Salinas, Raghu Rajan, Frank Hutter, Holger Hoos, Marius Lindauer, Theresa Eimer
Venue: Journal of Data-centric Machine Learning Research; 2026
First: 2024-09-27T15:22:28+00:00 · Latest: 2026-03-10T17:58:13+00:00
Abstract
Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at https://github.com/automl/arlbench.
中文标题/摘要
标题:ARLBench:强化学习中超参数优化基准测试的灵活高效评估
超参数是可靠训练高性能强化学习(RL)代理的关键因素。不幸的是,开发和评估用于调整这些超参数的自动化方法既昂贵又耗时。因此,这些方法通常仅在单一领域或算法上进行评估,使得比较变得困难,并限制了对其普适性的见解。我们提出了ARLBench,这是一种用于强化学习中超参数优化(HPO)的基准测试,它允许比较各种HPO方法,同时在评估方面非常高效。为了在低计算资源设置中进行强化学习中的HPO研究,我们选择了一个代表性的HPO任务子集,涵盖了多种算法和环境组合。这一选择使得仅使用以前所需计算资源的一小部分即可生成自动化RL(AutoRL)方法的性能概况,从而让更多的研究人员能够从事强化学习中的HPO研究。基于我们选择所基于的广泛而大规模的超参数景观数据集,ARLBench是一个高效、灵活且面向未来的AutoRL研究基础。基准测试和数据集可在https://github.com/automl/arlbench获取。
Summary / 总结
ARLBench is designed to facilitate the comparison of hyperparameter optimization (HPO) methods in reinforcement learning (RL) by selecting a representative subset of tasks that cover various RL algorithms and environments. This benchmark enables efficient evaluation, even with limited computational resources, and provides a large-scale dataset of hyperparameter landscapes. As a result, ARLBench supports broader research into HPO in RL and offers a future-oriented foundation for AutoRL studies.
ARLBench 通过选择涵盖多种 RL 算法和环境的代表性任务集,旨在促进 HPO 方法在 RL 中的比较。即使在计算资源有限的情况下,该基准也能实现高效评估,并提供大规模的超参数景观数据集。因此,ARLBench 支持更广泛的 HPO 在 RL 中的研究,并为 AutoRL 研究提供了一个面向未来的基础。
ReCoSplat: Autoregressive Feed-Forward Gaussian Splatting Using Render-and-Compare
Authors: Freeman Cheng, Botao Ye, Xueting Li, Junqi You, Fangneng Zhan, Ming-Hsuan Yang
First: 2026-03-10T17:58:08+00:00 · Latest: 2026-03-10T17:58:08+00:00
Abstract
Online novel view synthesis remains challenging, requiring robust scene reconstruction from sequential, often unposed, observations. We present ReCoSplat, an autoregressive feed-forward Gaussian Splatting model supporting posed or unposed inputs, with or without camera intrinsics. While assembling local Gaussians using camera poses scales better than canonical-space prediction, it creates a dilemma during training: using ground-truth poses ensures stability but causes a distribution mismatch when predicted poses are used at inference. To address this, we introduce a Render-and-Compare (ReCo) module. ReCo renders the current reconstruction from the predicted viewpoint and compares it with the incoming observation, providing a stable conditioning signal that compensates for pose errors. To support long sequences, we propose a hybrid KV cache compression strategy combining early-layer truncation with chunk-level selective retention, reducing the KV cache size by over 90% for 100+ frames. ReCoSplat achieves state-of-the-art performance across different input settings on both in- and out-of-distribution benchmarks. Code and pretrained models will be released. Our project page is at https://freemancheng.com/ReCoSplat .
中文标题/摘要
标题:ReCoSplat:基于渲染和比较的自回归前馈高斯点云合成
在线新颖视图合成仍然具有挑战性,需要从顺序且经常未摆姿势的观察中稳健地重建场景。我们提出了ReCoSplat,这是一种支持摆姿势或未摆姿势输入、有或无相机内参的自回归前馈高斯点云合成模型。虽然使用相机姿态组装局部高斯比在标准空间预测更具扩展性,但在训练过程中会产生一个困境:使用真实姿态确保稳定性,但在推理时使用预测姿态会导致分布不匹配。为了解决这个问题,我们引入了一个渲染和比较(ReCo)模块。ReCo从预测视角渲染当前重建,并将其与传入的观察进行比较,提供一个稳定的条件信号,以补偿姿态误差。为了支持长序列,我们提出了一种结合早期层截断和块级选择性保留的混合KV缓存压缩策略,对于100多帧,KV缓存大小减少了超过90%。ReCoSplat在不同输入设置下的室内和室外基准测试中均实现了最先进的性能。代码和预训练模型将被发布。我们的项目页面位于https://freemancheng.com/ReCoSplat 。
Summary / 总结
ReCoSplat is an autoregressive feed-forward Gaussian Splatting model designed for online novel view synthesis, capable of handling both posed and unposed inputs. It introduces a Render-and-Compare (ReCo) module to address the challenge of pose errors during training, ensuring stability and compensating for mismatches. Additionally, a hybrid KV cache compression strategy is proposed to manage long sequences efficiently. Experimental results show that ReCoSplat outperforms existing methods on various benchmarks, both in- and out-of-distribution.
ReCoSplat 是一种自回归前馈高斯点云模型,用于在线新颖视图合成,支持姿态已知和未知的输入。它引入了Render-and-Compare (ReCo) 模块来解决训练中的姿态误差问题,确保稳定性并补偿错配。此外,还提出了一种混合 KV 缓存压缩策略来有效管理长序列。实验结果表明,ReCoSplat 在各种基准测试中均优于现有方法,无论是室内还是室外场景。
Emotional Modulation in Swarm Decision Dynamics
Authors: David Freire-Obregón
First: 2026-03-10T17:56:42+00:00 · Latest: 2026-03-10T17:56:42+00:00
Comments: Accepted for presentation at the International Conference on Agents and Artificial Intelligence (ICAART 2026)
Abstract
Collective decision-making in biological and human groups often emerges from simple interaction rules that amplify minor differences into consensus. The bee equation, developed initially to describe nest-site selection in honeybee swarms, captures this dynamic through recruitment and inhibition processes. Here, we extend the bee equation into an agent-based model in which emotional valence (positive-negative) and arousal (low-high) act as modulators of interaction rates, effectively altering the recruitment and cross-inhibition parameters. Agents display simulated facial expressions mapped from their valence-arousal states, allowing the study of emotional contagion in consensus formation. Three scenarios are explored: (1) the joint effect of valence and arousal on consensus outcomes and speed, (2) the role of arousal in breaking ties when valence is matched, and (3) the "snowball effect" in which consensus accelerates after surpassing intermediate support thresholds. Results show that emotional modulation can bias decision outcomes and alter convergence times by shifting effective recruitment and inhibition rates. At the same time, intrinsic non-linear amplification can produce decisive wins even in fully symmetric emotional conditions. These findings link classical swarm decision theory with affective and social modelling, highlighting how both emotional asymmetries and structural tipping points shape collective outcomes. The proposed framework offers a flexible tool for studying the emotional dimensions of collective choice in both natural and artificial systems.
中文标题/摘要
标题:群体决策动力学中的情绪调节
生物群体和人类群体的集体决策往往源自简单的交互规则,这些规则将微小差异放大为共识。蜂群方程最初用于描述蜜蜂群体选择巢址的动力学,通过招募和抑制过程捕捉这一动态。在此,我们扩展了蜂群方程,构建了一个基于代理的模型,其中情绪的正负值和低高唤醒度作为交互速率的调节器,有效改变招募和交叉抑制参数。代理展示模拟面部表情,映射其正负值-唤醒度状态,允许研究情绪在达成共识中的传播。 三种情景被探索:(1)正负值和唤醒度对共识结果和速度的联合影响,(2)唤醒度在正负值匹配时打破僵局的作用,以及(3)“滚雪球效应”,即在超过中间支持阈值后,共识加速。结果表明,情绪调节可以偏倚决策结果,并通过改变有效的招募和抑制速率来改变收敛时间。同时,内在的非线性放大可以产生决定性的胜利,即使在完全对称的情绪条件下也是如此。 这些发现将经典的群体决策理论与情感和社会建模联系起来,突显了情绪不对称性和结构临界点如何塑造集体结果。提出的框架为研究自然和人工系统中集体选择的情绪维度提供了一个灵活的工具。
Summary / 总结
This study extends the bee equation to model emotional modulation in swarm decision dynamics using an agent-based approach. By incorporating emotional valence and arousal, the model explores how these factors influence consensus outcomes and speed. Key findings include the biasing effect of emotional modulation on decision outcomes and the acceleration of consensus formation after surpassing intermediate support thresholds. The research bridges swarm decision theory with affective and social modeling, emphasizing the role of emotional asymmetries and structural tipping points in shaping collective outcomes.
该研究扩展了蜂群决策动态中的蜜蜂方程,采用基于代理的方法,其中情绪的正负性和唤醒水平影响互动速率。关键发现包括情绪状态可以偏决策结果和改变收敛时间,并且非线性放大可以在对称情绪条件下导致决定性胜利。模型探讨了情绪不对称性和结构临界点如何影响集体结果,将蜂群决策理论与情感和社会建模联系起来。
BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Authors: Xinyu Gao, Gang Chen, Javier Alonso-Mora
First: 2026-03-10T17:56:16+00:00 · Latest: 2026-03-10T17:56:16+00:00
Comments: 8 pages. Project page: https://xin-yu-gao.github.io/beacon
Abstract
Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset built in the Habitat simulator, we conduct detailed experimental analysis to validate both our BEV space formulation and the design choices of each module. Our method improves the accuracy averaged across geodesic thresholds by 22.74 percentage points over the state-of-the-art image-space baseline on the validation subset with occluded target locations. Our project page is: https://xin-yu-gao.github.io/beacon.
中文标题/摘要
标题:BEACON:基于语言的遮挡下导航可用性预测
基于语言的局部导航要求机器人从其当前观察和开放词汇关系指令中推断出附近的可通行目标位置。现有的视觉-语言空间定位方法通常依赖视觉-语言模型(VLM)在图像空间中进行推理,产生与可见像素相关的二维预测。因此,它们在推断被家具或移动的人遮挡区域的目标位置时遇到困难。为了解决这个问题,我们提出了BEACON,它预测了一个以自我为中心的鸟瞰图(BEV)可用性热力图,覆盖了一个包括遮挡区域的局部区域。给定一个指令和来自机器人周围四个方向的环绕视图RGB-D观察结果,BEACON通过将空间线索注入VLM并将VLM的输出与深度衍生的BEV特征融合来预测BEV热力图。使用在Habitat模拟器中构建的具有遮挡感知的数据集,我们进行了详细的实验分析,以验证我们的BEV空间表示和每个模块的设计选择。我们的方法在验证子集上,对于具有遮挡目标位置的图像空间基线,平均地表距离阈值精度提高了22.74个百分点。我们的项目页面是:https://xin-yu-gao.github.io/beacon.
Summary / 总结
BEACON addresses the challenge of predicting target locations in occluded regions for language-conditioned local navigation. It predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region using a vision-language model and depth-derived BEV features. Experiments show a 22.74 percentage point improvement in accuracy over existing methods on the validation subset with occluded target locations.
研究旨在通过解决在遮挡区域推断目标位置的挑战,改进基于语言的局部导航。BEACON 预测一个以鸟瞰视角为中心的局部区域的BEV可操作性热力图,将空间线索注入视觉语言模型,并将其输出与深度衍生的BEV特征融合。实验结果显示,在包含遮挡目标位置的验证子集上,与最先进的图像空间基线相比,准确率提高了22.74个百分点。
From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding
Authors: Wenzhao Xiang, Yue Wu, Hongyang Yu, Feng Gao, Fan Yang, Xilin Chen
First: 2026-03-10T17:51:12+00:00 · Latest: 2026-03-10T17:51:12+00:00
Abstract
Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
中文标题/摘要
标题:从语义到像素:分层视觉理解的粗到细掩蔽自编码器
自监督视觉预训练方法存在固有的矛盾:对比学习(CL)捕捉全局语义但丢失了细粒度细节,而掩蔽图像建模(MIM)保留了局部纹理但因语义无关的随机掩蔽而遭受“注意力漂移”。我们提出了一种分层掩蔽自编码器C2FMAE,通过在三个数据粒度层次上显式学习分层视觉表示来解决这一矛盾:语义掩码(场景级别)、实例掩码(对象级别)和RGB图像(像素级别)。两种协同创新强化了严格的自上而下的学习原则。首先,级联解码器按顺序从场景语义重建到对象实例再到像素细节,建立了跨粒度的显式依赖关系,这是并行解码器无法捕捉到的。其次,渐进式掩蔽课程动态地将训练重点从语义引导转移到实例引导,最终转移到随机掩蔽,从而创建了一个从全局上下文到局部特征的结构化学习路径。为了支持这一框架,我们构建了一个大规模多粒度数据集,为所有128万张ImageNet-1K图像提供了高质量的伪标签。大量实验表明,C2FMAE在图像分类、对象检测和语义分割上取得了显著的性能提升,验证了我们分层设计在学习更稳健和泛化表示方面的有效性。
Summary / 总结
C2FMAE is a coarse-to-fine masked autoencoder that addresses the limitations of contrastive learning and masked image modeling by learning hierarchical visual representations across semantic, instance, and pixel levels. It uses a cascaded decoder and a progressive masking curriculum to enforce a top-down learning principle, improving performance on image classification, object detection, and semantic segmentation. The method constructs a large-scale multi-granular dataset with high-quality pseudo-labels to support its framework and demonstrates significant gains in robust and generalizable representations.
C2FMAE 是一种从粗到细的掩码自编码器,通过在语义、实例和像素级别学习层次视觉表示来解决对比学习和掩码图像建模的局限性。它使用级联解码器按顺序从场景语义到对象实例再到像素细节进行重建,并使用渐进式掩码课程来逐步将训练重点从全局特征转移到局部特征。实验表明,C2FMAE 在图像分类、对象检测和语义分割上的性能得到提升,证明了其层次化设计在学习鲁棒和通用表示方面的有效性。
Semi-Supervised Biomedical Image Segmentation via Diffusion Models and Teacher-Student Co-Training
Authors: Luca Ciampi, Gabriele Lagani, Giuseppe Amato, Fabrizio Falchi
First: 2025-04-02T09:41:43+00:00 · Latest: 2026-03-10T17:47:57+00:00
Abstract
Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi-supervised learning is a well-established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi-supervised teacher-student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle-consistency constraint based on noise-corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co-training process with a twin-student network. The student learns from ground-truth labels when available and from teacher-generated pseudo-labels otherwise, while the teacher continuously improves its pseudo-labeling capabilities. Finally, to further enhance performance, we introduce a multi-round pseudo-label generation strategy that iteratively improves the pseudo-labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state-of-the-art semi-supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at https://github.com/ciampluca/diffusion_semi_supervised_biomedical_image_segmentation
中文标题/摘要
标题:基于扩散模型和教师-学生联合训练的半监督生物医学图像分割
监督深度学习在语义分割中已经取得了卓越成果,能够准确识别医学图像中的解剖和病理结构。然而,它通常需要大量标注的训练数据集,这限制了其在临床环境中的可扩展性。为了解决这一挑战,半监督学习是一种成熟的方法,能够利用标记和未标记的数据。在本文中,我们介绍了一种新颖的半监督教师-学生框架,受生成模型近期成功的启发。我们的方法利用去噪扩散概率模型(DDPMs)通过逐步细化噪声输入生成分割掩码,这些噪声输入是根据相应的图像进行条件化处理的。教师模型首先通过基于噪声破坏图像重建的循环一致性约束进行无监督训练,使其能够生成具有信息性的语义掩码。随后,教师被整合到与双学生网络的联合训练过程中。学生在有真实标签时从真实标签中学习,否则从教师生成的伪标签中学习,而教师不断改进其伪标签生成能力。最后,为了进一步提高性能,我们引入了一种多轮伪标签生成策略,以逐步改进伪标签生成过程。我们在多个生物医学成像基准上评估了我们的方法,涵盖了多种成像模态和分割任务。实验结果表明,我们的方法在所有基准上都优于最先进的半监督技术,突显了其在标注数据有限场景中的有效性。可以在https://github.com/ciampluca/diffusion_semi_supervised_biomedical_image_segmentation 复现我们的实验代码
Summary / 总结
This paper presents a semi-supervised learning framework for biomedical image segmentation using diffusion models and teacher-student co-training. The method leverages both labeled and unlabeled data to improve segmentation accuracy without requiring large annotated datasets. The teacher model is trained unsupervised to generate informative semantic masks, which are then used to guide the learning of a student model through co-training. The student learns from both ground-truth labels and pseudo-labels generated by the teacher. The approach also includes a multi-round pseudo-label generation strategy to iteratively refine the pseudo-labeling process. Experiments on various biomedical imaging benchmarks demonstrate that this method outperforms existing semi-supervised techniques, especially in scenarios with limited annotated data.
该论文提出了一种使用扩散模型的半监督教师-学生框架进行生物医学图像分割。方法利用去噪扩散概率模型从噪声输入中生成分割掩码。教师模型通过无监督训练生成伪标签,与真实标签一起用于训练学生模型。此外,该方法还包括多轮伪标签生成策略以提高性能。在多种生物医学成像基准上的实验表明,该方法在有限标注数据的情况下优于现有半监督技术。
When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic
Authors: Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí
First: 2026-03-10T17:46:31+00:00 · Latest: 2026-03-10T17:46:31+00:00
Abstract
Deep Reinforcement Learning systems are highly sensitive to the learning rate (LR), and selecting stable and performant training runs often requires extensive hyperparameter search. In Proximal Policy Optimization (PPO) actor--critic methods, small LR values lead to slow convergence, whereas large LR values may induce instability or collapse. We analyse this phenomenon from the behavior of the hidden neurons in the network using the Overfitting-Underfitting Indicator (OUI), a metric that quantifies the balance of binary activation patterns over a fixed probe batch. We introduce an efficient batch-based formulation of OUI and derive a theoretical connection between LR and activation sign changes, clarifying how a correct evolution of the neuron's inner structure depends on the step size. Empirically, across three discrete-control environments and multiple seeds, we show that OUI measured at only 10\% of training already discriminates between LR regimes. We observe a consistent asymmetry: critic networks achieving highest return operate in an intermediate OUI band (avoiding saturation), whereas actor networks achieving highest return exhibit comparatively high OUI values. We then compare OUI-based screening rules against early return, clip-based, divergence-based, and flip-based criteria under matched recall over successful runs. In this setting, OUI provides the strongest early screening signal: OUI alone achieves the best precision at broader recall, while combining early return with OUI yields the highest precision in best-performing screening regimes, enabling aggressive pruning of unpromising runs without requiring full training.
中文标题/摘要
标题:当学习率出错时:PPO演员-评论家中的早期结构信号
深度强化学习系统对学习率(LR)非常敏感,选择稳定且性能良好的训练运行通常需要广泛的超参数搜索。在近端策略优化(PPO)演员-评论家方法中,较小的LR值会导致收敛速度缓慢,而较大的LR值可能会导致不稳定性或崩溃。我们通过过拟合-欠拟合指示器(OUI)来分析这种现象,OUI是一种量化固定探针批次中二进制激活模式平衡性的度量。我们引入了OUI的高效批次形式,并推导出LR与激活符号变化之间的理论联系,阐明了正确的神经元内部结构演变取决于步长。 实验上,在三个离散控制环境中,我们展示了在训练的10%处测量的OUI已经能够区分不同的LR区间。我们观察到一种一致的不对称性:获得最高回报的评论网络处于OUI中间区间(避免饱和),而获得最高回报的演员网络则表现出相对较高的OUI值。然后,我们将基于OUI的筛选规则与早期返回、裁剪基于、发散基于和翻转基于的标准进行了比较,在匹配召回率的条件下。在这种情况下,OUI提供了最强的早期筛选信号:仅OUI就能在更广泛的召回率下实现最佳精度,而结合早期返回与OUI则在最佳性能的筛选机制中提供最高的精度,从而能够在不需要完整训练的情况下对无前途的运行进行激进的剪枝。
Summary / 总结
This study investigates the sensitivity of Proximal Policy Optimization (PPO) actor-critic methods to learning rates (LR) by analyzing the behavior of hidden neurons using the Overfitting-Underfitting Indicator (OUI). The research shows that OUI measured early in training can effectively discriminate between different LR regimes. Critic networks achieving high returns operate in an intermediate OUI band, while actor networks exhibit higher OUI values. OUI-based screening outperforms other criteria in early return scenarios, enabling efficient pruning of unpromising runs.
论文研究了Proximal Policy Optimization (PPO)演员-评论家方法对学习率(LR)的敏感性,使用Overfitting-Underfitting指标(OUI)分析隐藏神经元的行为。研究表明,早期训练中测量的OUI能够有效区分不同的LR区间,其中获得高回报的评论家网络在中间OUI区间内运行,而获得高回报的演员网络则表现出较高的OUI值。实验结果表明,OUI提供了强大的早期筛选信号,能够在不进行完整训练的情况下高效地筛选出无前途的运行。
No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space
Authors: Yundi Zhang, Sevgi Gokce Kafali, Niklas Bubeck, Daniel Rueckert, Jiazhen Pan
First: 2026-03-10T17:38:38+00:00 · Latest: 2026-03-10T17:38:38+00:00
Abstract
Conventional clinical CMR pipelines rely on a sequential "reconstruct-then-analyze" paradigm, forcing an ill-posed intermediate step that introduces avoidable artifacts and information bottlenecks. This creates a fundamental mathematical paradox: it attempts to recover high-dimensional pixel arrays (i.e., images) from undersampled k-space, rather than directly extracting the low-dimensional physiological labels actually required for diagnosis. To unlock the direct diagnostic potential of k-space, we propose k-MTR (k-space Multi-Task Representation), a k-space representation learning framework that aligns undersampled k-space data and fully-sampled images into a shared semantic manifold. Leveraging a large-scale controlled simulation of 42,000 subjects, k-MTR forces the k-space encoder to restore anatomical information lost to undersampling directly within the latent space, bypassing the explicit inverse problem for downstream analysis. We demonstrate that this latent alignment enables the dense latent space embedded with high-level physiological semantics directly from undersampled frequencies. Across continuous phenotype regression, disease classification, and anatomical segmentation, k-MTR achieves highly competitive performance against state-of-the-art image-domain baselines. By showcasing that precise spatial geometries and multi-task features can be successfully recovered directly from the k-space representations, k-MTR provides a robust architectural blueprint for task-aware cardiac MRI workflows.
中文标题/摘要
标题:无图像,无问题:从欠采样k-空间进行端到端多任务心脏分析
传统的临床CMR流程依赖于“重建-分析”串行模式,导致一个病态的中间步骤,引入了可避免的伪影和信息瓶颈。这创造了一个基本的数学悖论:试图从欠采样的k-空间中恢复高维像素阵列(即图像),而不是直接提取诊断所需的低维生理标签。为了直接解锁k-空间的诊断潜力,我们提出了k-MTR(k-空间多任务表示),这是一种k-空间表示学习框架,将欠采样的k-空间数据和完全采样的图像对齐到共享语义流形中。利用42,000名受控模拟对象的大规模数据集,k-MTR迫使k-空间编码器直接在潜在空间中恢复由于欠采样丢失的解剖信息,绕过了下游分析中的显式逆问题。我们证明这种潜在对齐使高阶生理语义可以直接嵌入欠采样频率的密集潜在空间中。在连续表型回归、疾病分类和解剖分割中,k-MTR在最先进的图像域基线中表现出高度竞争力。通过展示可以从k-空间表示中直接恢复精确的空间几何和多任务特征,k-MTR为任务感知的心脏MRI工作流程提供了稳健的架构蓝图。
Summary / 总结
The research aims to improve cardiac MRI analysis by directly extracting physiological labels from undersampled k-space data, avoiding the intermediate image reconstruction step. The k-MTR framework aligns undersampled k-space data with fully-sampled images in a shared semantic manifold, allowing for direct extraction of high-level physiological semantics. Experiments show that k-MTR outperforms state-of-the-art image-domain methods in continuous phenotype regression, disease classification, and anatomical segmentation tasks, demonstrating the potential for task-aware cardiac MRI workflows directly from k-space data.
论文提出了一种k-MTR框架,通过将欠采样的k空间数据和全采样的图像对齐到共享语义流形,直接从欠采样的频率中提取生理标签。实验结果显示,k-MTR在连续表型回归、疾病分类和解剖分割等任务上优于现有的图像域基线,展示了k空间直接用于诊断而无需显式图像重建的潜力。
Adversarial Latent-State Training for Robust Policies in Partially Observable Domains
Authors: Angad Singh Ahuja
First: 2026-03-07T19:06:49+00:00 · Latest: 2026-03-10T17:36:57+00:00
Comments: 25 pages, 3 figures
Abstract
Robustness under latent distribution shift remains challenging in partially observable reinforcement learning. We formalize a focused setting where an adversary selects a hidden initial latent distribution before the episode, termed an adversarial latent-initial-state POMDP. Theoretically, we prove a latent minimax principle, characterize worst-case defender distributions, and derive approximate best-response inequalities with finite-sample concentration bounds that make the optimization and sampling terms explicit. Empirically, using a Battleship benchmark, we demonstrate that targeted exposure to shifted latent distributions reduces average robustness gaps between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget. Furthermore, iterative best-response training exhibits budget-sensitive behavior that is qualitatively consistent with the theorem-guided diagnostics once one accounts for discounted PPO surrogates and finite-sample noise. Ultimately, we show that for latent-initial-state problems, the framework yields a clean evaluation game and useful theorem-motivated diagnostics while also making clear where implementation-level surrogates and optimization limits enter.
中文标题/摘要
标题:对抗潜在状态训练在部分可观测域中稳健策略
在部分可观测强化学习中,潜在分布转移下的鲁棒性仍然具有挑战性。我们形式化了一个聚焦的设置,其中对手在开始之前选择一个隐藏的初始潜在分布,称为对抗潜在初始状态POMDP。理论上,我们证明了一个潜在的最小最大原则,刻画了最坏情况下的防御分布,并推导出带有有限样本集中性界的经验近似最优反应不等式,使优化和采样项明确化。实验上,使用Battleship基准,我们证明了针对转移后的潜在分布的针对性暴露将平均鲁棒性差距从等预算下的10.3减少到3.1发。此外,迭代最优反应训练表现出预算敏感的行为,与定理引导的诊断一致,一旦考虑到折扣PPO替代和有限样本噪声。最终,我们展示了对于潜在初始状态问题,该框架提供了一个清晰的评估游戏和有用的定理驱动的诊断,同时也明确了实施层面的替代和优化限制的进入点。
Summary / 总结
This paper addresses the challenge of robustness in partially observable reinforcement learning under latent distribution shift. It introduces an adversarial latent-initial-state POMDP where an adversary can choose the hidden initial state before the episode. Theoretical analysis proves a latent minimax principle and derives approximate best-response inequalities with finite-sample concentration bounds. Empirically, the method reduces the robustness gap between Spread and Uniform distributions from 10.3 to 3.1 shots at equal budget using a Battleship benchmark. Iterative best-response training shows budget-sensitive behavior consistent with the theorem-guided diagnostics, despite the use of discounted PPO surrogates and finite-sample noise.
该论文解决了部分可观测强化学习中在潜在分布变化下的鲁棒性问题。它引入了一个对抗初始潜在状态的POMDP模型,其中对手可以决定初始的潜在分布。理论分析证明了潜在的最小最大原则,并推导出近似最优响应不等式,带有有限样本的收敛界。实验上,该方法使用Battleship基准将Spread和Uniform分布之间的鲁棒性差距从10.3减少到3.1个射击次数,相同预算下。迭代最优响应训练显示出与理论预测一致的预算敏感行为,尽管存在一些实施层面的替代目标和有限样本噪声效应。
PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs
Authors: Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu
First: 2026-03-10T17:35:49+00:00 · Latest: 2026-03-10T17:35:49+00:00
Abstract
Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.
中文标题/摘要
标题:PathMem:病理MLLM的认知对齐记忆转换
计算病理学既需要视觉模式识别,也需要动态整合结构化的领域知识,包括分类学、分级标准和临床证据。实践中,诊断推理需要将形态学证据与正式的诊断和分级标准联系起来。尽管多模态大型语言模型(MLLMs)展示了强大的视觉语言推理能力,但它们缺乏明确的知识结构整合机制和可解释的记忆控制。因此,现有模型在推理过程中难以一致地整合病理学特定的诊断标准。受人类病理学家分层记忆过程的启发,我们提出PathMem,这是一种以记忆为中心的多模态框架,用于病理学MLLMs。PathMem 将结构化的病理学知识组织成长期记忆(LTM),并引入了一个记忆变换器,通过多模态记忆激活和上下文感知的知识接地,建模从LTM到工作记忆(WM)的动态过渡,从而实现上下文感知的记忆细化以支持下游推理。PathMem 在基准测试中实现了SOTA性能,WSI-Bench报告生成的WSI-精确度提高了12.8%,WSI-相关性提高了10.1%,开放性诊断分别提高了9.7%和8.9%,优于之前的基于WSI的模型。
Summary / 总结
PathMem is a memory-centric multimodal framework designed to enhance the reasoning capabilities of pathology large language models (MLLMs) by integrating structured domain knowledge. It organizes pathology knowledge as long-term memory and uses a Memory Transformer to dynamically transition this knowledge to working memory, enabling context-aware memory refinement. PathMem outperforms previous models on WSI-Bench, improving WSI-Precision by 12.8% and WSI-Relevance by 10.1%, and achieving a 9.7% and 8.9% improvement in open-ended diagnosis over prior WSI-based models.
PathMem 是一种记忆中心的多模态框架,旨在通过整合结构化领域知识来增强病理大型语言模型(MLLMs)的推理能力。它将病理知识组织成长期记忆,并使用记忆变换器动态地将这些知识过渡到工作记忆,从而实现上下文相关的记忆精炼。PathMem 在 WSI-Bench 上表现出色,WSI-Precision 提高了 12.8%,WSI-Relevance 提高了 10.1%,并且在开放性诊断方面分别提高了 9.7% 和 8.9%。
MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers
Authors: Arash Ahmadi, Sarah Sharif, Yaser M. Banad
First: 2025-04-11T22:19:48+00:00 · Latest: 2026-03-10T17:34:59+00:00
Comments: 42 pages, 28 figures
Abstract
Large Language Models (LLMs) are increasingly augmented with external tools through standardized interfaces like the Model Context Protocol (MCP). However, current MCP implementations face critical limitations: they typically require local process execution through STDIO transports, making them impractical for resource-constrained environments like mobile devices, web browsers, and edge computing. We present MCP Bridge, a lightweight RESTful proxy that connects to multiple MCP servers and exposes their capabilities through a unified API. Unlike existing solutions, MCP Bridge is fully LLM-agnostic, supporting any backend regardless of vendor. The system implements a risk-based execution model with three security levels-standard execution, confirmation workflow, and Docker isolation-while maintaining backward compatibility with standard MCP clients. However, reliable execution within this framework requires models that can strictly adhere to protocol schemas. To this end, we also fine-tuned the Qwen3 4B and 8B model family on the Agent-Ark/Toucan-1.5M dataset using four Reinforcement Learning techniques: Group Relative Policy Optimization (GRPO), Dr. GRPO, Beta Normalization Policy Optimization (BNPO), and Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO). Evaluated on the MCPToolBench++ benchmark, our optimized model achieves an F1 score of 73.0% that outperforms GPT-OSS-120B (62.17%) and remains competitive with the 70B+ parameter baselines. Evaluation demonstrates that MCP Bridge successfully addresses the constraints of direct MCP connections while providing enhanced security controls and cross-platform compatibility, enabling sophisticated LLM-powered applications in previously inaccessible environments.
中文标题/摘要
标题:MCP桥接器:一种轻量级、LLM无关的RESTful代理,用于模型上下文协议服务器
大型语言模型(LLMs)通过标准化接口(如模型上下文协议MCP)与外部工具进行集成,正在变得越来越普遍。然而,当前的MCP实现面临关键限制:它们通常需要通过STDIO传输进行本地进程执行,这使得它们在资源受限的环境中(如移动设备、网页浏览器和边缘计算)不切实际。我们提出了MCP桥接器,这是一种轻量级的RESTful代理,可以连接到多个MCP服务器并通过统一的API暴露其功能。与现有解决方案不同,MCP桥接器完全不依赖于特定的LLM,支持任何后端,无论供应商如何。该系统实现了一种基于风险的执行模型,具有三个安全级别:标准执行、确认工作流和Docker隔离,同时保持与标准MCP客户端的向后兼容性。然而,在此框架内可靠执行需要严格遵守协议模式的模型。为此,我们还使用四种强化学习技术(组相对策略优化GRPO、博士GRPO、贝塔归一化策略优化BNPO和解耦剪辑和动态采样策略优化DAPO)对Qwen3 4B和8B模型家族进行了微调,以Agent-Ark/Toucan-1.5M数据集为基准。在MCPToolBench++基准测试中,我们的优化模型实现了73.0%的F1分数,优于GPT-OSS-120B(62.17%),并且与70B+参数基线保持竞争力。评估表明,MCP桥接器成功地解决了直接MCP连接的限制,同时提供了增强的安全控制和跨平台兼容性,使复杂的LLM驱动的应用程序能够在以前无法访问的环境中运行。
Summary / 总结
The paper introduces MCP Bridge, a lightweight RESTful proxy for connecting to multiple Model Context Protocol (MCP) servers, which supports any LLM backend without vendor-specific limitations. It implements a risk-based execution model with three security levels and maintains backward compatibility with standard MCP clients. The authors fine-tuned the Qwen3 4B and 8B models using four RL techniques and achieved an F1 score of 73.0% on MCPToolBench++, outperforming GPT-OSS-120B (62.17%) and remaining competitive with larger models. This system addresses the limitations of direct MCP connections, providing enhanced security and cross-platform compatibility for LLM-powered applications.
MCP Bridge 是一个轻量级的 RESTful 代理,旨在连接多个 Model Context Protocol (MCP) 服务器并通过统一的 API 展示其功能,解决当前 MCP 实现需要本地进程执行且不适用于资源受限环境的问题。该系统支持任何后端 LLM,并实现了一种基于风险的执行模型,包含三个安全级别。作者使用四种强化学习技术对 Qwen3 4B 和 8B 模型进行了微调,并在 MCPToolBench++ 基准测试中取得了 73.0% 的 F1 分数,超过了 GPT-OSS-120B,且与更大规模的模型保持竞争力。
Towards Flexible Spectrum Access: Data-Driven Insights into Spectrum Demand
Authors: Mohamad Alkadamani, Amir Ghasemi, Halim Yanikomeroglu
First: 2026-03-10T17:34:16+00:00 · Latest: 2026-03-10T17:34:16+00:00
Comments: 7 pages, 5 figures. Presented at IEEE VTC 2024, Washington, DC. Published in the IEEE conference proceedings
Abstract
In the diverse landscape of 6G networks, where wireless connectivity demands surge and spectrum resources remain limited, flexible spectrum access becomes paramount. The success of crafting such schemes hinges on our ability to accurately characterize spectrum demand patterns across space and time. This paper presents a data-driven methodology for estimating spectrum demand variations over space and identifying key drivers of these variations in the mobile broadband landscape. By leveraging geospatial analytics and machine learning, the methodology is applied to a case study in Canada to estimate spectrum demand dynamics in urban regions. Our proposed model captures 70\% of the variability in spectrum demand when trained on one urban area and tested on another. These insights empower regulators to navigate the complexities of 6G networks and devise effective policies to meet future network demands.
中文标题/摘要
标题:迈向灵活频谱接入:基于数据的频谱需求洞察
在6G网络的多样化景观中,随着无线连接需求的激增和频谱资源的有限性,灵活的频谱接入变得至关重要。构建此类方案的成功取决于我们准确刻画空间和时间上的频谱需求模式的能力。本文提出了一种基于数据的方法,用于估计空间上的频谱需求变化,并识别移动宽带环境中这些变化的关键驱动因素。通过利用地理空间分析和机器学习,该方法应用于加拿大的一个案例研究,以估计城市地区的频谱需求动态。我们提出的模型在训练于一个城市区域并在另一个城市区域测试时,能够捕捉到70%的频谱需求变化。这些见解使监管机构能够应对6G网络的复杂性,并制定有效的政策以满足未来网络需求。
Summary / 总结
This paper aims to understand spectrum demand patterns in 6G networks to enable flexible spectrum access. It uses geospatial analytics and machine learning to estimate spectrum demand variations in urban regions of Canada. The model captures 70% of the variability in spectrum demand when tested on different urban areas, providing valuable insights for regulatory policies in 6G networks.
该论文旨在通过理解6G网络中的频谱需求模式来实现灵活的频谱访问。它使用地理空间分析和机器学习来估计加拿大城市地区频谱需求的变化。该模型能够捕捉到70%的频谱需求变化,为6G网络的监管政策提供了有价值的见解。
SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
Authors: Fredrik K. Gustafsson, Xiao Gu, Mattia Carletti, Patitapaban Palo, David W. Eyre, David A. Clifton
First: 2026-03-10T17:32:28+00:00 · Latest: 2026-03-10T17:32:28+00:00
Comments: Code is available at https://github.com/fregu856/SignalMC-MED
Abstract
Recent biosignal foundation models (FMs) have demonstrated promising performance across diverse clinical prediction tasks, yet systematic evaluation on long-duration multimodal data remains limited. We introduce SignalMC-MED, a benchmark for evaluating biosignal FMs on synchronized single-lead electrocardiogram (ECG) and photoplethysmogram (PPG) data. Derived from the MC-MED dataset, SignalMC-MED comprises 22,256 visits with 10-minute overlapping ECG and PPG signals, and includes 20 clinically relevant tasks spanning prediction of demographics, emergency department disposition, laboratory value regression, and detection of prior ICD-10 diagnoses. Using this benchmark, we perform a systematic evaluation of representative time-series and biosignal FMs across ECG-only, PPG-only, and ECG + PPG settings. We find that domain-specific biosignal FMs consistently outperform general time-series models, and that multimodal ECG + PPG fusion yields robust improvements over unimodal inputs. Moreover, using the full 10-minute signal consistently outperforms shorter segments, and larger model variants do not reliably outperform smaller ones. Hand-crafted ECG domain features provide a strong baseline and offer complementary value when combined with learned FM representations. Together, these results establish SignalMC-MED as a standardized benchmark and provide practical guidance for evaluating and deploying biosignal FMs.
中文标题/摘要
标题:SignalMC-MED:一种用于评估生物信号基础模型在单导联ECG和PPG上的多模态基准
近年来,生物信号基础模型(FMs)在多种临床预测任务中表现出有希望的性能,但在长时程多模态数据上的系统评估仍然有限。我们引入了SignalMC-MED基准,用于评估生物信号FMs在同步单导联心电图(ECG)和光电容积描记图(PPG)数据上的表现。该基准数据集源自MC-MED数据集,包含22,256次访问,每10分钟重叠的ECG和PPG信号,并包括20项临床相关任务,涵盖人口统计学预测、急诊科处置、实验室值回归以及ICD-10诊断的检测。使用该基准,我们对代表性的时序和生物信号FMs在ECG仅用、PPG仅用以及ECG + PPG设置下进行了系统评估。我们发现,特定领域的生物信号FMs始终优于通用时序模型,而多模态ECG + PPG融合在单模态输入上提供了稳健的改进。此外,使用完整的10分钟信号始终优于较短的片段,而较大的模型变体并不总是优于较小的模型。手工构建的ECG领域特征提供了强大的基线,并且当与学习到的FM表示结合时,提供了补充价值。这些结果共同确立了SignalMC-MED作为标准化基准,并为评估和部署生物信号FMs提供了实用指导。
Summary / 总结
SignalMC-MED is a benchmark for evaluating biosignal foundation models on synchronized single-lead ECG and PPG data, covering 20 clinically relevant tasks. The study finds that domain-specific biosignal models outperform general time-series models, and multimodal ECG + PPG fusion improves performance. Longer signal segments and larger models do not consistently outperform, while hand-crafted ECG features complement learned representations.
SignalMC-MED 是一个用于评估同步单导联 ECG 和 PPG 数据的生物信号基础模型的基准,涵盖了 20 个临床相关任务。研究显示,领域特定的生物信号模型优于通用时间序列模型,而多模态 ECG + PPG 融合提高了性能。较长的信号段和更大模型并不总是优于较小的模型,而手工制作的 ECG 特征则补充了学习到的模型表示。
Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective
Authors: Erkan Turan, Maks Ovsjanikov
First: 2026-03-10T17:30:35+00:00 · Latest: 2026-03-10T17:30:35+00:00
Abstract
Generative Modeling via Drifting has recently achieved state-of-the-art one-step image generation through a kernel-based drift operator, yet the success is largely empirical and its theoretical foundations remain poorly understood. In this paper, we make the following observation: \emph{under a Gaussian kernel, the drift operator is exactly a score difference on smoothed distributions}. This insight allows us to answer all three key questions left open in the original work: (1) whether a vanishing drift guarantees equality of distributions ($V_{p,q}=0\Rightarrow p=q$), (2) how to choose between kernels, and (3) why the stop-gradient operator is indispensable for stable training. Our observations position drifting within the well-studied score-matching family and enable a rich theoretical perspective. By linearizing the McKean-Vlasov dynamics and probing them in Fourier space, we reveal frequency-dependent convergence timescales comparable to \emph{Landau damping} in plasma kinetic theory: the Gaussian kernel suffers an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. We also propose an exponential bandwidth annealing schedule $σ(t)=σ_0 e^{-rt}$ that reduces convergence time from $\exp(O(K_{\max}^2))$ to $O(\log K_{\max})$. Finally, by formalizing drifting as a Wasserstein gradient flow of the smoothed KL divergence, we prove that the stop-gradient operator is derived directly from the frozen-field discretization mandated by the JKO scheme, and removing it severs training from any gradient-flow guarantee. This variational perspective further provides a general template for constructing novel drift operators, demonstrated with a Sinkhorn divergence drift.
中文标题/摘要
标题:生成漂移实际上是秘密的得分匹配:谱和变分视角
通过核基漂移算子进行生成建模的方法最近在单步图像生成方面达到了最先进的水平,但其成功主要是经验性的,其理论基础仍然知之甚少。在本文中,我们观察到:在高斯核下,漂移算子实际上是平滑分布上的得分差。这一洞察使我们能够回答原始工作中留下的三个关键问题:(1)是否漂移消失保证分布相等($V_{p,q}=0\Rightarrow p=q$),(2)如何选择核,以及(3)为什么必须使用停止梯度算子以实现稳定的训练。我们的观察将漂移置于广为研究的得分匹配家族中,并使其具备丰富的理论视角。通过线性化麦康恩-弗拉索夫动力学并在傅里叶空间中进行探查,我们揭示了频率依赖的收敛时间尺度,与等离子体动力学理论中的兰道阻尼相当:高斯核遭受了高频率的指数瓶颈,解释了为何实验上偏好拉普拉斯核。我们还提出了一个指数带宽退火计划$σ(t)=σ_0 e^{-rt}$,将收敛时间从$\exp(O(K_{\max}^2))$减少到$O(\log K_{\max})$。最后,通过将漂移形式化为平滑KL散度的Wasserstein梯度流,我们证明了停止梯度算子直接来源于JKO方案强制的冻结场离散化,移除它将切断训练与任何梯度流保证的联系。这种变分视角还为构建新的漂移算子提供了一个通用模板,通过Sinkhorn散度漂移进行了演示。
Summary / 总结
This paper addresses the theoretical foundations of Generative Modeling via Drifting, a method that uses a kernel-based drift operator for one-step image generation. The authors observe that under a Gaussian kernel, the drift operator is equivalent to a score difference on smoothed distributions, which helps answer key questions about the method's theoretical guarantees. They reveal that the Gaussian kernel has an exponential high-frequency bottleneck, explaining the empirical preference for the Laplacian kernel. Additionally, they propose an exponential bandwidth annealing schedule and formalize drifting as a Wasserstein gradient flow, proving the necessity of the stop-gradient operator for gradient-flow guarantees.
本文探讨了基于漂移操作的生成建模理论基础,该方法使用核基漂移操作进行单步图像生成。作者观察到,在高斯核下,漂移操作等同于光滑分布上的得分差异,为该方法的实证成功提供了理论基础。他们回答了三个关键问题:漂移消失是否意味着分布相等、如何选择核以及停止梯度操作的必要性。研究揭示了高斯核在高频段的瓶颈问题,解释了为何偏好拉普拉斯核。此外,他们提出了一种指数带宽退火计划以提高收敛性,并证明了停止梯度操作对于保持梯度流保证是必不可少的。
Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
Authors: Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative
First: 2026-03-10T17:26:45+00:00 · Latest: 2026-03-10T17:26:45+00:00
Abstract
Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80\% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff
中文标题/摘要
标题:自适应临床感知潜在扩散在多模态脑影像生成和缺失模态插补中的应用
多模态神经影像学为阿尔茨海默病诊断提供了互补的见解,但临床数据集经常存在缺失模态的问题。我们提出了ACADiff框架,通过自适应临床感知扩散来合成缺失的脑影像模态。ACADiff通过逐步去噪潜在表示,同时关注可用的影像数据和临床元数据,学习不完整多模态观察与目标模态之间的映射关系。该框架采用自适应融合,根据输入可用性动态重新配置,并通过GPT-4o编码提示提供语义临床指导。三个专门的生成器使sMRI、FDG-PET和AV45-PET之间实现双向合成。在ADNI受试者上进行评估,ACADiff在生成质量上表现出色,并且即使在80%的极端缺失情况下仍能保持稳健的诊断性能,优于所有现有基线。为了促进可重复性,代码可在https://github.com/rongzhou7/ACADiff获取
Summary / 总结
The research aims to address the issue of missing modalities in clinical neuroimaging datasets for Alzheimer's disease diagnosis. ACADiff, a framework that uses adaptive clinical-aware latent diffusion, is proposed to synthesize missing brain imaging modalities. By progressively denoising latent representations and attending to available imaging data and clinical metadata, ACADiff achieves high-quality multimodal brain image generation and maintains robust diagnostic performance even with 80% missing data, outperforming existing methods. The framework includes three specialized generators for bidirectional synthesis among sMRI, FDG-PET, and AV45-PET, and the code is available for reproducibility.
研究旨在解决临床神经影像数据集中缺失模态的问题,以提高阿尔茨海默病的诊断。ACADiff框架采用自适应临床感知的潜在扩散方法,用于合成缺失的大脑影像模态。通过逐步去噪潜在表示并关注可用的影像数据和临床元数据,ACADiff实现了高质量的多模态大脑影像生成,并在80%数据缺失的情况下仍保持稳健的诊断性能,优于现有方法。该框架包括三个专门的生成器,用于sMRI、FDG-PET和AV45-PET之间的双向合成,代码已公开以促进可重复性。
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
Authors: Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao
First: 2026-03-10T17:26:42+00:00 · Latest: 2026-03-10T17:26:42+00:00
Abstract
Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable text-motion alignment. Extensive experiments on HumanML3D and KIT-ML show that our method outperforms state-of-the-art text-motion retrieval approaches while offering interpretable fine-grained correspondences between text and motion. The code is available in the supplementary material.
中文标题/摘要
标题:基于关节角度运动图像和标记块晚交互的细粒度运动检索
文本-运动检索旨在学习自然语言描述与3D人体运动骨架序列之间的语义对齐潜在空间,使两种模态之间能够双向搜索。现有大多数方法使用双编码器框架,将运动和文本压缩为全局嵌入,忽略了细粒度的局部对应关系,从而降低了准确性。此外,这些全局嵌入方法对检索结果的解释性有限。为克服这些限制,我们提出了一种可解释的基于关节角度的运动表示,将关节级局部特征映射到与预训练的视觉变换器兼容的结构化伪图像。对于文本到运动检索,我们采用MaxSim,这是一种标记级晚交互机制,并通过掩码语言建模正则化增强它,以促进稳健且可解释的文本-运动对齐。在HumanML3D和KIT-ML上的广泛实验表明,我们的方法在可解释的细粒度文本-运动对应关系方面优于最先进的文本-运动检索方法。代码可在附录中获取。
Summary / 总结
This paper addresses the challenge of text-motion retrieval by proposing a joint-angle-based motion representation that maps joint-level local features into structured pseudo-images compatible with Vision Transformers. The method uses MaxSim with Masked Language Modeling regularization for text-to-motion retrieval, offering interpretable fine-grained correspondences. Experiments on HumanML3D and KIT-ML demonstrate superior performance compared to existing methods.
研究旨在通过解决现有全局嵌入方法的局限性,提高文本-动作检索的准确性和可解释性。提出的方法使用基于关节角度的动作表示,并采用一种基于标记的晚期交互机制MaxSim,该机制通过掩码语言建模正则化增强。在HumanML3D和KIT-ML上的实验表明,该方法优于最先进的方法,并提供了文本和动作之间的可解释的细粒度对应关系。
A Distributional Treatment of Real2Sim2Real for Object-Centric Agent Adaptation in Vision-Driven Deformable Linear Object Manipulation
Authors: Georgios Kamaras, Subramanian Ramamoorthy
Venue: In IEEE Robotics and Automation Letters, Volume 10, Issue 8, August 2025, Pages 8075-8082
First: 2025-02-25T20:01:06+00:00 · Latest: 2026-03-10T17:25:35+00:00
Abstract
We present an integrated (or end-to-end) framework for the Real2Sim2Real problem of manipulating deformable linear objects (DLOs) based on visual perception. Working with a parameterised set of DLOs, we use likelihood-free inference (LFI) to compute the posterior distributions for the physical parameters using which we can approximately simulate the behaviour of each specific DLO. We use these posteriors for domain randomisation while training, in simulation, object-specific visuomotor policies (i.e. assuming only visual and proprioceptive sensory) for a DLO reaching task, using model-free reinforcement learning. We demonstrate the utility of this approach by deploying sim-trained DLO manipulation policies in the real world in a zero-shot manner, i.e. without any further fine-tuning. In this context, we evaluate the capacity of a prominent LFI method to perform fine classification over the parametric set of DLOs, using only visual and proprioceptive data obtained in a dynamic manipulation trajectory. We then study the implications of the resulting domain distributions in sim-based policy learning and real-world performance.
中文标题/摘要
标题:基于视觉驱动的可变形线性物体操作中物体中心代理适应的实2仿2实分布处理方法
我们提出了一种集成(或端到端)框架,用于基于视觉感知操纵可变形线性物体(DLOs)的实2仿2实问题。使用参数化的DLO集合,我们使用无似然推断(LFI)来计算物理参数的后验分布,从而可以近似模拟每个特定DLO的行为。我们在训练过程中使用这些后验分布进行领域随机化,在仿真中使用无模型强化学习为DLO抓取任务训练特定于物体的视知觉运动策略(即,假设只有视觉和本体感觉感知)。我们通过零样本方式部署仿真实训的DLO操作策略,即无需任何进一步微调来展示该方法的实用性。在此背景下,我们评估了一种流行的LFI方法在仅使用动态操作轨迹中获得的视觉和本体感觉数据对参数化DLO集合进行精细分类的能力。然后我们研究了基于仿真的策略学习和实际性能中结果领域分布的影响。
Summary / 总结
The paper presents an integrated framework for manipulating deformable linear objects (DLOs) in the real world using visual perception. It uses likelihood-free inference to compute the posterior distributions of physical parameters for DLOs, which are then used for domain randomisation during training. The approach demonstrates the utility of sim-trained policies in the real world without further fine-tuning, evaluating the method's ability to perform fine classification using only visual and proprioceptive data during dynamic manipulation. The study shows that the domain distributions from simulation can effectively guide real-world performance.
论文提出了一种集成框架,用于使用视觉感知操纵变形线性物体(DLOs)。它使用无似然推断来计算DLOs的物理参数后验分布,然后在训练中使用这些后验分布进行领域随机化。该方法展示了模拟训练策略在现实世界中的实用性,无需进一步微调,并评估了仅使用视觉和本体感受数据在动态操作轨迹中进行细分类的能力。研究显示,模拟中的领域分布能够有效指导现实世界的性能。
Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery
Authors: Gilad Lerman, Kang Li, Tyler Maunu, Teng Zhang
First: 2025-06-25T15:23:32+00:00 · Latest: 2026-03-10T17:23:27+00:00
Abstract
Robust subspace estimation is fundamental to many machine learning and data analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and empirically effective approach to this problem, yet its theoretical properties remain poorly understood. This paper establishes that, under deterministic conditions, a variant of IRLS with dynamic smoothing regularization converges linearly to the underlying subspace from any initialization. We extend these guarantees to affine subspace estimation, a setting that lacks prior recovery theory. Additionally, we illustrate the practical benefits of IRLS through an application to low-dimensional neural network training. Our results provide the first global convergence guarantees for IRLS in robust subspace recovery and, more broadly, for nonconvex IRLS on a Riemannian manifold.
中文标题/摘要
标题:全局收敛于迭代加权最小二乘法在鲁棒子空间恢复中的应用
鲁棒子空间估计是许多机器学习和数据分析任务的基础。迭代加权最小二乘法(IRLS)是解决这一问题的一个优雅且经验上有效的办法,但其理论性质仍知之甚少。本文在确定性条件下证明,一种带有动态平滑正则化的IRLS变体可以从任何初始化线性收敛到潜在的子空间。我们还将这些保证扩展到仿射子空间估计,这是一个缺乏先验恢复理论的设置。此外,我们通过低维神经网络训练的应用展示了IRLS的实际益处。我们的结果提供了IRLS在鲁棒子空间恢复中的首个全局收敛保证,并且更广泛地,对于黎曼流形上的非凸IRLS提供了首个全局收敛保证。
Summary / 总结
This paper addresses the theoretical understanding of Iteratively Reweighted Least Squares (IRLS) for robust subspace recovery, which is crucial for various machine learning tasks. It proves that a variant of IRLS with dynamic smoothing regularization converges linearly to the underlying subspace from any initialization under deterministic conditions. The study also extends these guarantees to affine subspace estimation, a setting without prior recovery theory. Practical benefits of IRLS are demonstrated through its application in low-dimensional neural network training, providing the first global convergence guarantees for IRLS in robust subspace recovery and for nonconvex IRLS on a Riemannian manifold.
论文探讨了迭代加权最小二乘法(IRLS)在鲁棒子空间恢复中的理论理解,这对于机器学习和数据分析至关重要。研究证明,在确定性条件下,带有动态平滑正则化的IRLS变体可以从任何初始化线性收敛到潜在的子空间。此外,研究还将这些保证扩展到缺乏先验恢复理论的仿射子空间估计。通过将其应用于低维神经网络训练,展示了IRLS的实际益处,并提供了鲁棒子空间恢复中IRLS的第一个全局收敛保证以及非凸IRLS在黎曼流形上的保证。
On the Structural Failure of Chamfer Distance in 3D Shape Optimization
Authors: Chang-Yong Song, David Hyde
First: 2026-03-10T17:21:23+00:00 · Latest: 2026-03-10T17:21:23+00:00
Comments: 27 pages, including supplementary material
Abstract
Chamfer distance is the standard training loss for point cloud reconstruction, completion, and generation, yet directly optimizing it can produce worse Chamfer values than not optimizing it at all. We show that this paradoxical failure is gradient-structural. The per-point Chamfer gradient creates a many-to-one collapse that is the unique attractor of the forward term and cannot be resolved by any local regularizer, including repulsion, smoothness, and density-aware re-weighting. We derive a necessary condition for collapse suppression: coupling must propagate beyond local neighborhoods. In a controlled 2D setting, shared-basis deformation suppresses collapse by providing global coupling; in 3D shape morphing, a differentiable MPM prior instantiates the same principle, consistently reducing the Chamfer gap across 20 directed pairs with a 2.5$\times$ improvement on the topologically complex dragon. The presence or absence of non-local coupling determines whether Chamfer optimization succeeds or collapses. This provides a practical design criterion for any pipeline that optimizes point-level distance metrics.
中文标题/摘要
标题:关于3D形状优化中倒角距离结构失效的研究
倒角距离是点云重建、完成和生成的标准训练损失,但直接优化它可能会产生比不优化更差的倒角值。我们展示了这种反常的失败是梯度结构性的。每个点的倒角梯度会产生一个一对一的坍缩,这是前向项的唯一吸引子,任何局部正则化,包括排斥、平滑性和密度感知重权,都无法解决。我们推导出坍缩抑制的必要条件:耦合必须传播到局部邻域之外。在受控的2D设置中,共享基底变形通过提供全局耦合来抑制坍缩;在3D形状变形中,可微分的MPM先验实现了相同的原则,一致地减少了20个定向对的倒角差距,拓扑复杂的大龙的改进幅度为2.5倍。非局部耦合的存在与否决定了倒角优化是成功还是坍缩。这为任何优化点级距离度量的管道提供了实用的设计标准。
Summary / 总结
The study addresses the paradoxical failure of directly optimizing Chamfer distance in 3D shape optimization, showing that the per-point gradient creates a many-to-one collapse that cannot be resolved by local regularizers. The research demonstrates that coupling must propagate beyond local neighborhoods to suppress collapse, with shared-basis deformation and a differentiable MPM prior successfully reducing the Chamfer gap in 2D and 3D settings, respectively, by 2.5 times for the dragon model.
论文探讨了直接优化Chamfer距离为何会导致3D形状优化效果变差的问题。研究发现,由于梯度结构问题导致的局部耦合无法解决的多对一坍塌现象。实验表明,耦合必须传播到超越局部邻域才能抑制坍塌。共享基变形和可微MPM先验提供了全局耦合,有效地减少了2D和3D设置中的Chamfer差距,特别是在复杂形状上取得了显著改进。
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
Authors: Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He
First: 2026-03-10T17:18:53+00:00 · Latest: 2026-03-10T17:18:53+00:00
Comments: Accepted by CVPR26, codes and weights are publicly available
Abstract
Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
中文标题/摘要
标题:WikiCLIP:开放领域视觉实体识别的一种高效对比基准
开放领域视觉实体识别(VER)旨在将图像与维基百科等百科知识库中的实体关联起来。最近为VER量身定制的生成方法表现出强大的性能,但计算成本高昂,限制了其可扩展性和实际部署。在本文中,我们重新审视了VER中的对比范式,并引入了WikiCLIP,这是一种简单而有效的框架,为开放领域VER建立了强大的高效基准。WikiCLIP利用大型语言模型嵌入作为知识丰富的实体表示,并通过视觉引导知识适配器(VGKA)在像素级别对文本语义与视觉线索进行对齐。为了进一步促进细粒度的区分,一种硬负样本合成机制在训练过程中生成视觉相似但语义不同的负样本。在流行的开放领域VER基准测试,如OVEN上,实验结果表明,WikiCLIP显著优于强大的基线。具体而言,WikiCLIP在具有挑战性的OVEN未见过的集合上实现了16%的改进,而与领先的生成模型AutoVER相比,推理延迟降低了近100倍。项目页面可在https://artanic30.github.io/project_pages/WikiCLIP/获取。
Summary / 总结
WikiCLIP is a contrastive framework designed for open-domain visual entity recognition, aiming to efficiently associate images with Wikipedia entities. It uses large language model embeddings and a Vision-Guided Knowledge Adaptor to align textual and visual information, and a Hard Negative Synthesis Mechanism to enhance discrimination. WikiCLIP significantly outperforms existing baselines on benchmarks like OVEN, achieving a 16% improvement on the unseen set and reducing inference latency by nearly 100 times compared to the leading generative model, AutoVER.
WikiCLIP 是一种用于开放领域视觉实体识别的对比框架,旨在降低最近生成方法的高计算成本。它使用大型语言模型嵌入和视觉引导知识适配器来对齐文本和视觉信息,并通过困难负样本合成机制增强区分能力。在 OVEN 等基准测试上,WikiCLIP 出色地超越了强基线,实现了 OVEN 未见集 16% 的改进,并将推理延迟降低了近 100 倍,与 AutoVER 相比。
Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision
Authors: Anika Tabassum Meem, Muntasir Hossain Nadid, Md Zesun Ahmed Mia
First: 2026-02-12T18:15:32+00:00 · Latest: 2026-03-10T17:16:47+00:00
Abstract
Neuromorphic vision systems based on spiking neural networks (SNNs) offer ultra-low-power perception for event-based and frame-based cameras, yet catastrophic forgetting remains a critical barrier to deployment in continually evolving environments. Existing continual learning methods, developed primarily for artificial neural networks, seldom jointly optimize accuracy and energy efficiency, with particularly limited exploration on event-based datasets. We propose an energy-aware spike budgeting framework for continual SNN learning that integrates experience replay, learnable leaky integrate-and-fire neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Our approach exhibits modality-dependent behavior: on frame-based datasets (MNIST, CIFAR-10), spike budgeting acts as a sparsity-inducing regularizer, improving accuracy while reducing spike rates by up to 47\%; on event-based datasets (DVS-Gesture, N-MNIST, CIFAR-10-DVS), controlled budget relaxation enables accuracy gains up to 17.45 percentage points with minimal computational overhead. Across five benchmarks spanning both modalities, our method demonstrates consistent performance improvements while minimizing dynamic power consumption, advancing the practical viability of continual learning in neuromorphic vision systems.
中文标题/摘要
标题:基于突触神经网络的神经形态视觉中的能效意识尖峰预算化连续学习
基于突触神经网络(SNN)的神经形态视觉系统为事件驱动和帧驱动的相机提供超低功耗感知,但灾难性遗忘仍然是在不断变化的环境中部署的关键障碍。现有的连续学习方法主要针对人工神经网络开发,很少同时优化准确性和能效,特别是在事件驱动的数据集上探索有限。我们提出了一种能效意识的尖峰预算化框架,用于连续SNN学习,该框架结合了经验回放、可学习的漏型积分-放电神经元参数以及自适应尖峰调度器,在训练过程中强制执行特定数据集的能效约束。我们的方法表现出模态依赖性:在帧驱动的数据集(MNIST,CIFAR-10)上,尖峰预算化作为稀疏性诱导的正则化器,提高准确率的同时将尖峰率降低高达47%;在事件驱动的数据集(DVS-Gesture,N-MNIST,CIFAR-10-DVS)上,受控的预算放松可实现高达17.45个百分点的准确率提升,同时具有最小的计算开销。在五个涵盖不同模态的基准测试中,我们的方法在提高性能的同时最小化动态功耗,推进了神经形态视觉系统中连续学习的实际可行性。
Summary / 总结
The research aims to address catastrophic forgetting in continual learning for neuromorphic vision systems using spiking neural networks (SNNs). It introduces an energy-aware spike budgeting framework that combines experience replay, learnable neuron parameters, and an adaptive scheduler to optimize both accuracy and energy efficiency. The study shows that spike budgeting improves accuracy by up to 47% and reduces spike rates on frame-based datasets, while enabling up to 17.45 percentage point accuracy gains on event-based datasets with minimal computational overhead across various benchmarks.
论文针对基于脉冲神经网络(SNN)的神经形态视觉系统中的灾难性遗忘问题,提出了一种能量感知的脉冲预算框架,结合了经验回放、可学习的神经元参数和自适应调度器,以优化准确性和能效。该方法在帧基数据集上实现了高达47%的脉冲率减少,并在事件基数据集上实现了高达17.45个百分点的准确率提升,同时保持了低计算开销,展示了在多种基准上的性能改进和动态功耗的最小化。
The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Authors: Alper Yıldırım
First: 2026-03-05T14:41:01+00:00 · Latest: 2026-03-10T17:16:04+00:00
Comments: 19 pages, 2 figures, 3 tables. Code available at https://github.com/AlperYildirim1/geometric-grokking
Abstract
Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task-specific geometric alignment rather than a generic optimization stabilizer, we use non-commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task's intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.
中文标题/摘要
标题:理解“grokking”的几何归纳偏见:通过架构拓扑绕过相变
机制可解释性通常依赖于对训练网络的后验分析。我们采用干预性方法:通过修改架构拓扑来测试假设,观察训练动态。我们研究了在循环模块加法(Zp)上训练的Transformer中的“grokking”现象——延迟泛化,探讨特定的架构自由度是否延长了记忆阶段。 我们确定了标准Transformer中的两个独立结构因素:无界的表示幅度和数据依赖的注意力路由。首先,我们引入了一个完全有界的球形拓扑,强制在整个残差流中进行L2归一化,并使用一个固定温度尺度的嵌入矩阵。这消除了幅度相关的自由度,不使用权重衰减的情况下,将“grokking”起始时间减少了超过20倍。其次,均匀注意力消融用均匀分布覆盖了数据依赖的查询-键路由,将注意力层简化为连续的词袋(CBOW)聚合器。尽管消除了自适应路由,这些模型在所有种子上实现了100%的泛化,并完全绕过了“grokking”延迟。 为了评估这种加速是否是特定任务的几何对齐,而不是通用的优化稳定器,我们使用非交换的S5置换组合作为负对照。在S5上施加球形约束不会加速泛化。这表明消除记忆阶段强烈依赖于使架构先验与任务固有的对称性对齐。这些发现共同提供了干预性证据,表明架构自由度显著影响“grokking”,暗示了一种预测性的结构视角来理解训练动力学。
Summary / 总结
This study investigates the geometric inductive bias of grokking in Transformers by modifying architectural topology. It identifies two factors: unbounded representational magnitude and data-dependent attention routing. By introducing a fully bounded spherical topology and uniform attention, the study significantly reduces grokking onset time and bypasses the memorization phase. A control experiment with non-commutative S5 permutation composition shows that the acceleration is task-specific and aligns with the intrinsic symmetries of the task.
该研究通过修改架构拓扑来探究Transformer中grokking的几何归纳偏见。通过引入完全有界的球形拓扑和均匀注意力,研究人员显著缩短了grokking的起始时间,并完全绕过了记忆阶段。研究结果表明,架构自由度可以显著影响grokking,而将这些先验与任务内在对称性对齐是加速泛化的关键。
AI-Enabled Data-driven Intelligence for Spectrum Demand Estimation
Authors: Colin Brown, Mohamad Alkadamani, Halim Yanikomeroglu
First: 2026-03-10T17:11:36+00:00 · Latest: 2026-03-10T17:11:36+00:00
Comments: Presented at an IEEE ICC 2025 Workshop and published in the conference proceedings
Abstract
Accurately forecasting spectrum demand is a key component for efficient spectrum resource allocation and management. With the rapid growth in demand for wireless services, mobile network operators and regulators face increasing challenges in ensuring adequate spectrum availability. This paper presents a data-driven approach leveraging artificial intelligence (AI) and machine learning (ML) to estimate and manage spectrum demand. The approach uses multiple proxies of spectrum demand, drawing from site license data and derived from crowdsourced data. These proxies are validated against real-world mobile network traffic data to ensure reliability, achieving an R$^2$ value of 0.89 for an enhanced proxy. The proposed ML models are tested and validated across five major Canadian cities, demonstrating their generalizability and robustness. These contributions assist spectrum regulators in dynamic spectrum planning, enabling better resource allocation and policy adjustments to meet future network demands.
中文标题/摘要
标题:AI驱动的频谱需求估计智能
准确预测频谱需求是高效频谱资源分配和管理的关键组成部分。随着无线服务需求的快速增长,移动网络运营商和监管机构面临着确保足够频谱可用性的不断增加的挑战。本文提出了一种基于人工智能(AI)和机器学习(ML)的数据驱动方法,用于估计和管理频谱需求。该方法利用了频谱需求的多个代理指标,这些指标来源于站点许可数据和从众包数据中推导而来。这些代理指标通过与实际移动网络流量数据进行验证,确保了可靠性,增强后的代理指标的R²值达到0.89。提出的ML模型在加拿大五大城市进行了测试和验证,展示了其普适性和稳健性。这些贡献有助于频谱监管机构进行动态频谱规划,从而更好地进行资源分配和政策调整以满足未来网络需求。
Summary / 总结
This paper addresses the challenge of accurately forecasting spectrum demand to improve wireless service management. It introduces an AI-driven method using machine learning to estimate spectrum demand based on site license data and crowdsourced information, validated against real-world traffic data with an R$^2$ value of 0.89. The models were tested in five Canadian cities, showing generalizability and robustness, which can aid spectrum regulators in planning and policy adjustments for future network demands.
该论文旨在通过准确预测频谱需求来提高无线服务效率。它提出了一种基于AI的数据驱动方法,使用机器学习模型验证了多种频谱需求的代理指标,包括站点许可数据和众包信息。这些模型在五个主要加拿大城市进行了测试,R$^2$值达到0.89,显示出良好的普适性和鲁棒性。
MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems
Authors: Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li
First: 2026-03-10T17:03:11+00:00 · Latest: 2026-03-10T17:03:11+00:00
Abstract
While Multi-Agent Systems (MAS) show potential for complex clinical decision support, the field remains hindered by architectural fragmentation and the lack of standardized multimodal integration. Current medical MAS research suffers from non-uniform data ingestion pipelines, inconsistent visual-reasoning evaluation, and a lack of cross-specialty benchmarking. To address these challenges, we present MedMASLab, a unified framework and benchmarking platform for multimodal medical multi-agent systems. MedMASLab introduces: (1) A standardized multimodal agent communication protocol that enables seamless integration of 11 heterogeneous MAS architectures across 24 medical modalities. (2) An automated clinical reasoning evaluator, a zero-shot semantic evaluation paradigm that overcomes the limitations of lexical string-matching by leveraging large vision-language models to verify diagnostic logic and visual grounding. (3) The most extensive benchmark to date, spanning 11 organ systems and 473 diseases, standardizing data from 11 clinical benchmarks. Our systematic evaluation reveals a critical domain-specific performance gap: while MAS improves reasoning depth, current architectures exhibit significant fragility when transitioning between specialized medical sub-domains. We provide a rigorous ablation of interaction mechanisms and cost-performance trade-offs, establishing a new technical baseline for future autonomous clinical systems. The source code and data is publicly available at: https://github.com/NUS-Project/MedMASLab/
中文标题/摘要
标题:MedMASLab:统一编排框架,用于评估多模态医疗多智能体系统
尽管多智能体系统(MAS)在复杂临床决策支持方面显示出潜力,但该领域仍受到架构碎片化和缺乏标准化多模态集成的阻碍。当前的医疗MAS研究遭受非统一数据摄入管道、不一致的视觉推理评估和跨专科基准测试不足的困扰。为了解决这些挑战,我们提出了MedMASLab,这是一个用于多模态医疗多智能体系统的统一框架和基准平台。MedMASLab 引入了:(1)一个标准化的多模态智能体通信协议,使11种异构MAS架构在24种医疗模态之间无缝集成。(2)一个自动临床推理评估器,这是一种零样本语义评估范式,通过利用大型视觉语言模型来验证诊断逻辑和视觉定位,克服了词法字符串匹配的局限性。(3)迄今为止最广泛的基准测试,涵盖了11个器官系统和473种疾病,标准化了11个临床基准的数据。我们的系统评估揭示了一个关键的专业领域性能差距:尽管MAS提高了推理深度,但当前架构在从专业医学子领域过渡时表现出显著的脆弱性。我们提供了交互机制和成本性能权衡的严格分析,为未来的自主临床系统建立了新的技术基准。源代码和数据可在:https://github.com/NUS-Project/MedMASLab/ 公开获取。
Summary / 总结
MedMASLab is a unified framework designed to benchmark multimodal medical multi-agent systems, addressing the fragmented architecture and lack of standardized multimodal integration in current research. It introduces a standardized communication protocol for 11 heterogeneous MAS architectures across 24 medical modalities, an automated clinical reasoning evaluator, and the largest benchmark to date, covering 11 organ systems and 473 diseases. The evaluation highlights a significant performance gap between general and specialized medical sub-domains, emphasizing the need for robust domain-specific architectures. The framework provides a technical baseline for future autonomous clinical systems and is publicly available.
MedMASLab 是一个统一框架,旨在评估多模态医疗多智能体系统,解决当前研究中架构碎片化和缺乏标准化多模态集成的问题。它引入了一个标准化的通信协议,支持11种异构MAS架构在24种医疗模态下的无缝集成,一个自动化的临床推理评估器,以及迄今为止最大的基准测试,涵盖11个器官系统和473种疾病。评估结果显示,通用和专业医疗子领域之间存在显著的性能差距,强调了需要具有鲁棒性的领域特定架构。该框架为未来自主临床系统提供了技术基准,并已公开发布。
Robot Control Stack: A Lean Ecosystem for Robot Learning at Scale
Authors: Tobias Jülg, Pierre Krack, Seongjin Bien, Yannik Blei, Khaled Gamal, Ken Nakahara, Johannes Hechtl, Roberto Calandra, Wolfram Burgard, Florian Walter
Venue: ICRA 2026
First: 2025-09-18T13:12:16+00:00 · Latest: 2026-03-10T16:58:47+00:00
Comments: Accepted at ICRA 2026
Abstract
Vision-Language-Action models (VLAs) mark a major shift in robot learning. They replace specialized architectures and task-tailored components of expert policies with large-scale data collection and setup-specific fine-tuning. In this machine learning-focused workflow that is centered around models and scalable training, traditional robotics software frameworks become a bottleneck, while robot simulations offer only limited support for transitioning from and to real-world experiments. In this work, we close this gap by introducing Robot Control Stack (RCS), a lean ecosystem designed from the ground up to support research in robot learning with large-scale generalist policies. At its core, RCS features a modular and easily extensible layered architecture with a unified interface for simulated and physical robots, facilitating sim-to-real transfer. Despite its minimal footprint and dependencies, it offers a complete feature set, enabling both real-world experiments and large-scale training in simulation. Our contribution is twofold: First, we introduce the architecture of RCS and explain its design principles. Second, we evaluate its usability and performance along the development cycle of VLA and RL policies. Our experiments also provide an extensive evaluation of Octo, OpenVLA, and Pi Zero on multiple robots and shed light on how simulation data can improve real-world policy performance. Our code, datasets, weights, and videos are available at: https://robotcontrolstack.github.io/
中文标题/摘要
标题:机器人控制堆栈:大规模机器人学习的精简生态系统
视觉-语言-行动模型(VLAs)标志着机器人学习的重大转变。它们用大规模数据收集和特定场景的微调取代了专有架构和任务定制的专家策略。在以模型为中心、注重大规模训练的机器学习工作流程中,传统的机器人软件框架成为瓶颈,而机器人模拟仅提供有限的支持,用于从模拟到现实世界的实验过渡。在这项工作中,我们通过引入机器人控制堆栈(RCS),填补了这一空白,RCS 是从头开始设计的,旨在支持大规模通用策略的机器人学习研究。其核心是模块化且易于扩展的分层架构,具有统一的接口,适用于模拟和物理机器人,促进从模拟到现实的过渡。尽管其占用空间和依赖性很小,但它提供了完整的功能集,支持现实世界的实验和大规模的模拟训练。我们的贡献有两个方面:首先,我们介绍了RCS的架构及其设计原则;其次,我们评估了其在VLAs和RL策略开发周期中的可用性和性能。我们的实验还对Octo、OpenVLA和Pi Zero在多种机器人上的表现进行了全面评估,并揭示了模拟数据如何提高现实世界策略性能。我们的代码、数据集、权重和视频可在:https://robotcontrolstack.github.io/ 获取。
Summary / 总结
This paper introduces Robot Control Stack (RCS), a lean ecosystem designed to support large-scale robot learning with Vision-Language-Action models. RCS addresses the limitations of traditional robotics software frameworks and offers a modular architecture with unified interfaces for both simulated and physical robots, facilitating sim-to-real transfer. Experiments demonstrate that RCS enables both real-world experiments and large-scale training in simulation, with Octo, OpenVLA, and Pi Zero showing improved real-world policy performance through simulation data.
本文介绍了Robot Control Stack (RCS),这是一种旨在支持大规模机器人学习的轻量级生态系统,使用Vision-Language-Action模型。RCS解决了传统机器人软件框架的局限性,并提供了一个模块化的架构,具有统一的接口,适用于模拟和物理机器人,便于从模拟到现实的过渡。实验表明,RCS能够支持真实的实验和大规模的模拟训练,并通过模拟数据提高了Octo、OpenVLA和Pi Zero等设备在现实世界中的策略性能。
Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
Authors: Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong
First: 2026-03-10T16:50:32+00:00 · Latest: 2026-03-10T16:50:32+00:00
Abstract
Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.
中文标题/摘要
标题:将VLMs带入赛场:评估体育中的空间智能
体育运动长期以来一直吸引着广泛的关注,因为它们推动了人类身体和认知能力的极限。随着对视觉语言模型(VLMs)的空间智能兴趣日益增长,体育运动为理解高强度的人体运动和动态物体交互提供了一个自然的测试平台。为此,我们提出了CourtSI,这是首个针对体育场景的空间智能大型数据集。CourtSI 包含超过100万对问答,按照全面的分类系统系统地涵盖了空间计数、距离测量、定位和关系推理,覆盖了代表性网球场上包括羽毛球、网球和乒乓球在内的运动项目。利用明确的场地几何作为度量锚点,我们开发了一种半自动数据引擎来重建体育场景,从而实现CourtSI的大规模整理。此外,我们还引入了CourtSI-Bench,这是一个高质量的评估基准,包含3,686对经过严格人工验证的问答对。我们在CourtSI-Bench上评估了25个专有和开源的VLMs,揭示了人类与AI之间的性能差距,并且现有空间智能基准的泛化能力有限。这些发现表明,体育场景揭示了现有基准所捕捉的空间智能能力的局限性。进一步地,对Qwen3-VL-8B进行微调后,其在CourtSI-Bench上的准确率提高了23.5个百分点。调整后的模型还能够有效泛化到基于类似但未见过的运动构建的CourtSI-Ext评估集,并展示了增强的空间感知评论生成能力。总之,这些发现表明,CourtSI为推动VLMs在体育中的空间智能提供了可扩展的途径。
Summary / 总结
The paper introduces CourtSI, a large-scale dataset for spatial intelligence in sports, containing over 1 million QA pairs covering spatial counting, distance measurement, localization, and relational reasoning. CourtSI-Bench, a high-quality evaluation benchmark, is used to evaluate 25 VLMs, revealing a human-AI performance gap and limited generalization from existing benchmarks. Fine-tuning Qwen3-VL-8B on CourtSI improves accuracy by 23.5 percentage points and enhances spatial-aware commentary generation on unseen sports scenarios.
论文介绍了CourtSI,一个包含超过100万问答对的大规模体育领域空间智能数据集,涵盖了空间计数、距离测量、定位和关系推理。使用CourtSI-Bench这一高质量评估基准来评估25种VLM,揭示了人类与AI之间的性能差距以及现有基准的有限泛化能力。通过在CourtSI上微调Qwen3-VL-8B,准确率提高了23.5个百分点,并在未见过的运动场景中增强了空间感知的评论生成能力。
MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning
Authors: Yiyang Lu, Yu He, Jianlong Chen, Hongyuan Zha
First: 2026-03-10T16:49:44+00:00 · Latest: 2026-03-10T16:49:44+00:00
Abstract
Continual fine-tuning of large language models (LLMs) is becoming increasingly crucial as these models are deployed in dynamic environments where tasks and data distributions evolve over time. While strong adaptability enables rapid acquisition of new knowledge, it also exposes LLMs to catastrophic forgetting, where previously learned skills degrade during sequential training. Existing replay-based strategies, such as fixed interleaved replay, accuracy-supervised, and loss-driven scheduling, remain limited: some depend on heuristic rules and provide only partial mitigation of forgetting, while others improve performance but incur substantial computational overhead. Motivated by retention dynamics under sequential fine-tuning, we propose Memory-Inspired Sampler and Scheduler Replay (MSSR), an experience replay framework that estimates sample-level memory strength and schedules rehearsal at adaptive intervals to mitigate catastrophic forgetting while maintaining fast adaptation. Extensive experiments across three backbone models and 11 sequential tasks show that MSSR consistently outperforms state-of-the-art replay baselines, with particularly strong gains on reasoning-intensive and multiple-choice benchmarks.
中文标题/摘要
标题:MSSR:面向持续LLM微调的记忆感知自适应重放
随着大型语言模型(LLMs)部署在动态环境中,任务和数据分布随时间演变,持续微调LLMs变得越来越重要。虽然强大的适应性能够快速获取新知识,但也使LLMs面临灾难性遗忘的问题,即在顺序训练过程中之前学到的技能会退化。现有的基于重放的策略,如固定交错重放、准确度监督和损失驱动调度,仍然有限:一些依赖于启发式规则,只能部分缓解遗忘,而另一些则提高了性能但带来了巨大的计算开销。受顺序微调下的保留动态启发,我们提出了记忆启发式采样和调度重放(MSSR),这是一种经验重放框架,能够估计样本级别的记忆强度,并在自适应间隔内安排复习,以缓解灾难性遗忘并保持快速适应。在三个骨干模型和11个顺序任务上的广泛实验表明,MSSR在所有基准测试中都优于最先进的重放基线,特别是在推理密集型和多项选择基准测试中表现尤为突出。
Summary / 总结
The research aims to address catastrophic forgetting in the continual fine-tuning of large language models (LLMs) by proposing MSSR, a memory-aware adaptive replay framework. MSSR estimates sample-level memory strength and schedules rehearsal intervals adaptively to mitigate forgetting while enabling fast adaptation. Experiments across three backbone models and 11 sequential tasks demonstrate that MSSR outperforms existing replay baselines, especially on reasoning-intensive and multiple-choice benchmarks.
MSSR 是一种经验回放框架,旨在减轻大规模语言模型(LLM)连续微调中的灾难性遗忘。它通过估计样本级别的记忆强度并适应性地安排复习来保持快速适应并减少遗忘。实验表明,MSSR 在三个骨干模型和 11 个连续任务上优于现有回放基线,特别是在推理密集型和多项选择基准上表现出色。
Influencing LLM Multi-Agent Dialogue via Policy-Parameterized Prompts
Authors: Hongbo Bo, Jingyu Hu, Weiru Liu
First: 2026-03-10T16:47:25+00:00 · Latest: 2026-03-10T16:47:25+00:00
Abstract
Large Language Models (LLMs) have emerged as a new paradigm for multi-agent systems. However, existing research on the behaviour of LLM-based multi-agents relies on ad hoc prompts and lacks a principled policy perspective. Different from reinforcement learning, we investigate whether prompt-as-action can be parameterized so as to construct a lightweight policy which consists of a sequence of state-action pairs to influence conversational behaviours without training. Our framework regards prompts as actions executed by LLMs, and dynamically constructs prompts through five components based on the current state of the agent. To test the effectiveness of parameterized control, we evaluated the dialogue flow based on five indicators: responsiveness, rebuttal, evidence usage, non-repetition, and stance shift. We conduct experiments using different LLM-driven agents in two discussion scenarios related to the general public and show that prompt parameterization can influence the dialogue dynamics. This result shows that policy-parameterised prompts offer a simple and effective mechanism to influence the dialogue process, which will help the research of multi-agent systems in the direction of social simulation.
中文标题/摘要
标题:通过策略参数化提示影响LLM多智能体对话
大型语言模型(LLMs)已成为多智能体系统的新范式。然而,现有基于LLM的多智能体行为研究依赖于随意的提示,缺乏原则性的策略视角。不同于强化学习,我们研究提示是否可以参数化,以便构建一个轻量级策略,该策略由一系列状态-动作对组成,以影响对话行为而无需训练。我们的框架将提示视为由LLM执行的动作,并基于当前智能体的状态动态构建提示,基于五个组件。为了测试参数化控制的有效性,我们根据响应性、反驳、证据使用、不重复和立场转变五个指标评估了对话流程。我们使用两种与公众相关的讨论场景中的不同LLM驱动智能体进行实验,表明提示参数化可以影响对话动态。这一结果表明,策略参数化提示提供了一种简单而有效的机制来影响对话过程,这将有助于多智能体系统研究朝着社会模拟的方向发展。
Summary / 总结
This study explores the use of policy-parameterized prompts to influence the conversational behaviors of LLM-based multi-agent systems. By treating prompts as actions and dynamically constructing them based on the current state, the researchers were able to control dialogue flow without training. The evaluation across five indicators—responsiveness, rebuttal, evidence usage, non-repetition, and stance shift—demonstrated that parameterized prompts effectively influence dialogue dynamics, suggesting a simple and effective method for social simulation in multi-agent systems.
该研究探讨了使用策略参数化提示来影响基于LLM的多智能体系统的对话行为。通过将提示视为动作并根据当前状态动态构建它们,研究人员能够在无需训练的情况下控制对话流程。通过对响应性、反驳、证据使用、不重复和立场转变五个指标的评估,表明参数化提示能够有效影响对话动态,提出了一种简单而有效的社会模拟方法,有助于多智能体系统的研究方向。
LCA: Local Classifier Alignment for Continual Learning
Authors: Tung Tran, Danilo Vasconcellos Vargas, Khoat Than
First: 2026-03-10T16:46:09+00:00 · Latest: 2026-03-10T16:46:09+00:00
Abstract
A fundamental requirement for intelligent systems is the ability to learn continuously under changing environments. However, models trained in this regime often suffer from catastrophic forgetting. Leveraging pre-trained models has recently emerged as a promising solution, since their generalized feature extractors enable faster and more robust adaptation. While some earlier works mitigate forgetting by fine-tuning only on the first task, this approach quickly deteriorates as the number of tasks grows and the data distributions diverge. More recent research instead seeks to consolidate task knowledge into a unified backbone, or adapting the backbone as new tasks arrive. However, such approaches may create a (potential) \textit{mismatch} between task-specific classifiers and the adapted backbone. To address this issue, we propose a novel \textit{Local Classifier Alignment} (LCA) loss to better align the classifier with backbone. Theoretically, we show that this LCA loss can enable the classifier to not only generalize well for all observed tasks, but also improve robustness. Furthermore, we develop a complete solution for continual learning, following the model merging approach and using LCA. Extensive experiments on several standard benchmarks demonstrate that our method often achieves leading performance, sometimes surpasses the state-of-the-art methods with a large margin.
中文标题/摘要
标题:LCA:连续学习中的局部分类器对齐
智能系统的基本要求是在不断变化的环境中持续学习的能力。然而,在这种模式下训练的模型往往会出现灾难性遗忘。利用预训练模型最近被认为是一种有前途的解决方案,因为它们泛化的特征提取器能够实现更快和更稳健的适应。虽然一些早期的工作通过仅在第一个任务上进行微调来减轻遗忘,但随着任务数量的增加和数据分布的差异,这种方法很快就会失效。更近期的研究则试图将任务知识整合到一个统一的骨干网络中,或者在新任务到来时适应骨干网络。然而,这些方法可能会在任务特定分类器和适应后的骨干网络之间造成(潜在的)不匹配。为了解决这个问题,我们提出了一种新的“局部分类器对齐”(LCA)损失,以更好地使分类器与骨干网络对齐。理论上,我们证明这种LCA损失可以使分类器不仅能够很好地泛化到所有已观察到的任务,还能提高鲁棒性。此外,我们还开发了一个完整的连续学习解决方案,遵循模型合并方法并使用LCA。在多个标准基准上的广泛实验表明,我们的方法通常能够实现领先性能,有时甚至大幅超越现有最佳方法。
Summary / 总结
The paper addresses the challenge of catastrophic forgetting in continual learning by proposing a Local Classifier Alignment (LCA) loss to better align task-specific classifiers with the backbone. Theoretical analysis shows that LCA enhances generalization and robustness. Experiments on standard benchmarks show that the proposed method often outperforms existing approaches, sometimes with significant margins.
论文提出了一种局部分类器对齐(LCA)损失,以更好地使分类器与骨干网络对齐,解决连续学习中的灾难性遗忘问题。理论分析表明,LCA能够提升分类器的泛化能力和鲁棒性。在标准基准上的实验结果表明,所提出的方法通常优于现有方法,有时差距很大。
DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary
Authors: Jiazhi Guan, Quanwei Yang, Luying Huang, Junhao Liang, Borong Liang, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou, Jingdong Wang
First: 2026-03-10T16:40:41+00:00 · Latest: 2026-03-10T16:40:41+00:00
Abstract
Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.
中文标题/摘要
标题:DISPLAY: 通过稀疏运动指导和多任务辅助实现可操控的人机物交互视频生成
以人为中心的视频生成技术取得了快速进展,但现有方法在生成可控且物理上一致的人机物交互(HOI)视频方面仍存在困难。现有工作依赖密集的控制信号、模板视频或精心设计的文字提示,这限制了其灵活性和对新物体的泛化能力。我们提出了一种名为DISPLAY的框架,该框架由稀疏运动指导驱动,仅包含手腕关节坐标和形状无关的对象边界框。这种轻量级的指导减轻了人类和物体表示之间的不平衡,并使用户能够直观地控制。为了在如此稀疏的条件下提高保真度,我们提出了一种对象强调注意力机制,以提高对象的鲁棒性。为了解决高质量HOI数据稀缺的问题,我们进一步开发了一种多任务辅助训练策略,并设计了一个专用的数据整理管道,使模型能够从可靠的HOI样本和辅助任务中受益。全面的实验表明,我们的方法在各种任务中实现了高质量、可控的HOI生成。项目页面可访问 https://mumuwei.github.io/DISPLAY/
Summary / 总结
The research aims to generate controllable and physically consistent Human-Object Interaction (HOI) videos by introducing a framework called DISPLAY, which uses sparse motion guidance based on wrist joint coordinates and object bounding boxes. This method enhances the model's ability to handle sparse guidance and includes an Object-Stressed Attention mechanism to improve object robustness. The model also benefits from a Multi-Task Auxiliary Training strategy, which helps in generating high-fidelity HOI videos across various tasks. Comprehensive experiments demonstrate that DISPLAY can produce high-quality, controllable HOI videos.
研究旨在通过引入名为DISPLAY的框架来生成可控且物理上一致的人与物交互(HOI)视频,该框架使用稀疏运动指导,包括手腕关节坐标和无形状物体边界框。这种方法增强了模型对新型物体的灵活性和泛化能力。方法包括一种物体强调注意力机制,以在稀疏条件下提高物体的鲁棒性,以及一种多任务辅助训练策略,以利用辅助任务来更好地利用数据。实验表明,DISPLAY可以在各种任务中生成高质量和可控的HOI视频。
Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning
Authors: Yixin Zheng, Jiangran Lyu, Yifan Zhang, Jiayi Chen, Mi Yan, Yuntian Deng, Xuesong Shi, Xiaoguang Zhao, Yizhou Wang, Zhizheng Zhang, He Wang
First: 2026-03-10T16:40:30+00:00 · Latest: 2026-03-10T16:40:30+00:00
Comments: Project Page: https://pku-epic.github.io/DAPL/
Abstract
Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.
中文标题/摘要
标题:通过动力学感知策略学习在杂乱场景中新兴的外部灵巧性
外部灵巧性利用环境接触来克服抓握操作的局限性。然而,在杂乱场景中实现这种灵巧性仍然具有挑战性和未被充分探索,因为它需要在多个相互作用对象之间选择性地利用接触,而这些对象具有固有的耦合动力学。现有方法缺乏对这种复杂动力学的显式建模,因此在杂乱环境中的非抓握操作方面表现不佳,这反过来限制了它们在真实环境中的实际应用。在本文中,我们提出了一种动力学感知策略学习(DAPL)框架,该框架可以通过学习杂乱环境中接触引起的物体动力学的表示来促进策略学习。这种表示通过显式的世界建模学习,并用于条件强化学习,从而在无需手工接触启发式或复杂奖励塑造的情况下使外部灵巧性自然涌现。我们在仿真和真实世界中评估了我们的方法。我们的方法在对未见过的具有不同密度的杂乱场景的仿真中,成功率达到高出25%以上。在真实世界中,成功率约为50%,覆盖了10个杂乱场景,而实际的杂货部署进一步证明了从仿真到现实的稳健转移和适用性。
Summary / 总结
This paper addresses the challenge of achieving extrinsic dexterity in cluttered scenes by introducing a Dynamics-Aware Policy Learning (DAPL) framework. The method learns a representation of contact-induced object dynamics to condition reinforcement learning, avoiding the need for hand-crafted contact heuristics or complex reward shaping. The approach significantly outperforms prehensile manipulation, human teleoperation, and prior representation-based policies, with success rates over 25% higher in simulated cluttered scenes and around 50% in real-world cluttered scenes.
本文通过引入动态感知策略学习(DAPL)框架,解决了在杂乱场景中实现外在灵巧性的挑战。该框架学习接触引起的物体动力学表示,以条件强化学习,从而在无需手工设计启发式规则的情况下实现灵巧性。实验结果表明,该方法在未见过的杂乱场景中的成功率比传统的抓握操作、人类远程操作和先前的基于表示的策略高出25%以上,在10个杂乱场景中的成功率约为50%,并进一步展示了从仿真到现实的鲁棒性转移和实用性。
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
Authors: Changyao Tian, Danni Yang, Guanzhou Chen, Erfei Cui, Zhaokai Wang, Yuchen Duan, Penghao Yin, Sitao Chen, Ganlin Yang, Mingxin Liu, Zirun Zhu, Ziqian Fan, Leyao Gu, Haomin Wang, Qi Wei, Jinhui Yin, Xue Yang, Zhihang Zhong, Qi Qin, Yi Xin, Bin Fu, Yihao Liu, Jiaye Ge, Qipeng Guo, Gen Luo, Hongsheng Li, Yu Qiao, Kai Chen, Hongjie Zhang
First: 2026-03-10T16:38:33+00:00 · Latest: 2026-03-10T16:38:33+00:00
Comments: technical report, 61 pages, https://github.com/OpenGVLab/InternVL-U
Abstract
Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.
中文标题/摘要
标题:InternVL-U:普及统一多模态模型以实现理解、推理、生成和编辑
统一多模态模型(UMMs)将理解、推理、生成和编辑集成于一体,面临着在保持强大的语义理解能力和获得强大的生成能力之间固有的权衡。在本报告中,我们介绍了InternVL-U,这是一种轻量级的4B参数UMM,它在统一框架内普及了这些能力。InternVL-U遵循统一上下文建模和特定模态模块化设计的原则,解耦视觉表示,将最先进的多模态大型语言模型(MLLM)与专门的MMDiT基视觉生成头部相结合。为了进一步弥合美学生成与高层次智能之间的差距,我们构建了一个全面的数据合成管道,针对高语义密度任务,如文本渲染和科学推理,采用以推理为中心的范式,利用思维链(CoT)更好地将抽象用户意图与精细的视觉生成细节对齐。广泛的实验表明,InternVL-U实现了性能与效率的优越平衡。尽管仅使用4B参数,它在各种生成和编辑任务中始终优于比其大3倍以上的统一基线模型,如BAGEL(14B),同时保留了强大的多模态理解和推理能力。
Summary / 总结
InternVL-U is a lightweight 4B-parameter unified multimodal model that integrates understanding, reasoning, generation, and editing. It combines a state-of-the-art Multimodal Large Language Model with a specialized visual generation head, and uses a reasoning-centric paradigm with Chain-of-Thought to align abstract user intent with visual generation details. Experiments show that InternVL-U outperforms larger models like BAGEL (14B) on various tasks while maintaining strong multimodal understanding and reasoning capabilities.
InternVL-U 是一个轻量级的 4B 参数统一多模态模型,整合了理解和推理、生成和编辑功能。该模型基于统一上下文建模和模态特定模块化设计,结合了一个最先进的 MLLM 和专门的视觉生成头部。实验表明,尽管规模较小,InternVL-U 在各种生成和编辑任务上仍优于更大规模的模型(如 BAGEL 14B),同时保持了强大的多模态理解和推理能力。
Learning responsibility allocations for multi-agent interactions: A differentiable optimization approach with control barrier functions
Authors: Isaac Remy, David Fridovich-Keil, Karen Leung
First: 2024-10-09T20:20:41+00:00 · Latest: 2026-03-10T16:37:32+00:00
Comments: 8 pages, 7 figures
Abstract
From autonomous driving to package delivery, ensuring safe yet efficient multi-agent interaction is challenging as the interaction dynamics are influenced by hard-to-model factors such as social norms and contextual cues. Understanding these influences can aid in the design and evaluation of socially-aware autonomous agents whose behaviors are aligned with human values. In this work, we seek to codify factors governing safe multi-agent interactions via the lens of responsibility, i.e., an agent's willingness to deviate from their desired control to accommodate safe interaction with others. Specifically, we propose a data-driven modeling approach based on control barrier functions and differentiable optimization that efficiently learns agents' responsibility allocation from data. We demonstrate on synthetic and real-world datasets that we can obtain an interpretable and quantitative understanding of how much agents adjust their behavior to ensure the safety of others given their current environment.
中文标题/摘要
标题:多智能体交互的责任分配学习:基于控制障碍函数的可微优化方法
从自动驾驶到包裹配送,确保多智能体交互的安全性和效率极具挑战性,因为交互动力学受到难以建模的因素如社会规范和上下文线索的影响。理解这些影响有助于设计和评估与人类价值观相一致的社会智能体。在本文中,我们通过责任的视角,即智能体愿意偏离其期望控制以确保与他人的安全交互的意愿,来编码安全多智能体交互的因素。具体而言,我们提出了一种基于控制障碍函数和可微优化的数据驱动建模方法,以高效地从数据中学习智能体的责任分配。我们通过合成和真实世界数据集证明,可以获取智能体如何根据当前环境调整其行为以确保他人安全的可解释和定量理解。
Summary / 总结
This paper addresses the challenge of ensuring safe and efficient multi-agent interactions in scenarios like autonomous driving and package delivery. It proposes a method using control barrier functions and differentiable optimization to learn agents' responsibility allocations from data. The key finding is that this approach provides an interpretable and quantitative understanding of how agents adjust their behavior to ensure safety in different environments.
本文研究了在自动驾驶和包裹配送等场景中确保多个代理安全高效互动的挑战。它提出了一种使用控制障碍函数和可微优化的数据驱动建模方法来从数据中学习代理的责任分配。关键发现表明,这种方法可以提供一个可解释和定量的理解,即在不同环境中代理为了确保安全需要调整其行为的程度。
MissBench: Benchmarking Multimodal Affective Analysis under Imbalanced Missing Modalities
Authors: Tien Anh Pham, Phuong-Anh Nguyen, Duc-Trong Le, Cam-Van Thi Nguyen
First: 2026-03-10T16:36:45+00:00 · Latest: 2026-03-10T16:36:45+00:00
Abstract
Multimodal affective computing underpins key tasks such as sentiment analysis and emotion recognition. Standard evaluations, however, often assume that textual, acoustic, and visual modalities are equally available. In real applications, some modalities are systematically more fragile or expensive, creating imbalanced missing rates and training biases that task-level metrics alone do not reveal. We introduce MissBench, a benchmark and framework for multimodal affective tasks that standardizes both shared and imbalanced missing-rate protocols on four widely used sentiment and emotion datasets. MissBench also defines two diagnostic metrics. The Modality Equity Index (MEI) measures how fairly different modalities contribute across missing-modality configurations. The Modality Learning Index (MLI) quantifies optimization imbalance by comparing modality-specific gradient norms during training, aggregated across modality-related modules. Experiments on representative method families show that models that appear robust under shared missing rates can still exhibit marked modality inequity and optimization imbalance under imbalanced conditions. These findings position MissBench, together with MEI and MLI, as practical tools for stress-testing and analyzing multimodal affective models in realistic incomplete-modality settings.For reproducibility, we release our code at: https://anonymous.4open.science/r/MissBench-4098/
中文标题/摘要
标题:MissBench:在不平衡缺失模态下的多模态情感分析基准测试
多模态情感计算是情感分析和情绪识别等关键任务的基础。然而,标准评估通常假设文本、声学和视觉模态是同等可用的。在实际应用中,某些模态系统性地更脆弱或更昂贵,导致不平衡的缺失率和训练偏差,而任务级指标无法揭示这些偏差。我们引入了MissBench,这是一个用于多模态情感任务的基准和框架,它在四个广泛使用的语义和情绪数据集上标准化了共享和不平衡缺失率协议。MissBench还定义了两个诊断指标。模态公平指数(MEI)衡量不同模态在不同缺失模态配置下的公平贡献。模态学习指数(MLI)通过比较训练过程中模态特定梯度范数的差异来量化优化不平衡,这些差异在与模态相关的模块中汇总。代表性方法家族的实验表明,在不平衡条件下,即使在共享缺失率下表现看似稳健的模型也可能表现出明显的模态不公平和优化不平衡。这些发现将MissBench、MEI和MLI定位为在实际不完整模态设置中测试和分析多模态情感模型的实用工具。为了可重复性,我们发布了我们的代码:https://anonymous.4open.science/r/MissBench-4098/
Summary / 总结
MissBench is a benchmark for evaluating multimodal affective analysis under imbalanced missing modalities, addressing the limitations of standard evaluations that assume equal availability of textual, acoustic, and visual data. It introduces two diagnostic metrics: Modality Equity Index (MEI) to measure fairness of modality contributions and Modality Learning Index (MLI) to quantify optimization imbalance. Experiments show that models robust under shared missing rates can still exhibit modality inequity and optimization imbalance under imbalanced conditions, highlighting the need for MissBench as a practical tool for stress-testing multimodal affective models.
MissBench 是一个用于评估在不平衡缺失模态下多模态情感分析的基准,解决了标准评估假设文本、声学和视觉数据平等可用的局限性。它引入了两个诊断指标:模态公平指数(MEI)来衡量不同模态的公平贡献,以及模态学习指数(MLI)来量化训练不平衡。实验表明,即使在共享缺失率下表现稳健的模型,在不平衡条件下也可能表现出显著的模态不公平和优化不平衡,突显了MissBench作为测试多模态情感分析模型实际不完整模态设置的实用工具的重要性。
CarbonBench: A Global Benchmark for Upscaling of Carbon Fluxes Using Zero-Shot Learning
Authors: Aleksei Rozanov, Arvind Renganathan, Yimeng Zhang, Vipin Kumar
First: 2026-03-10T16:33:28+00:00 · Latest: 2026-03-10T16:33:28+00:00
Abstract
Accurately quantifying terrestrial carbon exchange is essential for climate policy and carbon accounting, yet models must generalize to ecosystems underrepresented in sparse eddy covariance observations. Despite this challenge being a natural instance of zero-shot spatial transfer learning for time series regression, no standardized benchmark exists to rigorously evaluate model performance across geographically distinct locations with different climate regimes and vegetation types. We introduce CarbonBench, the first benchmark for zero-shot spatial transfer in carbon flux upscaling. CarbonBench comprises over 1.3 million daily observations from 567 flux tower sites globally (2000-2024). It provides: (1) stratified evaluation protocols that explicitly test generalization across unseen vegetation types and climate regimes, separating spatial transfer from temporal autocorrelation; (2) a harmonized set of remote sensing and meteorological features to enable flexible architecture design; and (3) baselines ranging from tree-based methods to domain-generalization architectures. By bridging machine learning methodologies and Earth system science, CarbonBench aims to enable systematic comparison of transfer learning methods, serves as a testbed for regression under distribution shift, and contributes to the next-generation climate modeling efforts.
中文标题/摘要
标题:CarbonBench:一种基于零样本学习的全球碳通量放大基准
准确量化陆地碳交换对于气候政策和碳核算至关重要,但模型必须泛化到稀疏涡动相关观测中未被代表的生态系统。尽管这是一个自然的零样本空间迁移学习的时间序列回归实例,但没有标准化基准可以严格评估模型在具有不同气候类型和植被类型的地理上不同的位置上的性能。 我们介绍了CarbonBench,这是第一个用于碳通量放大的零样本空间迁移基准。CarbonBench 包含来自全球567个通量塔站超过130万天的观测数据(2000-2024年)。它提供了:(1) 分层评估协议,明确测试在未见植被类型和气候类型下的泛化能力,将空间迁移与时间自相关分开;(2) 一套协调的遥感和气象特征,以实现灵活的架构设计;(3) 从基于树的方法到领域泛化架构的基线。通过将机器学习方法与地球系统科学相结合,CarbonBench旨在实现迁移学习方法的系统比较,作为分布转移下的回归测试平台,并为下一代气候建模做出贡献。
Summary / 总结
CarbonBench is a benchmark for evaluating zero-shot spatial transfer learning in upscaling carbon fluxes, addressing the challenge of generalizing models to underrepresented ecosystems. It includes over 1.3 million daily observations from 567 flux tower sites globally, providing stratified evaluation protocols and harmonized features. Key findings include the ability to test generalization across unseen vegetation types and climate regimes, and the establishment of baselines for comparison of transfer learning methods.
CarbonBench 是一个用于评估零样本空间迁移学习在碳通量上尺度化中的基准,旨在解决模型向未被观测生态系统泛化的挑战。它包含来自全球567个通量塔站的超过130万条每日观测数据,提供了分层评估协议和标准化特征。主要发现表明,领域泛化架构在未见过的植被类型和气候条件下优于传统的树基方法。
Improved Robustness of Deep Reinforcement Learning for Control of Time-Varying Systems by Bounded Extremum Seeking
Authors: Shaifalee Saxena, Alan Williams, Rafael Fierro, Alexander Scheinker
First: 2025-10-02T18:53:02+00:00 · Latest: 2026-03-10T16:27:14+00:00
Abstract
In this paper, we study the use of robust model independent bounded extremum seeking (ES) feedback control to improve the robustness of deep reinforcement learning (DRL) controllers for a class of nonlinear time-varying systems. DRL has the potential to learn from large datasets to quickly control or optimize the outputs of many-parameter systems, but its performance degrades catastrophically when the system model changes rapidly over time. Bounded ES can handle time-varying systems with unknown control directions, but its convergence speed slows down as the number of tuned parameters increases and, like all local adaptive methods, it can get stuck in local minima. We demonstrate that together, DRL and bounded ES result in a hybrid controller whose performance exceeds the sum of its parts with DRL taking advantage of historical data to learn how to quickly control a many-parameter system to a desired setpoint while bounded ES ensures its robustness to time variations. We present a numerical study of a general time-varying system and a combined ES-DRL controller for automatic tuning of the Low Energy Beam Transport section at the Los Alamos Neutron Science Center linear particle accelerator.
中文标题/摘要
标题:通过有界极值搜索改进深度强化学习控制时变系统的鲁棒性
在本文中,我们研究了使用鲁棒模型独立的有界极值搜索(ES)反馈控制来提高一类非线性时变系统中深度强化学习(DRL)控制器鲁棒性的方法。DRL 有潜力从大量数据中学习,快速控制或优化多参数系统的输出,但当系统模型随着时间迅速变化时,其性能会灾难性地下降。有界ES 可以处理具有未知控制方向的时变系统,但随着调整参数数量的增加,其收敛速度会减慢,就像所有局部自适应方法一样,它可能会卡在局部最小值中。我们证明,DRL 和有界ES 结合在一起,形成的混合控制器的性能超过了它们各自的部分,DRL 利用历史数据学习如何快速控制多参数系统达到期望的目标值,而有界ES 确保其对时间变化的鲁棒性。我们对一般时变系统和结合ES-DRL 控制器进行了数值研究,用于洛斯阿拉莫斯中子科学中心线性粒子加速器低能束传输部分的自动调谐。
Summary / 总结
This paper investigates the integration of robust model-independent bounded extremum seeking (ES) feedback control with deep reinforcement learning (DRL) to enhance the robustness of DRL controllers for nonlinear time-varying systems. The study demonstrates that the hybrid controller outperforms individual DRL and ES methods by leveraging historical data for quick control and ensuring robustness against time variations. The performance of the combined ES-DRL controller was validated through numerical studies and its application in tuning the Low Energy Beam Transport section at the Los Alamos Neutron Science Center.
本文研究了将稳健的模型独立边界极值搜索(ES)反馈控制与深度强化学习(DRL)结合以增强DRL控制器在非线性时变系统中的鲁棒性。研究表明,通过利用历史数据进行快速控制并确保对时间变化的鲁棒性,这种混合控制器的表现优于单独的组件。研究通过一般时变系统的数值研究和洛斯阿拉莫斯中子科学中心线性粒子加速器低能束传输部分的自动调谐来证明这一点。
A Graph-Based Approach to Spectrum Demand Prediction Using Hierarchical Attention Networks
Authors: Mohamad Alkadamani, Halim Yanikomeroglu, Amir Ghasemi
First: 2026-03-10T16:20:51+00:00 · Latest: 2026-03-10T16:20:51+00:00
Comments: 7 pages, 6 figures. Presented at IEEE GLOBECOM 2025, Taiwan. To appear in the conference proceedings
Abstract
The surge in wireless connectivity demand, coupled with the finite nature of spectrum resources, compels the development of efficient spectrum management approaches. Spectrum sharing presents a promising avenue, although it demands precise characterization of spectrum demand for informed policy-making. This paper introduces HR-GAT, a hierarchical resolution graph attention network model, designed to predict spectrum demand using geospatial data. HR-GAT adeptly handles complex spatial demand patterns and resolves issues of spatial autocorrelation that usually challenge standard machine learning models, often resulting in poor generalization. Tested across five major Canadian cities, HR-GAT improves predictive accuracy of spectrum demand by 21% over eight baseline models, underscoring its superior performance and reliability.
中文标题/摘要
标题:基于图的频谱需求预测方法研究——采用分层注意网络
无线连接需求的激增与频谱资源的有限性促使了高效频谱管理方法的发展。频谱共享是一个有前景的途径,但需要精确地对频谱需求进行表征,以便进行明智的政策制定。本文介绍了一种分层分辨率图注意网络模型HR-GAT,该模型利用地理空间数据预测频谱需求。HR-GAT能够有效处理复杂的空间需求模式,并解决标准机器学习模型通常面临的空间自相关问题,从而提高泛化能力。在加拿大五大城市进行测试后,HR-GAT在频谱需求预测上的准确率比八个基线模型高出21%,证明了其优越的性能和可靠性。
Summary / 总结
This paper addresses the challenge of predicting spectrum demand to enhance efficient spectrum management, especially in the context of increasing wireless connectivity. It proposes HR-GAT, a hierarchical resolution graph attention network model, which outperforms eight baseline models by 21% in predictive accuracy across five major Canadian cities, demonstrating its effectiveness in handling complex spatial demand patterns and spatial autocorrelation issues.
该论文旨在通过预测频谱需求来有效管理无线连接资源。它提出了HR-GAT模型,这是一种分层分辨率图注意力网络模型,相比八个基线模型,在五个加拿大城市中将预测准确性提高了21%。该模型能够有效处理复杂的空间模式,并减少空间自相关性问题,从而提高泛化能力。
SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases
Authors: Laya Iyer, Angelina Wang, Sanmi Koyejo
First: 2026-03-10T16:15:12+00:00 · Latest: 2026-03-10T16:15:12+00:00
Comments: Accepted to EACL 2026 (Main Conference). 10 pages, 10 figures. Camera-ready version
Abstract
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech recognition (ASR). This paper closes that gap by proposing a benchmark suite, SCENEBench (Spatial, Cross-lingual, Environmental, Non-speech Evaluation), that targets a broad form of audio comprehension across four real-world categories: background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. These four categories are selected based on understudied needs from accessibility technology and industrial noise monitoring. In addition to performance, we also measure model latency. The purpose of this benchmark suite is to assess audio beyond just what words are said - rather, how they are said and the non-speech components of the audio. Because our audio samples are synthetically constructed (e.g., by overlaying two natural audio samples), we further validate our benchmark against 20 natural audio items per task, sub-sampled from existing datasets to match our task criteria, to assess ecological validity. We assess five state-of-the-art LALMs and find critical gaps: performance varies across tasks, with some tasks performing below random chance and others achieving high accuracy. These results provide direction for targeted improvements in model capabilities.
中文标题/摘要
标题:SCENEBench:基于辅助技术和工业应用场景的音频理解基准
大型语言模型(LLMs)的进步使音频处理能力显著增强,导致现在被称为大型音频语言模型(LALMs)的先进模型层出不穷。然而,很少有工作致力于测量超出自动语音识别(ASR)之外的音频理解能力。本文通过提出一个基准套件SCENEBench(空间、跨语言、环境、非言语评估),填补了这一空白,该基准套件针对四个现实世界的类别中的广泛音频理解形式:背景声音理解、噪声定位、跨语言语音理解以及语音特征识别。这四个类别是基于无障碍技术和工业噪声监测中的未研究需求而选择的。除了性能之外,我们还测量了模型的延迟。该基准套件的目的是评估音频不仅仅是说了什么,而是如何说的以及音频中的非言语成分。由于我们的音频样本是合成的(例如,通过叠加两个自然音频样本),我们进一步使用来自现有数据集的20个自然音频项目进行验证,以匹配我们的任务标准,以评估生态效度。我们评估了五种最先进的LALMs,并发现存在关键差距:任务间性能不同,有些任务的性能低于随机猜测,而其他任务则达到了高准确率。这些结果为模型能力的针对性改进提供了方向。
Summary / 总结
SCENEBench is a benchmark suite designed to evaluate audio understanding beyond automatic speech recognition, focusing on background sound understanding, noise localization, cross-linguistic speech understanding, and vocal characterizer recognition. It includes both synthetic and natural audio samples to ensure ecological validity. Five state-of-the-art Large Audio Language Models (LALMs) were assessed, revealing significant variations in performance across different tasks, with some tasks performing below random chance and others achieving high accuracy. This highlights critical gaps in current LALM capabilities and provides directions for targeted improvements.
该论文提出了SCENEBench基准套件,用于评估自动语音识别之外的音频理解能力,重点关注背景声音理解、噪声定位、跨语言语音理解和声源识别。它评估了五种最先进的大型音频语言模型,并发现不同任务之间的性能存在显著差异,有些任务的表现甚至低于随机猜测。基准还测量了模型的延迟,并包含自然音频样本以确保生态效度,突显了当前大型音频语言模型的关键差距。
History
20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553