Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
Authors: Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava
First: 2026-02-20T18:59:50+00:00 · Latest: 2026-02-20T18:59:50+00:00
Comments: Project page: see https://vatsalag99.github.io/memstream/
Abstract
Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
中文标题/摘要
标题:重返记忆之路:通过动态KV缓存记忆扩展视频流理解的令牌预算
视频流理解需要模型从连续视频流中稳健地编码、存储和检索信息,以支持准确的视频问答(VQA)。现有最先进的方法依赖于键值缓存来随着时间累积帧级信息,但每帧使用的令牌数量有限,导致丢失了细粒度的视觉细节。在本文中,我们提出扩展令牌预算以实现更精细的空间-时间理解和推理。首先,我们发现当前方法无法有效处理密集流:它们的特征编码导致查询帧相似度分数随时间增加,偏向于后期帧的检索。为了解决这个问题,我们引入了一种自适应选择策略,减少令牌冗余同时保留局部空间-时间信息。我们还提出了一种无需训练的检索专家混合模型,利用外部模型更好地识别相关帧。我们的方法MemStream在CG-Bench上提高了8.0%,在LVBench上提高了8.5%,在VideoMME(长)上相对于ReKV与Qwen2.5-VL-7B提高了2.4%。
Summary / 总结
This paper addresses the challenge of robustly encoding, storing, and retrieving information from continuous video streams for accurate video question answering. It proposes scaling the token budget to enhance spatiotemporal understanding and introduces an adaptive selection strategy and a retrieval mixture-of-experts to reduce token redundancy and better identify relevant frames. The method, MemStream, improves performance by +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
本文旨在解决从连续视频流中稳健地编码、存储和检索信息以实现准确的视频问答的问题。它提出扩大令牌预算以增强时空理解,并引入了一种自适应选择策略和检索混合专家,以减少令牌冗余并更好地识别相关帧。该方法MemStream在CG-Bench上提高了8.0%,在LVBench上提高了8.5%,在VideoMME(长)上提高了2.4%,优于ReKV与Qwen2.5-VL-7B。
SARAH: Spatially Aware Real-time Agentic Humans
Authors: Evonne Ng, Siwei Zhang, Zhang Chen, Michael Zollhoefer, Alexander Richard
First: 2026-02-20T18:59:35+00:00 · Latest: 2026-02-20T18:59:35+00:00
Comments: Project page: https://evonneng.github.io/sarah/
Abstract
As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.
中文标题/摘要
标题:SARAH: 空间感知实时自主人类
随着具身代理在VR、远程呈现和数字人类应用中的核心地位日益凸显,它们的运动必须超越与言语同步的手势:代理应面向用户,响应其动作,并保持自然的目光。当前方法缺乏这种空间感知能力。我们通过提出首个实时、完全因果的空间感知对话运动方法来填补这一空白,该方法适用于流式VR头显。给定用户的位置和二元音频,我们的方法生成全身运动,使手势与言语同步,同时根据用户调整代理的方向。我们的架构结合了因果变换器基VAE和交错的潜在令牌以实现流式推理,以及基于用户轨迹和音频的流动匹配模型。为了支持不同的注视偏好,我们引入了一种注视评分机制和无分类引导,以解耦学习与控制:模型从数据中捕捉自然的空间对齐,而用户可以在推理时调整眼神接触的强度。在Embody 3D数据集上,我们的方法在超过300 FPS的速度下实现了最先进的运动质量——比非因果基线快3倍——同时捕捉自然对话的微妙空间动态。我们通过实时VR系统验证了该方法,将空间感知对话代理带到实时部署。请参见https://evonneng.github.io/sarah/ 获取更多详情。
Summary / 总结
This paper addresses the need for spatial awareness in embodied agents for VR and telepresence applications. It introduces SARAH, a real-time, causal method that generates full-body motion aligned with speech and user orientation. The method uses a causal transformer-based VAE and a flow matching model to produce high-quality motion at over 300 FPS, outperforming non-causal baselines. The approach also includes a gaze scoring mechanism to allow users to adjust eye contact intensity during inference. Experiments on the Embody 3D dataset demonstrate state-of-the-art motion quality and natural spatial dynamics in real-time deployment.
研究旨在通过增强空间意识来提升VR和远程存在应用中的实体代理。方法使用因果变换器基线VAE与交错的潜在令牌和流匹配模型来生成与言语对齐的手势并使代理根据用户进行定向。该方法在超过300 FPS的速度下实现了最先进的运动质量,捕捉自然的空间动态,并通过注视评分机制支持不同的注视偏好。在Embody 3D数据集上,该方法优于非因果基线,并可在流式VR头显上部署。
Spatio-Spectroscopic Representation Learning using Unsupervised Convolutional Long-Short Term Memory Networks
Authors: Kameswara Bharadwaj Mantha, Lucy Fortson, Ramanakumar Sankar, Claudia Scarlata, Chris Lintott, Sandor Kruk, Mike Walmsley, Hugh Dickinson, Karen Masters, Brooke Simmons, Rebecca Smethurst
Venue: ICML
First: 2026-02-20T18:48:36+00:00 · Latest: 2026-02-20T18:48:36+00:00
Comments: This manuscript was previously submitted to ICML for peer review. Reviewers noted that while the underlying VAE-based architecture builds on established methods, its application to spatially-resolved IFS data is promising for unsupervised representation learning in astronomy. This version is released for community visibility. Reviewer decisions: Weak accept and Weak reject (Final: Reject)
Abstract
Integral Field Spectroscopy (IFS) surveys offer a unique new landscape in which to learn in both spatial and spectroscopic dimensions and could help uncover previously unknown insights into galaxy evolution. In this work, we demonstrate a new unsupervised deep learning framework using Convolutional Long-Short Term Memory Network Autoencoders to encode generalized feature representations across both spatial and spectroscopic dimensions spanning $19$ optical emission lines (3800A $< λ<$ 8000A) among a sample of $\sim 9000$ galaxies from the MaNGA IFS survey. As a demonstrative exercise, we assess our model on a sample of $290$ Active Galactic Nuclei (AGN) and highlight scientifically interesting characteristics of some highly anomalous AGN.
中文标题/摘要
标题:基于无监督卷积长短期记忆网络的空间光谱表示学习
积分场光谱学(IFS)调查提供了一个独特的全新视角,可以在空间和光谱维度上进行学习,有助于揭示星系演化中未知的见解。在本文中,我们展示了一种新的无监督深度学习框架,使用卷积长短期记忆网络自编码器来编码跨越19种光学发射线(3800A < λ < 8000A)的空间和光谱维度上的通用特征表示,样本来自MaNGA IFS调查的约9000个星系。作为演示练习,我们评估了该模型在290个活动星系核(AGN)样本上的表现,并强调了一些高度异常AGN的一些科学上有趣的特征。
Summary / 总结
This study aims to leverage Integral Field Spectroscopy (IFS) surveys to explore galaxy evolution by learning representations in both spatial and spectroscopic dimensions. The authors use an unsupervised Convolutional Long-Short Term Memory Network Autoencoder to encode feature representations across 19 optical emission lines for approximately 9000 galaxies from the MaNGA IFS survey. The model was tested on 290 Active Galactic Nuclei (AGN), revealing some highly anomalous AGN with interesting characteristics.
该研究旨在利用积分场光谱学(IFS)调查来探索星系演化,通过在空间和光谱维度上学习表示。作者使用卷积长短期记忆网络自编码器对来自MaNGA IFS调查的约9000个星系的19条光学发射线进行特征表示编码。该模型在290个活动星系核(AGN)上进行了测试,揭示了一些高度异常的AGN具有有趣的特征。
CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Authors: Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon Froehlich
First: 2026-02-20T18:46:27+00:00 · Latest: 2026-02-20T18:46:27+00:00
Abstract
Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav
中文标题/摘要
标题:CapNav:基于能力条件的室内导航基准测试
视觉-语言模型(VLMs)在视觉-语言导航(VLN)方面取得了显著进展,为导航决策提供了新的可能性,这不仅对机器人平台,也对人类用户有益。然而,现实世界的导航本质上受到代理移动约束的条件限制。例如,清洁机器人无法穿越楼梯,而四足机器人可以。我们引入了基于能力条件的导航(CapNav),这是一种旨在评估VLMs在给定代理特定物理和操作能力的情况下如何导航复杂室内空间的基准测试。CapNav 定义了五种代表性的人类和机器人代理,每种代理都描述了其物理尺寸、移动能力和环境交互能力。CapNav 提供了 45 个真实世界的室内场景、473 个导航任务和 2365 个问答对,以测试 VLMs 是否可以根据代理能力穿越室内环境。我们评估了 13 种现代 VLMs,发现当前 VLM 的导航性能随着移动约束的收紧而急剧下降,即使是最先进的模型也难以应对需要在空间维度上进行推理的障碍类型。最后,我们讨论了能力感知导航的含义以及未来 VLMs 中增强实体空间推理的机会。基准测试可在 https://github.com/makeabilitylab/CapNav 获取
Summary / 总结
The research introduces CapNav, a benchmark to evaluate Vision-Language Models (VLMs) in navigating indoor environments based on the physical capabilities of agents. It defines five human and robot agents with specific mobility constraints and provides 45 real-world indoor scenes and 473 navigation tasks. The study finds that current VLMs perform poorly under tight mobility constraints and struggle with spatial reasoning for certain obstacles, indicating a need for improved capability-aware navigation capabilities in VLMs.
研究旨在评估视觉语言模型(VLMs)在给定特定物理和操作能力的代理时,能否有效导航室内空间。研究引入了CapNav基准,包括五种人类和机器人代理、45个真实世界的室内场景和473个导航任务,以测试VLMs。主要发现表明,当前的VLMs在应对移动限制和空间推理方面存在困难,特别是在处理需要空间维度推理的障碍物时表现不佳,这表明需要在能力感知导航方面进行改进。
Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control
Authors: Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein
First: 2026-02-20T18:45:29+00:00 · Latest: 2026-02-20T18:45:29+00:00
Comments: Project page here: https://codeysun.github.io/generated-reality
Abstract
Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.
中文标题/摘要
标题:生成现实:基于交互视频生成与手部及摄像机控制的人本世界模拟
扩展现实(XR)需要能够响应用户跟踪的现实世界运动的生成模型,但当前的视频世界模型仅接受粗略的控制信号,如文本或键盘输入,限制了其在具身交互中的应用。我们提出了一种基于跟踪头部姿态和关节级手部姿态的人本视频世界模型。为此,我们评估了现有的扩散变换器条件策略,并提出了一种有效的3D头部和手部控制机制,使手物交互更加灵巧。我们使用此策略训练了一个双向视频扩散模型教师,并将其提炼为一个因果、交互式系统,生成第一人称的虚拟环境。我们用人类受试者评估了此生成现实系统,并展示了任务性能的提高,以及在执行动作时显著更高的控制感,相比相关基线有显著提升。
Summary / 总结
The research aims to develop a video world model that can respond to users' real-world motion for extended reality (XR) applications. The method involves using both head and hand poses for control, and a bidirectional video diffusion model is trained to generate egocentric virtual environments. The system shows improved task performance and a higher level of perceived control compared to existing methods.
研究旨在开发一种能够响应用户真实世界动作的视频世界模型,适用于扩展现实(XR)应用。方法包括使用头部和手部姿态进行控制,并训练双向视频扩散模型生成以自我为中心的虚拟环境。该系统在任务性能和对执行动作的感知控制水平上优于现有方法。
Benchmarking Graph Neural Networks in Solving Hard Constraint Satisfaction Problems
Authors: Geri Skenderi, Lorenzo Buffoni, Francesco D'Amico, David Machado, Raffaele Marino, Matteo Negri, Federico Ricci-Tersenghi, Carlo Lucibello, Maria Chiara Angelini
First: 2026-02-20T18:41:48+00:00 · Latest: 2026-02-20T18:41:48+00:00
Abstract
Graph neural networks (GNNs) are increasingly applied to hard optimization problems, often claiming superiority over classical heuristics. However, such claims risk being unsolid due to a lack of standard benchmarks on truly hard instances. From a statistical physics perspective, we propose new hard benchmarks based on random problems. We provide these benchmarks, along with performance results from both classical heuristics and GNNs. Our fair comparison shows that classical algorithms still outperform GNNs. We discuss the challenges for neural networks in this domain. Future claims of superiority can be made more robust using our benchmarks, available at https://github.com/ArtLabBocconi/RandCSPBench.
中文标题/摘要
标题:在解决硬约束满足问题时图神经网络基准测试
图神经网络(GNNs)越来越多地应用于硬优化问题,通常声称优于经典启发式方法。然而,由于缺乏针对真正硬实例的标准基准,这些声称可能缺乏坚实性。从统计物理学的角度出发,我们提出了基于随机问题的新硬基准。我们提供了这些基准,以及来自经典启发式方法和GNNs的性能结果。公平比较显示,经典算法仍然优于GNNs。我们讨论了神经网络在该领域面临的挑战。未来使用我们的基准可以更牢固地提出优越性声明,基准代码可在https://github.com/ArtLabBocconi/RandCSPBench获取。
Summary / 总结
The study benchmarks graph neural networks (GNNs) against classical heuristics on hard constraint satisfaction problems, proposing new benchmarks based on random problems. The results show that classical algorithms outperform GNNs, highlighting challenges for neural networks in this domain. Future claims of GNN superiority should use these benchmarks for validation.
研究将图神经网络(GNNs)与经典启发式算法在硬约束满足问题上进行基准测试,提出了基于随机问题的新基准。结果显示,经典算法优于GNNs,指出了神经网络在该领域面临的挑战。未来关于GNN优越性的声明应使用这些基准进行验证。
Learning Performance Maximizing Ensembles with Explainability Guarantees
Authors: Vincent Pisztora, Jia Li
First: 2023-12-20T02:21:26+00:00 · Latest: 2026-02-20T18:33:44+00:00
Abstract
In this paper we propose a method for the optimal allocation of observations between an intrinsically explainable glass box model and a black box model. An optimal allocation being defined as one which, for any given explainability level (i.e. the proportion of observations for which the explainable model is the prediction function), maximizes the performance of the ensemble on the underlying task, and maximizes performance of the explainable model on the observations allocated to it, subject to the maximal ensemble performance condition. The proposed method is shown to produce such explainability optimal allocations on a benchmark suite of tabular datasets across a variety of explainable and black box model types. These learned allocations are found to consistently maintain ensemble performance at very high explainability levels (explaining $74\%$ of observations on average), and in some cases even outperforming both the component explainable and black box models while improving explainability.
中文标题/摘要
标题:具有解释性保证的最大化学习性能集成
在本文中,我们提出了一种方法,用于在固有可解释的玻璃盒模型和黑盒模型之间进行最优观察分配。最优分配被定义为在任何给定解释性水平(即解释性模型作为预测函数的比例观察数)下,最大化集成在底层任务上的性能,并最大化分配给解释性模型的观察数上的性能,同时满足最大集成性能条件。所提出的方法在多种解释性模型和黑盒模型类型以及表格数据集基准套件上展示了这样的解释性最优分配。这些学习到的分配在高解释性水平下(平均解释74%的观察数)保持了集成性能,并且在某些情况下甚至优于组件解释性和黑盒模型,同时提高了解释性。
Summary / 总结
The paper proposes a method for optimally allocating observations between an explainable model and a black box model to maximize ensemble performance while ensuring explainability. The method determines the best allocation for any given explainability level, balancing the performance of the ensemble and the explainable model. Experiments on benchmark datasets show that the learned allocations maintain high ensemble performance even at high explainability levels, often outperforming the individual models while improving explainability.
论文提出了一种方法,通过在解释性模型和黑盒模型之间优化分配观察值来最大化集成性能并确保解释性。该方法通过定义一种优化分配来实现这一目标,该分配最大化了集成的性能和解释性模型在其分配观察值上的性能。实验表明,学习到的分配即使在高解释性水平(平均解释74%的观察值)下也能保持高集成性能,并且在某些情况下甚至可以超越个体模型。
Leakage and Second-Order Dynamics Improve Hippocampal RNN Replay
Authors: Josue Casco-Rodriguez, Nanda H. Krishna, Richard G. Baraniuk
First: 2026-02-20T18:07:09+00:00 · Latest: 2026-02-20T18:07:09+00:00
Abstract
Biological neural networks (like the hippocampus) can internally generate "replay" resembling stimulus-driven activity. Recent computational models of replay use noisy recurrent neural networks (RNNs) trained to path-integrate. Replay in these networks has been described as Langevin sampling, but new modifiers of noisy RNN replay have surpassed this description. We re-examine noisy RNN replay as sampling to understand or improve it in three ways: (1) Under simple assumptions, we prove that the gradients replay activity should follow are time-varying and difficult to estimate, but readily motivate the use of hidden state leakage in RNNs for replay. (2) We confirm that hidden state adaptation (negative feedback) encourages exploration in replay, but show that it incurs non-Markov sampling that also slows replay. (3) We propose the first model of temporally compressed replay in noisy path-integrating RNNs through hidden state momentum, connect it to underdamped Langevin sampling, and show that, together with adaptation, it counters slowness while maintaining exploration. We verify our findings via path-integration of 2D triangular and T-maze paths and of high-dimensional paths of synthetic rat place cell activity.
中文标题/摘要
标题:泄漏和第二阶动力学提高海马RNN重放
生物神经网络(如海马)可以在内部生成类似于刺激驱动活动的“重放”。最近的重放计算模型使用噪声循环神经网络(RNN)进行路径积分训练。这些网络中的重放被描述为朗格维采样,但噪声RNN重放的新修饰已经超越了这一描述。我们重新审视噪声RNN重放作为采样以理解或改进它三个方面:(1)在简单假设下,我们证明重放活动应遵循的梯度是时间变化且难以估计的,但容易促使在RNN中使用隐藏状态泄漏。(2)我们确认隐藏状态适应(负反馈)鼓励重放中的探索,但显示它会导致非马尔可夫采样,从而减慢重放。(3)我们提出了噪声路径积分RNN中时间压缩重放的第一个模型,通过隐藏状态动量连接到欠阻尼朗格维采样,并表明与适应结合使用时,它可以抵消缓慢性同时保持探索。我们通过2D三角形和T迷宫路径的路径积分以及合成小鼠位置细胞活动的高维路径验证了这些发现。
Summary / 总结
The study investigates how leakage and second-order dynamics enhance the replay process in noisy recurrent neural networks (RNNs) mimicking hippocampal function. It proves that time-varying gradients make it challenging to estimate replay activity, which motivates the use of leakage. The research also shows that hidden state adaptation encourages exploration but introduces non-Markov sampling and slows down replay. To address this, the authors propose using hidden state momentum, which, combined with adaptation, counters slowness while maintaining exploration. Experiments confirm these findings through path-integration of various maze paths and synthetic rat place cell activity.
研究探讨了泄漏和第二阶动态如何增强模仿海马体活动的噪声循环神经网络(RNN)中的回放过程。研究证明,回放活动的梯度是时间变化且难以估计,这促使使用隐藏状态泄漏。研究还表明,隐藏状态适应可以促进探索,但引入了非马尔可夫采样,从而减慢了回放速度。通过提出隐藏状态动量,研究解决了速度问题同时保持了探索性,并将其与欠阻尼拉格朗日采样联系起来。这些发现通过2D三角形和T迷宫路径以及高维合成大鼠位置细胞活动的路径整合实验得到了验证。
Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis
Authors: Junkai Liu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Daniel C. Alexander, Le Zhang
First: 2026-02-20T18:05:39+00:00 · Latest: 2026-02-20T18:05:39+00:00
Abstract
Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice. Existing methods rely on external guidance to supply detailed missing state for instructing generative models to synthesize missing MRIs. However, manual indicators are not always available or reliable in real-world scenarios due to the unpredictable nature of clinical environments. Moreover, these explicit masks are not informative enough to provide guidance for improving semantic consistency. In this work, we argue that generative models should infer and recognize missing states in a self-perceptive manner, enabling them to better capture subtle anatomical and pathological variations. Towards this goal, we propose CoPeDiT, a general-purpose latent diffusion model equipped with completeness perception for unified synthesis of 3D MRIs. Specifically, we incorporate dedicated pretext tasks into our tokenizer, CoPeVAE, empowering it to learn completeness-aware discriminative prompts, and design MDiT3D, a specialized diffusion transformer architecture for 3D MRI synthesis, that effectively uses the learned prompts as guidance to enhance semantic consistency in 3D space. Comprehensive evaluations on three large-scale MRI datasets demonstrate that CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness, generalizability, and flexibility. The code is available at https://github.com/JK-Liu7/CoPeDiT .
中文标题/摘要
标题:利用完整性感知的扩散变换器进行统一的3D MRI合成
多模态脑MRI中的缺失模态和心脏MRI中的缺失切片等问题,给临床实践带来了重大挑战。现有方法依赖外部指导来提供详细的缺失状态,以指导生成模型合成缺失的MRI。然而,在实际临床环境中,由于环境的不可预测性,手动指示可能不可用或不可靠。此外,这些显式的掩码不足以提供提高语义一致性的指导。在本文中,我们认为生成模型应该以自我感知的方式推断和识别缺失状态,从而更好地捕捉细微的解剖和病理变化。为了实现这一目标,我们提出了CoPeDiT,这是一种通用的潜在扩散模型,配备了完整性感知能力,用于统一合成3D MRI。具体而言,我们将在我们的分词器CoPeVAE中引入专门的预训练任务,使其能够学习完整性感知的判别提示,并设计MDiT3D,这是一种专门的3D MRI合成扩散变换器架构,能够有效利用学习到的提示作为指导,增强3D空间中的语义一致性。在三个大规模MRI数据集上的全面评估表明,CoPeDiT显著优于现有最先进的方法,实现了更高的鲁棒性、通用性和灵活性。代码可在https://github.com/JK-Liu7/CoPeDiT 获取。
Summary / 总结
Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice.
该研究针对多模态脑MRI和心脏MRI中缺失数据的问题,提出了一种通过自我感知来推断缺失状态的生成模型CoPeDiT。CoPeDiT利用了具备完整性感知的分词器和专门的3D扩散变换器来增强语义一致性。在三个大规模MRI数据集上的实验结果表明,CoPeDiT在鲁棒性、通用性和灵活性方面优于现有方法。
Self-Aware Object Detection via Degradation Manifolds
Authors: Stefan Becker, Simon Weiss, Wolfgang Hübner, Michael Arens
First: 2026-02-20T17:58:46+00:00 · Latest: 2026-02-20T17:58:46+00:00
Abstract
Object detectors achieve strong performance under nominal imaging conditions but can fail silently when exposed to blur, noise, compression, adverse weather, or resolution changes. In safety-critical settings, it is therefore insufficient to produce predictions without assessing whether the input remains within the detector's nominal operating regime. We refer to this capability as self-aware object detection.
We introduce a degradation-aware self-awareness framework based on degradation manifolds, which explicitly structure a detector's feature space according to image degradation rather than semantic content. Our method augments a standard detection backbone with a lightweight embedding head trained via multi-layer contrastive learning. Images sharing the same degradation composition are pulled together, while differing degradation configurations are pushed apart, yielding a geometrically organized representation that captures degradation type and severity without requiring degradation labels or explicit density modeling.
To anchor the learned geometry, we estimate a pristine prototype from clean training embeddings, defining a nominal operating point in representation space. Self-awareness emerges as geometric deviation from this reference, providing an intrinsic, image-level signal of degradation-induced shift that is independent of detection confidence.
Extensive experiments on synthetic corruption benchmarks, cross-dataset zero-shot transfer, and natural weather-induced distribution shifts demonstrate strong pristine-degraded separability, consistent behavior across multiple detector architectures, and robust generalization under semantic shift. These results suggest that degradation-aware representation geometry provides a practical and detector-agnostic foundation.
中文标题/摘要
标题:基于退化流形的自我意识目标检测
目标检测器在标准成像条件下表现出色,但在遭遇模糊、噪声、压缩、恶劣天气或分辨率变化时可能会无声地失效。在安全关键环境中,仅生成预测而不评估输入是否仍处于检测器的标准工作范围内是不够的。我们称这种能力为自我意识目标检测。
我们提出了一种基于退化流形的退化感知自我意识框架,该框架根据图像退化而非语义内容明确地结构化检测器的特征空间。我们的方法通过多层对比学习增强了一个标准检测主干网络,带有轻量级嵌入头。具有相同退化组成的图像被拉近,而不同退化配置的图像则被推开,从而生成一个几何上组织良好的表示,该表示捕捉退化类型和严重程度,而无需退化标签或显式密度建模。
为了锚定学习到的几何结构,我们从干净的训练嵌入中估计一个原始原型,定义表示空间中的一个标准工作点。自我意识表现为几何偏离这一参考,提供了一个独立于检测置信度的退化诱导变化的内在、图像级信号。
在合成退化基准测试、跨数据集零样本迁移以及自然天气诱导的分布偏移的广泛实验中,我们展示了强大的原始退化可分性、在多个检测器架构上的一致行为以及在语义偏移下的稳健泛化。这些结果表明,退化感知表示几何结构提供了一个实用且检测器无关的基础。
Summary / 总结
The paper addresses the issue of object detectors failing under non-nominal imaging conditions by introducing a self-aware object detection framework. This framework uses degradation manifolds to organize the feature space based on image degradation rather than semantic content. The method enhances a standard detection backbone with a lightweight embedding head trained via multi-layer contrastive learning, enabling the detection of degradation without explicit labels. Experiments show strong separability between pristine and degraded images, consistent behavior across different detector architectures, and robust generalization under semantic shift, indicating the practicality and detector-agnostic nature of the proposed approach.
研究旨在通过开发自意识检测框架,在复杂成像条件下改进物体检测。该框架使用退化流形来根据图像退化而非语义内容组织特征空间。方法通过多层对比学习增强标准检测骨干网,轻量级嵌入头在无需明确退化标签的情况下,能够有效区分原始和退化图像。实验表明,该方法在各种检测架构和自然天气条件下表现出稳健性能,表明退化感知的表示几何提供了实用且检测器无关的基础,用于自意识物体检测。
Learning to Tune Pure Pursuit in Autonomous Racing: Joint Lookahead and Steering-Gain Control with PPO
Authors: Mohamed Elgouhary, Amr S. El-Wakeel
First: 2026-02-20T17:48:21+00:00 · Latest: 2026-02-20T17:48:21+00:00
Abstract
Pure Pursuit (PP) is widely used in autonomous racing for real-time path tracking due to its efficiency and geometric clarity, yet performance is highly sensitive to how key parameters-lookahead distance and steering gain-are chosen. Standard velocity-based schedules adjust these only approximately and often fail to transfer across tracks and speed profiles. We propose a reinforcement-learning (RL) approach that jointly chooses the lookahead Ld and a steering gain g online using Proximal Policy Optimization (PPO). The policy observes compact state features (speed and curvature taps) and outputs (Ld, g) at each control step. Trained in F1TENTH Gym and deployed in a ROS 2 stack, the policy drives PP directly (with light smoothing) and requires no per-map retuning. Across simulation and real-car tests, the proposed RL-PP controller that jointly selects (Ld, g) consistently outperforms fixed-lookahead PP, velocity-scheduled adaptive PP, and an RL lookahead-only variant, and it also exceeds a kinematic MPC raceline tracker under our evaluated settings in lap time, path-tracking accuracy, and steering smoothness, demonstrating that policy-guided parameter tuning can reliably improve classical geometry-based control.
中文标题/摘要
标题:基于自主赛车的纯追求调参学习:联合前瞻距离和转向增益控制的PPO方法
纯追求(PP)在自主赛车中广泛用于实时路径跟踪,因其效率和几何清晰度而被广泛应用,但其性能高度依赖于关键参数——前瞻距离和转向增益的选择。标准的速度基调度方法仅能近似调整这些参数,且往往无法在不同赛道和速度配置下转移。我们提出了一种强化学习(RL)方法,使用近端策略优化(PPO)在线联合选择前瞻距离Ld和转向增益g。策略观察紧凑的状态特征(速度和曲率样本),并在每个控制步骤输出(Ld, g)。该策略在F1TENTH Gym中训练并在ROS 2堆栈中部署,直接驱动PP(带有轻微平滑处理),无需针对每张地图重新调整。在模拟和真实车辆测试中,联合选择(Ld, g)的提出的RL-PP控制器始终优于固定前瞻距离的PP、速度调度自适应PP以及RL前瞻距离仅有的变体,并且在圈速、路径跟踪精度和转向平滑度方面也超过了我们评估设置下的动力学MPC赛车线追踪器,证明了策略引导的参数调优可以可靠地改进经典几何控制。
Summary / 总结
The paper addresses the sensitivity of Pure Pursuit (PP) performance to its key parameters, lookahead distance and steering gain, by proposing a reinforcement learning approach using Proximal Policy Optimization (PPO). The method learns these parameters online based on compact state features, and the policy is trained in F1TENTH Gym and deployed in a ROS 2 stack. Experimental results show that the proposed RL-PP controller outperforms fixed-lookahead PP, velocity-scheduled adaptive PP, and an RL lookahead-only variant, achieving better lap times, path-tracking accuracy, and steering smoothness.
论文针对纯追求(PP)的关键参数——前瞻距离和转向增益对性能的高度敏感性,提出了一种基于Proximal Policy Optimization (PPO)的强化学习方法。该方法根据紧凑的状态特征在线学习这些参数,并在F1TENTH Gym中训练,在ROS 2堆栈中部署。实验结果表明,所提出的RL-PP控制器在环路时间、路径跟踪精度和转向平滑度方面优于固定前瞻距离的PP、基于速度调度的自适应PP以及RL前瞻距离仅优化的变体。
Adaptive GR(1) Specification Repair for Liveness-Preserving Shielding in Reinforcement Learning
Authors: Tiberiu-Andrei Georgescu, Alexander W. Goodall, Dalal Alrajeh, Francesco Belardinelli, Sebastian Uchitel
First: 2025-11-04T14:27:28+00:00 · Latest: 2026-02-20T17:44:58+00:00
Abstract
Shielding is widely used to enforce safety in reinforcement learning (RL), ensuring that an agent's actions remain compliant with formal specifications. Classical shielding approaches, however, are often static, in the sense that they assume fixed logical specifications and hand-crafted abstractions. While these static shields provide safety under nominal assumptions, they fail to adapt when environment assumptions are violated. In this paper, we develop an adaptive shielding framework based on based on Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and expressive fragment of Linear Temporal Logic (LTL) that captures both safety and liveness properties. Our method detects environment assumption violations at runtime and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online, in a systematic and interpretable way. This ensures that the shield evolves gracefully, ensuring liveness is achievable and minimally weakening goals only when necessary. We consider two case studies: Minepump and Atari Seaquest; showing that (i) static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped with our adaptive shield maintain near-optimal reward and perfect logical compliance compared with static shields.
中文标题/摘要
标题:自适应GR(1)规范修复以保持活锁保护的强化学习防护
防护在强化学习(RL)中广泛用于确保代理的行为符合正式规范。传统的防护方法通常是静态的,即它们假设固定的逻辑规范和手工构建的抽象。虽然这些静态防护在名义假设下提供了安全性,但在环境假设被违反时却无法适应。在本文中,我们基于广义反应性等级1(GR(1))规范开发了一个自适应防护框架,这是一种线性时序逻辑(LTL)的可处理且表达力强的片段,能够捕捉安全性和活锁属性。我们的方法在运行时检测环境假设的违反,并使用归纳逻辑编程(ILP)在线自动修复GR(1)规范,以系统且可解释的方式。这确保了防护能够优雅地演变,仅在必要时最小削弱目标以确保活锁的实现。我们考虑了两个案例研究:Minepump和Atari Seaquest;表明(i)在优化辅助奖励时,静态符号控制器通常严重次优,(ii)配备我们自适应防护的RL代理与静态防护相比,保持了接近最优的奖励和完美的逻辑合规性。
Summary / 总结
This paper addresses the limitations of static shielding in reinforcement learning by proposing an adaptive GR(1) specification repair framework. The method uses Inductive Logic Programming to detect and repair GR(1) specifications at runtime, ensuring the shield adapts to environment changes while preserving liveness. The study demonstrates that adaptive shields outperform static shields in optimizing auxiliary rewards and maintain near-optimal performance and logical compliance in the Minepump and Atari Seaquest environments.
本文针对强化学习中静态屏蔽的局限性,提出了一种基于GR(1)规范的自适应屏蔽框架。该方法在运行时检测违规行为,并使用归纳逻辑编程自动修复规范,确保保持活性并在必要时仅最小削弱目标。实验结果表明,自适应屏蔽在奖励优化和逻辑合规性方面均优于静态屏蔽。
Theory and interpretability of Quantum Extreme Learning Machines: a Pauli-transfer matrix approach
Authors: Markus Gross, Hans-Martin Rieser
First: 2026-02-20T17:33:27+00:00 · Latest: 2026-02-20T17:33:27+00:00
Comments: 34 pages, 12 figures
Abstract
Quantum reservoir computers (QRCs) have emerged as a promising approach to quantum machine learning, since they utilize the natural dynamics of quantum systems for data processing and are simple to train. Here, we consider n-qubit quantum extreme learning machines (QELMs) with continuous-time reservoir dynamics. QELMs are memoryless QRCs capable of various ML tasks, including image classification and time series forecasting. We apply the Pauli transfer matrix (PTM) formalism to theoretically analyze the influence of encoding, reservoir dynamics, and measurement operations, including temporal multiplexing, on the QELM performance. This formalism makes explicit that the encoding determines the complete set of (nonlinear) features available to the QELM, while the quantum channels linearly transform these features before they are probed by the chosen measurement operators. Optimizing a QELM can therefore be cast as a decoding problem in which one shapes the channel-induced transformations such that task-relevant features become available to the regressor. The PTM formalism allows one to identify the classical representation of a QELM and thereby guide its design towards a given training objective. As a specific application, we focus on learning nonlinear dynamical systems and show that a QELM trained on such trajectories learns a surrogate-approximation to the underlying flow map.
中文标题/摘要
标题:量子极限学习机的理论与可解释性:Pauli传输矩阵方法
量子蓄水库计算机(QRCs)已成为量子机器学习的一种有前途的方法,因为它们利用量子系统的自然动力学进行数据处理,并且易于训练。在这里,我们考虑具有连续时间蓄水库动力学的n量子比特量子极限学习机(QELMs)。QELMs是无记忆的QRCs,能够执行各种机器学习任务,包括图像分类和时间序列预测。我们应用Pauli传输矩阵(PTM)形式化方法来理论分析编码、蓄水库动力学和测量操作(包括时间复用)对QELM性能的影响。这种形式化方法明确指出,编码决定了QELM可用的完整集(非线性)特征,而量子通道线性地将这些特征转换为由所选测量算子探查的形式。优化QELM可以被表述为一个解码问题,在此问题中,通过塑造由通道引起的变换,使相关特征对回归器可用。PTM形式化方法允许识别QELM的经典表示,从而指导其设计以实现给定的训练目标。作为具体应用,我们专注于学习非线性动力系统,并展示了QELM在这些轨迹上的训练学习到了底层流形的近似替代。
Summary / 总结
This paper explores the theory and interpretability of quantum extreme learning machines (QELMs) using the Pauli transfer matrix (PTM) formalism. The authors analyze how encoding, reservoir dynamics, and measurement operations affect QELM performance, showing that encoding determines the set of nonlinear features available to the QELM, while quantum channels linearly transform these features. The PTM formalism helps in optimizing QELMs by shaping the channel-induced transformations to make task-relevant features available. As an application, the paper demonstrates that QELMs can learn surrogate approximations to the underlying flow map of nonlinear dynamical systems.
研究旨在通过Pauli转移矩阵(PTM)方法增强量子极端学习机(QELM)的理论理解和可解释性。方法涉及分析编码、量子存储器动力学和测量操作对QELM性能的影响。关键发现表明,编码决定了QELM可用的非线性特征集,而量子通道在测量前线性地变换这些特征。研究还表明,优化QELM可以被表述为一个解码问题,并将其应用于学习非线性动力系统,其中QELM在这些轨迹上进行训练,能够学习到底层流形的近似表示。
Zero-shot Interactive Perception
Authors: Venkatesh Sripada, Frank Guerin, Amir Ghalamzan
First: 2026-02-20T17:30:25+00:00 · Latest: 2026-02-20T17:30:25+00:00
Comments: Original manuscript submitted on April 24, 2025. Timestamped and publicly available on OpenReview: https://openreview.net/forum?id=7MhpFcr5Nx
Abstract
Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scenes with varying occlusions and task complexities. Our experiments demonstrate that ZS-IP outperforms passive and viewpoint-based perception techniques such as Mark-Based Visual Prompting (MOKA), particularly in pushing tasks, while preserving the integrity of non-target elements.
中文标题/摘要
标题:零样本交互感知
交互感知(IP)使机器人能够在其工作空间中提取隐藏信息并通过物理交互物体和改变环境状态来执行操作计划——这对于解决复杂、部分可观测场景中的遮挡和模糊至关重要。我们提出了零样本IP(ZS-IP),这是一种新颖的框架,将多策略操作(推和抓取)与记忆驱动的视觉语言模型(VLM)结合,以指导机器人交互并解决语义查询。ZS-IP 结合了三个关键组件:(1)增强观察(EO)模块,该模块通过常规关键点和我们提出的推线——一种针对推操作定制的新型2D视觉增强,增强了VLM的视觉感知,(2)记忆引导的操作模块,通过上下文查找强化语义推理,以及(3)基于VLM输出执行推、拉或抓取的机器人控制器。与针对拾取和放置优化的基于网格的增强不同,推线捕捉接触丰富的操作的利用机会,显著提高了推操作的性能。我们在具有不同遮挡和任务复杂度的7-DOF Franka Panda 手臂上评估了ZS-IP。我们的实验表明,ZS-IP 在推操作任务中优于被动和视角基于的感知技术,如基于标记的视觉提示(MOKA),同时保持非目标元素的完整性。
Summary / 总结
Zero-Shot Interactive Perception (ZS-IP) is a framework that combines multi-strategy manipulation with a memory-driven Vision Language Model to resolve occlusions and ambiguity in complex scenarios. It includes an Enhanced Observation module that uses pushlines for pushing actions, a memory-guided action module for semantic reasoning, and a robotic controller for executing actions. Experiments show that ZS-IP outperforms passive and viewpoint-based techniques like MOKA, especially in pushing tasks, while maintaining the integrity of non-target elements.
研究旨在通过交互感知使机器人能够在复杂环境中解决遮挡和模糊问题,这涉及与物体进行物理互动。提出的零样本交互感知(ZS-IP)框架结合了多策略操作和基于记忆的视觉语言模型来指导机器人互动。关键组件包括增强观察模块,使用推线进行推动作,记忆引导动作模块进行语义推理,以及机器人控制器根据视觉语言模型输出执行动作。实验表明,ZS-IP 在推动作任务中优于被动和视角基于的感知技术,同时保持非目标元素的完整性。
Visual Planning: Let's Think Only with Images
Authors: Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, Ivan Vulić
Venue: ICLR 2026 Oral
First: 2025-05-16T16:17:22+00:00 · Latest: 2026-02-20T17:09:35+00:00
Comments: ICLR 2026 (Oral)
Abstract
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first" tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.
中文标题/摘要
标题:视觉规划:仅用图像思考
近期大型语言模型(LLMs)及其多模态扩展(MLLMs)在多种任务中的机器推理能力有了显著提升。然而,这些模型主要依赖纯文本作为表达和结构化推理的媒介,即使存在视觉信息也是如此。在本文中,我们提出,对于涉及空间和几何信息的任务,语言可能不是最自然或有效的推理模态。受此启发,我们提出了一种新的范式——视觉规划,通过纯视觉表示进行规划,作为语言推理的补充渠道,特别适用于“视觉优先”任务。在这一范式中,规划通过一系列图像序列执行,这些图像编码视觉域中的逐步推理,类似于人类如何草图或可视化未来动作。我们引入了一种新的强化学习框架——基于GRPO的强化学习视觉规划(VPRL),在训练后增强大型视觉模型,显著提高了在FrozenLake、迷宫和MiniBehavior等代表性视觉导航任务中的规划性能。我们的视觉规划范式优于所有在纯文本空间中进行推理的规划变体。我们的结果确立了视觉规划作为语言推理的有效补充,为受益于直观、基于图像的推理的任务开辟了新的途径。
Summary / 总结
This work introduces Visual Planning, a new paradigm for reasoning tasks that emphasizes visual representations over text. Motivated by the limitations of text-based reasoning in spatial and geometrical tasks, the authors propose using sequences of images to encode step-by-step inference. They develop a reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), which significantly improves performance in visual navigation tasks such as FrozenLake, Maze, and MiniBehavior. The results demonstrate that visual planning outperforms text-only reasoning methods, suggesting its potential as a valuable supplement to language-based reasoning.
本文提出了一种新的视觉规划范式,通过纯粹的视觉表示来进行规划,特别适用于涉及空间和几何信息的任务。作者提出了一种基于强化学习的框架,视觉规划通过强化学习(VPRL),利用GRPO对大型视觉模型进行后训练。研究结果表明,这种视觉规划方法在FrozenLake、迷宫和MiniBehavior等视觉导航任务中优于纯文本推理,展示了其作为语言推理补充的潜力。
Quantum-enhanced satellite image classification
Authors: Qi Zhang, Anton Simen, Carlos Flores-Garrigós, Gabriel Alvarado Barrios, Paolo A. Erdman, Enrique Solano, Aaron C. Kemp, Vincent Beltrani, Vedangi Pathak, Hamed Mohammadbagherpoor
First: 2026-02-20T17:02:16+00:00 · Latest: 2026-02-20T17:02:16+00:00
Abstract
We demonstrate the application of a quantum feature extraction method to enhance multi-class image classification for space applications. By harnessing the dynamics of many-body spin Hamiltonians, the method generates expressive quantum features that, when combined with classical processing, lead to quantum-enhanced classification accuracy. Using a strong and well-established ResNet50 baseline, we achieved a maximum classical accuracy of 83%, which can be improved to 84% with a transfer learning approach. In contrast, applying our quantum-classical method the performance is increased to 87% accuracy, demonstrating a clear and reproducible improvement over robust classical approaches. Implemented on several of IBM's quantum processors, our hybrid quantum-classical approach delivers consistent gains of 2-3% in absolute accuracy. These results highlight the practical potential of current and near-term quantum processors in high-stakes, data-driven domains such as satellite imaging and remote sensing, while suggesting broader applicability in real-world machine learning tasks.
中文标题/摘要
标题:量子增强的卫星图像分类
我们展示了将量子特征提取方法应用于增强空间应用中的多类图像分类。通过利用多体自旋哈密顿量的动力学,该方法生成了具有表现力的量子特征,当与经典处理结合时,可以提高量子增强的分类准确性。使用强大的ResNet50基线,我们实现了83%的最大经典准确性,通过迁移学习方法可以提高到84%。相比之下,应用我们的量子-经典方法,性能提高到87%的准确性,证明了与稳健的经典方法相比的明显且可重复的改进。在IBM的几种量子处理器上实现,我们的混合量子-经典方法在绝对准确性上提供了2-3%的一致性增益。这些结果突显了当前和短期内量子处理器在高风险、数据驱动领域如卫星成像和遥感中的实际潜力,同时也暗示了在实际机器学习任务中的更广泛适用性。
Summary / 总结
The research aims to enhance multi-class image classification for space applications using a quantum feature extraction method. By combining quantum processing with classical methods, the study achieved a 2-3% improvement in accuracy, reaching up to 87% on IBM's quantum processors, compared to 84% with transfer learning on a classical ResNet50 baseline. This demonstrates the potential of current and near-term quantum processors in satellite imaging and remote sensing tasks.
研究旨在利用量子特征提取方法提升空间应用中的多类图像分类。该方法利用多体自旋哈密顿量生成表达性强的量子特征,然后与经典处理相结合。研究表明,量子-经典混合方法将分类准确率从83%提高到87%,超过了稳健的经典方法。在IBM的量子处理器上,混合方法一致实现了2-3%的更高准确率,表明当前和近期内的量子处理器在卫星成像和遥感等高风险数据驱动领域具有实际应用潜力。
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Authors: Pavithra PM Nair, Preethu Rose Anish
Venue: AAAI 2026
First: 2026-02-20T16:57:44+00:00 · Latest: 2026-02-20T16:57:44+00:00
Abstract
In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate case proceeding documents and decomposes them into decision points. Decision points are discrete legal determinations that encapsulate the legal issue, deciding authority, outcome, reasoning, and temporal context. The structured representation isolates the core determinations and their context, enabling accurate predictions and interpretable explanations. Vichara's explanations follow a structured format inspired by the IRAC (Issue-Rule-Application-Conclusion) framework and adapted for Indian legal reasoning. This enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently. We evaluate Vichara on two datasets, PredEx and the expert-annotated subset of the Indian Legal Documents Corpus (ILDC_expert), using four large language models: GPT-4o mini, Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B. Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B. Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
中文标题/摘要
标题:Vichara:印度司法系统中的上诉判决预测与解释
在印度等司法案件积压严重的地区,人工智能为法律判决预测提供了变革性的潜力。上诉案件是高等法院对低级法院判决进行审查而发布的正式决定,构成了积压案件的重要部分。为此,我们提出了Vichara,一种针对印度司法系统的新型框架,用于预测和解释上诉判决。Vichara 处理英文上诉案件程序文件,并将其分解为决策点。决策点是包含法律问题、决定依据、结果、推理和时间背景的离散法律判断。结构化的表示形式隔离了核心判断及其背景,从而实现准确的预测和可解释的解释。Vichara 的解释遵循由IRAC(问题-规则-应用-结论)框架启发并适应印度法律推理的结构化格式。这增强了可解释性,使法律专业人士能够高效地评估预测的合理性。我们使用PredEx数据集和印度法律文件语料库(ILDC_expert)的专家注释子集对Vichara进行了评估,使用了四个大型语言模型:GPT-4o mini、Llama-3.1-8B、Mistral-7B 和 Qwen2.5-7B。Vichara 在两个数据集上的表现均优于现有判决预测基准,GPT-4o mini 在 PredEx 上的性能最高(F1: 81.5,在 ILDC_expert 上为 80.3),其次是 Llama-3.1-8B。生成的解释在清晰度、关联性和有用性方面的评估中,GPT-4o mini 的可解释性更优。
Summary / 总结
Vichara is a framework designed to predict and explain appellate judgments in the Indian judicial system. It processes appellate case documents and decomposes them into decision points, which are then used to make accurate predictions and provide interpretable explanations following the IRAC framework. Vichara outperforms existing benchmarks, with GPT-4o mini achieving the highest F1 scores on both evaluation datasets, and superior interpretability as rated by human evaluators.
Vichara 是一个旨在预测和解释印度司法系统中上诉判决的框架。它处理上诉案件文件,将其分解为包含法律问题、结果和推理的决策点。Vichara 使用受 IRAC 框架启发的结构化格式提供可解释的解释。在两个数据集上使用四种大型语言模型进行评估,Vichara 超过了现有基准,GPT-4o mini 在两个数据集上的 F1 分数最高。人类评估还表明生成解释的可读性更优。
FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels
Authors: Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong
First: 2025-11-04T03:25:17+00:00 · Latest: 2026-02-20T16:21:35+00:00
Abstract
Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.
中文标题/摘要
标题:FATE:多难度层次形式代数基准系列
近年来,大型语言模型(LLMs)在形式定理证明方面展现了令人印象深刻的性能,特别是在IMO等竞赛数学基准测试中。然而,这些竞赛未能反映现代数学研究的深度、广度和抽象性。为弥合这一差距,我们引入了FATE(形式代数定理评估),这是一个新的形式代数基准系列,旨在引导高级数学推理的发展。我们提出了两个新组件,FATE-H和FATE-X,每个都包含100个抽象和交换代数问题。FATE系列涵盖了从本科练习到超过博士资格考试难度的问题范围。值得注意的是,FATE-X是第一个超越博士水平考试难度和Mathlib库覆盖范围的形式基准。我们对最新一代LLM证明器在这项新基准上的评估显示,与竞赛数学相比,它们的表现存在显著差距:最佳模型在FATE-H上的准确率为3%(pass@64),在FATE-X上为0%。我们的两阶段评估表明,模型的自然语言推理比其形式化推理的能力更为准确。我们系统地分类了在这一形式化过程中出现的常见错误。此外,一项比较研究显示,专门的证明器在自然语言阶段的表现可能不如通用模型,从而降低了其准确性。我们认为,FATE提供了一个稳健且具有挑战性的基准,为通往研究级形式数学推理的道路上设立了必要的检查点。
Summary / 总结
FATE is a new benchmark series for formal algebra designed to evaluate advanced mathematical reasoning, spanning from undergraduate exercises to PhD-level problems. It includes two components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. Evaluations of state-of-the-art language models on FATE show poor performance, with the best model achieving only 3% accuracy on FATE-H and 0% on FATE-X. The study reveals that models are better at natural-language reasoning than formalizing it, and a specialized prover is less effective than general-purpose models in this process. FATE provides a robust benchmark for formal mathematical reasoning.
FATE 是一个新的形式代数基准系列,旨在评估大型语言模型(LLMs)在高级数学推理方面的能力。它包括两个部分,FATE-H 和 FATE-X,每个部分包含 100 个问题,难度范围从本科到博士水平。评估结果显示,最先进的 LLM 证明器表现不佳,在 FATE-H 上仅达到 3% 的准确率,在 FATE-X 上为 0%。研究发现,模型的自然语言推理比其形式化推理更准确,而专门的证明器可能不如通用模型在这一过程中有效。
GRPO is Secretly a Process Reward Model
Authors: Michael Sullivan, Alexander Koller
Venue: ICML 2026
First: 2025-09-25T13:40:36+00:00 · Latest: 2026-02-20T16:20:57+00:00
Comments: 15 pages, 7 figures; under review at ICML 2026
Abstract
Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs tuned with $λ$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.
中文标题/摘要
标题:GRPO实际上是过程奖励模型
过程奖励模型(PRMs)在强化学习(RL)中允许精细的信用分配,而结果奖励模型(ORMs)则为整个轨迹分配单一奖励。然而,我们在本文中提供了理论证明,表明配备ORM的Group Relative Policy Optimization(GRPO)RL算法实际上等同于一个具备非平凡的、基于蒙特卡洛的PRM的PRM意识RL目标(在轻微假设下)。利用GRPO-as-a-PRM的框架,我们识别出GRPO目标中的一个缺陷,该缺陷与过程步骤和奖励的不平衡相互作用,从而阻碍了探索和利用(在不同条件下)。我们提出了一种简单的算法修改($λ$-GRPO)来缓解这一缺陷,并展示了使用$λ$-GRPO调优的LLMs在下游推理任务中优于使用标准GRPO调优的LLMs,并且达到最佳性能的速度更快。这些结果表明,我们可以通过利用vanilla GRPO算法中隐含的内置PRM结构来提升模型性能,而无需使用显式的PRM,并且对训练时间和成本的影响微乎其微。
Unifying Color and Lightness Correction with View-Adaptive Curve Adjustment for Robust 3D Novel View Synthesis
Authors: Ziteng Cui, Shuhong Liu, Xiaoyu Dong, Xuangeng Chu, Lin Gu, Ming-Hsuan Yang, Tatsuya Harada
Venue: CVPR 2025
First: 2026-02-20T16:20:50+00:00 · Latest: 2026-02-20T16:20:50+00:00
Comments: Journal extension version of CVPR 2025 paper: arXiv:2504.01503
Abstract
High-quality image acquisition in real-world environments remains challenging due to complex illumination variations and inherent limitations of camera imaging pipelines. These issues are exacerbated in multi-view capture, where differences in lighting, sensor responses, and image signal processor (ISP) configurations introduce photometric and chromatic inconsistencies that violate the assumptions of photometric consistency underlying modern 3D novel view synthesis (NVS) methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), leading to degraded reconstruction and rendering quality. We propose Luminance-GS++, a 3DGS-based framework for robust NVS under diverse illumination conditions. Our method combines a globally view-adaptive lightness adjustment with a local pixel-wise residual refinement for precise color correction. We further design unsupervised objectives that jointly enforce lightness correction and multi-view geometric and photometric consistency. Extensive experiments demonstrate state-of-the-art performance across challenging scenarios, including low-light, overexposure, and complex luminance and chromatic variations. Unlike prior approaches that modify the underlying representation, our method preserves the explicit 3DGS formulation, improving reconstruction fidelity while maintaining real-time rendering efficiency.
中文标题/摘要
标题:基于视图自适应曲线调整的色彩和亮度统一校正方法及其在鲁棒三维新颖视图合成中的应用
在现实环境中获得高质量图像仍然具有挑战性,因为存在复杂的光照变化和相机成像管道的固有限制。这些问题在多视角捕获中尤为严重,不同光照、传感器响应和图像信号处理器(ISP)配置之间的差异引入了光度和色彩不一致,违反了现代三维新颖视图合成(NVS)方法,包括神经辐射场(NeRF)和三维高斯点积(3DGS)方法的假设,导致重建和渲染质量下降。我们提出了一种基于3DGS的鲁棒NVS框架Luminance-GS++,以应对多种光照条件。该方法结合了全局视图自适应亮度调整和局部像素级残差细化,以实现精确的色彩校正。我们还设计了无监督目标,以同时强制执行亮度校正和多视角几何和光度一致性。大量实验表明,该方法在低光、过曝和复杂亮度和色彩变化等具有挑战性的场景中表现出最先进的性能。与先前修改底层表示的方法不同,我们的方法保留了显式的3DGS公式,提高了重建保真度,同时保持了实时渲染效率。
Summary / 总结
The paper addresses the challenge of photometric and chromatic inconsistencies in multi-view capture due to complex lighting conditions and camera limitations. It introduces Luminance-GS++, a 3DGS-based framework that combines global view-adaptive lightness adjustment with local pixel-wise color refinement. The method enforces lightness correction and multi-view consistency through unsupervised objectives. Experiments show that Luminance-GS++ outperforms existing methods in low-light, overexposure, and complex lighting scenarios, maintaining real-time rendering efficiency while improving reconstruction fidelity.
论文解决了多视角捕捉中由于复杂光照条件导致的光度和色彩不一致问题。它提出了Luminance-GS++,一种基于3DGS的框架,结合了全局视角自适应亮度调整和局部像素级色彩校正。该方法通过无监督目标同时确保亮度校正和多视角几何及光度一致性,实现了在低光照、过曝和复杂光照场景中的优越性能,优于现有的NeRF和3DGS等方法。
Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting
Authors: Tianyi Song, Danail Stoyanov, Evangelos Mazomenos, Francisco Vasconcelos
First: 2026-02-20T16:14:21+00:00 · Latest: 2026-02-20T16:14:21+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.
中文标题/摘要
标题:Diff2DGS:通过2D高斯点绘制可靠重建遮挡的手术场景
实时重建可变形的手术场景对于推进机器人手术、改善外科医生指导和实现自动化至关重要。最近的方法能够从达芬奇机器人手术视频中实现密集重建,通过图形加速,高斯点绘制(GS)提供了实时性能。然而,遮挡区域的重建质量仍然有限,深度准确性也未得到充分评估,因为基准如EndoNeRF和StereoMIS缺乏3D真实值。我们提出了一种新颖的两阶段框架Diff2DGS,用于可靠重建遮挡的手术场景。在第一阶段,一种基于扩散的视频模块结合时间先验,以高空间-时间一致性填补器械遮挡的组织。在第二阶段,我们通过可学习变形模型(LDM)适应2D高斯点绘制(2DGS),以捕捉动态组织变形和解剖几何结构。我们还通过在SCARED数据集上进行定量深度准确性分析,超越了先前的图像质量度量标准。Diff2DGS在EndoNeRF和StereoMIS上分别达到38.02 dB PSNR和34.40 dB,优于最先进的方法。此外,我们的实验表明,仅优化图像质量并不一定能够转化为最佳的3D重建准确性。为了解决这个问题,我们进一步优化了重建3D结果的深度质量,确保除了高保真外观之外,还具有更忠实的几何结构。
Summary / 总结
The research aims to improve the reconstruction of occluded surgical scenes in real-time for robotic surgery. It proposes Diff2DGS, a two-stage framework combining diffusion-based video inpainting and 2D Gaussian Splatting with a Learnable Deformation Model. The method outperforms existing approaches in both appearance and geometry, achieving high PSNR scores on benchmark datasets. The study also highlights the importance of optimizing depth accuracy for better 3D reconstruction.
研究旨在提高变形手术场景的实时重建,特别是在遮挡区域的表现,以改善机器人手术的指导。提出的Diff2DGS框架包括两个阶段:一个基于扩散的模块用于修复遮挡的组织,以及一个带有可学习变形模型的2D高斯点绘制方法,以捕捉动态组织变形。该方法在外观和几何结构上均优于现有方法,分别在EndoNeRF和StereoMIS上达到38.02 dB PSNR和34.40 dB。研究还指出,仅优化图像质量可能不会导致最佳的3D重建精度,因此进一步优化重建3D结果的深度质量,以实现更准确的几何结构和高保真的外观。
View Invariant Learning for Vision-Language Navigation in Continuous Environments
Authors: Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, Mark Crowley
First: 2025-07-05T18:04:35+00:00 · Latest: 2026-02-20T16:14:13+00:00
Comments: This paper is accepted to RA-L 2026
Abstract
Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V$^2$-VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components. Empirical results show that our method outperforms state-of-the-art approaches on V$^2$-VLNCE by 8-15\% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Evaluation of VIL in standard VLNCE settings shows that despite being trained for varied viewpoints, VIL often still improves performance. On the harder RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method. We further evaluate VIL for simulated camera placements derived from real robot configurations (e.g. Stretch RE-1, LoCoBot), showing consistent improvements of performance. Finally, we present a proof-of-concept real-robot evaluation in two physical environments using a panoramic RGB sensor combined with LiDAR. The code is available at https://github.com/realjoshqsun/V2-VLNCE.
中文标题/摘要
标题:连续环境中的视景语言导航的视点不变学习
连续环境中的视景语言导航(VLNCE),其中智能体遵循指令自由移动以到达目的地,是嵌入式人工智能中的关键研究问题。然而,大多数现有方法对视点变化敏感,即相机高度和视角的变化。我们在此引入了一个更通用的场景,V$^2$-VLNCE(具有变化视点的VLNCE),并提出了一种视点不变后训练框架,称为VIL(视点不变学习),使现有的导航策略对相机视点的变化更具鲁棒性。VIL 使用对比学习框架学习稀疏且视点不变的特征。我们还引入了一种教师-学生框架用于路径预测模块,这是VLNCE基线标准组成部分之一,其中视点依赖的教师模型将知识提炼到视点不变的学生模型中。我们采用端到端的训练范式联合优化这些组件。实验证明,我们的方法在V$^2$-VLNCE的两个标准基准数据集R2R-CE和RxR-CE上,基于成功率的度量,比最先进的方法高出8-15%。在标准VLNCE设置中评估VIL表明,尽管训练的是变化视点,VIL 通常仍然能提高性能。在更难的RxR-CE数据集上,我们的方法在所有指标上也达到了最先进的性能。这表明添加VIL 不会削弱标准视点性能,可以作为插件即用的后训练方法。我们进一步评估了VIL 对从真实机器人配置(例如Stretch RE-1,LoCoBot)派生的模拟相机放置的表现,显示出一致的性能改进。最后,我们使用全景RGB传感器结合LiDAR在两个物理环境中进行了真实机器人评估。代码可在https://github.com/realjoshqsun/V2-VLNCE/ 获取。
Summary / 总结
The research addresses the challenge of viewpoint sensitivity in Vision-Language Navigation in Continuous Environments (VLNCE) by introducing V$^2$-VLNCE and proposing a view-invariant post-training framework called VIL. VIL uses contrastive learning to learn sparse and viewpoint-invariant features and includes a teacher-student framework for the Waypoint Predictor Module. Empirical results show that VIL outperforms state-of-the-art approaches by 8-15% on Success Rate for R2R-CE and RxR-CE datasets, and achieves state-of-the-art performance on the harder RxR-CE dataset across all metrics, demonstrating its effectiveness in improving navigation robustness to viewpoint changes.
研究通过引入V$^2$-VLNCE和提出基于对比学习的视点不变后训练框架VIL来解决视觉-语言导航在连续环境中的视点敏感性问题。VIL使用对比学习框架学习稀疏且视点不变的特征,并包含一个教师-学生框架用于航点预测模块。实验结果表明,VIL在两个基准数据集上的成功率为8-15%的性能上优于现有最佳方法,并且在标准VLNCE设置和更难的RxR-CE数据集上也提高了性能,实现了最佳性能。VIL还在模拟真实机器人配置和物理环境中的全景RGB传感器与LiDAR结合的实机器人评估中表现出一致的性能提升。
Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation
Authors: Ziyue Liu, Davide Talon, Federico Girella, Zanxi Ruan, Mattia Mondo, Loris Bazzani, Yiming Wang, Marco Cristani
First: 2026-02-20T16:07:31+00:00 · Latest: 2026-02-20T16:07:31+00:00
Comments: Project page: https://intelligolabs.github.io/lots/
Abstract
Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.
中文标题/摘要
标题:通过局部文本和素描配对实现多级条件化以生成时尚图像
素描为设计师提供了一种简洁而富有表现力的早期时尚创意媒介,用于指定结构、轮廓和空间关系,而文本描述则补充素描以传达材料、颜色和风格细节。有效结合文本和视觉模态需要在利用文本局部属性指导时遵循素描的视觉结构。我们提出了LOcalized Text and Sketch with多级指导(LOTS)框架,通过结合全局素描指导和多个局部素描-文本配对来增强时尚图像生成。LOTS采用多级条件化阶段独立地在共享潜在空间中编码局部特征,同时保持全局结构协调。然后,扩散配对指导阶段通过注意力指导在扩散模型的多步去噪过程中整合局部和全局条件。为了验证我们的方法,我们开发了Sketchy,这是第一个每张图像提供多个文本-素描配对的时尚数据集。Sketchy提供了高质量、干净且专业外观的素描,具有一致的结构。为了评估该设置之外的鲁棒性,我们还包含了一个“野外”分割,其中包含非专家素描,具有更高的变化性和不完美性。实验表明,我们的方法增强了全局结构的一致性,同时利用了更丰富的局部语义指导,实现了对现有最佳方法的改进。数据集、平台和代码已公开。
Summary / 总结
This paper addresses the challenge of integrating textual and sketch-based guidance for fashion image generation. The proposed LOTS framework uses a multi-level conditioning stage to encode local features while maintaining global structural coherence, and a diffusion pair guidance stage to integrate both local and global conditioning. The authors introduce the Sketchy dataset, which includes multiple text-sketch pairs per image, and also provides an 'in the wild' split with non-expert sketches. Experiments show that LOTS improves global structural adherence and leverages richer localized semantic guidance, outperforming existing methods.
研究旨在通过结合局部文本和素描输入来提升时尚图像生成。方法LOTS使用多级条件阶段来编码局部特征同时保持全局结构,并通过扩散模型中的注意力引导来整合局部和全局条件。实验表明,LOTS增强了全局结构的一致性并利用了更丰富的局部语义指导,优于现有方法。数据集、平台和代码已公开可用。
JPmHC Dynamical Isometry via Orthogonal Hyper-Connections
Authors: Biswa Sengupta, Jinhua Wang, Leo Brunswic
First: 2026-02-20T16:06:01+00:00 · Latest: 2026-02-20T16:06:01+00:00
Abstract
Recent advances in deep learning, exemplified by Hyper-Connections (HC), have expanded the residual connection paradigm by introducing wider residual streams and diverse connectivity patterns. While these innovations yield significant performance gains, they compromise the identity mapping property of residual connections, leading to training instability, limited scalability, and increased memory overhead. To address these challenges, we propose JPmHC (Jacobian-spectrum Preserving manifold-constrained Hyper-Connections), a framework that replaces identity skips with a trainable linear mixer acting on n parallel streams while explicitly controlling gradient conditioning. By constraining the mixer M on operator-norm-bounded manifolds (e.g., bistochastic, Stiefel, Grassmann), JPmHC prevents gradient pathologies and enhances stability. JPmHC introduces three key contributions: (i) a free-probability analysis that predicts Jacobian spectra for structured skips, providing actionable design rules for mixer selection; (ii) memory-efficient implicit differentiation for fixed-point projections, reducing activation memory and synchronization overhead; and (iii) a Stiefel-constrained mixer via Cayley transforms, ensuring orthogonality without post-hoc normalization. Empirical evaluations on ARC-AGI demonstrate that JPmHC achieves faster convergence, higher accuracy, and lower computational cost compared to bistochastic baselines. As a flexible and scalable extension of HC, JPmHC advances spectrum-aware, stable, and efficient deep learning, offering insights into topological architecture design and foundational model evolution.
中文标题/摘要
标题:JPmHC 动态同构通过正交超连接
深度学习的最新进展,以超连接(HC)为例,通过引入更宽的残差流和多样的连接模式,扩展了残差连接的范式。虽然这些创新带来了显著的性能提升,但它们破坏了残差连接的恒等映射特性,导致训练不稳定、可扩展性受限以及内存开销增加。为了解决这些挑战,我们提出了JPmHC(保持雅可比谱的流形约束超连接),该框架用一个可训练的线性混合器替换恒等跳连,同时显式控制梯度条件。通过在算子范数有界流形(例如,双随机、Stiefel、Grassmann)上约束混合器M,JPmHC防止了梯度病态现象并增强了稳定性。JPmHC引入了三个关键贡献:(i)自由概率分析,预测结构跳连的雅可比谱,提供可操作的设计规则以选择混合器;(ii)固定点投影的内存高效隐式梯度法,减少激活内存和同步开销;(iii)通过Cayley变换约束的Stiefel混合器,确保正交性而无需后处理归一化。在ARC-AGI上的实证评估表明,与双随机基线相比,JPmHC实现了更快的收敛速度、更高的准确性和更低的计算成本。作为HC的灵活且可扩展的扩展,JPmHC推进了谱感知、稳定和高效的深度学习,提供了拓扑架构设计和基础模型演化的见解。
Summary / 总结
The paper addresses the challenges of using Hyper-Connections in deep learning, such as training instability and increased memory overhead. It proposes JPmHC, which uses a trainable linear mixer on multiple streams while controlling gradient conditioning. Key contributions include a free-probability analysis for mixer selection, memory-efficient implicit differentiation, and a Stiefel-constrained mixer. Experiments show JPmHC improves convergence, accuracy, and computational efficiency compared to bistochastic baselines.
研究旨在解决深度学习模型中Hyper-连接(HC)带来的训练不稳定性和内存开销增加的问题。JPmHC方法通过在多个并行流上引入可训练的线性混合器,并约束混合器以保持梯度条件,来解决这些问题。关键贡献包括自由概率分析以预测雅可比谱、内存高效的隐式梯度法以及通过Cayley变换约束的Stiefel混合器。实验结果表明,JPmHC在ARC-AGI上提高了收敛速度、准确性和计算效率,优于二随机性基线。
VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Authors: Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig
First: 2026-02-20T16:05:06+00:00 · Latest: 2026-02-20T16:05:06+00:00
Abstract
Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.
中文标题/摘要
标题:VeriSoftBench:面向Lean的开源形式验证基准
大型语言模型在交互定理证明方面取得了令人瞩目的成果,特别是在Lean中。然而,大多数基于LLM的证明自动化基准来自Mathlib生态系统的数学,而软件验证中的证明则在定义丰富的代码库中开发,包含大量项目特定的库。我们介绍了VeriSoftBench,这是一个包含500个来自开源形式方法开发的Lean 4证明义务的基准,打包以保留真实的仓库上下文和跨文件依赖关系。我们对前沿的LLM和专门的证明器的评估得出三个观察结果。首先,针对Mathlib风格数学进行调整的证明器在这一基于仓库的设置中表现不佳。其次,成功与传递的仓库依赖性密切相关:那些证明依赖于大量多跳依赖闭包的任务更不可能被解决。第三,提供针对证明依赖闭包的精选上下文比暴露整个仓库更能提高性能,但仍然有很大的改进空间。我们的基准和评估套件在https://github.com/utopia-group/VeriSoftBench发布。
CAIMAN: Causal Action Influence Detection for Sample-efficient Loco-manipulation
Authors: Yuanchen Yuan, Jin Cheng, Núria Armengol Urpí, Stelian Coros
First: 2025-02-02T16:16:53+00:00 · Latest: 2026-02-20T15:50:11+00:00
Abstract
Enabling legged robots to perform non-prehensile loco-manipulation is crucial for enhancing their versatility. Learning behaviors such as whole-body object pushing often requires sophisticated planning strategies or extensive task-specific reward shaping, especially in unstructured environments. In this work, we present CAIMAN, a practical reinforcement learning framework that encourages the agent to gain control over other entities in the environment. CAIMAN leverages causal action influence as an intrinsic motivation objective, allowing legged robots to efficiently acquire object pushing skills even under sparse task rewards. We employ a hierarchical control strategy, combining a low-level locomotion module with a high-level policy that generates task-relevant velocity commands and is trained to maximize the intrinsic reward. To estimate causal action influence, we learn the dynamics of the environment by integrating a kinematic prior with data collected during training. We empirically demonstrate CAIMAN's superior sample efficiency and adaptability to diverse scenarios in simulation, as well as its successful transfer to real-world systems without further fine-tuning. A video demo is available at https://www.youtube.com/watch?v=dNyvT04Cqaw.
中文标题/摘要
标题:CAIMAN:因果动作影响检测以提高样本效率的腿足机器人操作与搬运
使腿足机器人能够执行非抓取操作对于提高其灵活性至关重要。学习全身物体推举等行为通常需要复杂的规划策略或大量的任务特定奖励塑造,特别是在非结构化环境中。在本工作中,我们提出了CAIMAN,这是一种实用的强化学习框架,鼓励代理获得对环境其他实体的控制。CAIMAN利用因果动作影响作为内在动机目标,使腿足机器人即使在稀疏的任务奖励下也能高效地获得物体推举技能。我们采用分层控制策略,结合低级运动模块和生成任务相关速度命令的高级策略,并训练以最大化内在奖励。为了估计因果动作影响,我们通过将动力学先验与训练期间收集的数据整合来学习环境的动力学。我们在模拟中实证展示了CAIMAN的优越样本效率和对多种场景的适应性,并成功将其转移到实际系统中而无需进一步微调。视频演示可在https://www.youtube.com/watch?v=dNyvT04Cqaw获取。
Summary / 总结
CAIMAN is a reinforcement learning framework designed to enable legged robots to perform non-prehensile loco-manipulation tasks, such as whole-body object pushing, by leveraging causal action influence as an intrinsic motivation. It uses a hierarchical control strategy with a low-level locomotion module and a high-level policy trained to maximize an intrinsic reward. CAIMAN demonstrates superior sample efficiency and adaptability in simulation and successfully transfers to real-world systems without further fine-tuning.
CAIMAN 是一种强化学习框架,旨在通过利用因果动作影响作为内在动机来使腿足机器人执行非抓取式移动操作任务,如全身物体推动。该框架采用层次控制策略,结合低级运动模块和高级策略,后者被训练以最大化内在奖励。CAIMAN 在仿真中展示了出色的样本效率和适应性,并成功转移到现实世界系统中无需进一步微调。
Physics-informed graph neural networks for flow field estimation in carotid arteries
Authors: Julian Suk, Dieuwertje Alblas, Barbara A. Hutten, Albert Wiegman, Christoph Brune, Pim van Ooij, Jelmer M. Wolterink
First: 2024-08-13T13:09:28+00:00 · Latest: 2026-02-20T15:40:16+00:00
Comments: Published in "Medical Image Analysis"
Abstract
Hemodynamic quantities are valuable biomedical risk factors for cardiovascular pathology such as atherosclerosis. Non-invasive, in-vivo measurement of these quantities can only be performed using a select number of modalities that are not widely available, such as 4D flow magnetic resonance imaging (MRI). In this work, we create a surrogate model for hemodynamic flow field estimation, powered by machine learning. We train graph neural networks that include priors about the underlying symmetries and physics, limiting the amount of data required for training. This allows us to train the model using moderately-sized, in-vivo 4D flow MRI datasets, instead of large in-silico datasets obtained by computational fluid dynamics (CFD), as is the current standard. We create an efficient, equivariant neural network by combining the popular PointNet++ architecture with group-steerable layers. To incorporate the physics-informed priors, we derive an efficient discretisation scheme for the involved differential operators. We perform extensive experiments in carotid arteries and show that our model can accurately estimate low-noise hemodynamic flow fields in the carotid artery. Moreover, we show how the learned relation between geometry and hemodynamic quantities transfers to 3D vascular models obtained using a different imaging modality than the training data. This shows that physics-informed graph neural networks can be trained using 4D flow MRI data to estimate blood flow in unseen carotid artery geometries.
中文标题/摘要
标题:基于物理的图神经网络在颈动脉流场估计中的应用
血流动力学量是心血管病理如动脉粥样硬化的重要生物医学风险因素。非侵入性、体内测量这些量只能通过少数几种不广泛可用的成像模态,如4D流磁共振成像(MRI)来完成。在本研究中,我们创建了一个代理模型,利用机器学习进行血流动力学流场估计。我们训练了包含关于潜在对称性和物理先验的图神经网络,从而减少了训练所需的数据量。这使我们能够使用中等大小的体内4D流MRI数据集进行训练,而不是使用通过计算流体动力学(CFD)获得的大规模模拟数据集,这是当前的标准。我们通过结合流行的PointNet++架构和群自旋层,创建了一个高效的、同构的神经网络。为了引入物理先验,我们推导了一种高效的微分算子离散化方案。我们在颈动脉中进行了广泛的实验,并展示了我们的模型可以准确估计颈动脉中的低噪声血流动力学流场。此外,我们展示了在训练数据使用的成像模态不同的3D血管模型中,所学的几何与血流动力学量之间的关系是如何转移的。这表明,基于物理的图神经网络可以通过4D流MRI数据训练来估计未见过的颈动脉几何结构中的血流。
Summary / 总结
This study aims to develop a surrogate model for estimating hemodynamic flow fields in carotid arteries using machine learning, specifically graph neural networks that incorporate physical priors. The model is trained on moderately-sized in-vivo 4D flow MRI datasets, reducing the need for large in-silico datasets. Key findings include accurate estimation of low-noise hemodynamic flow fields and the transferability of the learned model to different 3D vascular geometries, demonstrating the potential of physics-informed graph neural networks for non-invasive hemodynamic analysis.
本研究旨在利用机器学习中的图神经网络开发一个用于估计颈动脉血流场的代理模型,该模型结合了物理先验知识。模型使用适度大小的在 vivo 4D 流动 MRI 数据集进行训练,减少了对大量计算流体动力学数据集的需求。主要发现包括低噪声血流场的准确估计以及所学模型在不同 3D 血管几何结构上的可转移性,展示了物理信息图神经网络在非侵入性血流分析中的潜力。
Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies
Authors: Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang
First: 2026-02-20T15:38:02+00:00 · Latest: 2026-02-20T15:38:02+00:00
Abstract
Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.
中文标题/摘要
标题:扩散以协调:高效的在线多智能体扩散策略
在线多智能体强化学习(MARL)是实现智能体高效协调的突出框架。增强策略表达能力对于实现卓越性能至关重要。基于扩散的生成模型在图像生成和离线设置中展示了出色的表达能力和多模态表示,但其在在线MARL中的潜力尚未得到充分探索。主要障碍在于扩散模型的不可计算似然性阻碍了基于熵的探索和协调。为解决这一挑战,我们提出了第一个在线离策略MARL框架——基于扩散策略的(OMAD)以协调智能体。我们的关键创新是最大化缩放联合熵的松弛策略目标,这有助于有效探索而不依赖于可计算的似然性。在此基础上,我们采用集中训练分散执行(CTDE)范式,并使用联合分布价值函数来优化分散的扩散策略。该方法利用可计算的熵增强目标来指导扩散策略的同时更新,从而确保协调的稳定性。在MPE和MAMuJoCo上的广泛评估表明,我们的方法在10个不同任务中成为新的最先进的方法,显示出高达2.5倍至5倍的样本效率提升。
Summary / 总结
The paper proposes OMAD, an online off-policy multi-agent reinforcement learning framework using diffusion policies to enhance policy expressiveness and facilitate effective exploration. It introduces a relaxed policy objective that maximizes scaled joint entropy and employs a joint distributional value function within the CTDE paradigm to optimize decentralized diffusion policies. Experimental results on MPE and MAMuJoCo show that OMAD outperforms existing methods, achieving a significant improvement in sample efficiency ranging from 2.5 to 5 times.
研究旨在通过增强在线多智能体强化学习(MARL)中的策略表达性来提高智能体之间的协调能力。提出的OMAD框架使用了扩散策略,并在集中式训练和分布式执行的框架下优化了一个放宽的策略目标,该目标最大化了缩放后的联合熵。这种方法可以在不依赖于不可计算的似然性的情况下实现有效的探索。实验结果表明,OMAD在MPE和MAMuJoCo上的表现优于现有方法,在10个不同的任务中实现了高达5倍的样本效率提升。
Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion
Authors: Xiaojian Ding, Lin Zhao, Xian Li, Xiaoying Zhu
Venue: NeurIPS 2025
First: 2025-10-14T02:58:10+00:00 · Latest: 2026-02-20T15:30:47+00:00
Comments: 13 pages, conference paper. Accepted to the Thirty-ninth Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model's robustness to hyperparameter variations. The code is available at https://github.com/XiaojianDing/2025-NeurIPS-HSACC.
中文标题/摘要
标题:基于层次语义对齐和协同完成的不完备多视图聚类
不完备多视图数据,其中某些视图对某些样本完全缺失,给传统的多视图聚类方法带来了重大挑战。现有的深度不完备多视图聚类方法通常依赖于静态融合策略或两阶段管道,导致融合结果次优和错误传播问题。为了解决这些局限性,本文提出了一种基于层次语义对齐和协同完成(HSACC)的新型不完备多视图聚类框架。HSACC通过双层语义空间设计实现了稳健的跨视图融合。在低层语义空间中,通过最大化视图间的互信息来确保一致性对齐。在高层语义空间中,根据各个视图与初始融合表示之间的分布亲和性动态分配视图权重,然后加权融合生成统一的全局表示。此外,HSACC通过将对齐的潜在表示投影到高维语义空间并联合优化重构和聚类目标,隐式恢复缺失视图,从而实现完成和聚类的协同学习。实验结果表明,HSACC在五个基准数据集上显著优于现有最先进的方法。消融研究验证了层次对齐和动态加权机制的有效性,而参数分析证实了该模型对超参数变化的鲁棒性。代码可在https://github.com/XiaojianDing/2025-NeurIPS-HSACC 获取。
Summary / 总结
This paper addresses the challenge of clustering incomplete multi-view data by proposing a novel framework called Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC uses a dual-level semantic space design to ensure consistency alignment and adaptive view weighting, and it implicitly recovers missing views through joint optimization of reconstruction and clustering objectives. Experimental results show that HSACC outperforms existing methods on five benchmark datasets, and ablation studies and parameter analysis further validate its effectiveness and robustness.
本文提出了一种新的框架Hierarchical Semantic Alignment and Cooperative Completion (HSACC),以解决不完整多视图数据的聚类问题。HSACC 使用双层语义空间设计来实现稳健的跨视图融合,在低层确保一致性对齐,并在高层动态分配视图权重。此外,通过联合优化重建和聚类目标隐式恢复缺失视图。实验结果表明,HSACC 在五个基准数据集上优于现有方法,并且消融研究和参数分析进一步验证了其有效性和鲁棒性。
HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation
Authors: Lei Xin, Yuhao Zheng, Ke Cheng, Changjiang Jiang, Zifan Zhang, Fanhu Zeng
First: 2026-02-20T15:11:40+00:00 · Latest: 2026-02-20T15:11:40+00:00
Comments: Preprint
Abstract
Modeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from prohibitive computational overhead. To address this challenge, we propose HyTRec, a model featuring a Hybrid Attention architecture that explicitly decouples long-term stable preferences from short-term intent spikes. By assigning massive historical sequences to a linear attention branch and reserving a specialized softmax attention branch for recent interactions, our approach restores precise retrieval capabilities within industrial-scale contexts involving ten thousand interactions. To mitigate the lag in capturing rapid interest drifts within the linear layers, we furthermore design Temporal-Aware Delta Network (TADN) to dynamically upweight fresh behavioral signals while effectively suppressing historical noise. Empirical results on industrial-scale datasets confirm the superiority that our model maintains linear inference speed and outperforms strong baselines, notably delivering over 8% improvement in Hit Rate for users with ultra-long sequences with great efficiency.
中文标题/摘要
标题:HyTRec:一种用于长行为序列推荐的混合时态感知注意力架构
用户长序列行为建模已成为生成推荐中的关键前沿领域。然而,现有解决方案面临困境:线性注意力机制因有限的状态容量而在效率和检索精度之间取得平衡,而softmax注意力则因计算开销巨大而受到限制。为解决这一挑战,我们提出HyTRec模型,该模型采用混合注意力架构,明确地将长期稳定偏好与短期意图波动分离。通过将大量历史序列分配给线性注意力分支,并为近期交互保留专门的softmax注意力分支,我们的方法在涉及十万次交互的工业规模场景中恢复了精确的检索能力。为了缓解线性层中捕捉快速兴趣漂移的滞后,我们进一步设计了时态感知差分网络(TADN),以动态加权新鲜行为信号并有效抑制历史噪声。在工业规模数据集上的实验证明,我们的模型保持了线性推理速度,并优于强大的基线模型,特别是在超长序列用户中,效率极高,准确率提高了超过8%。
Summary / 总结
HyTRec is a hybrid temporal-aware attention model designed to handle long sequences of user behaviors in recommendation systems. It uses a hybrid attention architecture to separate long-term stable preferences from short-term intent spikes, combining linear and softmax attention mechanisms. The Temporal-Aware Delta Network dynamically emphasizes recent interactions while suppressing historical noise. Experiments on large-scale datasets show that HyTRec maintains fast inference speed and outperforms strong baselines, especially for users with very long sequences, achieving over 8% improvement in Hit Rate.
HyTRec 是一种混合时序感知注意力模型,旨在处理推荐系统中的长用户行为序列。它通过线性注意力分支处理历史序列,通过软最大化注意力分支处理近期互动,以平衡精度和效率。时序感知差分网络动态强调近期行为,提高模型的响应性。实验表明,HyTRec 在处理长序列用户时优于强基线模型,尤其在 Hit Rate 上有超过 8% 的显著提升。
DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control
Authors: Shiyan Du, Conghan Yue, Xinyu Cheng, Dongyu Zhang
Venue: AAAI 2026
First: 2026-02-20T15:11:04+00:00 · Latest: 2026-02-20T15:11:04+00:00
Comments: Accepted by AAAI 2026
Abstract
Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.
中文标题/摘要
标题:DEIG:细粒度语义控制的详细增强实例生成
多实例生成在空间布局和属性绑定方面取得了显著进展。然而,现有方法在细粒度语义理解方面仍然面临挑战,尤其是在处理复杂文本描述时。为克服这些限制,我们提出了一种名为DEIG的新框架,用于细粒度和可控的多实例生成。DEIG结合了实例详细信息提取器(IDE),将文本编码嵌入转换为紧凑的、实例感知的表示,并结合了细节融合模块(DFM),通过实例基础的掩码注意力防止实例间属性泄露。这些组件使DEIG能够生成视觉上连贯的多实例场景,精确匹配丰富的局部文本描述。为了支持细粒度监督,我们构建了一个高质量的数据集,其中包含由VLM生成的详细、组合式的实例描述。我们还引入了DEIG-Bench,这是一个新的基准,具有区域级注释和针对人类和对象的多属性提示。实验表明,DEIG在多个基准测试中的空间一致性、语义准确性和组合泛化方面始终优于现有方法。此外,DEIG作为一个即插即用模块,使其易于集成到标准的扩散管道中。
Summary / 总结
DEIG is a novel framework for fine-grained and controllable multi-instance generation, addressing limitations in fine-grained semantic understanding. It uses an Instance Detail Extractor to create compact, instance-aware representations and a Detail Fusion Module to prevent attribute leakage. DEIG generates visually coherent scenes matching detailed textual descriptions and outperforms existing methods in spatial consistency, semantic accuracy, and compositional generalization. It also includes a high-quality dataset and DEIG-Bench for fine-grained supervision and evaluation.
DEIG 是一种新颖的框架,用于实现细粒度和可控的多实例生成,解决细粒度语义理解的局限性。它使用实例细节提取器创建紧凑的实例感知表示,并使用细节融合模块防止属性泄漏。DEIG 生成视觉上连贯的场景,与详细的文本描述相匹配,并在空间一致性、语义准确性和组合泛化方面优于现有方法。此外,它还引入了 DEIG-Bench,这是一个新的基准,带有区域级注释和针对人类和对象的多属性提示,用于评估。
Study of Training Dynamics for Memory-Constrained Fine-Tuning
Authors: Aël Quélennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione
First: 2025-10-22T15:21:05+00:00 · Latest: 2026-02-20T15:06:57+00:00
Abstract
Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.
中文标题/摘要
标题:记忆受限精细调优训练动力学研究
随着模型变得越来越大,而部署环境对资源的限制越来越严格,深度神经网络的高效训练变得越来越重要。我们提出了TraDy,这是一种新颖的迁移学习方案,利用了两个关键见解:层的重要性对于更新是架构依赖的,并且可以先验确定;而动态随机通道选择在梯度近似方面优于静态方法。我们引入了一种动态通道选择方法,在预选层之间以概率重新采样通道。广泛的实验表明,TraDy在各种下游任务和架构上实现了最先进的性能,同时严格保持了内存限制,实现了高达99%的激活稀疏性、95%的权重导数稀疏性以及97%的权重导数计算FLOPs减少。
Summary / 总结
The study addresses the challenge of training large deep neural networks with limited memory resources. It introduces TraDy, a transfer learning method that dynamically selects channels for gradient approximation, which is more efficient than static methods. Experiments show that TraDy outperforms existing methods across different tasks and architectures while significantly reducing memory usage and computational cost.
研究旨在解决在有限内存资源下训练大型深度神经网络的挑战。提出了一种名为TraDy的转移学习方法,该方法基于先验确定的层重要性动态选择用于梯度近似和更新的通道。实验表明,TraDy在不同任务和架构上优于现有方法,显著减少了内存使用,实现了高激活和权重导数稀疏性,并大幅减少了权重导数计算的FLOPs。
PRISM: Parallel Reward Integration with Symmetry for MORL
Authors: Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He
First: 2026-02-20T15:02:42+00:00 · Latest: 2026-02-20T15:02:42+00:00
Abstract
This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.
中文标题/摘要
标题:PRISM:并行奖励整合与对称性在多目标强化学习中的应用
这项工作研究了异构多目标强化学习(MORL),其中目标在时间频率上可能存在显著差异。这种异构性使得密集目标能够主导学习,而稀疏的长期奖励则难以获得足够的信用分配,导致样本效率低下。我们提出了一种并行奖励整合与对称性(PRISM)算法,该算法将反射对称性作为归纳偏置,用于对齐奖励通道。PRISM引入了ReSymNet,这是一种理论驱动的模型,用于解决目标之间的时间频率不匹配问题,通过残差块学习一个缩放的机会价值,加速探索同时保持最优策略。我们还提出了对称正则化(SymReg),这是一种反射等变正则化器,强制代理镜像并限制策略搜索到一个反射等变子空间。这种限制证明可以减少假设复杂性并提高泛化能力。在MuJoCo基准测试中,PRISM在帕累托覆盖和分布平衡方面始终优于稀疏奖励基线和使用完整密集奖励训练的先验知识,其超体积增益超过100%,最高可达32%。代码位于https://github.com/EVIEHub/PRISM。
Summary / 总结
This work addresses the challenge of learning from heterogeneous objectives in Multi-Objective Reinforcement Learning (MORL) by proposing PRISM, which uses reflectional symmetry as an inductive bias and introduces ReSymNet to align reward channels and SymReg to enforce reflectional equivariance. PRISM improves sample efficiency and generalization, achieving significant hypervolume gains over both sparse-reward baselines and full dense reward oracles in MuJoCo benchmarks.
该研究针对多目标强化学习(MORL)中异质目标的学习挑战,提出了PRISM算法,利用反射对称性作为归纳偏置来对齐奖励通道。PRISM引入了ReSymNet来解决时间频率不匹配问题,并提出了SymReg来强制执行反射对称性,从而减少假设复杂度并提高泛化能力。实验结果表明,PRISM在MuJoCo基准测试中优于稀疏奖励基线和使用完整密集奖励训练的oracle,显著提高了帕累托覆盖范围和分布平衡。
Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2
Authors: Naveenkumar G Venkataswamy, Yu Liu, Soumyabrata Dey, Stephanie Schuckers, Masudul H Imtiaz
First: 2025-10-07T17:33:41+00:00 · Latest: 2026-02-20T15:00:19+00:00
Comments: The new version is available at arXiv:2512.15548
Abstract
Smartphone-based iris recognition in the visible spectrum (VIS) remains difficult due to illumination variability, pigmentation differences, and the absence of standardized capture controls. This work presents a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance at acquisition and demonstrates that accurate VIS iris recognition is feasible on commodity devices. Using a custom Android application performing real-time framing, sharpness evaluation, and feedback, we introduce the CUVIRIS dataset of 752 compliant images from 47 subjects. A lightweight MobileNetV3-based multi-task segmentation network (LightIrisNet) is developed for efficient on-device processing, and a transformer matcher (IrisFormer) is adapted to the VIS domain. Under a standardized protocol and comparative benchmarking against prior CNN baselines, OSIRIS attains a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS. The acquisition app, trained models, and a public subset of the dataset are released to support reproducibility. These results confirm that standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.
中文标题/摘要
标题:基于智能手机的高质量可见光虹膜识别.V2
基于智能手机的可见光(VIS)虹膜识别由于光照变化、色素差异以及缺乏标准化捕获控制而仍然困难。本研究提出了一种紧凑的端到端管道,确保在捕获过程中符合ISO/IEC 29794-6质量标准,并证明在普通设备上实现准确的VIS虹膜识别是可行的。通过一个定制的Android应用程序进行实时构图、清晰度评估和反馈,我们引入了包含47名受试者752张合规图像的CUVIRIS数据集。开发了一种轻量级的MobileNetV3多任务分割网络(LightIrisNet)用于设备端高效处理,并将变压器匹配器(IrisFormer)适应到VIS领域。在标准化协议下,与先前的CNN基线进行比较基准测试,OSIRIS在FAR=0.01时达到TAR为97.9%(EER=0.76%),而仅在UBIRIS.v2上训练的IrisFormer在CUVIRIS上的EER为0.057%。该捕获应用程序、训练模型以及数据集的公共子集已发布以支持可重复性。这些结果表明,标准化捕获和VIS适应的轻量级模型能够使智能手机上的虹膜识别准确且实用。
Summary / 总结
This work addresses the challenges of smartphone-based iris recognition in the visible spectrum by developing a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance. The authors introduce the CUVIRIS dataset and a lightweight MobileNetV3-based segmentation network (LightIrisNet) for efficient on-device processing, along with an adapted transformer matcher (IrisFormer). Experimental results show that OSIRIS achieves a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS, confirming the feasibility of accurate iris recognition on smartphones under standardized conditions.
该研究通过开发一个符合ISO/IEC 29794-6质量标准的紧凑型管道,解决了可见光谱下智能手机虹膜识别的挑战。研究人员创建了一个实时图像捕获和质量评估的自定义Android应用程序,并引入了CUVIRIS数据集。他们还开发了一个轻量级的MobileNetV3基分割网络(LightIrisNet),并为可见光谱领域调整了一个变压器匹配器(IrisFormer)。在标准化条件下,OSIRIS在FAR=0.01时实现了97.9%的TAR(等效错误率EER=0.76%),而仅在UBIRIS.v2上训练的IrisFormer在CUVIRIS上的EER为0.057%。
RoEL: Robust Event-based 3D Line Reconstruction
Authors: Gwangtak Bae, Jaeho Shin, Seunggu Kang, Junho Kim, Ayoung Kim, Young Min Kim
First: 2026-02-20T14:43:46+00:00 · Latest: 2026-02-20T14:43:46+00:00
Comments: IEEE Transactions on Robotics (T-RO)
Abstract
Event cameras in motion tend to detect object boundaries or texture edges, which produce lines of brightness changes, especially in man-made environments. While lines can constitute a robust intermediate representation that is consistently observed, the sparse nature of lines may lead to drastic deterioration with minor estimation errors. Only a few previous works, often accompanied by additional sensors, utilize lines to compensate for the severe domain discrepancies of event sensors along with unpredictable noise characteristics. We propose a method that can stably extract tracks of varying appearances of lines using a clever algorithmic process that observes multiple representations from various time slices of events, compensating for potential adversaries within the event data. We then propose geometric cost functions that can refine the 3D line maps and camera poses, eliminating projective distortions and depth ambiguities. The 3D line maps are highly compact and can be equipped with our proposed cost function, which can be adapted for any observations that can detect and extract line structures or projections of them, including 3D point cloud maps or image observations. We demonstrate that our formulation is powerful enough to exhibit a significant performance boost in event-based mapping and pose refinement across diverse datasets, and can be flexibly applied to multimodal scenarios. Our results confirm that the proposed line-based formulation is a robust and effective approach for the practical deployment of event-based perceptual modules. Project page: https://gwangtak.github.io/roel/
中文标题/摘要
标题:RoEL:稳健的基于事件的3D直线重建
运动中的事件相机倾向于检测物体边界或纹理边缘,产生亮度变化的线条,尤其是在人造环境中。虽然线条可以构成一种稳健的中间表示,这种表示在观察中是一致的,但线条的稀疏性可能导致轻微估计误差下的大幅退化。只有少数先前的工作,通常伴随额外的传感器,利用线条来补偿事件传感器严重的领域差异以及不可预测的噪声特性。我们提出了一种方法,可以稳定地提取线条的多种外观轨迹,使用巧妙的算法过程,从事件的多个时间切片的不同表示中观察,补偿事件数据中的潜在对手。然后我们提出了几何代价函数,可以细化3D直线图和相机姿态,消除投影失真和深度歧义。3D直线图非常紧凑,可以装备我们提出的方法,该方法可以适应任何可以检测和提取线条结构或它们的投影的观察,包括3D点云图或图像观察。我们证明,我们的公式在事件基地图构建和姿态细化方面具有显著的性能提升,适用于多种数据集,并且可以灵活应用于多模态场景。我们的结果证实,所提出的基于线条的公式是事件基感知模块实际部署的稳健而有效的方法。
Summary / 总结
The research aims to develop a robust method for 3D line reconstruction using event cameras, which are sensitive to object boundaries and texture edges. The method involves extracting line tracks from multiple time slices and refining 3D line maps and camera poses using geometric cost functions. The approach demonstrates significant performance improvements in event-based mapping and pose refinement across various datasets and can be applied to multimodal scenarios.
研究旨在利用事件摄像头(对物体边界和纹理边缘敏感)开发一种稳健的3D线重建方法。该方法包括从多个时间切片中提取线迹,并使用几何代价函数细化3D线图和相机姿态。该方法在多种数据集上显著提高了基于事件的映射和姿态校准性能,并且可以灵活应用于多模态场景。
SAMa: Material-aware 3D Selection and Segmentation
Authors: Michael Fischer, Iliyan Georgiev, Thibault Groueix, Vladimir G. Kim, Tobias Ritschel, Valentin Deschaintre
First: 2024-11-28T18:59:02+00:00 · Latest: 2026-02-20T14:37:26+00:00
Comments: Project Page: https://mfischer-ucl.github.io/sama
Abstract
Decomposing 3D assets into material parts is a common task for artists, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for in-the-wild objects in arbitrary 3D representations. Building on SAM2's video prior, we construct a material-centric video dataset that extends it to the material domain. We propose an efficient way to lift the model's 2D predictions to 3D by projecting each view into an intermediary 3D point cloud using depth. Nearest-neighbor lookups between any 3D representation and this similarity point cloud allow us to efficiently reconstruct accurate selection masks over objects' surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for costly per-asset optimization, and performs optimization-free selection in seconds. SAMa outperforms several strong baselines in selection accuracy and multiview consistency and enables various compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output with PBR materials or selecting and editing materials on NeRFs and 3DGS captures.
中文标题/摘要
标题:SAMa:材料感知的3D选择与分割
将3D资产分解为材料部分是艺术家的一项常见任务,但仍然是一个高度手动的过程。在本文中,我们介绍了“选择任意材料”(SAMa),这是一种针对任意3D表示中的野外对象的材料选择方法。基于SAM2的视频先验,我们构建了一个以材料为中心的视频数据集,将其扩展到材料领域。我们提出了一种有效的方法,通过将每个视图投影到使用深度生成的中间3D点云中来提升模型的2D预测。任何3D表示与这个相似性点云之间的最近邻查找允许我们高效地重建对象表面的准确选择掩码,可以从任何视图进行检查。我们的方法设计上是多视图一致的,消除了昂贵的逐个资产优化的需要,并能在几秒钟内进行无优化的选择。SAMa在选择准确性和多视图一致性方面优于几个强大的基线,并能实现各种令人兴奋的应用,例如用PBR材料替换文本到3D输出中的漫反射材料,或在NeRF和3DGS捕获上选择和编辑材料。
Summary / 总结
The research aims to automate the material selection process in 3D assets, which is typically a manual task. SAMa, a material selection approach, uses a material-centric video dataset and projects 2D predictions into 3D space to create accurate selection masks. The method is multiview-consistent and performs optimization-free selection quickly, outperforming several baselines in accuracy and consistency, and enabling applications like material replacement and editing in various 3D representations.
研究旨在自动化将3D资产分解为材料部分的过程,这通常是艺术家的一项手动任务。SAMa是一种材料选择方法,利用材料为中心的视频数据集,并通过深度将2D预测投影到3D。该方法能够高效地重建物体的准确选择掩码,确保多视角一致性,并实现快速、无需优化的选择。SAMa在选择准确性和多视角一致性方面优于其他基线,支持如材料替换和编辑等各类3D表示的应用。
Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Noise-Stressed Synthetic Conditions
Authors: Riyaadh Gani
First: 2025-09-12T12:18:00+00:00 · Latest: 2026-02-20T14:36:11+00:00
Abstract
Non-invasive glucose monitoring outside controlled settings is dominated by low signal-to-noise ratio (SNR): hardware drift, environmental variation, and physiology suppress the glucose signature in NIR signals. We present a noise-stressed NIR simulator that injects 12-bit ADC quantisation, LED drift, photodiode dark noise, temperature/humidity variation, contact-pressure noise, Fitzpatrick I-VI melanin, and glucose variability to create a low-correlation regime (rho_glucose-NIR = 0.21). Using this platform, we benchmark six methods: Enhanced Beer-Lambert (physics-engineered ridge regression), Original PINN, Optimised PINN, RTE-inspired PINN, Selective RTE PINN, and a shallow DNN. The physics-engineered Beer Lambert model achieves the lowest error (13.6 mg/dL RMSE) with only 56 parameters and 0.01 ms inference, outperforming deeper PINNs and the SDNN baseline under low-SNR conditions. The study reframes the task as noise suppression under weak signal and shows that carefully engineered physics features can outperform higher-capacity models in this regime.
中文标题/摘要
标题:物理知情神经网络与物理模型在非侵入性血糖监测中的比较研究:在噪声压力下的合成条件
在受控环境之外进行非侵入性血糖监测主要受到低信噪比(SNR)的影响:硬件漂移、环境变化和生理学抑制了NIR信号中的血糖特征。我们提出了一种噪声压力下的NIR模拟器,注入了12位ADC量化、LED漂移、光电二极管暗噪声、温度/湿度变化、接触压力噪声、弗吉尼亚I-VI黑色素和血糖变异性,以创建一个低相关性环境(rho_glucose-NIR = 0.21)。使用该平台,我们基准测试了六种方法:增强的比尔-兰贝特定量模型(物理工程岭回归)、原始PINN、优化的PINN、基于RTE的PINN、选择性RTE PINN以及一个浅层DNN。物理工程的比尔-兰贝特定量模型在仅56个参数和0.01毫秒推理时间的情况下实现了最低误差(13.6 mg/dL RMSE),在低SNR条件下优于更深的PINN和SDNN基线。该研究将任务重新定义为在弱信号下的噪声抑制,并表明精心设计的物理特征可以在这种情况下超越更高容量的模型。
Summary / 总结
The study aims to evaluate the performance of physics-informed neural networks (PINNs) and physics models for non-invasive glucose monitoring under low signal-to-noise ratio conditions. Six methods were benchmarked, including physics-engineered Beer-Lambert, various PINNs, and a shallow DNN. The physics-engineered Beer-Lambert model achieved the lowest error (13.6 mg/dL RMSE) with minimal parameters and fast inference time, outperforming deeper PINNs and a DNN baseline under low-SNR conditions.
研究旨在评估物理启发神经网络(PINNs)和物理模型在低信噪比条件下的非侵入式血糖监测性能。六种方法进行了基准测试,包括物理工程化的Beer-Lambert模型、各种PINNs和一个浅层DNN。物理工程化的Beer-Lambert模型在最低均方根误差(13.6 mg/dL)下具有最少的参数和快速的推理时间,优于深层PINNs和DNN基线模型在低-SNR条件下的表现。
Variational Distributional Neuron
Authors: Yves Ruffenach
First: 2026-02-20T14:35:53+00:00 · Latest: 2026-02-20T14:35:53+00:00
Comments: 29 pages, 7 figures. Code available at GitHub (link in paper)
Abstract
We propose a proof of concept for a variational distributional neuron: a compute unit formulated as a VAE brick, explicitly carrying a prior, an amortized posterior and a local ELBO. The unit is no longer a deterministic scalar but a distribution: computing is no longer about propagating values, but about contracting a continuous space of possibilities under constraints. Each neuron parameterizes a posterior, propagates a reparameterized sample and is regularized by the KL term of a local ELBO - hence, the activation is distributional. This "contraction" becomes testable through local constraints and can be monitored via internal measures. The amount of contextual information carried by the unit, as well as the temporal persistence of this information, are locally tuned by distinct constraints. This proposal addresses a structural tension: in sequential generation, causality is predominantly organized in the symbolic space and, even when latents exist, they often remain auxiliary, while the effective dynamics are carried by a largely deterministic decoder. In parallel, probabilistic latent models capture factors of variation and uncertainty, but that uncertainty typically remains borne by global or parametric mechanisms, while units continue to propagate scalars - hence the pivot question: if uncertainty is intrinsic to computation, why does the compute unit not carry it explicitly? We therefore draw two axes: (i) the composition of probabilistic constraints, which must be made stable, interpretable and controllable; and (ii) granularity: if inference is a negotiation of distributions under constraints, should the primitive unit remain deterministic or become distributional? We analyze "collapse" modes and the conditions for a "living neuron", then extend the contribution over time via autoregressive priors over the latent, per unit.
中文标题/摘要
标题:变分分布神经元
我们提出了一种变分分布神经元的概念证明:一种作为VAE砖块的计算单元,明确携带先验、近似后验和局部ELBO。该单元不再是一个确定性的标量,而是一个分布:计算不再关于传递值,而是关于在约束下收缩可能性的连续空间。每个神经元参数化一个后验,传递一个重参数化的样本,并通过局部ELBO的KL项进行正则化——因此,激活是分布性的。这种“收缩”可以通过局部约束进行测试,并可以通过内部指标进行监控。单元携带的上下文信息量以及这种信息的时序持久性,通过不同的约束进行局部调整。该提案解决了结构上的紧张关系:在序列生成中,因果性主要组织在符号空间中,即使存在潜在变量,它们通常仍然是辅助性的,而有效的动态是由一个主要确定性的解码器携带的。同时,概率潜在模型捕捉了变化和不确定性因素,但这种不确定性通常仍然由全局或参数机制承担,而单元继续传递标量——因此,关键问题在于:如果不确定性是计算的内在属性,为什么计算单元不明确地携带它?因此,我们绘制了两个轴:(i) 概率约束的组成,必须使其稳定、可解释和可控;(ii) 粒度:如果推理是在约束下对分布的谈判,那么原始单元应该保持确定性还是变得分布性?我们分析了“坍塌”模式以及“活神经元”的条件,然后通过单位上的潜在的自回归先验在时间上扩展贡献。
Summary / 总结
The paper introduces a variational distributional neuron as a proof of concept, which is a compute unit that explicitly carries a prior, an amortized posterior, and a local ELBO. This neuron is no longer a deterministic scalar but a distribution, focusing on contracting a continuous space of possibilities under constraints. Key findings include the ability to monitor the amount of contextual information and temporal persistence through internal measures, addressing the structural tension between symbolic causality and probabilistic uncertainty in sequential generation and latent models.
论文提出了一种变分分布神经元作为概念验证,这是一种计算单元,明确携带先验、近似后验和局部ELBO。该神经元参数化后验,传播重参数样本,并通过局部ELBO的KL项进行正则化,使其激活成为分布式的。神经元能够在一个受约束的连续空间中收缩可能性,这种能力可以通过内部指标进行测试和监控。该提案解决了序列生成中的结构性紧张,即因果关系主要存在于符号空间中,而单元传播标量,同时概率潜在模型捕捉变异因素但仍然辅助。关键发现是计算单元应该显式地携带不确定性,以更好地捕捉内在的计算动态。
Learning to Orchestrate Agents in Natural Language with the Conductor
Authors: Stefan Nielsen, Edoardo Cetin, Peter Schwendeman, Qi Sun, Jinglue Xu, Yujin Tang
First: 2025-12-04T02:23:13+00:00 · Latest: 2026-02-20T14:32:50+00:00
Abstract
Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.
中文标题/摘要
标题:使用指挥家在自然语言中编排代理
来自不同提供商的强大大型语言模型(LLMs)经过昂贵的训练和微调,以适应不同的领域。在此项工作中,我们引入了一种新的指挥家模型,通过强化学习训练,自动发现LLMs之间的强大协调策略。我们的指挥家不仅学习设计有效的代理间通信拓扑结构,还学习向LLMs提供工程师导向的指令,以最大化利用它们的个体能力。我们展示了通过学习在强大工人LLMs池中的最优协调策略,一个7B的指挥家在LiveCodeBench和GPQA等具有挑战性的推理基准测试中取得了显著的性能提升,超越了任何单一的工人。通过使用随机化的代理池进行训练,我们的指挥家能够适应任意的开源和闭源代理集合,满足任何用户需求。此外,允许指挥家选择自己作为工人,产生了递归拓扑结构,通过在线迭代适应实现了新的动态测试时扩展,从而提升性能。更广泛地说,我们的工作是早期证明语言模型协调可以通过强化学习解锁的示例之一,在这种情况下,强大的协调策略自然地在LLMs中通过纯粹的端到端奖励最大化而出现。
Summary / 总结
This work introduces a Conductor model trained with reinforcement learning to automatically discover effective coordination strategies among large language models (LLMs). The Conductor not only designs communication topologies for agent-to-agent collaboration but also prompts LLMs with focused instructions to leverage their individual capabilities. Experimental results show that a 7B Conductor significantly outperforms individual LLMs in challenging reasoning benchmarks, achieving state-of-the-art results in LiveCodeBench and GPQA. The model adapts to various sets of open- and closed-source agents and can recursively elevate performance through dynamic test-time scaling.
这项工作引入了一个通过强化学习训练的Conductor模型,用于协调大型语言模型(LLMs)以实现有效的协作。Conductor设计通信拓扑并为LLMs提供指令,以充分利用它们的个体能力。实验表明,一个7B的Conductor在LiveCodeBench和GPQA等推理基准测试中显著优于单一的LLM,达到了最先进的结果。该模型能够适应各种不同的代理集合,并可以通过在线迭代适应来递归提升性能。
J3DAI: A tiny DNN-Based Edge AI Accelerator for 3D-Stacked CMOS Image Sensor
Authors: Benoit Tain, Raphael Millet, Romain Lemaire, Michal Szczepanski, Laurent Alacoque, Emmanuel Pluchart, Sylvain Choisnet, Rohit Prasad, Jerome Chossat, Pascal Pierunek, Pascal Vivet, Sebastien Thuries
First: 2025-06-18T09:46:02+00:00 · Latest: 2026-02-20T14:32:03+00:00
Abstract
This paper presents J3DAI, a tiny deep neural network-based hardware accelerator for a 3-layer 3D-stacked CMOS image sensor featuring an artificial intelligence (AI) chip integrating a Deep Neural Network (DNN)-based accelerator. The DNN accelerator is designed to efficiently perform neural network tasks such as image classification and segmentation. This paper focuses on the digital system of J3DAI, highlighting its Performance-Power-Area (PPA) characteristics and showcasing advanced edge AI capabilities on a CMOS image sensor. To support hardware, we utilized the Aidge comprehensive software framework, which enables the programming of both the host processor and the DNN accelerator. Aidge supports post-training quantization, significantly reducing memory footprint and computational complexity, making it crucial for deploying models on resource-constrained hardware like J3DAI. Our experimental results demonstrate the versatility and efficiency of this innovative design in the field of edge AI, showcasing its potential to handle both simple and computationally intensive tasks. Future work will focus on further optimizing the architecture and exploring new applications to fully leverage the capabilities of J3DAI. As edge AI continues to grow in importance, innovations like J3DAI will play a crucial role in enabling real-time, low-latency, and energy-efficient AI processing at the edge.
中文标题/摘要
标题:J3DAI:一种基于DNN的边缘AI加速器,用于3D堆叠CMOS图像传感器
本文介绍了J3DAI,一种基于3层3D堆叠CMOS图像传感器的微小深度神经网络硬件加速器,集成了一个基于深度神经网络(DNN)的加速器的人工智能(AI)芯片。DNN加速器旨在高效执行神经网络任务,如图像分类和分割。本文重点介绍了J3DAI的性能-功耗-面积(PPA)特性,并展示了其在CMOS图像传感器上的先进边缘AI能力。为了支持硬件,我们使用了Aidge综合软件框架,该框架能够编程主机处理器和DNN加速器。Aidge支持后训练量化,显著减少了内存占用和计算复杂性,对于在资源受限的硬件如J3DAI上部署模型至关重要。我们的实验结果展示了这种创新设计在边缘AI领域的灵活性和效率,展示了其处理简单和计算密集型任务的潜力。未来的工作将集中在进一步优化架构并探索新应用,以充分利用J3DAI的能力。随着边缘AI的重要性不断增加,像J3DAI这样的创新将在实现实时、低延迟和节能的AI处理方面发挥关键作用。
Summary / 总结
J3DAI is a tiny DNN-based hardware accelerator for a 3-layer 3D-stacked CMOS image sensor, designed to perform image classification and segmentation efficiently. It utilizes the Aidge software framework for programming and supports post-training quantization to reduce memory and computational complexity. Experimental results show J3DAI's effectiveness in handling both simple and complex tasks, demonstrating its potential for real-time, low-latency, and energy-efficient edge AI processing.
J3DAI 是一种针对 3 层 3D-堆叠 CMOS 图像传感器的 DNN 基础硬件加速器,旨在高效执行图像分类和分割任务。它利用 Aidge 软件框架对主机处理器和 DNN 加速器进行编程,并支持后训练量化以减少内存和计算复杂性。实验结果表明 J3DAI 在处理简单和计算密集型任务方面具有灵活性和高效性,突显了其在实时、低延迟和节能边缘 AI 处理方面的潜力。