arXiv 论文速递

2026-01-06 03:31
Snapshot: 20260106_0331
AdaGaR: Adaptive Gabor Representation for Dynamic Scene Reconstruction
Authors: Jiewen Chan, Zhenjun Zhao, Yu-Lun Liu
First: 2026-01-02T18:59:55+00:00 · Latest: 2026-01-02T18:59:55+00:00
Comments: Project page: https://jiewenchan.github.io/AdaGaR/
Abstract
Reconstructing dynamic 3D scenes from monocular videos requires simultaneously capturing high-frequency appearance details and temporally continuous motion. Existing methods using single Gaussian primitives are limited by their low-pass filtering nature, while standard Gabor functions introduce energy instability. Moreover, lack of temporal continuity constraints often leads to motion artifacts during interpolation. We propose AdaGaR, a unified framework addressing both frequency adaptivity and temporal continuity in explicit dynamic scene modeling. We introduce Adaptive Gabor Representation, extending Gaussians through learnable frequency weights and adaptive energy compensation to balance detail capture and stability. For temporal continuity, we employ Cubic Hermite Splines with Temporal Curvature Regularization to ensure smooth motion evolution. An Adaptive Initialization mechanism combining depth estimation, point tracking, and foreground masks establishes stable point cloud distributions in early training. Experiments on Tap-Vid DAVIS demonstrate state-of-the-art performance (PSNR 35.49, SSIM 0.9433, LPIPS 0.0723) and strong generalization across frame interpolation, depth consistency, video editing, and stereo view synthesis. Project page: https://jiewenchan.github.io/AdaGaR/
中文标题/摘要
标题:AdaGaR:自适应高斯表示用于动态场景重建
从单目视频重建动态3D场景需要同时捕捉高频外观细节和时间连续的运动。现有使用单一高斯基元的方法受限于其低通滤波特性,而标准高斯函数引入能量不稳定性。此外,缺乏时间连续性约束往往导致插值过程中出现运动伪影。我们提出AdaGaR,这是一种统一框架,旨在同时在显式动态场景建模中实现频率自适应性和时间连续性。我们引入自适应高斯表示,通过可学习的频率权重和自适应能量补偿扩展高斯函数,以平衡细节捕捉和稳定性。为了实现时间连续性,我们使用带有时间曲率正则化的三次Hermite样条线,以确保平滑的运动演变。自适应初始化机制结合深度估计、点跟踪和前景掩码,在早期训练中建立稳定的点云分布。在Tap-Vid DAVIS上的实验表明,该方法具有最先进的性能(PSNR 35.49,SSIM 0.9433,LPIPS 0.0723),并且在帧插值、深度一致性、视频编辑和立体视图合成方面具有强大的泛化能力。项目页面:https://jiewenchan.github.io/AdaGaR/
Summary / 总结
AdaGaR is a unified framework for reconstructing dynamic 3D scenes from monocular videos, addressing both frequency adaptivity and temporal continuity. It uses Adaptive Gabor Representation, which extends Gaussians with learnable frequency weights and adaptive energy compensation for better detail capture and stability. For temporal continuity, it employs Cubic Hermite Splines with Temporal Curvature Regularization. Experiments show AdaGaR achieves state-of-the-art performance with PSNR 35.49, SSIM 0.9433, and LPIPS 0.0723 on Tap-Vid DAVIS, and demonstrates strong generalization across various tasks.
AdaGaR 是一种用于从单目视频重建动态场景的统一框架,通过引入自适应 Gabor 表示和时间连续性约束来解决现有方法的局限性。该方法使用可学习的频率权重和自适应能量补偿来捕捉高频细节同时保持稳定性,并采用带时间曲率正则化的三次 Hermite 插值确保平滑的运动演化。初始点云分布通过自适应初始化机制进行稳定化。实验表明,AdaGaR 在 PSNR、SSIM 和 LPIPS 等指标上达到最先进的性能,并在帧插值、视频编辑等任务上表现出强大的泛化能力。
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning
Authors: Valentin Noël
First: 2026-01-02T18:49:37+00:00 · Latest: 2026-01-02T18:49:37+00:00
Comments: 58 pages, 19 figures, Under Review
Abstract
We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.
中文标题/摘要
标题:理性几何:有效数学推理的光谱特征
我们提出了一种无需训练的方法,通过光谱分析注意力模式来检测大型语言模型中的有效数学推理。通过将注意力矩阵视为动态图的邻接矩阵,我们提取了四个可解释的光谱诊断指标:Fiedler值(代数连通性)、高频能量比(HFER)、图信号平滑性和光谱熵,这些指标在有效和无效数学证明之间表现出统计学上的显著差异。在四个独立架构家族(Meta Llama、阿里巴巴 Qwen、微软 Phi 和 Mistral AI)的七个变压器模型上进行的实验表明,这种光谱特征产生的效应大小高达Cohen's $d = 3.30$ ($p < 10^{-116}$),在严格的评估下可实现85.0%至95.6%的分类准确率,全数据集上的校准阈值达到93%至95%。该方法无需训练数据、微调或学习分类器:只需一个光谱指标的阈值即可实现高准确率。通过系统性标签修正,我们发现光谱方法检测的是逻辑连贯性而非编译器接受,识别出形式验证器因技术故障而拒绝的数学上有效的证明。我们还发现一种架构依赖性:Mistral-7B的滑动窗口注意力将区分信号从HFER转移到晚期层平滑性($d = 2.09$,$p_{\text{MW}} = 1.16 \times 10^{-48}$),揭示了注意力机制设计影响哪些光谱特征捕捉推理有效性。这些发现确立了光谱图分析作为推理验证的原理性框架,具有立即应用于幻觉检测和AI安全监控的应用前景。
Summary / 总结
The study introduces a training-free method to detect valid mathematical reasoning in large language models using spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs, the method extracts four spectral diagnostics: Fiedler value, high-frequency energy ratio, graph signal smoothness, and spectral entropy. These diagnostics show significant differences between valid and invalid proofs. Experiments across seven transformer models from four independent architectural families demonstrate high classification accuracy (85.0–95.6%) and calibrated thresholds of 93–95% accuracy. The method identifies logical coherence rather than compiler acceptance and reveals architectural dependencies affecting the discriminative signal.
研究提出了一种无需训练的方法,通过分析注意力模式的频谱特征来识别大型语言模型中的有效数学推理。通过将注意力矩阵视为动态图的邻接矩阵,作者提取了四个频谱诊断指标:Fiedler值、高频率能量比、图信号平滑度和频谱熵。这些指标在有效和无效证明之间显示出显著差异。实验结果显示,该方法在七个来自四个架构家族的变压器模型上的分类准确率高达85.0–95.6%,且在完整数据集上的校准阈值达到93–95%。该方法检测的是逻辑连贯性而非编译器接受,并揭示了架构依赖性对区分信号特征的影响。
Categorical Reparameterization with Denoising Diffusion models
Authors: Samson Gourevitch, Alain Durmus, Eric Moulines, Jimmy Olsson, Yazid Janati
First: 2026-01-02T18:30:05+00:00 · Latest: 2026-01-02T18:30:05+00:00
Comments: working paper
Abstract
Gradient-based optimization with categorical variables typically relies on score-function estimators, which are unbiased but noisy, or on continuous relaxations that replace the discrete distribution with a smooth surrogate admitting a pathwise (reparameterized) gradient, at the cost of optimizing a biased, temperature-dependent objective. In this paper, we extend this family of relaxations by introducing a diffusion-based soft reparameterization for categorical distributions. For these distributions, the denoiser under a Gaussian noising process admits a closed form and can be computed efficiently, yielding a training-free diffusion sampler through which we can backpropagate. Our experiments show that the proposed reparameterization trick yields competitive or improved optimization performance on various benchmarks.
中文标题/摘要
标题:带去噪扩散模型的分类重参数化
基于梯度的优化通常依赖于评分函数估计器,这些估计器虽然无偏但噪声较大,或者依赖于连续松弛,用平滑替代离散分布,允许路径(重参数化)梯度,但代价是优化一个有偏的、温度依赖的目标函数。在本文中,我们通过引入基于扩散的软重参数化扩展了这种松弛族。对于这些分布,高斯去噪过程下的去噪器具有闭式解且可以高效计算,从而通过训练免费的扩散采样器进行反向传播。我们的实验表明,提出的重参数化技巧在各种基准上提供了竞争力或改进的优化性能。
Summary / 总结
The paper addresses the challenge of optimizing categorical variables using gradient-based methods, which are typically noisy or biased. It introduces a diffusion-based soft reparameterization for categorical distributions, allowing for efficient computation and backpropagation through a training-free diffusion sampler. The experiments demonstrate that this method provides competitive or improved optimization performance across various benchmarks.
论文解决了使用梯度方法优化离散变量时面临的噪声或偏差问题。它提出了一种基于扩散的软重参数化方法,适用于离散分布,能够高效计算和反向传播。实验结果显示,该方法在各种基准测试中提供了与现有技术相当或更好的优化性能。
Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients
Authors: Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa
First: 2025-12-28T21:57:42+00:00 · Latest: 2026-01-02T18:25:09+00:00
Abstract
Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
中文标题/摘要
标题:基准成功,临床失败:当强化学习优化基准而非患者
近期针对大型语言模型(LLMs)的强化学习(RL)进展在推理任务上取得了改进,但其在医疗成像领域的资源受限应用仍被严重忽视。我们引入了ChexReason,这是一种通过R1风格方法(SFT后接GRPO)训练的视觉-语言模型,仅使用了2,000个SFT样本、1,000个RL样本和一个A100 GPU。在CheXpert和NIH基准上的评估揭示了一个根本性的矛盾:GRPO恢复了分布内性能(在CheXpert上提高了23%,宏F1分数为0.346),但降低了跨数据集的迁移性(在NIH上下降了19%)。这与高资源模型如NV-Reason-CXR-3B的表现相似,表明问题可能源自RL范式而非规模。我们发现了一种泛化悖论,即SFT检查点在优化前对NIH的性能有所提升,表明教师引导的推理捕捉到了更多机构无关的特征。此外,跨模型比较显示,结构化推理框架对通用视觉语言模型有益,但对医学预训练模型的增益有限。因此,精心策划的监督微调可能在需要跨多样人群稳健性的临床部署中优于激进的RL方法。
Summary / 总结
This paper explores the application of Reinforcement Learning (RL) in medical imaging using a vision-language model, ChexReason, trained with limited resources. Despite improving in-distribution performance on CheXpert and NIH benchmarks, the model shows reduced cross-dataset transferability, indicating a fundamental tension between benchmark success and clinical applicability. The study suggests that the RL paradigm itself may be the cause of this issue, and that supervised fine-tuning might be more effective for robust clinical deployment across diverse populations.
研究探讨了有限资源下使用强化学习(RL)在医学影像中的应用,通过ChexReason视觉语言模型进行训练。尽管在推理任务上有所改进,但RL优化导致了内部性能和跨数据集迁移性的权衡。模型在CheXpert上的表现提高了23%,但在NIH上下降了19%,表明问题可能源自RL范式本身而非规模不足。研究揭示了一个泛化悖论,即监督微调的检查点在优化前能增强跨数据集性能,表明教师引导的推理捕捉到了更多机构无关的特征。
Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models
Authors: Shambhavi Mishra, Julio Silva-Rodriguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz
First: 2024-11-26T00:15:37+00:00 · Latest: 2026-01-02T18:18:27+00:00
Comments: Added additional figures to communicate the algorithm
Abstract
Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.
中文标题/摘要
标题:语义锚点传输:视觉语言模型的鲁棒测试时适应
大型预训练视觉语言模型(VLMs),如CLIP,在广泛的任务中展示了前所未有的零样本性能。然而,这些模型在分布变化下可能不可靠,其性能会显著下降。在本文中,我们研究了如何高效地利用类别文本信息来缓解VLMs在推理过程中遇到的分布漂移。特别是,我们提出通过将视觉嵌入与可靠的文本基语义锚点对齐来生成噪声测试样本的伪标签。具体而言,为了保持数据集的正常结构,我们将问题形式化为批处理标签分配问题,该问题可以使用最优传输高效求解。我们的方法,语义锚点传输(SAT),利用这些伪标签作为测试时适应的监督信号,提供了一种原理性的跨模态对齐解决方案。此外,SAT进一步利用了异构文本线索,通过多模板蒸馏方法复制无监督表示学习中的多视图对比学习策略,而不增加额外的计算复杂度。在多个流行的测试时适应基准上的广泛实验中,SAT在多种复杂性上表现出优越性,相对于最近的先进方法实现了持续的性能提升,同时计算效率高。
Summary / 总结
This work addresses the issue of distributional shifts in large pre-trained vision-language models (VLMs) like CLIP, which can degrade their performance during inference. The authors propose Semantic Anchor Transport (SAT), a method that generates pseudo-labels for test-time samples by aligning visual embeddings with reliable text-based semantic anchors using Optimal Transport. SAT uses these pseudo-labels for test-time adaptation, achieving consistent performance gains over recent state-of-the-art methods while maintaining computational efficiency. Extensive experiments on various benchmarks demonstrate SAT's effectiveness in mitigating distribution drifts.
本文解决了大型预训练视觉-语言模型(VLMs)在推理过程中因分布变化而导致性能下降的问题。作者提出了一种称为语义锚点传输(SAT)的方法,通过将视觉嵌入与可靠的文本语义锚点对齐来生成噪声测试样本的伪标签。SAT 使用最优传输来解决批量标签分配问题,并采用多模板蒸馏方法实现跨模态对齐。实验表明,SAT 在多个测试时适应基准测试上优于最近的先进方法,同时保持了计算效率。
Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection
Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall
First: 2026-01-02T18:17:22+00:00 · Latest: 2026-01-02T18:17:22+00:00
Comments: Accepted at IJCB 2025
Abstract
While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
中文标题/摘要
标题:探究多模态大型语言模型在音频换音检测中的可行性
尽管视觉-语言模型(VLMs)和多模态大型语言模型(MLLMs)在检测图像和视频换音方面表现出强大的泛化能力,但它们在音频换音检测中的应用仍鲜有探索。本研究旨在探索MLLMs在音频换音检测中的潜力。通过将音频输入与多种文本提示作为查询相结合,以检验MLLMs在跨模态学习鲁棒表示方面的可行性,特别是针对音频换音检测。因此,我们尝试使用文本感知和语境丰富的问答式提示,并进行二元决策。我们假设这种特征引导的推理有助于促进更深层次的多模态理解,并使音频换音检测中的特征学习更加稳健。我们评估了两种MLLMs,Qwen2-Audio-7B-Instruct和SALMONN,在两种评估模式下的性能:(a)零样本和(b)微调。我们的实验表明,结合音频与多提示方法可能是音频换音检测的一个可行方向。实验结果表明,这些模型在缺乏特定任务训练的情况下表现不佳,难以泛化到域外数据。然而,它们在少量监督下对域内数据表现出良好的性能,这表明音频换音检测具有潜在的前景。
Summary / 总结
This study investigates the use of Multi-modal Large Language Models (MLLMs) for detecting audio deepfakes, focusing on text-aware and context-rich prompts. The research combines audio inputs with various text queries to explore robust multimodal representation learning. Experiments with Qwen2-Audio-7B-Instruct and SALMONN show that while these models perform poorly without task-specific training, they achieve good performance on in-domain data with minimal supervision, indicating potential for audio deepfake detection.
研究探讨了使用多模态大型语言模型(MLLMs)进行音频深伪检测的可能性,重点在于文本感知和语境丰富的提示。研究评估了两种MLLMs,Qwen2-Audio-7B-Instruct和SALMONN,在零样本和微调模式下的表现。实验表明,这些模型在没有特定训练的情况下表现较差,但在少量监督下可以在领域内数据上取得良好的性能,显示出音频深伪检测的潜力。
Brain network science modelling of sparse neural networks enables Transformers and LLMs to perform as fully connected
Authors: Yingtao Zhang, Diego Cerretti, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci
First: 2025-01-31T13:04:37+00:00 · Latest: 2026-01-02T18:15:12+00:00
Abstract
Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties in keeping peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (less than 1% connectivity) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is $O(Nd^3)$ - N node network size, d node degree - restricting it to ultra-sparse regimes. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. Here, we design the first brain-inspired network model - termed bipartite receptive field (BRF) - to initialize the connectivity of sparse artificial neural networks. We further introduce a GPU-friendly matrix-based approximation of CH link prediction, reducing complexity to $O(N^3)$. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a flexible strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. Additionally, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that BRF offers performance advantages over previous network science models. Using 1% of connections, CHTs outperforms fully connected networks in MLP architectures on image classification tasks, compressing some networks to less than 30% of the nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Finally, at 30% connectivity, both CHTs and CHTss outperform other DST methods in language modeling task.
中文标题/摘要
标题:大脑网络科学建模的稀疏神经网络使Transformer和大语言模型能够表现得如同全连接网络
动态稀疏训练(DST)可以降低ANN的计算需求,但在高稀疏度水平下保持峰值性能存在困难。Cannistraci-Hebb训练(CHT)是一种受大脑启发的方法,用于在DST中增加连接性。CHT利用无梯度、拓扑驱动的链接再生长,显示出与全连接网络相比,在各种任务中具有超稀疏(连接率低于1%)的优势。然而,CHT存在两个主要缺点:(i)其时间复杂度为$O(Nd^3)$ - N个节点网络大小,d个节点度 - 限制其仅适用于超稀疏区域。(ii)它选择顶级链接预测得分,这在早期训练阶段是不合适的,此时网络的连接不可靠。在这里,我们设计了第一个受大脑启发的网络模型——称为双部分感受野(BRF)——以初始化稀疏人工神经网络的连接性。我们进一步引入了CHT链接预测的GPU友好矩阵近似,将复杂度降低到$O(N^3)$。我们引入了Cannistraci-Hebb训练软规则(CHTs),它采用灵活的策略在链接删除和再生长中采样连接,平衡网络拓扑的探索和利用。此外,我们将CHTs与Sigmoid渐进密度衰减(CHTss)结合使用。实验证明,BRF在与之前的大脑网络科学模型相比提供了性能优势。使用1%的连接,CHTs在MLP架构上的图像分类任务中优于全连接网络,压缩某些网络到节点的不到30%。使用5%的连接,CHTss在两个基于Transformer的机器翻译任务中优于全连接网络。最后,在30%的连接性下,CHTs和CHTss在语言建模任务中均优于其他DST方法。
Summary / 总结
The research aims to improve the performance of dynamic sparse training (DST) in artificial neural networks (ANNs) by addressing its limitations in maintaining peak performance at high sparsity levels. The study introduces a brain-inspired method called bipartite receptive field (BRF) and a modified Cannistraci-Hebb training (CHT) approach, which includes a flexible sampling strategy (CHTs) and a sigmoid gradual density decay (CHTss). The results show that BRF and CHTs can outperform fully connected networks using as little as 1% of connections in MLP architectures for image classification tasks, and 5% of connections in Transformer-based machine translation tasks. At 30% connectivity, both CHTs and CHTss surpass other DST methods in language modeling tasks.
研究旨在通过改进动态稀疏训练(DST)来提高人工神经网络(ANNs)在高稀疏度下的性能,解决其在保持峰值性能方面的局限性。研究引入了一种脑启发方法,即双部分感受野(BRF),用于初始化连接性,并引入了灵活的Cannistraci-Hebb训练软规则(CHTs),以平衡网络拓扑的探索和利用。实验结果表明,使用1%的连接,CHTs在图像分类任务中优于全连接网络;使用5%的连接,CHTs在机器翻译任务中优于全连接网络;在30%的连接下,CHTs和CHTs在语言建模任务中均优于其他DST方法。
LLM Agents for Combinatorial Efficient Frontiers: Investment Portfolio Optimization
Authors: Simon Paquette-Greenbaum, Jiangbo Yu
First: 2026-01-02T18:02:13+00:00 · Latest: 2026-01-02T18:02:13+00:00
Abstract
Investment portfolio optimization is a task conducted in all major financial institutions. The Cardinality Constrained Mean-Variance Portfolio Optimization (CCPO) problem formulation is ubiquitous for portfolio optimization. The challenge of this type of portfolio optimization, a mixed-integer quadratic programming (MIQP) problem, arises from the intractability of solutions from exact solvers, where heuristic algorithms are used to find approximate portfolio solutions. CCPO entails many laborious and complex workflows and also requires extensive effort pertaining to heuristic algorithm development, where the combination of pooled heuristic solutions results in improved efficient frontiers. Hence, common approaches are to develop many heuristic algorithms. Agentic frameworks emerge as a promising candidate for many problems within combinatorial optimization, as they have been shown to be equally efficient with regard to automating large workflows and have been shown to be excellent in terms of algorithm development, sometimes surpassing human-level performance. This study implements a novel agentic framework for the CCPO and explores several concrete architectures. In benchmark problems, the implemented agentic framework matches state-of-the-art algorithms. Furthermore, complex workflows and algorithm development efforts are alleviated, while in the worst case, lower but acceptable error is reported.
中文标题/摘要
标题:组合有效前沿的LLM代理:投资组合优化
投资组合优化是所有主要金融机构中的一项任务。卡丹诺约束均值-方差投资组合优化(CCPO)问题表述是组合优化中普遍存在的形式。这种类型的投资组合优化面临的挑战是一个混合整数二次规划(MIQP)问题,由于精确求解器难以求解,通常使用启发式算法来寻找近似投资组合解。CCPO 包含许多繁琐且复杂的流程,还需要大量的努力来开发启发式算法,其中组合的启发式解决方案的结合可以改善有效前沿。因此,常见的方法是开发许多启发式算法。代理框架作为组合优化中许多问题的有前途的候选者,已经显示出在自动化大规模工作流方面同样有效,并且在算法开发方面表现出色,有时甚至超过人类水平。本研究实现了一种新颖的代理框架来解决CCPO,并探索了几种具体的架构。在基准问题中,实现的代理框架与最先进的算法相当。此外,复杂的流程和算法开发努力得到了缓解,而在最坏的情况下,报告了较低但可接受的误差。
Summary / 总结
This study addresses the challenge of Cardinality Constrained Mean-Variance Portfolio Optimization (CCPO) in investment portfolio optimization, which is a mixed-integer quadratic programming problem. The research implements an agentic framework to automate the complex workflows and algorithm development, matching state-of-the-art algorithms in benchmark problems. Although the worst-case error is slightly higher, the overall effort in developing algorithms is significantly reduced.
本研究通过实现一种代理框架来解决约束均值-方差投资组合优化(CCPO)问题,这是一种混合整数二次规划问题。该方法探索了代理框架的各种架构以优化投资组合。结果表明,代理框架在基准问题上可以与最先进的算法相匹配,减少了复杂工作流和算法开发的努力,尽管在最坏情况下接受略低但可接受的误差。
Unified Primitive Proxies for Structured Shape Completion
Authors: Zhaiyu Chen, Yuqing Wang, Xiao Xiang Zhu
First: 2026-01-02T17:32:40+00:00 · Latest: 2026-01-02T17:32:40+00:00
Abstract
Structured shape completion recovers missing geometry as primitives rather than as unstructured points, which enables primitive-based surface reconstruction. Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. To drive this unified representation, we introduce primitive proxies, learnable queries that are contextualized to produce assembly-ready outputs. To ensure consistent optimization, our training strategy couples primitives and points with online target updates. Across synthetic and real-world benchmarks with four independent assembly solvers, UniCo consistently outperforms recent baselines, lowering Chamfer distance by up to 50% and improving normal consistency by up to 7%. These results establish an attractive recipe for structured 3D understanding from incomplete data. Project page: https://unico-completion.github.io.
中文标题/摘要
标题:统一的基本代理用于结构化形状完成
结构化形状完成以基本形状而非无序点恢复缺失的几何结构,这使得基于基本形状的曲面重建成为可能。我们不遵循现有的级联方法,而是重新思考基本形状和点之间的交互方式,发现直接在关注共享形状特征的专用路径中解码基本形状更为有效。遵循这一原则,我们提出了UniCo,在单次前向传递中预测一组具有完整几何结构、语义信息和内点成员资格的基本形状。为了驱动这种统一的表示,我们引入了基本形状代理,这是一种可学习的查询,能够上下文化以生成装配就绪的输出。为了确保一致的优化,我们的训练策略将基本形状和点耦合在一起,并通过在线目标更新进行耦合。在四个独立装配求解器的合成和真实世界基准测试中,UniCo 一致地优于最近的基线,将切比雪夫距离降低高达 50%,并提高法线一致性高达 7%。这些结果确立了一种从不完整数据中进行结构化 3D 理解的有吸引力的配方。项目页面:https://unico-completion.github.io.
Summary / 总结
The research aims to improve structured shape completion by predicting primitives rather than unstructured points. UniCo, a unified model, predicts complete primitives in a single pass, which includes geometry, semantics, and inlier membership. The model uses primitive proxies to generate assembly-ready outputs and couples primitives and points during training. Experimental results show that UniCo outperforms recent baselines, reducing Chamfer distance by up to 50% and improving normal consistency by up to 7%. This approach provides a promising method for structured 3D understanding from incomplete data.
研究旨在通过单次前向传递预测具有完整几何、语义和内点成员资格的基元,UniCo 使用基元代理生成可装配的输出,优于最近的基线,将均方差距离降低多达 50%,并提高法线一致性多达 7%。该方法为从不完整数据中进行结构化 3D 理解提供了强有力的方法。
Adaptive Learning Guided by Bias-Noise-Alignment Diagnostics
Authors: Akash Samanta, Sheldon Williamson
First: 2025-12-30T19:57:52+00:00 · Latest: 2026-01-02T17:32:09+00:00
Comments: This preprint focuses on the theoretical framework and diagnostic behavior. Comprehensive experimental validation in application-specific settings is deferred to a companion experimental study
Abstract
Learning systems deployed in nonstationary and safety-critical environments often suffer from instability, slow convergence, or brittle adaptation when learning dynamics evolve over time. While modern optimization, reinforcement learning, and meta-learning methods adapt to gradient statistics, they largely ignore the temporal structure of the error signal itself. This paper proposes a diagnostic-driven adaptive learning framework that explicitly models error evolution through a principled decomposition into bias, capturing persistent drift; noise, capturing stochastic variability; and alignment, capturing repeated directional excitation leading to overshoot. These diagnostics are computed online from lightweight statistics of loss or temporal-difference (TD) error trajectories and are independent of model architecture or task domain. We show that the proposed bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic reinforcement learning, and learned optimizers. Within this framework, we introduce three diagnostic-driven instantiations: the Human-inspired Supervised Adaptive Optimizer (HSAO), Hybrid Error-Diagnostic Reinforcement Learning (HED-RL) for actor-critic methods, and the Meta-Learned Learning Policy (MLLP). Under standard smoothness assumptions, we establish bounded effective updates and stability properties for all cases. Representative diagnostic illustrations in actor-critic learning highlight how the proposed signals modulate adaptation in response to TD error structure. Overall, this work elevates error evolution to a first-class object in adaptive learning and provides an interpretable, lightweight foundation for reliable learning in dynamic environments.
中文标题/摘要
标题:基于偏差-噪声-对齐诊断的自适应学习
部署在非平稳和安全性关键环境中的学习系统往往在动态变化时出现不稳定、收敛缓慢或脆弱的适应性问题。尽管现代优化、强化学习和元学习方法会适应梯度统计,但它们很大程度上忽略了误差信号本身的时间结构。本文提出了一种诊断驱动的自适应学习框架,通过原理性的分解将误差演化明确建模为偏差、噪声和对齐,分别捕捉持久漂移、随机变异性以及导致超调的重复方向性激励。这些诊断在线上从损失或时差(TD)误差轨迹的轻量级统计中计算得出,与模型架构或任务领域无关。我们展示了所提出的偏差-噪声-对齐分解为监督优化、演员-评论家强化学习和学习优化器提供了一个统一的控制框架。在此框架下,我们引入了三种诊断驱动的实例:人类启发的监督自适应优化器(HSAO)、混合误差-诊断强化学习(HED-RL)以及元学习学习策略(MLLP)。在标准平滑性假设下,我们为所有情况建立了有界有效更新和稳定性属性。代表性的诊断示例在演员-评论家学习中突显了所提出信号如何根据TD误差结构调节适应。总体而言,这项工作将误差演化提升为自适应学习中的头等大事,并为动态环境中的可靠学习提供了一个可解释的轻量级基础。
Summary / 总结
This paper introduces a diagnostic-driven adaptive learning framework that decomposes error evolution into bias, noise, and alignment to address instability and slow convergence in nonstationary environments. The framework computes these diagnostics online from loss or temporal-difference error trajectories, enabling adaptive learning in various settings. Key findings include the introduction of HSAO, HED-RL, and MLLP, which provide stable and effective adaptation in supervised optimization, reinforcement learning, and meta-learning, respectively, under standard smoothness assumptions.
该论文提出了一种诊断驱动的自适应学习框架,通过将误差演变分解为偏差、噪声和对齐来解决非平稳环境中的不稳定性问题。该框架在线计算这些诊断指标并应用于监督优化、强化学习和学习优化器。主要发现包括引入HSAO、HED-RL和MLLP,并在光滑性假设下建立了有界有效更新和稳定性属性。该工作提供了一种可解释且轻量的基础框架,以在动态环境中实现可靠的自适应学习。
Memory Bank Compression for Continual Adaptation of Large Language Models
Authors: Thomas Katraouras, Dimitrios Rafailidis
First: 2026-01-02T17:22:34+00:00 · Latest: 2026-01-02T17:22:34+00:00
Comments: Accepted to the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26)
Abstract
Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.
中文标题/摘要
标题:大型语言模型持续适应的内存银行压缩
大型语言模型(LLMs)已成为许多日常应用的支柱。然而,随着数据的演变,其知识迅速变得过时。持续学习旨在更新LLMs以获取新信息,而不抹去之前获得的知识。尽管全微调等方法可以纳入新数据,但它们计算成本高昂且容易发生灾难性遗忘,即先前的知识被覆盖。通过为LLMs配备一个内存银行,即一个外部内存模块来存储未来使用的信息,内存增强方法解决了这一问题。然而,这些方法面临一个关键限制,特别是在大规模数据流到达的现实场景中,内存银行不断增长。在本文中,我们提出了一种MBC模型,在在线适应学习过程中通过代码本优化策略压缩内存银行。为了确保稳定学习,我们还引入了一种在线重置机制,以防止代码本崩溃。此外,我们还在LLM的注意力层中采用键值低秩适应,使压缩的内存表示能够高效利用。基准问答数据集的实验表明,与最竞争的基线相比,MBC将内存银行的大小压缩到0.3%,同时在在线适应学习过程中保持高保留准确性。我们的代码可在https://github.com/Thomkat/MBC/公开获取。
Summary / 总结
This paper addresses the challenge of continual learning for Large Language Models (LLMs) by proposing MBC, which compresses the memory bank through codebook optimization and introduces an online resetting mechanism to prevent codebook collapse. The method also employs Key-Value Low-Rank Adaptation in attention layers to efficiently utilize compressed memory representations. Experiments show that MBC reduces the memory bank size to 0.3% compared to the most competitive baseline while maintaining high retention accuracy during online adaptation learning.
本文提出了一种名为MBC的方法,通过代码本优化压缩记忆库,并引入在线重置机制防止代码本崩溃。该方法还在注意力层中使用了键值低秩适应,以高效利用压缩的记忆表示。实验表明,MBC将记忆库大小压缩到0.3%,同时在在线适应学习过程中保持了高保留准确性,优于最竞对基准。
Med-2D SegNet: A Light Weight Deep Neural Network for Medical 2D Image Segmentation
Authors: Lameya Sabrin, Md. Sanaullah Chowdhury, Salauddin Tapu, Noyon Kumar Sarkar, Ferdous Bin Ali
First: 2025-04-20T19:04:43+00:00 · Latest: 2026-01-02T16:47:20+00:00
Abstract
Accurate and efficient medical image segmentation is crucial for advancing clinical diagnostics and surgical planning, yet remains a complex challenge due to the variability in anatomical structures and the demand for low-complexity models. In this paper, we introduced Med-2D SegNet, a novel and highly efficient segmentation architecture that delivers outstanding accuracy while maintaining a minimal computational footprint. Med-2D SegNet achieves state-of-the-art performance across multiple benchmark datasets, including KVASIR-SEG, PH2, EndoVis, and GLAS, with an average Dice similarity coefficient (DSC) of 89.77% across 20 diverse datasets. Central to its success is the compact Med Block, a specialized encoder design that incorporates dimension expansion and parameter reduction, enabling precise feature extraction while keeping model parameters to a low count of just 2.07 million. Med-2D SegNet excels in cross-dataset generalization, particularly in polyp segmentation, where it was trained on KVASIR-SEG and showed strong performance on unseen datasets, demonstrating its robustness in zero-shot learning scenarios, even though we acknowledge that further improvements are possible. With top-tier performance in both binary and multi-class segmentation, Med-2D SegNet redefines the balance between accuracy and efficiency, setting a new benchmark for medical image analysis. This work paves the way for developing accessible, high-performance diagnostic tools suitable for clinical environments and resource-constrained settings, making it a step forward in the democratization of advanced medical technology.
Summary / 总结
Med-2D SegNet is designed to address the challenges of accurate and efficient medical image segmentation, particularly in the variability of anatomical structures. It employs a compact Med Block to achieve state-of-the-art performance across multiple datasets with an average Dice similarity coefficient of 89.77%, while maintaining a minimal computational footprint of 2.07 million parameters. The model excels in cross-dataset generalization, especially in polyp segmentation, showing strong performance on unseen datasets.
Med-2D SegNet旨在解决医学图像分割中的准确性和效率问题,特别是在具有不同解剖结构的多样化数据集中的挑战。它通过紧凑的Med Block实现多项基准测试中的顶级性能,平均Dice相似系数为89.77%,同时保持了207万参数的低计算复杂度。该模型在跨数据集泛化方面表现出色,特别是在息肉分割中,展示了强大的零样本学习能力。
Stochastic Actor-Critic: Mitigating Overestimation via Temporal Aleatoric Uncertainty
Authors: Uğurcan Özalp
First: 2026-01-02T16:33:17+00:00 · Latest: 2026-01-02T16:33:17+00:00
Comments: 19 pages
Abstract
Off-policy actor-critic methods in reinforcement learning train a critic with temporal-difference updates and use it as a learning signal for the policy (actor). This design typically achieves higher sample efficiency than purely on-policy methods. However, critic networks tend to overestimate value estimates systematically. This is often addressed by introducing a pessimistic bias based on uncertainty estimates. Current methods employ ensembling to quantify the critic's epistemic uncertainty-uncertainty due to limited data and model ambiguity-to scale pessimistic updates. In this work, we propose a new algorithm called Stochastic Actor-Critic (STAC) that incorporates temporal (one-step) aleatoric uncertainty-uncertainty arising from stochastic transitions, rewards, and policy-induced variability in Bellman targets-to scale pessimistic bias in temporal-difference updates, rather than relying on epistemic uncertainty. STAC uses a single distributional critic network to model the temporal return uncertainty, and applies dropout to both the critic and actor networks for regularization. Our results show that pessimism based on a distributional critic alone suffices to mitigate overestimation, and naturally leads to risk-averse behavior in stochastic environments. Introducing dropout further improves training stability and performance by means of regularization. With this design, STAC achieves improved computational efficiency using a single distributional critic network.
中文标题/摘要
标题:随机行为-批评家:通过时间偶然不确定性减轻过度估计
在强化学习中,离策行为-批评家方法使用时间差分更新训练批评家,并将其作为策略(行为家)的学习信号。这种设计通常比纯策方法具有更高的样本效率。然而,批评家网络倾向于系统地高估价值估计。这通常通过引入基于不确定性估计的悲观偏见来解决。当前的方法通过集成来量化批评家的先验不确定性——由于数据有限和模型模糊性导致的不确定性——以调整悲观更新。在本工作中,我们提出了一种名为随机行为-批评家(STAC)的新算法,该算法将时间(一步)偶然不确定性——来自随机转换、奖励和策略诱导的贝尔曼目标变异性——纳入时间差分更新的悲观偏见中,而不是依赖于先验不确定性。STAC 使用单一的分布批评家网络来建模时间回报不确定性,并在批评家和行为家网络中应用 dropout 以进行正则化。我们的结果显示,仅基于分布批评家的悲观性足以减轻过度估计,并自然导致在随机环境中表现出风险规避行为。引入 dropout 进一步通过正则化提高训练稳定性和性能。通过这种设计,STAC 使用单一的分布批评家网络实现了改进的计算效率。
Summary / 总结
The research aims to address the overestimation issue in off-policy actor-critic methods by incorporating temporal aleatoric uncertainty. The proposed Stochastic Actor-Critic (STAC) algorithm uses a single distributional critic network to model the uncertainty and applies dropout for regularization. Experimental results show that this approach effectively mitigates overestimation and leads to risk-averse behavior, improving performance in stochastic environments.
研究通过提出Stochastic Actor-Critic (STAC)算法,将时间上的 aleatoric 不确定性引入时间差分更新中,以解决 off-policy actor-critic 方法中的过估计问题。STAC 使用单一的分布性批评网络和 dropout 正则化,从而在随机环境中实现更好的样本效率和风险规避行为。
Data-Driven Analysis of Crash Patterns in SAE Level 2 and Level 4 Automated Vehicles Using K-means Clustering and Association Rule Mining
Authors: Jewel Rana Palit, Vijayalakshmi K Kumarasamy, Osama A. Osman
First: 2025-12-27T13:30:07+00:00 · Latest: 2026-01-02T16:28:22+00:00
Comments: 7 tables, 7 figures, 23 pages including references
Abstract
Automated Vehicles (AV) hold potential to reduce or eliminate human driving errors, enhance traffic safety, and support sustainable mobility. Recently, crash data has increasingly revealed that AV behavior can deviate from expected safety outcomes, raising concerns about the technology's safety and operational reliability in mixed traffic environments. While past research has investigated AV crash, most studies rely on small-size California-centered datasets, with a limited focus on understanding crash trends across various SAE Levels of automation. This study analyzes over 2,500 AV crash records from the United States National Highway Traffic Safety Administration (NHTSA), covering SAE Levels 2 and 4, to uncover underlying crash dynamics. A two-stage data mining framework is developed. K-means clustering is first applied to segment crash records into 4 distinct behavioral clusters based on temporal, spatial, and environmental factors. Then, Association Rule Mining (ARM) is used to extract interpretable multivariate relationships between crash patterns and crash contributors including lighting conditions, surface condition, vehicle dynamics, and environmental conditions within each cluster. These insights provide actionable guidance for AV developers, safety regulators, and policymakers in formulating AV deployment strategies and minimizing crash risks.
中文标题/摘要
标题:基于K-means聚类和关联规则挖掘的SAE二级和四级自动驾驶汽车碰撞模式数据驱动分析
自动驾驶汽车(AV)有可能减少或消除人类驾驶错误,提高交通安全,并支持可持续交通。最近,碰撞数据越来越多地揭示出AV行为可能偏离预期的安全结果,引发了对其在混合交通环境中安全性和操作可靠性的担忧。尽管以往的研究已经探讨了AV碰撞,但大多数研究依赖于以加利福尼亚为中心的小规模数据集,且对不同自动化水平下的碰撞趋势理解有限。本研究分析了来自美国国家高速公路交通安全管理局(NHTSA)的超过2,500起AV碰撞记录,涵盖了SAE二级和四级,以揭示潜在的碰撞动态。开发了两阶段数据挖掘框架。首先应用K-means聚类,根据时间、空间和环境因素将碰撞记录分为4个不同的行为簇。然后使用关联规则挖掘(ARM)提取每个簇内碰撞模式和碰撞因素(包括照明条件、路面状况、车辆动态和环境条件)之间的可解释多变量关系。这些见解为AV开发者、安全监管机构和政策制定者在制定AV部署策略和减少碰撞风险方面提供了可操作的指导。
Summary / 总结
This study aims to analyze crash patterns in SAE Level 2 and Level 4 automated vehicles using a two-stage data mining framework. K-means clustering is applied to segment crash records into four distinct behavioral clusters based on temporal, spatial, and environmental factors, followed by Association Rule Mining to extract multivariate relationships between crash patterns and contributors. Key findings include the identification of specific crash dynamics and contributing factors within each cluster, providing actionable insights for AV developers, safety regulators, and policymakers to minimize crash risks.
本研究旨在使用两阶段数据挖掘框架分析SAE Level 2和Level 4自动驾驶车辆的碰撞模式。首先应用K-means聚类将碰撞记录分为四个不同的行为簇,基于时间、空间和环境因素,然后使用关联规则挖掘(ARM)来识别碰撞模式与照明条件、路面状况、车辆动态和环境条件之间的多变量关系。主要发现包括识别出四个行为簇以及可解释的关系,这些可以指导自动驾驶车辆开发者、安全监管机构和政策制定者减少碰撞风险。
Precision Autotuning for Linear Solvers via Contextual Bandit-Based RL
Authors: Erin Carson, Xinye Chen
First: 2026-01-02T15:59:42+00:00 · Latest: 2026-01-02T15:59:42+00:00
Abstract
We propose a reinforcement learning (RL) framework for adaptive precision tuning of linear solvers, and can be extended to general algorithms. The framework is formulated as a contextual bandit problem and solved using incremental action-value estimation with a discretized state space to select optimal precision configurations for computational steps, balancing precision and computational efficiency. To verify its effectiveness, we apply the framework to iterative refinement for solving linear systems $Ax = b$. In this application, our approach dynamically chooses precisions based on calculated features from the system. In detail, a Q-table maps discretized features (e.g., approximate condition number and matrix norm)to actions (chosen precision configurations for specific steps), optimized via an epsilon-greedy strategy to maximize a multi-objective reward balancing accuracy and computational cost. Empirical results demonstrate effective precision selection, reducing computational cost while maintaining accuracy comparable to double-precision baselines. The framework generalizes to diverse out-of-sample data and offers insight into utilizing RL precision selection for other numerical algorithms, advancing mixed-precision numerical methods in scientific computing. To the best of our knowledge, this is the first work on precision autotuning with RL and verified on unseen datasets.
中文标题/摘要
标题:基于上下文多臂老虎机的线性求解器精度自调优
我们提出了一种基于强化学习(RL)的框架,用于自适应调整线性求解器的精度,并可扩展到一般算法。该框架被形式化为一个上下文多臂老虎机问题,并使用增量动作值估计和离散化状态空间来选择计算步骤的最佳精度配置,以平衡精度和计算效率。为了验证其有效性,我们将该框架应用于求解线性系统 $Ax = b$ 的迭代校正。在此应用中,我们的方法根据系统计算出的特征动态选择精度。具体而言,Q表将离散化的特征(例如,近似条件数和矩阵范数)映射到动作(特定步骤选择的精度配置),通过ε-贪婪策略优化以最大化平衡准确性和计算成本的多目标奖励。实验证明了有效的精度选择,降低了计算成本同时保持与双精度基线相当的准确性。该框架适用于各种离样本数据,并为利用RL选择精度以其他数值算法提供见解,推动了科学计算中混合精度数值方法的发展。据我们所知,这是首次使用RL进行精度自调优的工作,并已在未见数据集上得到验证。
Summary / 总结
The paper proposes a reinforcement learning framework for adaptive precision tuning of linear solvers, formulated as a contextual bandit problem. It uses incremental action-value estimation to select optimal precision configurations, balancing precision and computational efficiency. Applied to iterative refinement for solving linear systems, the approach dynamically chooses precisions based on system features, reducing computational cost while maintaining accuracy. The framework generalizes to unseen data and offers insights for other numerical algorithms, marking the first work on precision autotuning with RL.
论文提出了一种基于强化学习的线性求解器自适应精度调优框架,将其形式化为上下文臂问题。该框架使用增量动作值估计和离散化状态空间来选择最优的精度配置,平衡精度和计算效率。应用于求解线性系统的迭代细化,该方法根据系统特征动态选择精度,并通过ε-贪婪策略优化以平衡准确性和计算成本。结果表明,该方法能够有效选择精度,减少计算成本同时保持与双精度基线相当的准确性,并且能够泛化到未见过的数据集。
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
Authors: Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua
Venue: NeurIPS Spotlight
First: 2025-12-28T12:25:43+00:00 · Latest: 2026-01-02T15:48:28+00:00
Comments: Accepted by NeurIPS as a Spotlight paper. Code: https://github.com/JavisVerse/JavisGPT
Abstract
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
中文标题/摘要
标题:JavisGPT:统一多模态大语言模型用于音视频理解和生成
本文介绍了JavisGPT,这是首个用于联合音视频(JAV)理解和生成的统一多模态大语言模型(MLLM)。JavisGPT具有简洁的编码器-大语言模型-解码器架构,包含一个同步融合模块(SyncFusion)用于时空音视频融合和同步感知可学习查询,以连接预训练的JAV-DiT生成器。此设计使多模态指令下的音视频理解和生成具有时间一致性。我们设计了一个有效的三阶段训练管道,包括多模态预训练、音视频微调和大规模指令调优,逐步从现有的视觉语言模型构建多模态理解和生成。在指令调优方面,我们构建了JavisInst-Omni,这是一个高质量的指令数据集,包含超过20万GPT-4o精选的音视频文本对话,涵盖了多样性和多层次的理解和生成场景。在音视频理解和生成基准测试中,我们的实验表明JavisGPT在复杂和时间同步的设置中优于现有MLLM。
Summary / 总结
JavisGPT is the first unified multimodal large language model for joint audio-video comprehension and generation. It uses an encoder-LLM-decoder architecture with a SyncFusion module for spatio-temporal fusion and synchrony-aware queries. The model is trained in three stages: multimodal pretraining, audio-video fine-tuning, and instruction tuning with a large dataset. Experiments show that JavisGPT outperforms existing models, especially in complex and temporally synchronized settings.
JavisGPT 是首个用于联合音频-视频理解和生成的统一多模态大语言模型。它采用编码器-LLM-解码器架构,并包含一个 SyncFusion 模块用于融合音频和视频数据。该模型通过三阶段训练管道进行训练,并在 JAV 基准测试中表现出色,特别是在复杂和时间同步的场景中优于现有模型。
Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model
Authors: Hao Guan, Li Zhou
First: 2026-01-02T15:12:06+00:00 · Latest: 2026-01-02T15:12:06+00:00
Comments: 8 pages, 6 figures
Abstract
Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.
中文标题/摘要
标题:病理视觉语言模型在数据偏移下性能退化的检测
视觉语言模型在医学图像分析和疾病诊断方面展现了强大的潜力。然而,在部署后,当输入数据分布从开发期间的变化时,它们的性能可能会下降。检测这种性能退化对于临床可靠性至关重要,但对大型预训练VLMs来说,它们在没有标注数据的情况下运行,这使得检测变得具有挑战性。在本研究中,我们探讨了在先进病理VLM中数据偏移下性能退化的检测。我们研究了输入级数据偏移和输出级预测行为,以了解它们在监控模型可靠性中的各自作用。为了便于系统分析输入数据偏移,我们开发了DomainSAT,一个轻量级的图形界面工具箱,集成了代表性偏移检测算法,使数据偏移的直观探索成为可能。我们的分析表明,虽然输入数据偏移检测在识别分布变化和提供早期诊断信号方面是有效的,但它并不总是与实际性能退化相对应。受此观察的启发,我们进一步研究了基于输出的监控,并引入了一个无标签、基于置信度的退化指标,直接捕捉模型预测置信度的变化。我们发现,该指标与性能退化之间存在密切关系,并且可以作为输入偏移检测的有效补充。在大规模病理数据集上的肿瘤分类实验表明,结合输入数据偏移检测和基于输出置信度的指标,可以更可靠地检测和解释VLMs在数据偏移下的性能退化。这些发现为监测数字病理学中基础模型的可靠性提供了一个实用且互补的框架。
Summary / 总结
This study investigates performance degradation in a state-of-the-art pathology vision-language model under data shift. It develops DomainSAT, a lightweight toolbox for analyzing input-level data shift and introduces a label-free, confidence-based degradation indicator for output-level monitoring. The research finds that combining input data shift detection with output confidence-based indicators enhances the reliability of detecting and interpreting performance degradation in VLMs under data shift, providing a practical framework for monitoring model reliability in digital pathology.
本研究探讨了病理视觉语言模型在数据偏移下的性能退化问题,开发了DomainSAT轻量级工具箱用于检测输入级数据偏移,并引入了无标签、基于置信度的退化指标用于输出级监控。研究发现,结合输入数据偏移检测与输出置信度基指标,可以更可靠地检测和解释VLMs在数据偏移下的性能退化,为数字病理学中监测模型可靠性提供了实用框架。
Efficient Deep Demosaicing with Spatially Downsampled Isotropic Networks
Authors: Cory Fan, Wenchao Zhang
Venue: WACV
First: 2026-01-02T14:40:58+00:00 · Latest: 2026-01-02T14:40:58+00:00
Comments: 9 pages, 5 figures. To be published at WVAQ Workshop at WACV
Abstract
In digital imaging, image demosaicing is a crucial first step which recovers the RGB information from a color filter array (CFA). Oftentimes, deep learning is utilized to perform image demosaicing. Given that most modern digital imaging applications occur on mobile platforms, applying deep learning to demosaicing requires lightweight and efficient networks. Isotropic networks, also known as residual-in-residual networks, have been often employed for image demosaicing and joint-demosaicing-and-denoising (JDD). Most demosaicing isotropic networks avoid spatial downsampling entirely, and thus are often prohibitively expensive computationally for mobile applications. Contrary to previous isotropic network designs, this paper claims that spatial downsampling to a signficant degree can improve the efficiency and performance of isotropic networks. To validate this claim, we design simple fully convolutional networks with and without downsampling using a mathematical architecture design technique adapted from DeepMAD, and find that downsampling improves empirical performance. Additionally, empirical testing of the downsampled variant, JD3Net, of our fully convolutional networks reveals strong empirical performance on a variety of image demosaicing and JDD tasks.
中文标题/摘要
标题:基于空间降采样的各向同性网络高效深度去马赛克
在数字成像中,图像去马赛克是至关重要的第一步,它从色彩滤波阵列(CFA)中恢复RGB信息。通常使用深度学习来进行图像去马赛克。鉴于大多数现代数字成像应用发生在移动平台上,将深度学习应用于去马赛克需要轻量级和高效的网络。各向同性网络,也称为残差中的残差网络,常用于图像去马赛克和联合去马赛克与降噪(JDD)。大多数去马赛克各向同性网络完全避免了空间降采样,因此对于移动应用来说往往是计算上过于昂贵的。与此相反,本文声称,对各向同性网络进行显著的空间降采样可以提高其效率和性能。为了验证这一说法,我们设计了带有和不带降采样的简单全卷积网络,并使用从DeepMAD改编的数学架构设计技术,发现降采样可以提高实际性能。此外,对我们的全卷积网络去马赛克3网络(JD3Net)的降采样变体进行的实证测试显示,它在各种图像去马赛克和JDD任务上表现出强大的实际性能。
Summary / 总结
This paper addresses the challenge of efficient image demosaicing for mobile applications by proposing spatially downsampled isotropic networks. The authors design simple fully convolutional networks with and without spatial downsampling and find that downsampling enhances both efficiency and performance. Empirical testing on various demosaicing and joint-demosaicing-and-denoising tasks demonstrates the effectiveness of the downsampled variant, JD3Net.
该论文旨在通过使用空间下采样的等向性网络解决移动平台上的高效图像去马赛克问题。作者设计了具有和不具有下采样的简单全卷积网络,并发现下采样可以同时提高效率和性能。在各种去马赛克和联合去马赛克与降噪任务上的实验证明了所提出方法的有效性,特别是下采样的变体JD3Net。
BSAT: B-Spline Adaptive Tokenizer for Long-Term Time Series Forecasting
Authors: Maximilian Reinwardt, Michael Eichelbeck, Matthias Althoff
First: 2026-01-02T14:27:54+00:00 · Latest: 2026-01-02T14:27:54+00:00
Comments: 20 pages, 7 figures
Abstract
Long-term time series forecasting using transformers is hampered by the quadratic complexity of self-attention and the rigidity of uniform patching, which may be misaligned with the data's semantic structure. In this paper, we introduce the \textit{B-Spline Adaptive Tokenizer (BSAT)}, a novel, parameter-free method that adaptively segments a time series by fitting it with B-splines. BSAT algorithmically places tokens in high-curvature regions and represents each variable-length basis function as a fixed-size token, composed of its coefficient and position. Further, we propose a hybrid positional encoding that combines a additive learnable positional encoding with Rotary Positional Embedding featuring a layer-wise learnable base: L-RoPE. This allows each layer to attend to different temporal dependencies. Our experiments on several public benchmarks show that our model is competitive with strong performance at high compression rates. This makes it particularly well-suited for use cases with strong memory constraints.
中文标题/摘要
标题:BSAT: B-样条自适应分词器用于长期时间序列预测
使用变压器进行长期时间序列预测受到自注意力的二次复杂性和均匀分词的刚性限制,这可能与数据的语义结构不一致。本文提出了一种新颖的、无需参数的方法——B-样条自适应分词器(BSAT),该方法通过拟合B-样条自适应地分割时间序列。BSAT算法性地在高曲率区域放置分词,并将每个可变长度基函数表示为固定大小的分词,由其系数和位置组成。此外,我们提出了一种混合位置编码,结合了可学习的加性位置编码和具有逐层可学习基的旋转位置嵌入:L-RoPE。这使得每一层能够关注不同的时间依赖性。我们在几个公开基准上的实验表明,我们的模型在高压缩率下具有很强的竞争力。这使其特别适用于具有强大内存限制的应用场景。
Summary / 总结
The research addresses the challenges of long-term time series forecasting using transformers, specifically the quadratic complexity of self-attention and the rigidity of uniform patching. It introduces BSAT, a parameter-free method that adaptively segments time series using B-splines, placing tokens in high-curvature regions. The model also employs a hybrid positional encoding combining learnable and Rotary Positional Embedding. Experiments on public benchmarks demonstrate competitive performance with high compression rates, making it suitable for memory-constrained applications.
研究旨在通过引入B-Spline自适应分词器(BSAT),使用B样条自适应分割时间序列数据,解决使用变压器进行长期时间序列预测时遇到的挑战。该方法在高曲率区域放置分词,并结合使用可学习的位置编码和旋转位置嵌入,使每一层能够关注不同的时间依赖性。实验表明,该模型即使在高压缩率下也能表现出色,特别适用于内存受限的应用场景。
ARISE: Adaptive Reinforcement Integrated with Swarm Exploration
Authors: Rajiv Chaitanya M, D R Ramesh Babu
First: 2026-01-02T14:09:22+00:00 · Latest: 2026-01-02T14:09:22+00:00
Comments: 12 pages. Accepted for presentation at WCSC 2026
Abstract
Effective exploration remains a key challenge in RL, especially with non-stationary rewards or high-dimensional policies. We introduce ARISE, a lightweight framework that enhances reinforcement learning by augmenting standard policy-gradient methods with a compact swarm-based exploration layer. ARISE blends policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory sampled in the action space, and modulates exploration adaptively using reward-variance cues. While easy benchmarks exhibit only slight improvements (e.g., +0.7% on CartPole-v1), ARISE yields substantial gains on more challenging tasks, including +46% on LunarLander-v3 and +22% on Hopper-v4, while preserving stability on Walker2d and Ant. Under non-stationary reward shifts, ARISE provides marked robustness advantages, outperforming PPO by +75 points on CartPole and improving LunarLander accordingly. Ablation studies confirm that both the swarm component and the adaptive mechanism contribute to the performance. Overall, ARISE offers a simple, architecture-agnostic route to more exploratory and resilient RL agents without altering core algorithmic structures.
中文标题/摘要
标题:ARISE:自适应强化学习与群探索集成
有效的探索仍然是RL中的关键挑战,尤其是在非平稳奖励或高维策略的情况下。我们引入了ARISE,这是一种轻量级框架,通过将紧凑的基于群的探索层与标准策略梯度方法相结合来增强强化学习。ARISE将策略动作与粒子驱动的提案相结合,其中每个粒子代表在动作空间中采样的候选策略轨迹,并使用奖励方差提示自适应地调节探索。虽然在简单的基准测试中仅表现出轻微的改进(例如,在CartPole-v1上提高了0.7%),但在更具挑战性的任务中,ARISE却取得了显著的提升,包括在LunarLander-v3上提高了46%,在Hopper-v4上提高了22%,同时在Walker2d和Ant上保持了稳定性。在非平稳奖励变化下,ARISE提供了显著的鲁棒性优势,在CartPole上比PPO高出75分,在LunarLander上也相应地提高了表现。消融研究证实,群组件和自适应机制都对性能有所贡献。总体而言,ARISE提供了一种简单且架构无关的方法,以实现更具探索性和鲁棒性的RL代理,而不改变核心算法结构。
Summary / 总结
ARISE is a lightweight framework that improves reinforcement learning by integrating a compact swarm-based exploration layer with standard policy-gradient methods. It enhances exploration by blending policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory. ARISE shows significant improvements on challenging tasks such as LunarLander-v3 (+46%) and Hopper-v4 (+22%), while maintaining stability on Walker2d and Ant. It also demonstrates robustness under non-stationary reward shifts, outperforming PPO on CartPole by +75 points. Ablation studies confirm the importance of both the swarm component and the adaptive mechanism.
ARISE 是一个轻量级框架,通过将基于群的探索层与标准策略梯度方法集成来改进强化学习。它通过将策略动作与粒子驱动的提案结合,并基于奖励方差进行探索调节来增强探索。ARISE 在具有挑战性的任务上表现出显著的性能提升,例如在 LunarLander-v3 上提高了 46%,在 Hopper-v4 上提高了 22%,同时在 Walker2d 和 Ant 上保持了稳定性。它还展示了在非平稳奖励变化下的鲁棒性优势,在 CartPole 上比 PPO 高出 75 分。消融研究证实了群组组件和自适应机制的重要性。
Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model
Authors: Kaiwen Tang, Zhanglu Yan, Weng-Fai Wong
Venue: ICML 2025
First: 2024-09-04T10:20:50+00:00 · Latest: 2026-01-02T13:42:47+00:00
Comments: Accepted by ICML 2025. Camera-ready version
Abstract
For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models targeted for deployment in resource-constrained devices where energy efficiency is critical. Spiking neural networks (SNNs) offer a promising solution due to their energy efficiency, and there are already works on realizing transformer-based models on SNNs. However, key operations like softmax and layer normalization (LN) are difficult to implement on neuromorphic hardware, and many of these early works sidestepped them. To address these challenges, we introduce Sorbet, a transformer-based spiking language model that is more neuromorphic hardware-compatible. Sorbet incorporates a novel shifting-based softmax called PTsoftmax and a Bit Shifting PowerNorm (BSPN), both designed to replace the respective energy-intensive operations. By leveraging knowledge distillation and model quantization, Sorbet achieved a highly compressed binary weight model that maintains competitive performance while achieving $27.16\times$ energy savings compared to BERT. We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference. Our code is publicly available at \href{https://github.com/Kaiwen-Tang/Sorbet}{https://github.com/Kaiwen-Tang/Sorbet}
中文标题/摘要
标题:Sorbet:一种与神经形态硬件兼容的基于变压器的脉冲语言模型
由于隐私等因素,存在在边缘部署语言模型的应用场景。这催生了针对资源受限设备的微型语言模型,其中能源效率至关重要。脉冲神经网络(SNN)因其能源效率而具有前景,并且已经有人在SNN上实现基于变压器的模型。然而,关键操作如Softmax和层归一化(LN)在神经形态硬件上难以实现,早期许多工作都绕过了这些操作。为了解决这些挑战,我们提出了Sorbet,一种更与神经形态硬件兼容的基于变压器的脉冲语言模型。Sorbet引入了一种新颖的基于位移的Softmax,称为PTsoftmax,以及一种位移幂归一化(BSPN),两者都旨在替换相应的高能耗操作。通过利用知识蒸馏和模型量化,Sorbet实现了高度压缩的二进制权重模型,保持了竞争力的同时实现了27.16倍的能源节省,与BERT相比。我们通过在GLUE基准测试和一系列消融研究中进行广泛测试,验证了Sorbet作为语言模型推理的能源高效解决方案的潜力。我们的代码已公开发布在https://github.com/Kaiwen-Tang/Sorbet
Summary / 总结
Sorbet is a transformer-based spiking language model designed for neuromorphic hardware, addressing the challenges of energy-intensive operations like softmax and layer normalization. By incorporating PTsoftmax and BSPN, Sorbet achieves a highly compressed binary weight model with $27.16 imes$ energy savings compared to BERT while maintaining competitive performance. Sorbet was validated through extensive testing on the GLUE benchmark and ablation studies, showing its potential as an energy-efficient solution for language model inference.
Sorbet 是一种针对神经形态硬件设计的变压器基语言模型,旨在解决如 softmax 和层规范化等高能耗操作的挑战。它引入了 PTsoftmax 和 BSPN 来替换这些操作,并通过知识蒸馏和模型量化实现了高度压缩的二进制权重模型。Sorbet 在 GLUE 基准测试和消融研究中的广泛测试中展示了与 BERT 相比高达 $27.16 imes$ 的能效提升,同时保持了竞争力的性能。
Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery
Authors: Carson Dudley, Reiden Magdaleno, Christopher Harding, Marisa Eisenberg
First: 2025-07-11T19:18:42+00:00 · Latest: 2026-01-02T13:21:46+00:00
Abstract
Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While hybrid approaches like Physics-Informed Neural Networks (PINNs) embed domain knowledge as functional constraints, they can be brittle under model misspecification. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that instead embeds domain knowledge into the training data to establish a structural prior. By pretraining on synthetic corpora spanning diverse model structures and observational artifacts, SGNNs learn the broad patterns of physical possibility. This allows the model to internalize the underlying dynamics of a system without being forced to satisfy a single, potentially incorrect equation. We evaluated SGNNs across scientific disciplines and found that this approach confers significant robustness. In prediction tasks, SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines. In tests on dengue outbreaks, SGNNs outperformed physics-constrained models even when both were restricted to incorrect human-to-human transmission equations, demonstrating that SGNNs are potentially more robust to model misspecification. For inference, SGNNs extend the logic of simulation-based inference to enable supervised learning for unobservable targets, estimating early COVID-19 transmissibility more accurately than traditional methods. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability that maps real-world data back to the simulated manifold to identify underlying processes. By unifying these disparate simulation-based techniques into a single framework, we demonstrate that mechanistic simulations can serve as effective training data for robust scientific inference that generalizes beyond the limitations of fixed functional forms.
中文标题/摘要
标题:模拟作为监督:基于机制的预训练以促进科学发现
科学建模在可解释的机制理论与机器学习的预测能力之间存在权衡。虽然像物理感知神经网络(PINNs)这样的混合方法将领域知识嵌入为函数约束,但在模型误设时可能会变得脆弱。我们提出了基于模拟的神经网络(SGNNs)框架,该框架将领域知识嵌入训练数据中,以建立结构先验。通过在涵盖多种模型结构和观测特征的合成语料库上进行预训练,SGNNs 学习了物理可能性的广泛模式。这使模型能够内化系统的潜在动力学,而无需被迫满足单一的、可能不正确的方程。我们在多个科学领域进行了评估,发现这种方法提供了显著的鲁棒性。在预测任务中,SGNNs 的 COVID-19 预测能力几乎提高了 CDC 基线的三倍。在登革热爆发测试中,即使两种模型都受限于错误的人际传播方程,SGNNs 也优于物理约束模型,证明了 SGNNs 对模型误设具有更强的鲁棒性。对于推断,SGNNs 将基于模拟的推断逻辑扩展到监督学习,以估计早期 COVID-19 传播性,比传统方法更准确。最后,SGNNs 使回溯到模拟的归因成为可能,这是一种机制可解释性形式,将现实世界数据映射回模拟流形以识别潜在过程。通过将这些不同的基于模拟的技术统一到一个框架中,我们证明了机制模拟可以作为有效训练数据,以实现超越固定函数形式限制的稳健科学推断。
Summary / 总结
The paper introduces Simulation-Grounded Neural Networks (SGNNs), which pretrain on synthetic corpora to learn broad physical patterns, enhancing robustness in scientific modeling. SGNNs outperform traditional Physics-Informed Neural Networks (PINNs) in prediction tasks, nearly tripling COVID-19 forecasting skill and outperforming physics-constrained models even when the models are based on incorrect equations. SGNNs also improve inference accuracy and enable back-to-simulation attribution, providing mechanistic interpretability by mapping real-world data to the simulated manifold.
论文提出了基于模拟的神经网络(SGNNs),以解决科学建模中解释性和预测能力之间的权衡问题。SGNNs通过预训练合成数据集来学习广泛的物理可能性模式,使模型能够内化系统动力学而不受单一错误方程的限制。实验表明,SGNNs在COVID-19预测中显著提高性能,并在登革热暴发预测中优于物理约束模型,即使两者都使用了错误的传播方程。SGNNs还提高了推断准确性,并实现了回溯到模拟流形的归因,提供了一个统一框架以实现稳健的科学推断。
Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians
Authors: Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson
First: 2026-01-02T13:04:47+00:00 · Latest: 2026-01-02T13:04:47+00:00
Abstract
Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.
中文标题/摘要
标题:像素到4D:基于相机控制的图像到视频生成与动态3D高斯分布
人类能够仅凭一张图片预测场景的未来动态。能够模仿这种能力的视频生成模型是智能系统的重要组成部分。近期的方法在单图条件下的视频生成中提高了时间连贯性和3D一致性。然而,这些方法往往缺乏稳健的用户可控性,例如修改相机路径,限制了其在实际应用中的适用性。大多数现有的相机控制图像到视频模型在准确建模相机运动、保持时间连贯性和保持几何完整性方面存在困难。利用显式的中间3D表示提供了一种有前景的解决方案,通过这种方式可以生成与给定的相机轨迹一致的连贯视频。尽管这些方法通常使用3D点云来渲染场景并在后期引入物体运动,但这种两步过程仍然无法实现完全的时间连贯性,尽管允许对相机运动进行精确控制。我们提出了一种新的框架,该框架在一次前向传递中构建3D高斯场景表示,并在给定单张图片的情况下采样可能的物体运动。这使得可以在不需要迭代去噪以注入物体运动到渲染帧的情况下实现快速、相机引导的视频生成。在KITT、Waymo、RealEstate10K和DL3DV-10K数据集上的大量实验表明,我们的方法在视频质量和推理效率方面达到了最先进的水平。项目页面可在https://melonienimasha.github.io/Pixel-to-4D-Website/获取。
Summary / 总结
The research aims to develop a method for generating coherent 4D videos from a single image with user-controlled camera paths. The proposed Pixel-to-4D framework uses a single forward pass to construct a 3D Gaussian scene representation and sample plausible object motion, enabling fast and controllable video generation. Experiments show that the method outperforms existing approaches in terms of video quality and inference efficiency on various datasets including KITTI, Waymo, RealEstate10K, and DL3DV-10K.
研究旨在开发一种从单张图像生成具有摄像机控制的连贯4D视频的方法,解决现有模型在用户可控性和时间一致性方面的局限性。提出的Pixel-to-4D框架在单次前向传递中构建3D高斯场景表示并采样可能的物体运动,从而实现无需迭代去噪即可快速、摄像机引导的视频生成。在KITTI、Waymo、RealEstate10K和DL3DV-10K数据集上的实验表明,该方法实现了最先进的视频质量和推理效率。
PoseStreamer: A Multi-modal Framework for 3D Tracking of Unseen Moving Objects
Authors: Huiming Yang, Linglin Liao, Fei Ding, Sibo Wang, Zijian Zeng
First: 2025-12-28T15:52:58+00:00 · Latest: 2026-01-02T12:58:07+00:00
Abstract
Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.
Summary / 总结
PoseStreamer is a multi-modal framework for 6DoF pose estimation of unseen moving objects, addressing the limitations of standard RGB cameras in high-speed and low-light scenarios. It integrates an Adaptive Pose Memory Queue, an Object-centric 2D Tracker, and a Ray Pose Filter. Experiments show that PoseStreamer outperforms existing methods in high-speed moving scenarios and demonstrates strong generalizability for unseen objects.
PoseStreamer 是一个用于未见移动物体的 6DoF 姿态估计的多模态框架,旨在解决高速和低光照场景下的挑战。它结合了自适应姿态记忆队列、对象中心的 2D 跟踪器和沿摄像机射线的姿态滤波器。实验表明,PoseStreamer 在高速移动场景中表现出色,并且作为无模板框架具有良好的通用性。
IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning
Authors: Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Fuzhen Li, Liu Kang, Feng Jiang, Zhiyong Zheng, Fan Yang
First: 2026-01-02T12:57:06+00:00 · Latest: 2026-01-02T12:57:06+00:00
Comments: 14 pages, 4 figures
Abstract
Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.
中文标题/摘要
标题:IRPO:通过强化学习扩展布雷德利-泰利模型
生成奖励模型(GRMs)由于其可解释性、推理时的可扩展性和通过强化学习(RL)进行细化的潜力,在奖励建模方面引起了广泛的研究兴趣。然而,广泛使用的成对GRMs在与GRPO等组相对策略优化(Group Relative Policy Optimization)等RL算法集成时,会形成计算瓶颈。这种瓶颈来源于两个因素:(i)成对比较所需的O(n^2)时间复杂性以获得相对评分,以及(ii)重复采样或额外的思考链(CoT)推理以提高性能的计算开销。为了解决第一个因素,我们提出了组间相对偏好优化(IRPO),这是一种新颖的RL框架,将广为人知的布雷德利-泰利模型纳入GRPO中。通过为每个响应生成点评分,IRPO在RL训练期间可以高效地评估任意数量的候选者,同时保持可解释性和精细的奖励信号。实验结果表明,IRPO在多个基准测试中实现了点对点GRMs的最佳性能,其性能与当前领先的成对GRMs相当。此外,我们还展示了IRPO在后训练评估中显著优于成对GRMs。
Summary / 总结
The paper introduces IRPO, a reinforcement learning framework that integrates the Bradley-Terry model to address the computational bottleneck in pairwise generative reward models used with RL algorithms. IRPO generates pointwise scores for responses, allowing efficient evaluation of many candidates during training while maintaining interpretability. Experiments show that IRPO outperforms existing pointwise models and is competitive with leading pairwise models, especially in post-training evaluations.
论文提出了IRPO,这是一种结合布拉德利-特里模型的强化学习框架,旨在解决在使用RL算法时,对战生成奖励模型的计算瓶颈问题。IRPO为每个响应生成点估计分数,允许在训练期间高效评估大量候选者,同时保持可解释性和精细的奖励信号。实验表明,IRPO在多个基准测试中超越了现有的点估计和对战生成奖励模型,实现了最先进的性能,并且在后续评估中与当前领先的对战生成奖励模型具有可比性。
Frequent subgraph-based persistent homology for graph classification
Authors: Xinyang Chen, Amaël Broustet, Guanyuan Zeng, Cheng He, Guoting Chen
First: 2025-12-31T15:21:15+00:00 · Latest: 2026-01-02T12:53:32+00:00
Comments: v2: Author list updated to include previously omitted co-authors
Abstract
Persistent homology (PH) has recently emerged as a powerful tool for extracting topological features. Integrating PH into machine learning and deep learning models enhances topology awareness and interpretability. However, most PH methods on graphs rely on a limited set of filtrations, such as degree-based or weight-based filtrations, which overlook richer features like recurring information across the dataset and thus restrict expressive power. In this work, we propose a novel graph filtration called Frequent Subgraph Filtration (FSF), which is derived from frequent subgraphs and produces stable and information-rich frequency-based persistent homology (FPH) features. We study the theoretical properties of FSF and provide both proofs and experimental validation. Beyond persistent homology itself, we introduce two approaches for graph classification: an FPH-based machine learning model (FPH-ML) and a hybrid framework that integrates FPH with graph neural networks (FPH-GNNs) to enhance topology-aware graph representation learning. Our frameworks bridge frequent subgraph mining and topological data analysis, offering a new perspective on topology-aware feature extraction. Experimental results show that FPH-ML achieves competitive or superior accuracy compared with kernel-based and degree-based filtration methods. When integrated into graph neural networks, FPH yields relative performance gains ranging from 0.4 to 21 percent, with improvements of up to 8.2 percentage points over GCN and GIN backbones across benchmarks.
中文标题/摘要
标题:基于频繁子图的持久同调法用于图分类
持久同调(PH)最近已成为提取拓扑特征的强大工具。将PH集成到机器学习和深度学习模型中增强了拓扑意识和可解释性。然而,大多数图上的PH方法依赖于有限的滤波器集,如度数基或权重基滤波器,这忽略了数据集中反复出现的信息,从而限制了表达能力。在本文中,我们提出了一种新的图滤波器,称为频繁子图滤波器(FSF),该滤波器源自频繁子图并产生稳定且信息丰富的基于频率的持久同调(FPH)特征。我们研究了FSF的理论性质并提供了证明和实验验证。除了持久同调本身,我们还介绍了两种图分类方法:基于FPH的机器学习模型(FPH-ML)和将FPH与图神经网络(FPH-GNNs)结合的混合框架,以增强拓扑感知的图表示学习。我们的框架将频繁子图挖掘与拓扑数据分析相结合,提供了拓扑感知特征提取的新视角。实验结果表明,FPH-ML在与核基和度数基滤波器方法相比时,实现了竞争力或更优的准确性。当集成到图神经网络中时,FPH在基准测试中相对性能提高了0.4%到21%,并在GCN和GIN骨干网络上最高提高了8.2个百分点。
Summary / 总结
This paper introduces Frequent Subgraph Filtration (FSF) for graph classification, which enhances persistent homology (PH) by focusing on recurring subgraph patterns. The method produces stable and informative frequency-based PH features, leading to improved topology-aware graph representations. Experiments show that FPH-ML, a machine learning model based on these features, achieves competitive or superior accuracy compared to kernel-based and degree-based methods. Integrating FPH with graph neural networks (GNNs) further improves performance, with relative gains ranging from 0.4% to 21%, and up to 8.2 percentage points over GCN and GIN backbones.
本文提出了用于图分类的频繁子图滤波(FSF),生成了稳定且丰富的基于频率的持久同调(FPH)特征。该方法增强了机器学习模型中的拓扑意识和可解释性。实验结果表明,FPH-ML在准确度上与核方法和基于度的滤波方法相当或更优。将FPH与图神经网络(FPH-GNNs)结合使用,可提高性能,最高可提升8.2个百分点,超过GCN和GIN基线模型。
Fast-weight Product Key Memory
Authors: Tianyu Zhao, Llion Jones
First: 2026-01-02T12:37:53+00:00 · Latest: 2026-01-02T12:37:53+00:00
Abstract
Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While Softmax attention offers unbounded storage at prohibitive quadratic costs, linear variants provide efficiency but suffer from limited, fixed-size storage. We propose Fast-weight Product Key Memory (FwPKM), a novel architecture that resolves this tension by transforming the sparse Product Key Memory (PKM) from a static module into a dynamic, "fast-weight" episodic memory. Unlike PKM, FwPKM updates its parameters dynamically at both training and inference time via local chunk-level gradient descent, allowing the model to rapidly memorize and retrieve new key-value pairs from input sequences. Experiments reveal that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle in a Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
中文标题/摘要
标题:快速权重产品键记忆
现代语言模型中的序列建模层通常在存储容量和计算效率之间面临权衡。虽然Softmax注意力提供无界的存储但成本高昂,线性变体则提供效率但存储有限且固定。我们提出了一种新的架构——快速权重产品键记忆(FwPKM),通过将稀疏的产品键记忆(PKM)从静态模块转变为动态的“快速权重”情景记忆来解决这一矛盾。与PKM不同,FwPKM在训练和推理时通过局部块级梯度下降动态更新其参数,使模型能够快速记忆和检索输入序列中的新键值对。实验表明,FwPKM作为有效的情景记忆,能够补充标准模块的语义记忆,显著降低长上下文数据集的困惑度。值得注意的是,在“针扎干草堆”评估中,FwPKM能够在仅训练于4K词元序列的情况下泛化到128K词元的上下文。
Summary / 总结
The paper addresses the trade-off between storage capacity and computational efficiency in sequence modeling layers of modern language models. It introduces Fast-weight Product Key Memory (FwPKM), a dynamic episodic memory that updates its parameters via local gradient descent, allowing efficient memorization and retrieval of new key-value pairs. Experiments show that FwPKM reduces perplexity on long-context datasets and generalizes well to large contexts despite training on smaller sequences.
论文解决了现代语言模型中序列建模层在存储容量和计算效率之间的权衡问题。它提出了快速权重产品键记忆(FwPKM),这是一种动态的 episodic 记忆,通过局部梯度下降更新其参数,允许高效地记忆和检索新的键值对。实验表明,FwPKM 在长上下文数据集上减少了困惑度,并且即使在训练时使用较小的序列也能很好地泛化到大上下文。
Three factor delay learning rules for spiking neural networks
Authors: Luke Vassallo, Nima Taherinejad
First: 2026-01-02T12:28:53+00:00 · Latest: 2026-01-02T12:28:53+00:00
Comments: 7 pages, 5 figures
Abstract
Spiking Neural Networks (SNNs) are dynamical systems that operate on spatiotemporal data, yet their learnable parameters are often limited to synaptic weights, contributing little to temporal pattern recognition. Learnable parameters that delay spike times can improve classification performance in temporal tasks, but existing methods rely on large networks and offline learning, making them unsuitable for real-time operation in resource-constrained environments. In this paper, we introduce synaptic and axonal delays to leaky integrate and fire (LIF)-based feedforward and recurrent SNNs, and propose three-factor learning rules to simultaneously learn delay parameters online. We employ a smooth Gaussian surrogate to approximate spike derivatives exclusively for the eligibility trace calculation, and together with a top-down error signal determine parameter updates. Our experiments show that incorporating delays improves accuracy by up to 20% over a weights-only baseline, and for networks with similar parameter counts, jointly learning weights and delays yields up to 14% higher accuracy. On the SHD speech recognition dataset, our method achieves similar accuracy to offline backpropagation-based approaches. Compared to state-of-the-art methods, it reduces model size by 6.6x and inference latency by 67%, with only a 2.4% drop in classification accuracy. Our findings benefit the design of power and area-constrained neuromorphic processors by enabling on-device learning and lowering memory requirements.
中文标题/摘要
标题:基于突触和轴突延迟的三因素延迟学习规则用于脉冲神经网络
脉冲神经网络(SNNs)是操作时空数据的动力系统,但其可学习参数通常仅限于突触权重,对时间模式识别贡献甚微。能够延迟突触时间的可学习参数可以提高时间任务中的分类性能,但现有方法依赖于大型网络和离线学习,使其不适合资源受限环境中的实时操作。本文向基于LIF的前馈和递归SNN引入了突触和轴突延迟,并提出了三因素学习规则以在线同时学习延迟参数。我们使用平滑的高斯近似来近似脉冲导数,仅用于资格迹计算,并与自上而下的误差信号确定参数更新。实验表明,引入延迟可将准确率提高20%以上,与仅权重基线相比,对于具有相似参数计数的网络,同时学习权重和延迟可提高14%以上的准确率。在SHD语音识别数据集上,我们的方法与基于反向传播的方法具有相似的准确率。与最先进的方法相比,它将模型大小减少了6.6倍,推理延迟减少了67%,分类准确率仅下降2.4%。我们的发现有助于功率和面积受限的神经形态处理器的设计,使其能够在设备上进行学习并降低内存需求。
Summary / 总结
This paper introduces three-factor delay learning rules for spiking neural networks (SNNs) to enhance temporal pattern recognition. By incorporating synaptic and axonal delays into LIF-based SNNs and using a smooth Gaussian surrogate for eligibility trace calculation, the method allows for online learning of delay parameters. Experiments demonstrate that these delays improve accuracy by up to 20% over a weights-only baseline, and jointly learning weights and delays yields up to 14% higher accuracy. The approach achieves similar accuracy to offline backpropagation-based methods on the SHD speech recognition dataset while reducing model size and inference latency, and maintaining high classification accuracy.
本文通过引入可学习的突触和轴突延迟,解决了传统脉冲神经网络(SNN)在处理时间模式方面的局限性,并提出了适用于前馈和反馈SNN的三因素延迟学习规则,实现了延迟参数的在线学习。实验表明,引入延迟可以将分类准确率提高20%以上,与仅学习权重相比,同时学习权重和延迟可提高高达14%的准确率。该方法在SHD语音识别数据集上达到了与基于反向传播的离线方法相似的准确率,同时减少了模型大小和推理延迟,并保持了较高的分类准确率。
EXAONE Deep: Reasoning Enhanced Language Models
Authors: Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Yongmin Park, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun
First: 2025-03-16T14:39:33+00:00 · Latest: 2026-01-02T12:15:21+00:00
Abstract
We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Deep models are openly available for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE.
中文标题/摘要
标题:EXAONE Deep:增强推理能力的语言模型
我们介绍了EXAONE Deep系列,该系列在各种推理任务中表现出色,包括数学和编程基准测试。我们主要使用包含长序列思维过程的推理专用数据集来训练我们的模型。评估结果显示,我们的较小模型EXAONE Deep 2.4B和7.8B在与之相当规模的其他模型中表现出色,而最大的模型EXAONE Deep 32B则在与领先开源模型的竞争中表现出色。所有EXAONE Deep模型均免费提供用于研究目的,并可从https://huggingface.co/LGAI-EXAONE下载。
Summary / 总结
The research aims to develop EXAONE Deep series, which excel in reasoning tasks such as math and coding. The models are trained on a dataset that includes detailed thought processes, allowing them to handle complex reasoning. The smaller models, EXAONE Deep 2.4B and 7.8B, outperform comparable models, while the largest model, EXAONE Deep 32B, shows competitive performance against leading open-source models. All EXAONE Deep models are available for research purposes and can be downloaded from Hugging Face.
研究旨在开发EXAONE Deep系列模型,这些模型在数学和编码等推理任务中表现出色。模型通过包含详细思考过程的数据集进行训练,能够处理复杂的推理任务。较小的模型EXAONE Deep 2.4B和7.8B在与之相当的模型中表现出色,而最大的模型EXAONE Deep 32B则与领先开源模型具有竞争力。所有EXAONE Deep模型均可用于研究目的,并可从Hugging Face下载。
Flattening Hierarchies with Policy Bootstrapping
Authors: John L. Zhou, Jonathan C. Kao
Venue: NeurIPS 2025 Spotlight
First: 2025-05-20T23:31:30+00:00 · Latest: 2026-01-02T12:08:01+00:00
Comments: NeurIPS 2025 (Spotlight, top 3.2%)
Abstract
Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/
中文标题/摘要
标题:使用策略自举扁平化层级结构
离线目标条件强化学习(GCRL)是一种在大型无奖励轨迹数据集上预训练通用策略的有前途的方法,类似于用于训练计算机视觉和自然语言处理基础模型的自监督目标。然而,由于稀疏奖励和折扣的组合,将GCRL扩展到更长的时间范围仍然具有挑战性,这使得原始动作与远期目标之间的比较优势变得模糊。层级强化学习方法在长期目标任务上取得了显著的实验结果,但它们依赖于模块化、时间尺度特定的策略和子目标生成,这增加了额外的复杂性,阻碍了向高维目标空间的扩展。在本文中,我们提出了一种算法,通过优势加权重要性采样从子目标条件策略中训练一个扁平(非层级)的目标条件策略。我们的方法消除了对(子)目标空间生成模型的需要,我们发现这对于在大型状态空间中扩展到高维控制至关重要。我们进一步表明,现有的层级和自举方法对应于我们推导中的特定设计选择。在一系列全面的基于状态和像素的移动和操作基准测试中,我们的方法匹配或超越了最先进的离线GCRL算法,并扩展到先前方法失败的复杂、长期任务。项目页面:https://johnlyzhou.github.io/saw/
Summary / 总结
This paper addresses the challenge of scaling offline goal-conditioned reinforcement learning (GCRL) to long-horizon tasks by introducing a method that trains a flat policy using policy bootstrapping. The approach avoids the need for a generative model over subgoals, which is crucial for handling high-dimensional state spaces. Experiments show that the proposed method outperforms existing GCRL algorithms on various locomotion and manipulation tasks, particularly on complex, long-horizon tasks where hierarchical methods struggle.
该研究通过引入一种通过子目标条件策略进行递归训练的扁平策略来解决在线下目标导向强化学习(GCRL)中扩展到长时任务的挑战。该方法使用优势加权重要性采样,并避免了在目标空间上使用生成模型的需求,这对于处理高维控制至关重要。实验结果表明,所提出的方法在各种运动和操作基准测试中与现有的最先进的GCRL算法相当或更优,特别是在先前方法难以应对的复杂、长时任务中表现出色。
EXAONE 3.0 7.8B Instruction Tuned Language Model
Authors: Soyoung An, Kyunghoon Bae, Eunbi Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Yeonjung Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Euisoon Kim, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Moontae Lee, Seungjun Lee, Woohyung Lim, Sangha Park, Sooyoun Park, Yongmin Park, Boseong Seo, Sihoon Yang, Heuiyeen Yeen, Kyungjae Yoo, Hyeongu Yun
First: 2024-08-07T04:38:38+00:00 · Latest: 2026-01-02T12:07:31+00:00
Abstract
We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art open models of similar size. Our comparative analysis shows that EXAONE 3.0 excels particularly in Korean, while achieving compelling performance across general tasks and complex reasoning. With its strong real-world effectiveness and bilingual proficiency, we hope that EXAONE keeps contributing to advancements in Expert AI. Our EXAONE 3.0 instruction-tuned model is available at https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct.
中文标题/摘要
标题:EXAONE 3.0 7.8B指令调优语言模型
我们介绍了EXAONE 3.0指令调优语言模型,这是由LG AI Research开发的大规模语言模型(LLMs)家族中的首个开源模型。在不同模型规模中,我们公开发布了7.8B指令调优模型,以促进开放研究和创新。通过广泛评估各种公共和内部基准,EXAONE 3.0在指令遵循能力方面展示了与同类最佳开源模型相当的竞争力。我们的比较分析表明,EXAONE 3.0在韩语方面尤为出色,同时在通用任务和复杂推理方面也表现出色。凭借其强大的实际效果和双语能力,我们希望EXAONE继续为专家人工智能的进步做出贡献。我们的EXAONE 3.0指令调优模型可在https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct获取。
Summary / 总结
The research introduces EXAONE 3.0, a 7.8B instruction-tuned language model developed by LG AI Research, which excels in Korean and performs well in general tasks and complex reasoning. The model was evaluated across various public and in-house benchmarks, showing competitive real-world performance compared to other state-of-the-art open models of similar size. The EXAONE 3.0 model is publicly available for further research and innovation.
研究介绍了由LG AI Research开发的EXAONE 3.0,这是一个7.8B参数的指令调优语言模型,旨在增强其实用性能和指令遵循能力。广泛的基准测试显示,EXAONE 3.0在韩语方面表现出色,并在一般任务和复杂推理方面表现出色,使其成为大型语言模型领域的有竞争力的开源模型。
Episodic Contextual Bandits with Knapsacks under Conversion Models
Authors: Wang Chi Cheung, Zitian Li
First: 2025-07-09T14:00:05+00:00 · Latest: 2026-01-02T11:55:23+00:00
Abstract
We study an online setting, where a decision maker (DM) interacts with contextual bandit-with-knapsack (BwK) instances in repeated episodes. These episodes start with different resource amounts, and the contexts' probability distributions are non-stationary in an episode. All episodes share the same latent conversion model, which governs the random outcome contingent upon a request's context and an allocation decision. Our model captures applications such as dynamic pricing on perishable resources with episodic replenishment, and first price auctions in repeated episodes with different starting budgets. We design an online algorithm that achieves a regret sub-linear in $T$, the number of episodes, assuming access to a \emph{confidence bound oracle} that achieves an $o(T)$-regret. Such an oracle is readily available from existing contextual bandit literature. We overcome the technical challenge with arbitrarily many possible contexts, which leads to a reinforcement learning problem with an unbounded state space. Our framework provides improved regret bounds in certain settings when the DM is provided with unlabeled feature data, which is novel to the contextual BwK literature.
中文标题/摘要
标题:具有转换模型的分组上下文臂问题
我们研究了一个在线设置,在该设置中,决策者(DM)在重复的分组中与上下文臂带背包(BwK)实例进行交互。这些分组以不同的资源量开始,并且分组内的上下文概率分布是非平稳的。所有分组共享相同的潜在转换模型,该模型决定了请求的上下文和分配决策之后的随机结果。我们的模型捕捉到诸如具有分组补充的 perishable 资源动态定价以及具有不同起始预算的重复时期内的一级价格拍卖等应用。我们设计了一个在线算法,在假设可以访问一个具有 o(T) 后悔的置信界 oracle 的情况下,该算法的后悔是 T 的次线性函数。这样的 oracle 可以从现有的上下文臂文献中获得。我们克服了由于可能的上下文数量任意多而导致的强化学习问题中未定义状态空间的技术挑战。我们的框架在某些情况下为 DM 提供了未标记特征数据时提供了改进的后悔界,这是上下文 BwK 文献中的新成果。
Summary / 总结
The paper investigates an online setting where a decision maker interacts with contextual bandit-with-knapsack instances in repeated episodes, each starting with different resource amounts and non-stationary context distributions. The decision maker uses a latent conversion model to make allocation decisions. An online algorithm is designed to achieve sub-linear regret in the number of episodes, assuming access to a confidence bound oracle. This framework improves regret bounds in certain settings when unlabeled feature data is provided, which is novel in the contextual BwK literature.
论文研究了一个在线设置,其中决策者在重复的时期内与具有上下文的背包臂问题实例进行交互,每个时期开始时资源量不同且上下文分布是非平稳的。决策者使用潜在转换模型来做出分配决策。设计了一个在线算法,在假设可以访问置信边界预言机的情况下,实现了时期数量的亚线性遗憾。该框架在某些情况下,当提供未标记的特征数据时,可以改进遗憾边界,这是上下文背包臂问题文献中的新成果。
CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models
Authors: Neeraj Anand, Samyak Jha, Udbhav Bamba, Rahul Rahaman
First: 2026-01-02T11:39:00+00:00 · Latest: 2026-01-02T11:39:00+00:00
Comments: Accepted at TMLR 2026
Abstract
Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.
中文标题/摘要
标题:CRoPS:一种无需训练的视觉-语言模型幻觉抑制框架
尽管大型视觉-语言模型(LVLMs)取得了快速的成功,但它们生成幻觉内容的倾向一直是一个持续的挑战,这在实际应用中削弱了其可靠性。现有的无需训练的方法虽然可以解决幻觉问题,但存在两个局限性:(i) 它们依赖于幻觉来源的狭窄假设,(ii) 它们在生成过程中后期效果下降,而幻觉最有可能在此时发生。一种常见的策略是通过完全或部分移除视觉标记并将其与原始模型进行对比来构建幻觉模型。然而,这本身是不够的,因为视觉信息仍然会传递到生成的文本中。基于这一洞察,我们提出了一种新的幻觉模型,通过选择性地移除关键文本标记来捕捉幻觉效果。我们进一步引入了广义对比解码,它将多种幻觉模型整合在一起,以代表多种幻觉来源。这些想法共同构成了CRoPS,一种无需训练的幻觉抑制框架,它通过提高CHAIR分数20%并在六个基准和三个LVLM家族中实现一致的改进,超越了最先进的无需训练的方法。
Summary / 总结
The research aims to address the issue of hallucinations in Large Vision-Language Models (LVLMs) by proposing CRoPS, a training-free framework. CRoPS mitigates hallucinations by selectively removing key text tokens and integrating multiple hallucinated models. The framework significantly improves CHAIR scores by 20% and consistently outperforms existing training-free methods across six benchmarks and three LVLM families.
研究旨在解决大型视觉-语言模型(LVLM)中的幻觉问题,这可能会影响其可靠性。提出的CRoPS方法引入了一种无需训练的框架,通过选择性地移除关键文本令牌并整合多个幻觉模型来缓解幻觉。该方法使CHAIR得分提高了20%,并在六个基准和三种LVLM家族中实现了持续的改进,优于现有的最佳无需训练方法。
Reconstructing Building Height from Spaceborne TomoSAR Point Clouds Using a Dual-Topology Network
Authors: Zhaiyu Chen, Yuanyuan Wang, Yilei Shi, Xiao Xiang Zhu
First: 2026-01-02T11:34:35+00:00 · Latest: 2026-01-02T11:34:35+00:00
Comments: Accepted for publication in IEEE Transactions on Geoscience and Remote Sensing
Abstract
Reliable building height estimation is essential for various urban applications. Spaceborne SAR tomography (TomoSAR) provides weather-independent, side-looking observations that capture facade-level structure, offering a promising alternative to conventional optical methods. However, TomoSAR point clouds often suffer from noise, anisotropic point distributions, and data voids on incoherent surfaces, all of which hinder accurate height reconstruction. To address these challenges, we introduce a learning-based framework for converting raw TomoSAR points into high-resolution building height maps. Our dual-topology network alternates between a point branch that models irregular scatterer features and a grid branch that enforces spatial consistency. By jointly processing these representations, the network denoises the input points and inpaints missing regions to produce continuous height estimates. To our knowledge, this is the first proof of concept for large-scale urban height mapping directly from TomoSAR point clouds. Extensive experiments on data from Munich and Berlin validate the effectiveness of our approach. Moreover, we demonstrate that our framework can be extended to incorporate optical satellite imagery, further enhancing reconstruction quality. The source code is available at https://github.com/zhu-xlab/tomosar2height.
中文标题/摘要
标题:基于双拓扑网络从机载TomoSAR点云重建建筑高度
可靠的建筑高度估计对于各种城市应用至关重要。机载SAR层析成像(TomoSAR)提供了独立于天气的侧视观测,能够捕捉到立面结构,为传统的光学方法提供了替代方案。然而,TomoSAR点云经常受到噪声、各向异性点分布和不相干表面数据空洞的影响,这些都阻碍了准确的高度重建。为了解决这些挑战,我们提出了一种基于学习的框架,将原始TomoSAR点转换为高分辨率的建筑高度图。我们的双拓扑网络交替处理建模不规则散射体特性的点分支和确保空间一致性的网格分支。通过联合处理这些表示,网络去噪输入点并填补缺失区域,生成连续的高度估计。据我们所知,这是首次直接从TomoSAR点云进行大规模城市高度制图的概念验证。在慕尼黑和柏林的数据上进行的大量实验验证了我们方法的有效性。此外,我们展示了我们的框架可以扩展以结合光学卫星图像,进一步提高重建质量。源代码可在https://github.com/zhu-xlab/tomosar2height获取。
Summary / 总结
This study addresses the challenge of accurately estimating building heights from noisy and incomplete TomoSAR point clouds by introducing a dual-topology network. The network alternates between a point branch for irregular features and a grid branch for spatial consistency, effectively denoising and inpainting the data to produce high-resolution height maps. Experiments on Munich and Berlin data show the effectiveness of this approach, and the framework can be extended to incorporate optical satellite imagery for further improvement.
本文通过引入双拓扑网络解决了从噪声较大的TomoSAR点云中可靠估计建筑物高度的问题。该网络交替处理点分支以处理不规则特征和网格分支以确保空间一致性,从而有效去除噪声并填补缺失区域,生成高分辨率的高度图。在慕尼黑和柏林的数据上进行的实验表明该方法的有效性,并且该框架可以扩展以结合光学卫星图像以获得更好的重建质量。
Quality Detection of Stored Potatoes via Transfer Learning: A CNN and Vision Transformer Approach
Authors: Shrikant Kapse, Priyankkumar Dhrangdhariya, Priya Kedia, Manasi Patwardhan, Shankar Kausley, Soumyadipta Maiti, Beena Rai, Shirish Karande
First: 2026-01-02T11:10:55+00:00 · Latest: 2026-01-02T11:10:55+00:00
Abstract
Image-based deep learning provides a non-invasive, scalable solution for monitoring potato quality during storage, addressing key challenges such as sprout detection, weight loss estimation, and shelf-life prediction. In this study, images and corresponding weight data were collected over a 200-day period under controlled temperature and humidity conditions. Leveraging powerful pre-trained architectures of ResNet, VGG, DenseNet, and Vision Transformer (ViT), we designed two specialized models: (1) a high-precision binary classifier for sprout detection, and (2) an advanced multi-class predictor to estimate weight loss and forecast remaining shelf-life with remarkable accuracy. DenseNet achieved exceptional performance, with 98.03% accuracy in sprout detection. Shelf-life prediction models performed best with coarse class divisions (2-5 classes), achieving over 89.83% accuracy, while accuracy declined for finer divisions (6-8 classes) due to subtle visual differences and limited data per class. These findings demonstrate the feasibility of integrating image-based models into automated sorting and inventory systems, enabling early identification of sprouted potatoes and dynamic categorization based on storage stage. Practical implications include improved inventory management, differential pricing strategies, and reduced food waste across supply chains. While predicting exact shelf-life intervals remains challenging, focusing on broader class divisions ensures robust performance. Future research should aim to develop generalized models trained on diverse potato varieties and storage conditions to enhance adaptability and scalability. Overall, this approach offers a cost-effective, non-destructive method for quality assessment, supporting efficiency and sustainability in potato storage and distribution.
中文标题/摘要
标题:基于迁移学习的储存马铃薯质量检测:一种CNN和视觉变换器方法
基于图像的深度学习为监测储存期间马铃薯质量提供了无损、可扩展的解决方案,解决了诸如发芽检测、重量损失估计和保质期预测等关键挑战。在本研究中,研究人员在受控温度和湿度条件下,收集了200天的图像和相应的重量数据。利用强大的预训练架构ResNet、VGG、DenseNet和视觉变换器(ViT),设计了两个专门模型:(1)高精度二分类器用于发芽检测;(2)先进的多分类预测器用于估计重量损失并准确预测剩余保质期。DenseNet在发芽检测中表现出色,准确率为98.03%。保质期预测模型在粗分类(2-5类)中表现最佳,准确率超过89.83%,而细分类(6-8类)由于细微的视觉差异和每类数据有限,准确率下降。这些发现表明,将基于图像的模型集成到自动化分拣和库存系统中是可行的,可以实现早期识别发芽马铃薯并根据储存阶段进行动态分类。实际意义包括改进库存管理、差异化定价策略和减少供应链中的食物浪费。虽然预测确切的保质期区间仍然具有挑战性,但专注于更广泛的分类区间可以确保稳健的性能。未来的研究应致力于开发适用于多种马铃薯品种和储存条件的通用模型,以增强适应性和可扩展性。总体而言,这种方法提供了一种成本效益高、无损的评估方法,支持马铃薯储存和分配中的效率和可持续性。
Summary / 总结
This study uses image-based deep learning to monitor potato quality during storage, focusing on sprout detection and weight loss estimation. By collecting images and weight data over 200 days under controlled conditions, the researchers designed specialized models using pre-trained architectures like ResNet, VGG, DenseNet, and Vision Transformer. DenseNet showed high precision in sprout detection with 98.03% accuracy. Shelf-life prediction models performed best with coarse class divisions, achieving over 89.83% accuracy. The findings suggest that image-based models can be integrated into automated sorting and inventory systems, improving inventory management and reducing food waste.
该研究旨在利用基于图像的深度学习技术监控储存期间的马铃薯质量,解决诸如发芽检测和重量损失估计等挑战。作者在受控条件下收集了200天内的图像和重量数据,并使用预训练的ResNet、VGG、DenseNet和Vision Transformer架构开发了专门的模型。DenseNet在发芽检测中达到了98.03%的准确率,而货架期预测模型在2-5个类别中表现最佳,准确率超过89.83%。研究结果表明,基于图像的模型可以集成到自动分拣系统中,实现早期发芽检测和动态库存管理,支持可持续性和减少食物浪费。
MCD: Marginal Contrastive Discrimination for conditional density estimation
Authors: Katia Meziani, Aminata Ndiaye, Benjamin Riu
First: 2022-06-03T14:22:29+00:00 · Latest: 2026-01-02T10:19:25+00:00
Abstract
We consider the problem of conditional density estimation, which is a major topic of interest in the fields of statistical and machine learning. Our method, called Marginal Contrastive Discrimination, MCD, reformulates the conditional density function into two factors, the marginal density function of the target variable and a ratio of density functions which can be estimated through binary classification. Like noise-contrastive methods, MCD can leverage state-of-the-art supervised learning techniques to perform conditional density estimation, including neural networks. Our benchmark reveals that our method significantly outperforms in practice existing methods on most density models and regression datasets.
中文标题/摘要
标题:MCD:边际对比鉴别法在条件密度估计中的应用
我们考虑条件密度估计的问题,这是统计学和机器学习领域的重要研究课题。我们的方法称为边际对比鉴别法(MCD),将条件密度函数重新表述为两个部分:目标变量的边际密度函数和一个可以通过二元分类估计的密度函数比值。与噪声对比方法类似,MCD 可以利用最先进的监督学习技术进行条件密度估计,包括神经网络。我们的基准测试表明,与现有的大多数密度模型和回归数据集上的方法相比,我们的方法在实践中表现显著更优。
Summary / 总结
The research aims to address the challenge of conditional density estimation, a key issue in statistical and machine learning. The proposed method, MCD (Marginal Contrastive Discrimination), decomposes the conditional density function into the marginal density of the target variable and a ratio estimated via binary classification. This approach leverages advanced supervised learning techniques, such as neural networks, for conditional density estimation. Experimental results show that MCD performs better than existing methods across various density models and regression datasets.
研究旨在通过提出边际对比区分(MCD)方法解决条件密度估计的问题,该方法将条件密度函数分解为目标变量的边际密度和通过二元分类估计的密度比。MCD 利用神经网络等监督学习技术来估计条件密度。实验结果表明,MCD 在大多数密度模型和回归数据集上的表现优于现有方法。
Matrix-free Second-order Optimization of Gaussian Splats with Residual Sampling
Authors: Hamza Pehlivan, Andrea Boscolo Camiletto, Lin Geng Foo, Marc Habermann, Christian Theobalt
First: 2025-04-17T12:52:08+00:00 · Latest: 2026-01-02T10:18:11+00:00
Abstract
3D Gaussian Splatting (3DGS) is widely used for novel view synthesis due to its high rendering quality and fast inference time. However, 3DGS predominantly relies on first-order optimizers such as Adam, which leads to long training times. To address this limitation, we propose a novel second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG), which we specifically tailor towards Gaussian Splatting. Our key insight is that the Jacobian in 3DGS exhibits significant sparsity since each Gaussian affects only a limited number of pixels. We exploit this sparsity by proposing a matrix-free and GPU-parallelized LM optimization. To further improve its efficiency, we propose sampling strategies for both the camera views and loss function and, consequently, the normal equation, significantly reducing the computational complexity. In addition, we increase the convergence rate of the second-order approximation by introducing an effective heuristic to determine the learning rate that avoids the expensive computation cost of line search methods. As a result, our method achieves a $3\times$ speedup over standard LM and outperforms Adam by $~6\times$ when the Gaussian count is low while remaining competitive for moderate counts. Project Page: https://vcai.mpi-inf.mpg.de/projects/LM-IS
中文标题/摘要
标题:无矩阵二次优化的残差采样高斯斑点
3D高斯斑点渲染(3DGS)因其高渲染质量和快速推理时间而广泛用于新颖视图合成。然而,3DGS主要依赖于一阶优化器如Adam,导致训练时间较长。为解决这一问题,我们提出了一种基于Levenberg-Marquardt(LM)和共轭梯度(CG)的新型二次优化策略,并特别针对高斯斑点进行了定制。我们的关键见解是,在3DGS中,雅可比矩阵表现出显著的稀疏性,因为每个高斯斑点仅影响少量像素。我们通过提出一种无矩阵的GPU并行LM优化来利用这种稀疏性。为了进一步提高其效率,我们提出了针对相机视图和损失函数的采样策略,从而显著降低了计算复杂度。此外,我们通过引入一种有效的学习率确定启发式方法来提高二次近似的收敛速度,该方法避免了线搜索方法的昂贵计算成本。因此,我们的方法在标准LM上实现了3倍的速度提升,并在高斯斑点数量较低时比Adam提高了约6倍的性能,而在中等数量时仍具有竞争力。项目页面:https://vcai.mpi-inf.mpg.de/projects/LM-IS
Summary / 总结
This paper addresses the long training times of 3D Gaussian Splatting (3DGS) by proposing a matrix-free second-order optimization strategy based on Levenberg-Marquardt (LM) and Conjugate Gradient (CG). The method exploits the sparsity of the Jacobian matrix in 3DGS to achieve GPU-parallelized optimization. By introducing efficient sampling strategies and a heuristic for determining the learning rate, the method significantly reduces computational complexity and achieves a 3x speedup over standard LM and a 6x improvement over Adam for low Gaussian counts, while maintaining competitiveness for moderate counts.
研究旨在通过提出基于Levenberg-Marquardt (LM)和Conjugate Gradient (CG)的矩阵自由第二阶优化方法来提高3D Gaussian Splatting (3DGS)的训练效率。该方法利用雅可比矩阵的稀疏性,并采用GPU并行优化以提高效率。此外,它引入了对相机视角和损失函数的采样策略以减少计算复杂性,并通过有效的学习率启发式方法来避免昂贵的线搜索计算。结果表明,该方法在低Gaussian计数时比标准LM快$3 imes$,比Adam快约$6 imes$,而在中等计数时仍具有竞争力。
RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation
Authors: Junxiao Xue, Pavel Smirnov, Ziao Li, Yunyun Shi, Shi Chen, Xinyi Yin, Xiaohan Yue, Lei Wang, Yiduo Wang, Feng Lin, Yijia Chen, Xiao Ma, Xiaoran Yan, Qing Zhang, Fengjian Xue, Xuecheng Wu
First: 2026-01-02T09:48:48+00:00 · Latest: 2026-01-02T09:48:48+00:00
Abstract
We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients'motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose estimation and motion analysis using RGB video input from multiple cameras which can be applied to the field of rehabilitation training. The pipeline can help to monitor and correct patients'actions, thus aiding them in regaining muscle strength and motor functions. Secondly, we propose a fast tracking method for medical rehabilitation scenarios with multiple-person interference, which requires less than 1ms for tracking for a single frame. Additionally, we modify SmoothNet for real-time posture estimation, effectively reducing pose estimation errors and restoring the patient's true motion state, making it visually smoother. Finally, we use Unity platform for real-time monitoring and evaluation of patients' motion during rehabilitation, and to display the muscle stress conditions to assist patients with their rehabilitation training.
中文标题/摘要
标题:RePose:康复训练中的实时3D人体姿态估计与生物力学分析框架
我们提出了一种名为RePose的实时3D人体姿态估计和运动分析方法,用于康复训练。该方法能够实时监控和评估患者在康复过程中的运动,提供即时反馈和指导,帮助患者正确执行康复练习。首先,我们介绍了一种统一的端到端实时人体姿态估计和运动分析管道,使用来自多个摄像头的RGB视频输入,适用于康复训练领域。该管道有助于监控和纠正患者的动作,从而帮助他们恢复肌肉力量和运动功能。其次,我们提出了一种快速跟踪方法,适用于多人员干扰的医疗康复场景,单帧跟踪时间少于1ms。此外,我们对SmoothNet进行了修改,以实现实时姿态估计,有效减少了姿态估计误差,恢复了患者的真实运动状态,使其视觉上更加平滑。最后,我们使用Unity平台对患者在康复过程中的运动进行实时监控和评估,并显示肌肉应力情况,以辅助患者的康复训练。
Summary / 总结
RePose is a real-time 3D human pose estimation and biomechanical analysis framework for rehabilitation, utilizing a unified pipeline that processes RGB video from multiple cameras for motion monitoring and correction. It includes a fast tracking method for scenarios with multiple people and modifies SmoothNet for real-time posture estimation, reducing errors and improving visual smoothness. The system uses the Unity platform for real-time feedback and displays muscle stress conditions to aid patients during rehabilitation exercises.
RePose 是一种用于康复训练的实时 3D 人体姿态估计和生物力学分析框架。它采用统一的管道从多个摄像头获取实时人体姿态估计,并提出了一种快速跟踪方法来处理多人干扰,每帧跟踪时间少于 1ms。该框架还对 SmoothNet 进行了修改,以实现实时姿态估计,减少了错误并提供了更平滑的视觉运动。关键发现包括能够实时监控和纠正患者动作,帮助恢复肌肉力量和运动功能。
LEL: Lipschitz Continuity Constrained Ensemble Learning for Efficient EEG-Based Intra-subject Emotion Recognition
Authors: Shengyu Gong, Yueyang Li, Zijian Kang, Bo Chai, Weiming Zeng, Hongjie Yan, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang
First: 2025-04-12T09:41:23+00:00 · Latest: 2026-01-02T09:44:11+00:00
Abstract
Accurate and efficient recognition of emotional states is critical for human social functioning, and impairments in this ability are associated with significant psychosocial difficulties. While electroencephalography (EEG) offers a powerful tool for objective emotion detection, existing EEG-based Emotion Recognition (EER) methods suffer from three key limitations: (1) insufficient model stability, (2) limited accuracy in processing high-dimensional nonlinear EEG signals, and (3) poor robustness against intra-subject variability and signal noise. To address these challenges, we introduce Lipschitz continuity-constrained Ensemble Learning (LEL), a novel framework that enhances EEG-based emotion recognition by enforcing Lipschitz continuity constraints on Transformer-based attention mechanisms, spectral extraction, and normalization modules. This constraint ensures model stability, reduces sensitivity to signal variability and noise, and improves generalization capability. Additionally, LEL employs a learnable ensemble fusion strategy that optimally combines decisions from multiple heterogeneous classifiers to mitigate single-model bias and variance. Extensive experiments on three public benchmark datasets (EAV, FACED, and SEED) demonstrate superior performance, achieving average recognition accuracies of 74.25%, 81.19%, and 86.79%, respectively. The official implementation codes are available at https://github.com/NZWANG/LEL.
中文标题/摘要
标题:LEL:受限于Lipschitz连续性的集成学习以提高基于EEG的情绪识别效率
准确且高效的识别情绪状态对于人类社会功能至关重要,而这种能力的缺陷与重大的心理社会困难相关联。尽管脑电图(EEG)提供了一种强大的工具用于客观情绪检测,但现有的基于EEG的情绪识别(EER)方法存在三个关键限制:(1)模型稳定性不足,(2)在处理高维非线性EEG信号时准确性有限,以及(3)对个体间差异和信号噪声的鲁棒性差。为了解决这些挑战,我们引入了受限于Lipschitz连续性的集成学习(LEL),这是一种新颖的框架,通过在基于Transformer的注意力机制、频谱提取和归一化模块上施加Lipschitz连续性约束来增强基于EEG的情绪识别。这种约束确保了模型的稳定性,减少了对信号变化和噪声的敏感性,并提高了泛化能力。此外,LEL采用了一种可学习的集成融合策略,该策略通过优化组合多个异质分类器的决策来减轻单模型偏差和方差。在三个公开基准数据集(EAV、FACED和SEED)上的广泛实验表明,其性能优越,分别实现了74.25%、81.19%和86.79%的平均识别准确率。官方实现代码可在https://github.com/NZWANG/LEL获取。
Summary / 总结
The research aims to improve the accuracy and efficiency of EEG-based emotion recognition by addressing model instability, high-dimensional signal processing, and intra-subject variability. The proposed LEL framework uses Lipschitz continuity constraints on attention mechanisms, spectral extraction, and normalization to enhance model stability and robustness. It also employs a learnable ensemble fusion strategy to combine decisions from multiple classifiers. Experiments on three benchmark datasets show that LEL achieves average recognition accuracies of 74.25%, 81.19%, and 86.79%.
论文提出了LEL框架,通过施加Lipschitz连续性约束和使用可学习的集成融合策略来提升基于EEG的情绪识别。该方法解决了模型不稳定、高维信号处理和个体间变异性的问题。在三个基准数据集上的实验结果显示了改进的准确率,分别为74.25%、81.19%和86.79%。
Do Chatbot LLMs Talk Too Much? The YapBench Benchmark
Authors: Vadim Borisov, Michael Gröger, Mina Mikhael, Richard H. Schreiber
First: 2026-01-02T09:43:52+00:00 · Latest: 2026-01-02T09:43:52+00:00
Abstract
Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini increasingly act as general-purpose copilots, yet they often respond with unnecessary length on simple requests, adding redundant explanations, hedging, or boilerplate that increases cognitive load and inflates token-based inference cost. Prior work suggests that preference-based post-training and LLM-judged evaluations can induce systematic length bias, where longer answers are rewarded even at comparable quality. We introduce YapBench, a lightweight benchmark for quantifying user-visible over-generation on brevity-ideal prompts. Each item consists of a single-turn prompt, a curated minimal-sufficient baseline answer, and a category label. Our primary metric, YapScore, measures excess response length beyond the baseline in characters, enabling comparisons across models without relying on any specific tokenizer. We summarize model performance via the YapIndex, a uniformly weighted average of category-level median YapScores. YapBench contains over three hundred English prompts spanning three common brevity-ideal settings: (A) minimal or ambiguous inputs where the ideal behavior is a short clarification, (B) closed-form factual questions with short stable answers, and (C) one-line coding tasks where a single command or snippet suffices. Evaluating 76 assistant LLMs, we observe an order-of-magnitude spread in median excess length and distinct category-specific failure modes, including vacuum-filling on ambiguous inputs and explanation or formatting overhead on one-line technical requests. We release the benchmark and maintain a live leaderboard for tracking verbosity behavior over time.
中文标题/摘要
标题:聊天机器人大语言模型是否说得太多?YapBench基准测试
大型语言模型(LLMs)如ChatGPT、Claude和Gemini越来越多地扮演通用副驾的角色,但在处理简单请求时,它们经常给出不必要的长篇回答,增加冗余解释、犹豫不决或模板化的内容,这增加了认知负担并增加了基于令牌的推理成本。先前的研究表明,基于偏好的后训练和LLM评估可能会导致系统性的长度偏差,即使答案质量相当,较长的答案也会受到奖励。 我们引入了YapBench,这是一个轻量级基准,用于量化简明提示下的过度生成。每个项目包含一个单轮提示、一个精心挑选的最小充分基线答案和一个类别标签。我们的主要指标YapScore衡量超出基线的额外响应长度(以字符为单位),这使得在不依赖任何特定分词器的情况下,可以跨模型进行比较。我们通过YapIndex,即按类别加权平均的中位YapScore,来总结模型性能。 YapBench包含超过三百个英语提示,涵盖了三种常见的简明提示设置:(A)最小或模糊输入,理想的反应是简短的澄清;(B)封闭形式的事实性问题,有简短稳定的答案;(C)一行代码任务,只需一个命令或片段即可。评估76个助手LLM后,我们观察到中位数额外长度的量级差异,并且在不同类别中存在特定的失败模式,包括在模糊输入上进行真空填充以及在一行技术请求上增加解释或格式化开销。我们发布了该基准,并维护了一个实时排行榜,以跟踪随着时间的推移而变化的冗长行为。
Summary / 总结
The research aims to address the issue of large language models (LLMs) providing overly long responses, which can increase cognitive load and token-based inference cost. To measure this, the authors developed YapBench, a benchmark that evaluates LLMs on brevity-ideal prompts. Key findings include an order-of-magnitude spread in median excess length across different categories of prompts and distinct failure modes such as vacuum-filling on ambiguous inputs and explanation or formatting overhead on one-line technical requests.
研究旨在解决大型语言模型(LLMs)提供过长响应的问题,这会增加认知负担和基于令牌的推理成本。研究引入了YapBench,这是一个用于衡量简明理想提示中过度生成的基准。主要发现包括76个模型在中位数过长长度上的数量级差异,以及不同的失败模式,例如在模糊输入上的真空填充和在技术请求上的解释冗余。
History
20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553