arXiv 论文速递

2026-03-13 03:51
Snapshot: 20260313_0351
COMIC: Agentic Sketch Comedy Generation
Authors: Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
First: 2026-03-11T17:59:59+00:00 · Latest: 2026-03-11T17:59:59+00:00
Comments: Project page: https://susunghong.github.io/COMIC/
Abstract
We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.
中文标题/摘要
标题:COMIC: 代理素描喜剧生成
我们提出了一种全自动AI系统,能够生成类似于周六夜现场等素描节目的短喜剧视频。该系统从角色参考开始,采用了一群基于真实制作工作室角色的代理人群,通过迭代竞争、评估和改进来优化质量和输出的多样性。一个关键贡献是通过分析YouTube上的喜剧视频语料库,引入了与真实观众偏好对齐的LLM评论家,以自动评估幽默感。我们的实验表明,该框架生成的结果接近专业制作素描的质量,同时在视频生成方面表现出最先进的性能。
Summary / 总结
The research aims to develop an AI system that can generate short comedic videos similar to sketch shows. It uses a population of AI agents representing various production roles to iteratively improve the quality and diversity of ideas. The system includes LLM critics trained on YouTube comedy videos to evaluate humor. Experiments show that the system produces results comparable to professionally produced sketches and demonstrates state-of-the-art performance in video generation.
研究旨在开发一个能够生成类似于脱口秀的短喜剧视频的AI系统。该系统使用代表制作角色的AI代理群体进行迭代竞争和改进,并通过分析YouTube上的喜剧视频来评估幽默感,以符合观众偏好。实验表明,该系统生成的喜剧视频质量接近专业制作的草稿,展示了在视频生成方面的先进性能。
LiTo: Surface Light Field Tokenization
Authors: Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel
Venue: ICLR 2026
First: 2026-03-11T17:59:59+00:00 · Latest: 2026-03-11T17:59:59+00:00
Comments: ICLR 2026; Project page: https://apple.github.io/ml-lito/
Abstract
We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.
中文标题/摘要
标题:LiTo:表面光场令牌化
我们提出了一种3D潜在表示,同时建模对象几何和视点依赖的外观。大多数先前的工作要么专注于重建3D几何,要么预测视点无关的漫反射外观,因此难以捕捉到真实的视点依赖效果。我们的方法利用了RGB-深度图像提供了表面光场的样本。通过将这种表面光场的随机子样本编码为紧凑的潜在向量集,我们的模型学会了在统一的3D潜在空间中表示几何和外观。这种表示在复杂光照下能够再现视点依赖效果,如镜面高光和菲涅尔反射。我们进一步在该表示上训练了一个潜在流匹配模型,使其在单张输入图像的条件下学习其分布,从而能够生成与输入中的光照和材料一致的3D对象。实验表明,我们的方法在视觉质量和输入保真度方面优于现有方法。
Summary / 总结
The research aims to develop a 3D latent representation that captures both object geometry and view-dependent appearance. The method encodes random subsamples of a surface light field into latent vectors, allowing the model to learn a unified 3D latent space for geometry and appearance. Experiments demonstrate that this approach outperforms existing methods in visual quality and input fidelity, particularly in reproducing view-dependent effects like specular highlights and Fresnel reflections under complex lighting conditions.
研究旨在开发一种能够同时捕捉物体几何形状和视点依赖外观的3D潜在表示。方法通过将RGB-深度图像中的表面光场的随机子样本编码为紧凑的潜在向量集,使模型能够在统一的3D潜在空间中表示几何形状和外观。实验结果表明,该方法在视觉质量和输入保真度方面优于现有方法,能够有效再现复杂的光照条件下的镜面高光和菲涅尔反射等视点依赖效果。
Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation
Authors: Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette
First: 2026-03-11T17:59:42+00:00 · Latest: 2026-03-11T17:59:42+00:00
Comments: 27 pages, 15 figures
Abstract
We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel-wise 1D approximations that neglect lateral diffusion, and soft-constrained Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high-resolution 3D tomography. Our discretize-then-optimize paradigm effectively mitigates the spectral bias and ill-posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at https://cab-lab-princeton.github.io/nefty/
中文标题/摘要
标题:神经场热成像:用于非破坏性评估的可微物理框架
我们提出了神经场热成像(NeFTY),这是一种用于从瞬态表面温度测量中定量重建材料属性的可微物理框架。传统热成像依赖于像素级的一维近似,忽略了横向扩散,而软约束物理感知神经网络(PINNs)在瞬态扩散场景中往往由于梯度刚性而失效。NeFTY 将三维扩散场参数化为通过严格的数值求解器优化的连续神经场。通过利用可微物理求解器,我们的方法将热力学定律作为硬约束强制执行,同时保持高分辨率三维成像所需的内存效率。我们通过先离散化后优化的范式有效缓解了逆热传导固有的频谱偏差和病态性,从而能够在任意尺度上恢复次表面缺陷。合成数据的实验验证表明,NeFTY 显著提高了次表面缺陷定位的准确性。更多细节请参见 https://cab-lab-princeton.github.io/nefty/
Summary / 总结
Neural Field Thermal Tomography (NeFTY) is a differentiable physics framework designed for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. It parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver, addressing limitations of traditional thermography and soft-constrained PINNs. Experiments on synthetic data show that NeFTY enhances the accuracy of subsurface defect localization compared to existing methods.
Neural Field Thermal Tomography (NeFTY) 是一种从瞬态表面温度测量中重建 3D 材料属性的差分物理框架。它通过将 3D 扩散场参数化为连续的神经场,并强制执行热力学定律作为硬约束,克服了传统热成像和 PINNs 的局限性。实验结果表明,NeFTY 在缺陷定位的准确性上优于现有方法。
Agentar-Fin-OCR
Authors: Siyi Qian, Xiongfei Bai, Bingtao Fu, Yichen Lu, Gaoyang Zhang, Xudong Yang, Peng Zhang
First: 2026-03-11T17:59:42+00:00 · Latest: 2026-03-11T17:59:42+00:00
Abstract
In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.
中文标题/摘要
标题:Agentar-Fin-OCR
本文提出了一种名为Agentar-Fin-OCR的文档解析系统,专门针对金融领域的文档,能够将超长的金融PDF文件转换为语义一致、高度准确的结构化输出,并具有审计级别的溯源性。为解决金融领域特有的挑战,如复杂的页面布局、跨页结构断点以及单元格级别的引用能力,Agentar-Fin-OCR结合了(1)跨页内容整合算法以恢复页面间的连续性,以及文档级别标题层次重建(DHR)模块以构建全局一致的目录树,用于结构化检索;(2)一种适应难度的课程学习训练策略用于表格解析,以及使用结构锚定标记从解码器隐藏状态中定位表格单元格的CellBBoxRegressor模块,无需外部检测器。实验表明,我们的模型在OmniDocBench的表格解析指标上表现出色。为了在金融领域实现现实的评估,我们进一步引入了FinDocBench基准,其中包括六类金融文档类别,并附有专家验证的注释和评估指标,如基于目录树编辑距离的相似性(TocEDS)、跨页连接的TEDS以及表格单元格交并比(C-IoU)。我们在FinDocBench上评估了多种最先进的模型,以评估它们在金融文档上的能力和剩余的局限性。总体而言,Agentar-Fin-OCR和FinDocBench为可靠的下游金融文档应用提供了实用的基础。
Summary / 总结
Agentar-Fin-OCR is a document parsing system designed for financial-domain documents, addressing challenges like complex layouts and cross-page discontinuities through a Cross-page Contents Consolidation algorithm and a Document-level Heading Hierarchy Reconstruction module. It also employs a difficulty-adaptive curriculum learning strategy and a CellBBoxRegressor module for table parsing. Experiments show high performance on table parsing metrics. FinDocBench, a benchmark including six financial document categories with expert-verified annotations, was introduced to evaluate models' capabilities and limitations on financial documents, demonstrating the practicality of Agentar-Fin-OCR for reliable financial document applications.
Agentar-Fin-OCR 是一种针对金融领域文档的解析系统,通过跨页内容整合算法和文档层级标题重构模块解决复杂布局和跨页断续等问题。它还采用了一种难度自适应的课程学习策略和单元格边界回归模块进行表格解析。实验结果显示在表格解析指标上的高性能。此外,引入了FinDocBench基准,包含六个金融文档类别并附有专家验证的注释,用于评估模型在金融领域的表现,强调其在下游应用中的可靠性。
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Authors: Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan
First: 2026-03-11T17:59:40+00:00 · Latest: 2026-03-11T17:59:40+00:00
Comments: Project page: https://genjib.github.io/v2m_zero/
Abstract
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/
中文标题/摘要
标题:V2M-Zero:零对齐视频音乐生成
现有文本到音乐模型在生成与视频事件时间对齐的音乐方面具有挑战性,因为它们缺乏精细的时间控制。我们提出了V2M-Zero,这是一种零对齐的视频到音乐生成方法,可以输出与视频时间对齐的音乐。我们的方法受到一个关键观察的启发:时间同步需要匹配何时以及发生了多少变化,而不是发生了什么变化。尽管音乐事件和视觉事件在语义上不同,但它们在每个模态内表现出共享的时间结构。我们通过使用预训练的音乐和视频编码器计算的跨模态相似性事件曲线来捕捉这种结构。通过独立测量每个模态内的时间变化,这些曲线提供了跨模态的可比表示。这使得一个简单的训练策略成为可能:在音乐事件曲线上微调文本到音乐模型,然后在推理时替换视频事件曲线,无需跨模态训练或配对数据。在OES-Pub、MovieGenBench-Music和AIST++上,V2M-Zero在配对数据基线之上取得了显著的改进:音频质量提高了5-21%,语义对齐提高了13-15%,时间同步提高了21-52%,舞蹈视频的节拍对齐提高了28%。通过大规模的众包主观听觉测试,我们发现了类似的结果。总体而言,我们的结果验证了通过模态内特征进行时间对齐,而不是跨模态配对监督,对于视频到音乐生成是有效的。结果可在https://genjib.github.io/v2m_zero/ 查看。
Summary / 总结
V2M-Zero is a zero-pair video-to-music generation approach that uses event curves from pretrained encoders to achieve fine-grained temporal alignment. It outperforms paired-data baselines by 5-21% in audio quality, 13-15% in semantic alignment, 21-52% in temporal synchronization, and 28% in beat alignment on dance videos. This method relies on within-modality features rather than cross-modal supervision, demonstrating its effectiveness in generating time-aligned music for video events.
V2M-Zero 是一种零配对的视频到音乐生成方法,能够输出与视频时间对齐的音乐。它使用预训练编码器中的事件曲线来捕获每个模态内的共享时间结构,从而可以在没有配对数据的情况下对文本到音乐模型进行微调。V2M-Zero 在各种基准测试和主观听觉测试中表现出显著的改进,实现了更高的音频质量、更好的语义对齐和更好的时间同步。
DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving
Authors: Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan
First: 2026-03-11T17:59:31+00:00 · Latest: 2026-03-11T17:59:31+00:00
Comments: 18 pages, 10 figures
Abstract
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
中文标题/摘要
标题:DynVLA:自主驾驶中行动推理的世界动力学学习
我们提出了DynVLA,一种驾驶VLA模型,引入了一种新的CoT范式,称为动力学CoT。DynVLA在行动生成之前预测紧凑的动力学,使决策更加明智且物理上合理。为了获得紧凑的动力学表示,DynVLA引入了动力学分词器,将未来演变压缩为少量的动力学令牌。考虑到交互密集型驾驶场景中的丰富环境动力学,DynVLA解耦了以自我为中心和环境为中心的动力学,从而更准确地建模世界动力学。然后,我们通过SFT和RFT训练DynVLA在行动之前生成动力学令牌,提高决策质量同时保持低延迟的推理。与缺乏精细时空理解的文本CoT相比,以及由于密集图像预测而引入大量冗余的视觉CoT,动力学CoT以紧凑、可解释和高效的形式捕捉世界演变。在NAVSIM、Bench2Drive和一个大规模的内部数据集上的广泛实验表明,DynVLA在Textual CoT和Visual CoT方法上始终表现出色,验证了动力学CoT的有效性和实际价值。
Summary / 总结
DynVLA is a driving VLA model that introduces a new CoT paradigm called Dynamics CoT, which forecasts compact world dynamics before action generation. It uses a Dynamics Tokenizer to compress future evolution into a small set of tokens, improving decision-making accuracy. Experiments on various datasets show that DynVLA outperforms Textual CoT and Visual CoT methods, validating the effectiveness of Dynamics CoT in autonomous driving scenarios.
DynVLA 是一种驾驶 VLA 模型,引入了 Dynamics CoT 新的 CoT 帕累托,用于自主驾驶中的行动推理。它在行动生成前预测紧凑的世界动态,使用 Dynamics Tokenizer 将未来演变压缩成一组小的标记。这种方法通过解耦以自我为中心和环境为中心的动态,提高了决策质量并保持了延迟高效的推理。实验表明,DynVLA 在各种数据集上优于 Textual CoT 和 Visual CoT 方法,验证了 Dynamics CoT 的有效性。
Instruction set for the representation of graphs
Authors: Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez
First: 2026-03-11T17:57:44+00:00 · Latest: 2026-03-11T17:57:44+00:00
Abstract
We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling
中文标题/摘要
标题:图的表示指令集
我们提出了IsalGraph方法,该方法使用九字符指令字母表将任何有限简单图的结构紧凑地表示为字符串。编码由一个小虚拟机执行,该虚拟机包含一个稀疏图、一个循环双链表(CDLL)的图节点引用以及两个遍历指针。指令要么通过CDLL移动指针,要么将节点或边插入图中。一个关键设计属性是字母表上的每个字符串都可以解码为一个有效的图,没有任何无效状态可到达。贪婪的GraphToString算法以节点数的多项式时间将任何连接图编码为字符串;一种穷举回溯变体通过选择所有起始节点和所有有效遍历顺序中最短的字典序最小字符串来生成一个规范字符串。我们在五个真实世界的图基准数据集(IAM Letter LOW/MED/HIGH、LINUX和AIDS)上评估了该表示方法,并表明IsalGraph字符串之间的Levenshtein距离与图编辑距离(GED)之间存在很强的相关性。这些特性使IsalGraph字符串成为图结构的紧凑、同构不变且语言模型兼容的序列编码,具有直接应用于图相似性搜索、图生成和图条件语言建模的应用。
Summary / 总结
The research aims to develop a compact string representation for graph structures using a nine-character instruction set. The method, IsalGraph, employs a small virtual machine with a sparse graph, a circular doubly-linked list, and two traversal pointers to encode any finite, simple graph. Key findings include strong correlation between IsalGraph strings and graph edit distance, making it suitable for graph similarity search, generation, and language modeling.
IsalGraph 是一种使用九字符指令集将图结构编码为紧凑字符串的方法。它使用一个虚拟机,包含稀疏图、循环双链表和两个遍历指针来执行移动指针或插入节点和边的指令。该方法确保所有字符串都能解码为有效的图,并包括一个高效编码连接图的贪心 GraphToString 算法。在五个真实世界数据集上的实验显示,IsalGraph 字符串与图编辑距离之间有很强的相关性,使其成为紧凑、同构不变且语言模型兼容的图结构表示,具有直接应用于图相似性搜索、图生成和图条件语言建模的应用。
Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style
Authors: Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown
First: 2026-03-11T17:49:45+00:00 · Latest: 2026-03-11T17:49:45+00:00
Comments: 12 pages, 12 figures
Abstract
VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.
中文标题/摘要
标题:AI能否像艺术史家一样看画?视觉语言模型如何识别艺术风格的解读
视觉语言模型(VLMs)在一系列计算机视觉任务中变得越来越熟练,包括视觉问答和物体检测。这包括在艺术领域的强大能力,从分析艺术品到生成艺术。在计算机科学家与艺术史家的跨学科合作中,我们描述了VLMs预测艺术风格的机制,并评估了它们与艺术史家用来推理艺术风格的标准的一致性程度。我们采用潜在空间分解方法来识别驱动艺术风格预测的概念,并进行定量评估、因果分析和艺术史家的评估。我们的研究发现,73%提取的概念被认为由艺术史家判断具有连贯且语义上有意义的视觉特征,90%用于预测特定艺术品风格的概念被认为相关。在使用不相关概念成功预测风格的情况下,艺术史家指出了可能的原因;例如,模型可能“理解”概念在更形式化的层面,如明暗对比。
SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
Authors: Muhammad Saif Ullah Khan, Didier Stricker
First: 2026-02-24T11:31:20+00:00 · Latest: 2026-03-11T17:44:41+00:00
Comments: Camera-ready version
Abstract
Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine's complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.
中文标题/摘要
标题:SIMSPINE:一种生物力学感知的3D脊柱运动注释和基准框架
脊柱运动建模是理解人类生物力学的基础,但由于脊柱复杂的多关节运动学和缺乏大规模3D注释,这一领域在计算机视觉中的研究仍然不足。我们提出了一种生物力学感知的关键点模拟框架,该框架利用肌肉骨骼建模从解剖学上一致地为现有的人体姿态数据集添加3D脊柱关键点。利用该框架,我们创建了第一个开放数据集SIMSPINE,该数据集提供了自然全身运动在室内多摄像头捕捉下的稀疏椎体级别3D脊柱注释,无需外部约束。该数据集包含214万帧,使得可以从细微姿态变化中学习椎体运动学,并弥合了肌肉骨骼模拟与计算机视觉之间的差距。此外,我们还发布了预训练基准模型,包括微调的2D检测器、单目3D姿态提升模型和多视图重建管道,建立了生物力学有效的脊柱运动估计的统一基准。具体而言,我们的2D脊柱基准模型在受控环境中将最先进的AUC从0.63提高到0.80,在野外脊柱跟踪中将AP从0.91提高到0.93。通过该模拟框架和SIMSPINE数据集,我们推进了基于视觉的生物力学、运动分析和数字人体建模的研究,使其能够在自然条件下实现可重复的、解剖学基础的3D脊柱估计。
Summary / 总结
The research aims to enhance the understanding of human biomechanics through 3D spine motion annotation, addressing the lack of large-scale 3D annotations for the spine due to its complex kinematics. The authors developed a biomechanics-aware keypoint simulation framework to augment existing human pose datasets with anatomically consistent 3D spinal keypoints. This led to the creation of the SIMSPINE dataset, which includes 2.14 million frames of vertebra-level 3D spinal annotations for natural full-body motions. The dataset and pretrained baselines establish a benchmark for spine motion estimation, improving state-of-the-art performance in both controlled and real-world environments.
研究旨在通过提供一个生物力学感知的关键点模拟框架来改善对人类生物力学的理解。方法是将现有的人体姿态数据集与解剖学一致的3D脊椎关键点相结合。关键实验发现包括创建SIMSPINE数据集,包含214万帧自然全身运动的椎体级3D脊椎注释,并发布预训练基线,这些基线提高了脊椎运动估计,AUC和AP分数分别从0.63提高到0.80和从0.91提高到0.93。
Pixel Motion Diffusion is What We Need for Robot Control
Authors: E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo
Venue: CVPR 2026
First: 2025-09-26T17:59:59+00:00 · Latest: 2026-03-11T17:37:15+00:00
Comments: Accepted to CVPR 2026. Project page: https://eronguyen.github.io/DAWN
Abstract
We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/
中文标题/摘要
标题:像素运动扩散是机器人控制所需
我们提出了DAWN(扩散是机器人控制所需的一切),这是一种统一的基于扩散的框架,用于语言条件下的机器人操作,通过结构化的像素运动表示将高层运动意图与低层机器人动作连接起来。在DAWN中,高层和低层控制器都被建模为扩散过程,从而形成一个完全可训练、端到端的系统,具有可解释的中间运动抽象。DAWN在具有挑战性的CALVIN基准测试中取得了最先进的成果,展示了其强大的多任务性能,并进一步在MetaWorld中验证了其有效性。尽管模拟与现实之间存在巨大的领域差距,并且缺乏实际数据,我们仅通过少量微调就展示了可靠的现实世界转移,这表明基于扩散的运动抽象在机器人控制中的实用可行性。我们的结果表明,将扩散建模与以运动为中心的表示相结合,作为可扩展和鲁棒机器人学习的强基线的有效性。
Summary / 总结
DAWN is a unified diffusion-based framework for language-conditioned robotic manipulation, which models both high-level and low-level controllers as diffusion processes to achieve interpretable intermediate motion abstractions. It demonstrates state-of-the-art results on the CALVIN benchmark and MetaWorld, showing strong multi-task performance and reliable real-world transfer with minimal fine-tuning despite the domain gap between simulation and reality.
DAWN 是一种统一的基于扩散的框架,用于语言条件下的机器人操作,将高层和低层控制器都建模为扩散过程,以实现一个完全可训练、端到端的系统,并具有可解释的运动抽象。DAWN 在 CALVIN 基准和 MetaWorld 上展示了最先进的性能,并通过最少的微调展示了可靠的现实世界转移,突显了基于扩散的运动抽象在机器人控制中的实用可行性。
Geometric Scaling of Bayesian Inference in LLMs
Authors: Naman Agarwal, Siddhartha R. Dalal, Vishal Misra
First: 2025-12-27T05:29:55+00:00 · Latest: 2026-03-11T17:34:01+00:00
Comments: fixed bugg references
Abstract
Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate -- low-dimensional value manifolds and progressively orthogonal keys -- that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.
中文标题/摘要
标题:几何缩放下的大规模语言模型中的贝叶斯推断
近期研究表明,小型变压器在受控的“风洞”环境中训练时可以实现精确的贝叶斯推断,并且其训练动力学产生了一种几何基底——低维价值流形和逐渐正交的键,这些编码了后验结构。我们研究这种几何特征是否在生产级语言模型中持续存在。在Pythia、Phi-2、Llama-3和Mistral系列中,我们发现最后一层的价值表示沿着一个主要轴组织,其位置与预测熵强烈相关,并且领域限制的提示将这种结构压缩到与合成环境中观察到的相同低维流形中。 为了探究这种几何特征的作用,我们在Pythia-410M的熵对齐轴上进行有针对性的干预,在上下文学习过程中。移除或扰动这条轴会选择性地破坏局部不确定性几何结构,而匹配的随机轴干预则使其保持不变。然而,这些单层操作并没有产生与贝叶斯行为成比例的特定降解,表明几何结构是不确定性的一种特权读出,而不是单一的计算瓶颈。综上所述,我们的结果表明,现代语言模型保留了风洞中实现贝叶斯推断的几何基底,并沿着这条基底组织其近似贝叶斯更新。
Summary / 总结
This study explores whether the geometric signature observed in small transformers trained for Bayesian inference persists in larger, production-grade language models. Across various models like Pythia, Phi-2, Llama-3, and Mistral, the research finds that the last-layer value representations align along a dominant axis related to predictive entropy, and domain-restricted prompts collapse this structure into low-dimensional manifolds similar to those seen in synthetic settings. Interventions on this entropy-aligned axis during in-context learning selectively disrupt the local uncertainty geometry, suggesting that this geometry is a privileged readout of uncertainty rather than a singular computational bottleneck.
研究探讨了在更大规模的生产级语言模型中是否保留了小变压器训练时观察到的几何特征。研究人员在Pythia、Phi-2、Llama-3和Mistral等模型中发现,最后一层的价值表示沿着与预测熵相关的主导轴排列。在上下文学习过程中对这条轴进行干预会选择性地破坏局部不确定性几何结构,表明这种几何结构是不确定性的一种特权读出,而不是单一的计算瓶颈。
Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
Authors: Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei
First: 2026-03-10T10:31:58+00:00 · Latest: 2026-03-11T17:27:13+00:00
Comments: accepted by ICLR2026
Abstract
Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID.
中文标题/摘要
标题:去冗存精,协同重要性多样性:VLMs中的视觉标记压缩
视觉语言模型(VLMs)因视觉标记过度生成面临显著的计算效率问题。尽管先前工作表明大量视觉标记是冗余的,但现有压缩方法难以在重要性保存和信息多样性之间取得平衡。为解决这一问题,我们提出了一种名为PruneSID的无训练协同重要性多样性方法,其包含两阶段管道:(1)主语义成分分析(PSCA)用于将标记聚类为语义一致的组,确保全面的概念覆盖;(2)组内非最大抑制(NMS)用于去除冗余标记同时保留每个组内的关键代表性标记。此外,PruneSID还引入了一种基于图像复杂性的信息感知动态压缩比机制,根据图像复杂性优化标记压缩率,从而在多种场景中实现更有效的平均信息保存。大量实验表明,PruneSID在LLaVA-1.5上达到96.3%的准确率,仅保留11.1%的标记,并在LLaVA-NeXT上以5.6%的极端压缩率实现92.8%的准确率,相比先前方法提高了2.5%,且预填充速度比原模型快7.8倍。我们的框架适用于多种VLMs和图像、视频模态,展示了强大的跨模态通用性。代码可在https://github.com/ZhengyaoFang/PruneSID获取。
Summary / 总结
The research aims to address the computational inefficiencies in vision-language models (VLMs) due to redundant visual tokens. PruneSID, a training-free method, uses a two-stage pipeline: Principal Semantic Components Analysis (PSCA) for clustering tokens and Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key ones. It also includes an information-aware dynamic compression ratio mechanism. Experiments show that PruneSID achieves high accuracy with significant token reduction, outperforming previous methods and offering faster prefilling speeds.
论文针对视觉语言模型(VLMs)因冗余视觉标记而导致的计算效率低下问题,提出了一种名为PruneSID的无训练方法,该方法采用两阶段管道:主语义成分分析(PSCA)进行标记聚类和组内非最大抑制(NMS)进行冗余标记的修剪,同时保留关键标记。该方法还包含一种基于信息的动态压缩率机制。实验表明,PruneSID即使在极端压缩率下也能保持高准确性,优于先前的方法,并提供更快的预填充速度。该框架在不同VLMs和模态下具有很强的通用性。
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Authors: Naman Agarwal, Siddhartha R. Dalal, Vishal Misra
First: 2025-12-27T05:31:44+00:00 · Latest: 2026-03-11T17:25:35+00:00
Comments: fixed buggy references
Abstract
Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = α_{ij}\bigl(b_{ij}-\mathbb{E}_{α_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ Δv_j = -η\sum_i α_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $α_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).
Summary / 总结
The paper aims to elucidate the mechanisms by which gradient-based learning shapes the internal geometry of transformer attention heads, particularly through cross-entropy training. The authors derive a routing law for attention scores and a responsibility-weighted update for value vectors. Key findings show that these dynamics lead to a positive feedback loop where attention and content specialize together, resembling a two-timescale EM procedure. Simulations demonstrate that these gradient dynamics not only minimize cross-entropy but also sculpt the Bayesian manifolds necessary for probabilistic reasoning, providing a unified picture of optimization, geometry, and function in transformers.
论文研究了梯度学习如何塑造transformer中的注意力机制,提供了交叉熵训练的一阶分析。关键发现包括基于优势的路由法则和责任加权更新法则。这些动态形成了一个正反馈循环,使得路由和内容共同专业化,类似于两阶段EM过程。实验结果表明,这些梯度动态不仅最小化了交叉熵,还创建了Bayesian推理所需的几何结构,统一了优化、几何和功能。
Factorized Neural Implicit DMD for Parametric Dynamics
Authors: Siyuan Chen, Zhecheng Wang, Yixin Chen, Yue Chang, Peter Yichen Chen, Eitan Grinspun, Jonathan Panuelos
First: 2026-03-11T17:20:34+00:00 · Latest: 2026-03-11T17:20:34+00:00
Abstract
A data-driven, model-free approach to modeling the temporal evolution of physical systems mitigates the need for explicit knowledge of the governing equations. Even when physical priors such as partial differential equations are available, such systems often reside in high-dimensional state spaces and exhibit nonlinear dynamics, making traditional numerical solvers computationally expensive and ill-suited for real-time analysis and control. Consider the problem of learning a parametric flow of a dynamical system: with an initial field and a set of physical parameters, we aim to predict the system's evolution over time in a way that supports long-horizon rollouts, generalization to unseen parameters, and spectral analysis. We propose a physics-coded neural field parameterization of the Koopman operator's spectral decomposition. Unlike a physics-constrained neural field, which fits a single solution surface, and neural operators, which directly approximate the solution operator at fixed time horizons, our model learns a factorized flow operator that decouples spatial modes and temporal evolution. This structure exposes underlying eigenvalues, modes, and stability of the underlying physical process to enable stable long-term rollouts, interpolation across parameter spaces, and spectral analysis. We demonstrate the efficacy of our method on a range of dynamics problems, showcasing its ability to accurately predict complex spatiotemporal phenomena while providing insights into the system's dynamic behavior.
中文标题/摘要
标题:因子化神经隐式DMD参数动力学
一种基于数据、无模型的方法用于建模物理系统的时间演化,减少了对显式掌握支配方程的需求。即使在有物理先验如偏微分方程的情况下,这些系统通常存在于高维状态空间中并表现出非线性动力学,使得传统数值求解器在计算上昂贵且不适合实时分析和控制。考虑学习动力系统参数流的问题:给定初始场和一组物理参数,我们旨在以支持长时序滚动、对未见参数的泛化和频谱分析的方式预测系统的演化。 我们提出了一种物理编码的神经场参数化Koopman算子的频谱分解。与物理约束的神经场不同,后者拟合单一解表面,而神经算子直接在固定时间区间内近似解算子,我们的模型学习一个因子化的流算子,将空间模式和时间演化解耦。这种结构揭示了潜在的特征值、模式和物理过程的稳定性,以实现稳定的长期滚动、参数空间的插值和频谱分析。我们在一系列动力学问题上展示了该方法的有效性,展示了其准确预测复杂时空现象并提供系统动态行为见解的能力。
Summary / 总结
The research aims to develop a data-driven method for modeling the temporal evolution of physical systems without requiring explicit knowledge of governing equations. The proposed method, Factorized Neural Implicit DMD, learns a factorized flow operator that decouples spatial modes and temporal evolution, enabling spectral analysis and stable long-term predictions. Key experimental findings include accurate predictions of complex spatiotemporal phenomena and the ability to generalize to unseen parameters.
研究旨在开发一种无需显式了解控制方程的数据驱动方法来建模物理系统的时态演化。方法提出了一种物理编码的神经场参数化Koopman算子的谱分解,将空间模式和时间演化解耦,从而实现长期稳定预测和谱分析。关键实验发现表明,所提出的方法能够准确预测复杂的时空现象,并提供系统动态行为的见解。
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
Authors: Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei
First: 2026-03-11T17:18:12+00:00 · Latest: 2026-03-11T17:18:12+00:00
Comments: accepted by CVPR2026
Abstract
Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.
中文标题/摘要
标题:太生动以至于不真实?生成色彩保真度基准测试与校准
近年来,文本到图像(T2I)生成技术在视觉质量方面取得了显著进步,但生成出与现实世界摄影视觉上真实的照片仍然具有挑战性。这在一定程度上是由于现有评估范式的偏见:人类评分和偏好训练的度量标准往往偏好视觉上过于鲜艳、饱和度和对比度夸张的图像,即使在要求生成现实风格图像时,生成的图像也往往过于生动而不真实。为解决这一问题,我们提出了色彩保真度数据集(CFD)和色彩保真度度量(CFM),用于客观评估现实风格生成中的色彩保真度。CFD包含超过130万张真实和合成图像,具有不同程度的色彩现实性,而CFM采用多模态编码器学习感知色彩保真度。此外,我们提出了一种无需训练的色彩保真度精炼(CFR),它能够自适应地调节生成中的空间-时间指导尺度,从而增强色彩的真实性。结合使用,CFD支持CFM进行评估,其学习到的注意力进一步引导CFR精炼T2I保真度,形成一个逐步框架,用于评估和改进现实风格T2I生成中的色彩保真度。数据集和代码可在https://github.com/ZhengyaoFang/CFM/获取。
Summary / 总结
The paper addresses the challenge of generating images that appear visually authentic by introducing the Color Fidelity Dataset (CFD) and the Color Fidelity Metric (CFM). CFD consists of over 1.3 million real and synthetic images with varying levels of color realism, while CFM uses a multimodal encoder to assess color fidelity. Additionally, a training-free Color Fidelity Refinement (CFR) method is proposed to enhance color authenticity in generated images, forming a progressive framework for evaluating and improving color fidelity in text-to-image generation.
研究旨在通过解决图像过于生动的问题,提高文本到图像生成的视觉真实性。引入了色彩保真度数据集(CFD)和色彩保真度度量(CFM)以客观评估色彩保真度,并提出了一种无需训练的色彩保真度精炼(CFR)方法来增强色彩的真实性。CFM使用多模态编码器学习感知色彩保真度,而CFR通过自适应调节空间-时间指导尺度来精炼T2I保真度,形成一个逐步框架以评估和提高现实风格T2I生成的色彩保真度。
MCMC Informed Neural Emulators for Uncertainty Quantification in Dynamical Systems
Authors: Heikki Haario, Zhi-Song Liu, Martin Simon, Hendrik Weichel
First: 2026-03-11T17:16:48+00:00 · Latest: 2026-03-11T17:16:48+00:00
Abstract
Neural networks are a commonly used approach to replace physical models with computationally cheap surrogates. Parametric uncertainty quantification can be included in training, assuming that an accurate prior distribution of the model parameters is available. Here we study the common opposite situation, where direct screening or random sampling of model parameters leads to exhaustive training times and evaluations at unphysical parameter values. Our solution is to decouple uncertainty quantification from network architecture. Instead of sampling network weights, we introduce the model-parameter distribution as an input to network training via Markov chain Monte Carlo (MCMC). In this way, the surrogate achieves the same uncertainty quantification as the underlying physical model, but with substantially reduced computation time. The approach is fully agnostic with respect to the neural network choice. In our examples, we present a quantile emulator for prediction and a novel autoencoder-based ODE network emulator that can flexibly estimate different trajectory paths corresponding to different ODE model parameters. Moreover, we present a mathematical analysis that provides a transparent way to relate potential performance loss to measurable distribution mismatch.
中文标题/摘要
标题:基于MCMC的神经模拟器用于动力系统中的不确定性量化
神经网络常被用来替代物理模型,用以提供计算成本较低的替代方案。可以在训练中包含参数不确定性量化,前提是需要有准确的模型参数先验分布。然而,在本文中,我们研究了相反的情况,即直接筛选或随机采样模型参数会导致训练时间耗尽,并且在不合理的参数值上进行评估。我们的解决方案是将不确定性量化与网络架构解耦。我们不是采样网络权重,而是通过马尔可夫链蒙特卡洛(MCMC)将模型参数分布作为网络训练的输入。这样,模拟器可以像底层物理模型一样实现相同的不确定性量化,但计算时间大大减少。该方法对神经网络的选择完全无偏见。在我们的示例中,我们提供了一个分位数模拟器用于预测,并提出了一种基于自编码器的新型ODE网络模拟器,可以灵活地估计不同ODE模型参数对应的不同轨迹路径。此外,我们还提供了一种数学分析,以透明的方式将潜在的性能损失与可测量的分布不匹配联系起来。
Summary / 总结
This paper addresses the challenge of parametric uncertainty quantification in dynamical systems by proposing a method that decouples uncertainty quantification from the neural network architecture. It uses Markov chain Monte Carlo (MCMC) to introduce the model-parameter distribution as an input during training, allowing for efficient surrogate modeling without exhaustive sampling. The approach is agnostic to the choice of neural network and demonstrates its effectiveness through a quantile emulator and an autoencoder-based ODE network emulator, which can estimate different trajectory paths corresponding to various ODE model parameters. The analysis also provides a way to quantify potential performance loss due to distribution mismatch.
该研究通过将参数不确定性量化与神经网络架构解耦来解决动力系统中的参数不确定性量化问题。通过使用马尔可夫链蒙特卡洛(MCMC)将模型参数分布作为输入,这种方法能够在保持与物理模型相同不确定性量化精度的同时,大幅减少计算时间。该方法对神经网络的选择是无偏的,并在分位数模拟和ODE网络模拟中展示了有效性,还提供了一种透明的方式来关联性能损失与分布不匹配。
Federated Learning-driven Beam Management in LEO 6G Non-Terrestrial Networks
Authors: Maria Lamprini Bartsioka, Ioannis A. Bartsiokas, Athanasios D. Panagopoulos, Dimitra I. Kaklamani, Iakovos S. Venieris
First: 2026-03-11T17:13:41+00:00 · Latest: 2026-03-11T17:13:41+00:00
Comments: 2 pages with 2 figures and 1 table. Accepted in 2026 International Applied Computational Electromagnetics Society (ACES) Symposium
Abstract
Low Earth Orbit (LEO) Non-Terrestrial Networks (NTNs) require efficient beam management under dynamic propagation conditions. This work investigates Federated Learning (FL)-based beam selection in LEO satellite constellations, where orbital planes operate as distributed learners through the utilization of High-Altitude Platform Stations (HAPS). Two models, a Multi-Layer Perceptron (MLP) and a Graph Neural Network (GNN), are evaluated using realistic channel and beamforming data. Results demonstrate that GNN surpasses MLP in beam prediction accuracy and stability, particularly at low elevation angles, enabling lightweight and intelligent beam management for future NTN deployments.
中文标题/摘要
标题:基于联邦学习的低地球轨道6G非地面网络波束管理
低地球轨道(LEO)非地面网络(NTNs)在动态传播条件下需要高效的波束管理。本研究探讨了基于联邦学习(FL)的波束选择在LEO卫星星座中的应用,其中轨道平面作为分布式学习者通过高海拔平台站(HAPS)的利用。评估了两种模型,多层感知机(MLP)和图神经网络(GNN),使用了现实的信道和波束形成数据。结果表明,GNN在波束预测精度和稳定性方面优于MLP,尤其是在低仰角时,这使得未来NTN部署能够实现轻量级和智能的波束管理。
Summary / 总结
This work explores Federated Learning (FL) for beam management in Low Earth Orbit (LEO) Non-Terrestrial Networks (NTNs), using High-Altitude Platform Stations (HAPS) as distributed learners. Two models, MLP and GNN, are evaluated, showing that GNN outperforms MLP in beam prediction accuracy and stability, especially at low elevation angles, facilitating lightweight and intelligent beam management for future NTN deployments.
该研究通过使用联邦学习(FL)进行波束选择,以解决LEO NTN系统中波束管理的效率问题。使用真实数据评估了两种模型MLP和GNN,结果显示GNN在波束预测精度和稳定性方面优于MLP,特别是在低仰角情况下,有助于未来NTN部署的轻量级和智能化波束管理。
Resource Allocation in Hybrid Radio-Optical IoT Networks using GNN with Multi-task Learning
Authors: Aymen Hamrouni, Sofie Pollin, Hazem Sallouha
First: 2025-10-29T15:02:28+00:00 · Latest: 2026-03-11T17:11:34+00:00
Comments: Accepted for publications in IEEE Transactions on Machine Learning in Communications and Networking (TMLCN) 20 pages, 17 figures, 3 tables
Abstract
This paper addresses the problem of dual-technology scheduling in hybrid Internet-of-Things (IoT) networks that integrate Optical Wireless Communication (OWC) with Radio Frequency (RF). We first present an optimization formulation that jointly maximizes throughput and minimizes delivery-based Age of Information (AoI) between access points and IoT nodes under energy and link availability constraints. However, solving such NP-hard problems at scale is computationally intractable and typically assumes full channel observability, which is impractical in real deployments. To address this challenge, we propose the Dual-Graph Embedding with Transformer (DGET) framework, a supervised multi-task learning architecture that combines a two-stage Graph Neural Network (GNN) with a Transformer encoder. The first stage employs a transductive GNN to encode the known graph topology together with initial node and link states, such as energy levels, available links, and queued transmissions. The second stage introduces an inductive GNN for temporal refinement, enabling the model to generalize these embeddings to evolving network states while capturing variations in energy and queue dynamics over time through a consistency loss. The resulting embeddings are then processed by a Transformer-based classifier that models cross-link dependencies using multi-head self-attention. Simulation results show that hybrid RF-OWC networks outperform standalone RF systems by supporting higher traffic loads and reducing AoI by up to 20% while maintaining comparable energy consumption. Compared with optimization-based methods, the proposed DGET framework achieves near-optimal scheduling with over 90% classification accuracy, lower computational complexity, and improved robustness under partial channel observability.
中文标题/摘要
标题:混合无线光通信物联网网络中基于多任务学习的图神经网络资源分配
本文探讨了将光无线通信(OWC)与射频(RF)结合的混合物联网(IoT)网络中的双技术调度问题。我们首先提出了一种优化公式,旨在在满足能量和链路可用性约束的情况下,同时最大化吞吐量并最小化接入点与物联网节点之间的基于交付的信息时延(AoI)。然而,解决此类NP难问题在大规模情况下是计算上不可行的,通常假设完全的信道可观测性,这在实际部署中是不切实际的。为解决这一挑战,我们提出了双图嵌入变换器(DGET)框架,这是一种监督多任务学习架构,结合了两阶段图神经网络(GNN)和变换器编码器。第一阶段使用归纳GNN来编码已知的图拓扑结构以及初始节点和链路状态,如能量水平、可用链路和排队传输。第二阶段引入了递归GNN进行时间细化,使模型能够将这些嵌入推广到不断变化的网络状态,同时通过一致性损失捕捉能量和队列动力学随时间的变化。最终嵌入通过基于变换器的分类器进行处理,该分类器使用多头自注意力模型跨链路依赖关系。仿真结果表明,混合RF-OWC网络在支持更高流量负载和减少AoI最多20%的同时,保持了与独立RF系统相当的能耗。与基于优化的方法相比,所提出的DGET框架实现了接近最优的调度,分类准确率超过90%,计算复杂度更低,并且在部分信道可观测性下具有更好的鲁棒性。
Summary / 总结
This paper tackles the challenge of scheduling in hybrid RF-OWC IoT networks by formulating an optimization problem that maximizes throughput and minimizes AoI. To overcome the computational intractability of this problem, the authors propose the DGET framework, which uses a two-stage GNN and Transformer to handle evolving network states and partial channel observability. Simulation results demonstrate that hybrid RF-OWC networks outperform standalone RF systems in terms of traffic load and AoI reduction, while maintaining similar energy consumption. The DGET framework achieves near-optimal scheduling with high accuracy and lower computational complexity compared to optimization-based methods.
本文针对混合RF-OWC物联网网络中的双技术调度问题,通过优化模型最大化吞吐量并最小化AoI,同时考虑能量和链路可用性约束。为克服计算难题和实际限制,作者提出了DGET框架,结合两阶段GNN和Transformer编码器。该模型编码初始网络状态并在时间上进行细化,实现了在部分可观测性下的近最优调度,准确率和鲁棒性较高。仿真结果表明,混合RF-OWC网络在流量负载和AoI减少方面优于单独的RF系统。
Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
Authors: Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang
Venue: ICLR 2026
First: 2025-06-09T08:11:20+00:00 · Latest: 2026-03-11T17:08:49+00:00
Comments: 12 pages, 5 figures
Abstract
Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.
中文标题/摘要
标题:学习强化学习无法掌握的内容:最难问题的交错在线微调
大型语言模型(LLM)推理的最新进展表明,诸如规划和自我反思等复杂行为可以通过强化学习(RL)涌现出来。然而,尽管取得了这些成功,当前形式的RL仍然不足以诱导超出基模型限制的能力,因为它主要基于模型现有知识进行优化,而不是促进新信息的获取。为了解决这一局限性,我们采用监督微调(SFT)来学习RL无法掌握的内容,通过利用高质量的示范数据,可以引入新的知识和推理模式。我们分析了RL和SFT在LLM推理中的训练动态,发现RL在保持和提高模型原有能力范围内的问题性能方面表现出色,而SFT则更有效地使模型能够解决超出当前模型范围的问题。受RL和SFT互补优势的启发,我们提出了一种新的训练方法——ReLIFT(Reinforcement Learning Interleaved with Online Fine-Tuning)。在ReLIFT中,模型主要使用RL进行训练,但在遇到难题时,会收集高质量的解决方案进行微调,并交替进行RL和微调训练,以增强模型的推理能力。ReLIFT在五个竞赛级基准和一个分布外基准上,相对于其他零RL模型,平均提高了超过5.2分。此外,我们证明ReLIFT仅使用13%的详细示范数据就能超越RL和SFT,突显了其可扩展性。这些结果提供了ReLIFT克服RL基本局限性的有力证据,并强调了其巨大的潜力。
GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
Authors: Boyuan Chen, Minghao Shao, Siddharth Garg, Ramesh Karri, Muhammad Shafique
First: 2026-03-11T17:04:30+00:00 · Latest: 2026-03-11T17:04:30+00:00
Abstract
Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
中文标题/摘要
标题:GroundCount:通过对象检测实现视觉-语言模型的空间定位以减轻计数幻觉
视觉语言模型(VLMs)在计数任务中表现出持续的幻觉现象,准确率远低于其他视觉推理任务(不包括情感分析)。这一现象在最先进的具有推理能力的VLMs中依然存在。相反,基于CNN的对象检测模型(ODMs)如YOLO在空间定位和实例计数方面表现出色,且计算开销较小。我们提出了一种名为GroundCount的框架,该框架通过从ODMs引入显式的空间定位来增强VLMs,以减轻计数幻觉。在最佳情况下,我们的基于提示的增强策略在性能最佳的模型(Ovis2.5-2B)上实现了81.3%的计数准确率,比基线提高了6.6个百分点,同时通过消除幻觉驱动的推理循环将推理时间减少了22%。我们进行了全面的消融研究,表明位置编码是关键组件,对强模型有利但对弱模型不利。相比之下,置信度分数对大多数架构引入了噪声,其移除在四个模型中提高了性能。我们进一步评估了特征级融合架构,发现通过结构化提示实现的显式符号定位优于隐式的特征融合,尽管使用了复杂的跨注意力机制。我们的方法在四个模型中实现了一致的改进(6.2-7.5个百分点),其中一个模型由于迭代反射机制与结构化提示不兼容而表现出性能下降。这些结果表明,计数失败的根本原因在于空间语义整合的局限性,而不是特定架构的缺陷,同时强调了增强策略中架构兼容性的重要性。
Summary / 总结
GroundCount is a framework that integrates object detection models with Vision Language Models (VLMs) to improve counting accuracy, reducing hallucinations. It uses a prompt-based augmentation strategy that enhances the best-performing model (Ovis2.5-2B) by 6.6 percentage points while decreasing inference time by 22%. Ablation studies show that positional encoding is crucial for stronger models but not for weaker ones, and confidence scores often introduce noise. Feature-level fusion architectures perform better with explicit symbolic grounding via structured prompts, indicating that counting failures arise from spatial-semantic integration limitations rather than specific architectural issues.
论文通过提出GroundCount框架,将对象检测模型(ODMs)与视觉语言模型(VLMs)结合,以解决VLMs在计数任务中的幻觉问题。通过ODMs的空间定位,该方法在最佳性能模型上实现了81.3%的计数准确率,提高了6.6个百分点,并减少了推理时间。消融研究表明,位置编码对较强模型至关重要,但对较弱模型则不然,而置信度分数通常会引入噪声。研究还发现,通过结构化提示进行显式符号定位优于隐式特征融合,这在五个评估的VLM架构中有四个架构中带来了持续的改进。
FRIEND: Federated Learning for Joint Optimization of multi-RIS Configuration and Eavesdropper Intelligent Detection in B5G Networks
Authors: Maria Lamprini A. Bartsioka, Ioannis A. Bartsiokas, Anastasios K. Papazafeiropoulos, Maria A. Seimeni, Dimitra I. Kaklamani, Iakovos S. Venieris
First: 2026-03-11T17:02:40+00:00 · Latest: 2026-03-11T17:02:40+00:00
Comments: 8 pages with 5 figures and 2 tables. Accepted in 29th Conference on Innovation in Clouds, Internet and Networks (ICIN 2026)
Abstract
As wireless systems evolve toward Beyond 5G (B5G), the adoption of cell-free (CF) millimeter-wave (mmWave) architectures combined with Reconfigurable Intelligent Surfaces (RIS) is emerging as a key enabler for ultra-reliable, high-capacity, scalable, and secure Industrial Internet of Things (IIoT) communications. However, safeguarding these complex and distributed environments against eavesdropping remains a critical challenge, particularly when conventional security mechanisms struggle to overcome scalability, and latency constraints. In this paper, a novel framework for detecting malicious users in RIS-enhanced cell-free mmWave networks using Federated Learning (FL) is presented. The envisioned setup features multiple access points (APs) operating without traditional cell boundaries, assisted by RIS nodes to dynamically shape the wireless propagation environment. Edge devices collaboratively train a Deep Convolutional Neural Network (DCNN) on locally observed Channel State Information (CSI), eliminating the need for raw data exchange. Moreover, an early-exit mechanism is incorporated in that model to jointly satisfy computational complexity requirements. Performance evaluation indicates that the integration of FL and multi-RIS coordination improves approximately 30% the achieved secrecy rate (SR) compared to baseline non-RIS-assisted methods while maintaining near-optimal detection accuracy levels. This work establishes a distributed, privacy-preserving approach to physical layer eavesdropping detection tailored for next-generation IIoT deployments.
中文标题/摘要
标题:FRIEND: 联邦学习在增强型5G网络中联合优化多RIS配置和窃听智能检测
随着无线系统向增强型5G (B5G) 进化,基于无小区(CF)毫米波(mmWave)架构结合可重构智能表面(RIS)的应用正在成为实现超可靠、高容量、可扩展和安全的工业物联网(IIoT)通信的关键使能器。然而,保护这些复杂且分布式的环境免受窃听仍然是一个关键挑战,尤其是在传统安全机制难以克服可扩展性和延迟约束时。本文提出了一种新的框架,利用联邦学习(FL)在增强型5G网络中的RIS辅助下检测恶意用户。设想的设置包括多个无传统小区边界的接入点(APs),由RIS节点协助动态塑造无线传播环境。边缘设备协作在本地观测的信道状态信息(CSI)上训练深度卷积神经网络(DCNN),消除原始数据交换的需要。此外,该模型中还集成了早期退出机制,以共同满足计算复杂度要求。性能评估表明,将FL与多RIS协调结合使用,与非RIS辅助的基本方法相比,可提高约30%的保密率(SR),同时保持接近最优的检测准确性水平。本文为下一代IIoT部署提供了一种分布式、隐私保护的物理层窃听检测方法。
Summary / 总结
This paper presents a novel framework using Federated Learning (FL) to detect malicious users in RIS-enhanced cell-free mmWave networks. The framework collaboratively trains a Deep Convolutional Neural Network (DCNN) on local Channel State Information (CSI) without exchanging raw data, and incorporates an early-exit mechanism to manage computational complexity. Experimental results show that integrating FL and multi-RIS coordination enhances the secrecy rate by approximately 30% compared to baseline methods, while maintaining near-optimal detection accuracy. This approach provides a distributed and privacy-preserving solution for physical layer eavesdropping detection in next-generation IIoT deployments.
本文提出了一种使用联邦学习(FL)检测RIS增强的无小区边界cell-free mmWave网络中恶意用户的新型框架。该框架在本地CSI上协作训练一个深度卷积神经网络(DCNN),无需交换原始数据,并引入了早期退出机制以管理计算复杂性。结果表明,将FL与多RIS协调结合使用可将保密率提高约30%,同时保持接近最优的检测准确性。该方法提供了一种适用于下一代IIoT部署的分布式和隐私保护的物理层窃听检测解决方案。
Mamba Neural Operator: Who Wins? Transformers vs. State-Space Models for PDEs
Authors: Chun-Wun Cheng, Jiahao Huang, Yi Zhang, Guang Yang, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero
First: 2024-10-03T00:32:31+00:00 · Latest: 2026-03-11T17:00:24+00:00
Comments: Accepted in Journal of Computational Physics 2025
Abstract
Partial differential equations (PDEs) are widely used to model complex physical systems, but solving them efficiently remains a significant challenge. Recently, Transformers have emerged as the preferred architecture for PDEs due to their ability to capture intricate dependencies. However, they struggle with representing continuous dynamics and long-range interactions. To overcome these limitations, we introduce the Mamba Neural Operator (MNO), a novel framework that enhances neural operator-based techniques for solving PDEs. MNO establishes a formal theoretical connection between structured state-space models (SSMs) and neural operators, offering a unified structure that can adapt to diverse architectures, including Transformer-based models. By leveraging the structured design of SSMs, MNO captures long-range dependencies and continuous dynamics more effectively than traditional Transformers. Through extensive analysis, we show that MNO significantly boosts the expressive power and accuracy of neural operators, making it not just a complement but a superior framework for PDE-related tasks, bridging the gap between efficient representation and accurate solution approximation. Our code is available on https://github.com/Math-ML-X/Mamba-Neural-Operator
中文标题/摘要
标题:Mamba神经算子:谁胜出?用于PDE的Transformer与状态空间模型对比
偏微分方程(PDEs)广泛用于建模复杂的物理系统,但高效求解它们仍然是一个重大挑战。最近,Transformer因其能够捕捉复杂的依赖关系而成为PDEs的首选架构。然而,它们在表示连续动力学和长程相互作用方面存在困难。为克服这些限制,我们引入了Mamba神经算子(MNO),这是一种新颖的框架,可以增强基于神经算子的技术来解决PDEs。MNO建立了结构化状态空间模型(SSMs)和神经算子之间的正式理论联系,提供了一种统一的结构,可以适应各种架构,包括基于Transformer的模型。通过利用SSMs的结构化设计,MNO比传统Transformer更有效地捕捉长程依赖关系和连续动力学。通过广泛的分析,我们表明MNO显著增强了神经算子的表达能力和准确性,使其不仅是一种补充,而且是PDE相关任务的优越框架,填补了高效表示和准确解近似之间的差距。我们的代码可在https://github.com/Math-ML-X/Mamba-Neural-Operator获取
Summary / 总结
The paper addresses the challenge of efficiently solving partial differential equations (PDEs) by introducing the Mamba Neural Operator (MNO), which enhances neural operator techniques. MNO integrates structured state-space models (SSMs) with neural operators, improving the ability to capture long-range dependencies and continuous dynamics. Experiments demonstrate that MNO outperforms traditional Transformers in terms of both expressive power and accuracy for PDE-related tasks.
论文通过引入Mamba神经算子(MNO),结合结构化状态空间模型和神经算子的优点,解决了偏微分方程(PDEs)的高效求解问题。MNO增强了对长程依赖性和连续动态的捕捉能力,超越了传统的Transformer。大量实验表明,MNO显著提高了神经算子的准确性和表达能力,使其成为PDE相关任务的更优框架。
ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging
Authors: Athanasios Angelakis
First: 2026-02-20T01:38:59+00:00 · Latest: 2026-03-11T16:56:59+00:00
Comments: 24 pages, 15 figures, 5 tables. Code and models available at https://github.com/Bluesman79/ZACH-ViT
Abstract
Vision Transformers rely on positional embeddings and class tokens encoding fixed spatial priors. While effective for natural images, these priors may be suboptimal when spatial layout is weakly informative, a frequent condition in medical imaging. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes positional embeddings and the [CLS] token, achieving permutation-invariant patch processing via global average pooling. Zero-token denotes removal of the dedicated aggregation token and positional encodings. Patch tokens remain unchanged. Adaptive residual projections preserve training stability under strict parameter constraints. We evaluate ZACH-ViT across seven MedMNIST datasets under a strict few-shot protocol (50 samples/class, fixed hyperparameters, five seeds). Results reveal regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves strongest advantage on BloodMNIST and remains competitive on PathMNIST, while relative advantage decreases on datasets with stronger anatomical priors (OCTMNIST, OrganAMNIST), consistent with our hypothesis. Component and pooling ablations show positional support becomes mildly beneficial as spatial structure increases, whereas reintroducing a [CLS] token is consistently unfavorable. These findings support that architectural alignment with data structure can outweigh universal benchmark dominance. Despite minimal size and no pretraining, ZACH-ViT achieves competitive performance under data-scarce conditions, relevant for compact medical imaging and low-resource settings. Code: https://github.com/Bluesman79/ZACH-ViT
中文标题/摘要
标题:ZACH-ViT:紧凑型视觉变换器在医学成像中的依赖于范式的归纳偏见
视觉变换器依赖于位置嵌入和类标记编码固定的空间先验。虽然对于自然图像有效,但在医学成像中,当空间布局信息较弱时,这些先验可能并不理想。我们引入了ZACH-ViT(Zero-token Adaptive Compact Hierarchical Vision Transformer),这是一种紧凑型视觉变换器,去除了位置嵌入和[CLS]标记,并通过全局平均池化实现不变的块处理。Zero-token表示去除了专用聚合标记和位置编码,块标记保持不变。自适应残差投影在严格参数约束下保持训练稳定性。我们在严格的少量样本协议(每类50个样本,固定超参数,五个种子)下,对ZACH-ViT在七个MedMNIST数据集上进行了评估。结果表明,存在依赖于范式的性能:ZACH-ViT(0.25M参数,从头开始训练)在BloodMNIST上表现最佳,而在PathMNIST上保持竞争力,但在具有更强解剖先验的数据集(OCTMNIST,OrganAMNIST)上相对优势下降,这与我们的假设一致。组件和池化消融实验表明,随着空间结构的增加,位置支持变得略微有益,而重新引入[CLS]标记始终是不利的。这些发现支持了与数据结构的架构对齐可以超越通用基准主导地位的观点。尽管ZACH-ViT体积小且未进行预训练,但在数据稀缺条件下仍能实现竞争力的性能,对于紧凑型医学成像和低资源环境具有重要意义。代码:https://github.com/Bluesman79/ZACH-ViT
Summary / 总结
ZACH-ViT is a compact Vision Transformer that removes positional embeddings and class tokens, focusing on permutation-invariant patch processing via global average pooling. Evaluated on seven MedMNIST datasets, ZACH-ViT shows regime-dependent performance, excelling on BloodMNIST and remaining competitive on PathMNIST but with decreasing advantage on datasets with strong anatomical priors. This aligns with the hypothesis that architectural alignment with data structure can outweigh universal benchmark dominance, making ZACH-ViT suitable for compact medical imaging in low-resource settings.
ZACH-ViT 是一种紧凑的 Vision Transformer,移除了位置嵌入和类标记,通过全局平均池化实现不变的 patch 处理。在七个 MedMNIST 数据集上的评估显示,ZACH-ViT 在表现上具有依赖于数据集的特性,在 BloodMNIST 上表现出色,在 PathMNIST 上保持竞争力,但在具有强烈解剖先验的数据集上优势减弱。这与假设一致,即架构与数据结构的对齐可以超越通用基准主导地位,使 ZACH-ViT 适用于低资源环境下的紧凑医疗成像。
Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation
Authors: Zixuan Liu, Ruoyi Qiao, Chenrui Tie, Xuanwei Liu, Yunfan Lou, Chongkai Gao, Zhixuan Xu, Lin Shao
First: 2026-03-11T16:55:49+00:00 · Latest: 2026-03-11T16:55:49+00:00
Comments: 16 pages
Abstract
Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is https://contact-coverage-guided-exploration.github.io.
中文标题/摘要
标题:接触覆盖引导探索在通用灵巧操作中的应用
深度强化学习(DRL)在具有明确奖励结构的领域取得了显著成功,如Atari游戏和运动控制。相比之下,灵巧操作缺乏通用的奖励形式,通常依赖于特定任务的手工设计先验来指导手-物体交互。我们提出了一种名为接触覆盖引导探索(CCGE)的一般探索方法,旨在用于通用灵巧操作任务。CCGE将接触状态表示为物体表面点与预定义的手指关键点之间的交集,鼓励灵巧的手指发现多样且新颖的接触模式,即哪些手指接触哪些物体区域。它通过学习哈希码获得离散化的物体状态来维护一个接触计数器,捕捉每个手指与不同物体区域交互的频率。该计数器以两种互补的方式利用:(1)基于计数的接触覆盖奖励,促进探索新颖的接触模式;(2)能量导向的接近奖励,引导智能体向未充分探索的接触区域移动。我们在包括杂乱物体分离、受限物体检索、手中重新定向和双臂操作在内的多种灵巧操作任务上评估了CCGE。实验结果表明,CCGE在现有探索方法上显著提高了训练效率和成功率,并且通过CCGE学习到的接触模式能够稳健地转移到实际的机器人系统中。项目页面为https://contact-coverage-guided-exploration.github.io。
Summary / 总结
The research aims to address the challenge of general-purpose dexterous manipulation by proposing Contact Coverage-Guided Exploration (CCGE), which uses a novel method to explore diverse contact patterns between hands and objects. CCGE represents contact states and uses a contact counter to encourage the discovery of novel contact patterns and under-explored regions. The method improves training efficiency and success rates in various manipulation tasks and transfers well to real-world robotic systems.
研究旨在通过提出接触覆盖引导探索(CCGE)方法解决通用灵巧操作的挑战,该方法鼓励发现多样化的接触模式。CCGE通过表示接触状态并使用接触计数器跟踪手指与不同物体区域的交互频率,提供基于接触覆盖的探索奖励和基于能量的接近奖励来引导智能体探索未探索的区域。在各种操作任务上的实验表明,CCGE在训练效率和成功率方面显著优于现有方法,并且学习到的接触模式能够很好地转移到实际的机器人系统中。
TOSSS: a CVE-based Software Security Benchmark for Large Language Models
Authors: Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi, Angela Makhanu, Gaëtan Peter, Roos Wensveen
First: 2026-03-11T16:54:01+00:00 · Latest: 2026-03-11T16:54:01+00:00
Abstract
With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.
中文标题/摘要
标题:TOSSS:基于CVE的软件安全基准测试用于大型语言模型
随着其能力的不断增强,大型语言模型(LLMs)现在被广泛应用于许多行业。它们已成为软件工程师的有用工具,并支持广泛的开发任务。随着LLMs在软件开发工作流程中的应用越来越广泛,一个关键问题出现了:LLMs在软件安全方面表现如何?与此同时,世界各国组织都在大力投资网络安全,以减少受到破坏性攻击的暴露。将LLMs集成到软件工程工作流程中可能会引入新的漏洞,削弱现有的安全努力。 我们提出了TOSSS(Two-Option Secure Snippet Selection),这是一个基准测试,用于衡量LLMs在选择安全代码片段和易受攻击代码片段之间的能力。现有的针对LLMs的安全基准测试仅涵盖有限范围的漏洞。相比之下,TOSSS依赖于CVE数据库,并提供了一个可扩展的框架,可以随着时间的推移整合新披露的漏洞。我们的基准测试根据模型的行为给每个模型一个0到1之间的安全评分;得分为1表示模型总是选择安全的代码片段,得分为0表示它总是选择易受攻击的代码片段。我们在C/C++和Java代码上评估了14个广泛使用的开源和闭源模型,并观察到评分范围从0.48到0.89。LLMs提供商已经发布了许多模型的基准测试评分,TOSSS可以成为这些报告中的一个补充的安全重点评分。
Summary / 总结
The paper introduces TOSSS, a benchmark for evaluating Large Language Models (LLMs) in software security by measuring their ability to choose between secure and vulnerable code snippets. Leveraging the CVE database, TOSSS provides an extensible framework for integrating new vulnerabilities. Evaluating 14 LLMs on C/C++ and Java code, the benchmark yields scores ranging from 0.48 to 0.89, indicating the models' varying levels of security awareness. This work aims to address the critical question of LLMs' suitability in software security and to complement existing benchmark scores with a security-focused metric.
论文介绍了TOSSS基准,用于评估大型语言模型(LLMs)在软件安全方面的表现,通过衡量其选择安全或漏洞代码片段的能力。TOSSS利用CVE数据库提供了一个可扩展的框架,以评估各种漏洞。对14种LLM在C/C++和Java代码上的评估结果显示了0.48到0.89的安全分数,表明模型在安全意识方面的差异。这项工作旨在填补现有安全基准的空白,并为LLM提供一个安全焦点评分。
Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI
Authors: Joan Perramon-Llussà, Amelia Jiménez-Sánchez, Grzegorz Skorupko, Fotis Avgoustidis, Carlos Martín-Isla, Karim Lekadir, Polyxeni Gkontra
Venue: MICCAI 2026
First: 2026-03-11T16:52:21+00:00 · Latest: 2026-03-11T16:52:21+00:00
Comments: 11 pages, 2 figures. Submitted to MICCAI 2026
Abstract
Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M\&Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.
中文标题/摘要
标题:Med-DualLoRA:针对3D心脏MRI的医学基础模型局部适应
基础模型(FMs)在医疗成像任务和模态中表现出强大的下游性能潜力,包括心脏磁共振成像(CMR),经过特定任务的适应后。然而,使用单一站点的数据进行适应可能导致性能不佳和模型偏差增加,而集中式微调临床数据由于隐私限制往往不可行。联邦微调提供了一种隐私保护的替代方案;然而,传统方法在异构、非IID多中心数据下表现不佳,并且在适应大型模型时会产生大量通信开销。在本文中,我们研究了3D CMR疾病检测的联邦FM微调,并提出了一种客户端感知的参数高效微调(PEFT)联邦框架Med-DualLoRA,通过加性分解分离全局共享和局部低秩适应(LoRA)。全局和局部LoRA模块在本地训练,但仅共享和聚合全局组件,保持局部适配器的隐私。此设计提高了个性化能力,同时显著降低了通信成本。实验表明,仅适应两个变压器块即可保持性能并进一步提高效率。我们在ACDC和联合M&M数据集上对多中心最先进的cine 3D CMR FM进行疾病检测微调,将每个供应商视为联邦客户端,评估了我们的方法。Med-DualLoRA在与其他联邦PEFT基线相比时,实现了统计上显著的性能提升(平衡准确率0.768,特异性0.612),同时保持了通信效率。我们的方法为在现实临床约束下提供了一种可扩展的医学FM联邦局部适应解决方案。
Summary / 总结
The research aims to improve the performance of foundation models in 3D cardiac MRI disease detection by addressing the limitations of single-site and centralized fine-tuning. Med-DualLoRA, a client-aware parameter-efficient fine-tuning framework, is proposed to disentangle global and local low-rank adaptations through additive decomposition. Experiments show that adapting only two transformer blocks preserves performance while reducing communication cost, and Med-DualLoRA achieves statistically significant improved performance compared to other federated PEFT baselines.
该研究旨在解决在多个中心适应用于3D心脏MRI疾病检测的基础模型时的隐私保护问题。提出了Med-DualLoRA,一种联邦学习框架,将全局和局部低秩适应分离。该框架通过仅共享全局组件并保持本地适配器私有来减少通信开销。实验表明,仅适应两个变压器块可以保持性能并提高效率,相比其他联邦PEFT基线,其平衡准确率和特异性更高。
Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition
Authors: Jian Sun, Mohammad H. Mahoor
Venue: Neural Comput & Applic 38, 107 (2026)
First: 2026-03-11T16:51:55+00:00 · Latest: 2026-03-11T16:51:55+00:00
Comments: 9 figures, 10 tables,
Abstract
Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.
中文标题/摘要
标题:基于对比学习的视频质量评估-联合视频视觉变换器用于视频识别
视频质量显著影响视频分类。我们在从清晰视频中很好地分类轻度认知障碍时发现了这个问题,但在模糊视频中效果较差。从那时起,我们意识到参考视频质量评估(VQA)可能提高视频分类效果。本文提出了一种基于自我监督学习的视频视觉变换器(SSL-V3),结合无参考VQA,以实现这一目标。SSL-V3利用联合自我监督机制将VQA融入视频分类,解决VQA中常见的标签短缺问题,使得无法提供准确的视频质量评分。简而言之,联合自我监督机制将视频质量评分作为因素直接调整视频分类的特征图。然后,评分作为交点,将VQA和分类连接起来,利用监督分类任务调整VQA的参数。SSL-V3在两个数据集上取得了稳健的实验结果。例如,在I-CONECT(一个涉及面部视频的医疗保健数据集)的一些访谈视频上达到了94.87%的准确率,验证了SSL-V3的有效性。
Summary / 总结
This paper addresses the impact of video quality on classification tasks, particularly in recognizing Mild Cognitive Impairment. It proposes SSL-V3, a Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA, which uses a combined self-supervised learning mechanism to integrate video quality assessment into video classification. The method effectively addresses the label shortage issue in video datasets. Experiments on two datasets showed that SSL-V3 achieved high accuracy, reaching 94.87% on interview videos in the I-CONECT dataset.
本文通过提出结合自监督学习和视频视觉变换器以及无参考视频质量评估的SSL-V3方法,解决了视频质量对分类准确性的影响问题。该方法使用结合自监督学习机制将视频质量分数直接融入视频分类中,解决了视频数据集中标签不足的问题。实验结果表明,SSL-V3在两个数据集,包括I-CONECT,上达到了高准确率,如在访谈视频上的准确率为94.87%,证明了其有效性。
Pointy - A Lightweight Transformer for Point Cloud Foundation Models
Authors: Konrad Szafer, Marek Kraft, Dominik Belter
Venue: ICLR 2025
First: 2026-03-11T16:50:46+00:00 · Latest: 2026-03-11T16:50:46+00:00
Comments: To appear in the proceedings of ACIVS 2025. An earlier version was presented at the SCI-FM workshop at ICLR 2025
Abstract
Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.
中文标题/摘要
标题:点状 - 一种轻量级的点云基础模型变压器
点云数据的基础模型最近在能力上有了显著增长,通常依赖于语言或视觉的广泛表示学习。在本研究中,我们采取了一种更受控的方法,引入了一种基于轻量级变压器的点云架构。与对跨模态监督的大量依赖不同,我们的模型仅在39,000个点云上进行训练,但其性能却超过了多个在超过200,000个训练样本上训练的大型基础模型。有趣的是,我们的方法接近了在超过一百万个点云、图像和文本样本上训练的模型的最新成果,这表明精心策划的训练设置和架构的价值。为了确保严格的评估,我们进行了一个全面的复制研究,标准化了训练制度,并在多个点云架构上进行了基准测试。这种统一的实验框架隔离了架构选择的影响,允许透明的比较,并突显了我们设计和其他无分词器架构的优势。我们的结果显示,简单的骨干网络可以达到与更复杂或数据丰富的策略相当的结果。该实现包括代码、预训练模型和训练协议,可在https://github.com/KonradSzafer/Pointy 获取。
Summary / 总结
This paper introduces Pointy, a lightweight transformer-based architecture for point cloud data, which is trained on a smaller dataset of 39k point clouds and outperforms larger models trained on over 200k samples. The model approaches state-of-the-art results seen in models with extensive cross-modal supervision, highlighting the effectiveness of a carefully curated training setup and architecture. The authors conduct a replication study to ensure rigorous evaluation and demonstrate that simple backbones can achieve competitive performance compared to more complex strategies. The implementation details, including code and pre-trained models, are available online.
该研究引入了基于轻量级变压器的Pointy模型,仅使用39k样本训练,其性能优于使用超过200k样本训练的大型模型。该模型展示了精心设计的训练设置和架构的有效性,达到了与见过数百万样本的模型相当的最先进结果。研究包括了复制实验以确保严格的评估,并突出了简单骨干网络优于复杂或数据丰富策略的优势。相关实现细节和代码已公开在线。
Bio-Inspired Self-Supervised Learning for Wrist-worn IMU Signals
Authors: Prithviraj Tarale, Kiet Chu, Abhishek Varghese, Kai-Chun Liu, Maxwell A Xu, Mohit Iyyer, Sunghoon I. Lee
First: 2026-03-11T16:48:33+00:00 · Latest: 2026-03-11T16:48:33+00:00
Abstract
Wearable accelerometers have enabled large-scale health and wellness monitoring, yet learning robust human-activity representations has been constrained by the scarcity of labeled data. While self-supervised learning offers a potential remedy, existing approaches treat sensor streams as unstructured time series, overlooking the underlying biological structure of human movement, a factor we argue is critical for effective Human Activity Recognition (HAR). We introduce a novel tokenization strategy grounded in the submovement theory of motor control, which posits that continuous wrist motion is composed of superposed elementary basis functions called submovements. We define our token as the movement segment, a unit of motion composed of a finite sequence of submovements that is readily extractable from wrist accelerometer signals. By treating these segments as tokens, we pretrain a Transformer encoder via masked movement-segment reconstruction to model the temporal dependencies of movement segments, shifting the learning focus beyond local waveform morphology. Pretrained on the NHANES corpus (approximately 28k hours; approximately 11k participants; approximately 10M windows), our representations outperform strong wearable SSL baselines across six subject-disjoint HAR benchmarks. Furthermore, they demonstrate stronger data efficiency in data-scarce settings. Code and pretrained weights will be made publicly available.
中文标题/摘要
标题:生物启发的自监督学习用于腕戴式IMU信号
可穿戴加速度计使大规模健康和福祉监测成为可能,但学习稳健的人类活动表示受到标注数据稀缺的限制。虽然自监督学习提供了一种潜在的解决方案,但现有方法将传感器流视为无结构的时间序列,忽视了人类运动的潜在生物结构,我们认为这是有效的人体活动识别(HAR)的关键因素。我们引入了一种基于运动控制的子运动理论的新型分词策略,该理论认为连续的手腕运动由称为子运动的基本函数叠加组成。我们将我们的分词定义为运动片段,这是由有限序列的子运动组成的运动单位,可以从手腕加速度计信号中轻松提取。通过将这些片段视为分词,我们使用掩码运动片段重建预训练了一个Transformer编码器,以建模运动片段的时间依赖性,将学习重点从局部波形形态转移到了更广泛的方面。在NHANES语料库(约28000小时;约11000名参与者;约1000万个窗口)上预训练后,我们的表示在六个受试者不重叠的HAR基准测试中优于强大的可穿戴自监督学习基线,并且在数据稀缺的环境中表现出更强的数据效率。代码和预训练权重将公开提供。
Summary / 总结
This study addresses the challenge of learning robust human-activity representations from wrist-worn accelerometer data using self-supervised learning. It introduces a novel tokenization strategy based on submovements, which are fundamental components of human motion. By treating these submovement segments as tokens, the researchers pretrained a Transformer encoder to model temporal dependencies, improving performance on various HAR benchmarks and showing better data efficiency in data-scarce settings compared to existing methods.
该研究旨在利用自监督学习从腕戴加速度计数据中学习稳健的人类活动表示。它引入了一种基于子运动理论的新分词策略,将运动片段作为分词进行预训练。预训练的表示在多个HAR基准测试中优于现有自监督学习方法,并在数据稀缺的情况下表现出更好的数据效率。
Ranking Reasoning LLMs under Test-Time Scaling
Authors: Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary
First: 2026-03-11T16:47:41+00:00 · Latest: 2026-03-11T16:47:41+00:00
Comments: Code is available at https://github.com/mohsenhariri/scorio
Abstract
Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.
中文标题/摘要
标题:测试时缩放下逻辑推理大模型的排名推理
测试时缩放通过每条提示采样多个输出来评估逻辑推理大模型,但在这种模式下对模型进行排名仍较少被探索。我们形式化了测试时缩放下的密集基准排名,并引入了Scorio库,该库实现了配对比较模型、项目反应理论(IRT)模型、投票规则以及图和谱方法等统计排名方法。在四个奥林匹克风格的数学基准(AIME'24、AIME'25、HMMT'25和BrUMO'25;最多N=80次试验)上的20个推理模型中,大多数全试验排名与贝叶斯黄金标准$\mathrm{Bayes}_{\mathcal{U}}@80$(平均Kendall's $τ_b = 0.93$--$0.95$)高度一致,且19到34种方法完全恢复了相同的排序。在单试验模式下,最佳方法达到$τ_b \approx 0.86$。使用贪婪解码作为经验先验($\mathrm{Bayes}_{\mathbf{R}_0}@N$)在N=1时可减少方差16%到52%,但在贪婪解码和随机采样结果不一致时可能会导致排名偏差。这些结果确定了适用于高预算和低预算测试时缩放的可靠排名方法。我们已在https://github.com/mohsenhariri/scorio上开源了Scorio库。
Summary / 总结
This study evaluates reasoning LLMs under test-time scaling by sampling multiple outputs per prompt and introduces Scorio, a library for statistical ranking methods. Across 20 reasoning models on four Olympiad-style math benchmarks, most full-trial rankings closely match the Bayesian gold standard, with Kendall's τ_b ranging from 0.93 to 0.95. In the single-trial regime, the best methods achieve τ_b around 0.86. Greedy decoding reduces variance but can bias rankings when disagreeing with stochastic sampling. The study provides reliable ranking methods for both high- and low-budget test-time scaling scenarios.
该研究通过每次提示采样多个输出来评估推理LLM在测试时缩放下的表现,并引入了Scorio库,该库实现了配对比较模型、项目反应理论(IRT)模型、投票规则和图-谱方法等统计排名方法。在四个奥林匹克风格的数学基准测试上,20个推理模型的大多数全试次排名与贝叶斯黄金标准非常接近,Kendall's τ_b范围在0.93到0.95之间。在单试次情况下,最佳方法的τ_b约为0.86。贪婪解码可以减少方差,但在与随机采样结果不一致时会偏排名。该研究确定了适用于高预算和低预算测试时缩放的可靠排名方法,并发布了Scorio作为开源库。
When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra
Authors: Mira Jürgens, Gaetan De Waele, Morteza Rakhshaninejad, Willem Waegeman
First: 2026-03-11T16:40:50+00:00 · Latest: 2026-03-11T16:40:50+00:00
Abstract
Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.
中文标题/摘要
标题:我们在何时应信任注释?从质谱数据检索分子结构的选择性预测
用于从串联质谱数据(MS/MS)识别分子结构的机器学习方法取得了快速进展,但当前方法仍存在显著的错误率。在临床代谢组学和环境筛查等高风险应用中,错误的注释可能会产生严重后果,因此确定何时可以信任预测变得至关重要。我们提出了一种从MS/MS光谱检索分子结构的选择性预测框架,使模型在不确定性过高时可以避免预测。我们以内生风险-覆盖率权衡框架来表述问题,并在两种粒度级别上全面评估不确定性量化策略:指纹级别不确定性,即预测的分子指纹位,以及检索级别不确定性,即候选排名。我们比较了包括一阶置信度度量、来自二阶分布的 aleatoric 和 epistemic 不确定性估计以及潜在空间中的距离度量在内的评分函数。所有实验均在MassSpecGym基准上进行。我们的分析表明,虽然指纹级别不确定性分数是检索成功不良的代理指标,但计算成本低廉的一阶置信度度量和检索级别 aleatoric 不确定性在评估设置中实现了强大的风险-覆盖率权衡。我们通过应用基于泛化边界的无分布风险控制,展示了从业者可以指定可接受的错误率,并以高概率获得满足该约束条件的注释子集。
Summary / 总结
The research aims to improve the reliability of molecular structure identification from mass spectra by developing a selective prediction framework. This framework allows models to avoid making predictions when uncertainty is high, addressing the critical issue of error rates in high-stakes applications. Key findings show that fingerprint-level uncertainty scores are not effective predictors of retrieval success, but first-order confidence measures and retrieval-level aleatoric uncertainty provide a strong balance between risk and coverage. By using distribution-free risk control, practitioners can ensure a specified error rate with high probability.
研究旨在通过识别质谱数据中的分子结构,提高准确性,特别是在关键应用中,错误的注释可能会产生严重后果。研究引入了一种选择性预测框架,允许模型在不确定性过高时放弃预测。研究评估了指纹级和检索级的多种不确定性量化策略,发现一阶置信度度量和检索级的 aleatoric 不确定性提供了良好的风险与覆盖率之间的平衡。通过使用分布无关的风险控制,实践者可以确保以高概率满足指定的错误率。
Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections
Authors: Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn, Johannes Breidenbach, Stefano Puliti, Nils Noelke
First: 2025-12-17T12:09:41+00:00 · Latest: 2026-03-11T16:33:47+00:00
Comments: 34 pages, 17 figures, submitted to Forestry: An International Journal of Forest Research
Abstract
Aiming to advance research in the field of interpretability of deep learning models for tree species classification using TLS 3D point clouds we present insights in the classification abilities of YOLOv8 through a new framework which enables systematic analysis of saliency maps derived from CAM (Class Activation Mapping). To investigate the contribution of structural tree features to the classification decisions of the models, we link regions with high saliency derived from the application of Finer-CAM to segments of 2D side-view images that correspond to structural tree features. Using TLS 3D point clouds from 2445 trees across seven European tree species, we trained five YOLOv8 models with cross-validation, reaching a mean accuracy of 96% (SD = 0.24%) when applied to the test data. Our results demonstrate that Finer-CAM can be considered faithful in identifying discriminative regions that discriminate target tree species. This renders Finer-CAM suitable for enhancing the interpretability of the tree species classification models. Analysis of 630 saliency maps indicate that the models primarily rely on image regions associated with tree crowns for species classification. While this result is pronounced in Silver Birch, European Beech, English oak, and Norway Spruce, image regions associated with stems contribute more frequently to the differentiation of European ash, Scots pine, and Douglas-fir. We demonstrate that the visibility of detailed structural tree features in the 2D side-view images enhances the discriminative performances of the models, indicating YOLOv8`s abilities to leverage detailed point cloud representations. Our results represent a first step toward enhancing the understanding of the classification decision processes of tree species classification models, aiding in the identification of data set and model limitations, and building confidence in model predictions.
中文标题/摘要
标题:增强树种分类:YOLOv8和可解释AI应用于TLS点云投影的新见解
为了推进使用TLS 3D点云进行树种分类的深度学习模型可解释性研究,我们通过一个新框架展示了YOLOv8在分类能力方面的见解,该框架允许系统分析CAM(类别激活映射)衍生的显著性图。为了研究结构树特征对模型分类决策的贡献,我们将高显著性区域与应用Finer-CAM到2D侧视图图像片段中结构树特征相关联。使用来自欧洲7种树种的2445棵树的TLS 3D点云,我们训练了五个交叉验证的YOLOv8模型,当应用于测试数据时,达到96%(标准差=0.24%)的平均准确率。我们的结果表明,Finer-CAM可以被认为是忠实地识别区分目标树种的区域,这使得Finer-CAM适合于增强树种分类模型的可解释性。对630个显著性图的分析表明,模型主要依赖于与树冠相关的图像区域进行树种分类。虽然这一结果在白桦、欧洲赤松、英国橡树和云杉中尤为明显,但与树干相关的图像区域在区分欧洲白蜡、冷杉和 Douglas 松中更为频繁。我们证明了2D侧视图图像中详细结构树特征的可见性增强了模型的区分性能,表明YOLOv8能够利用详细的点云表示。我们的结果代表了增强对树种分类模型分类决策过程理解的第一步,有助于识别数据集和模型的局限性,并增强对模型预测的信心。
Summary / 总结
This study aims to improve the interpretability of deep learning models for tree species classification using TLS 3D point clouds. It employs YOLOv8 and Finer-CAM to analyze saliency maps, linking structural tree features to classification decisions. With a dataset of 2445 trees from seven European species, the models achieved a mean accuracy of 96% (SD = 0.24%). The analysis indicates that tree crowns are primarily used for species classification, while stems play a more significant role in differentiating certain species. This work enhances the understanding of classification processes and aids in identifying model limitations.
本研究旨在通过TLS 3D点云提高树种分类深度学习模型的可解释性,采用YOLOv8和Finer-CAM分析显著性图,将结构树特征与分类决策联系起来。使用来自七个欧洲树种的2445棵树的数据集,模型的平均准确率为96%(SD = 0.24%)。分析表明,树冠主要用于物种分类,而树干在区分某些物种时更为重要。这项工作增强了对分类过程的理解,并有助于识别模型的局限性。
InstantSfM: Towards GPU-Native SfM for the Deep Learning Era
Authors: Jiankun Zhong, Zitong Zhan, Quankai Gao, Ziyu Chen, Haozhe Lou, Jiageng Mao, Ulrich Neumann, Chen Wang, Yue Wang
First: 2025-10-15T08:58:05+00:00 · Latest: 2026-03-11T16:28:28+00:00
Abstract
Structure-from-Motion (SfM) is a fundamental technique for recovering camera poses and scene structure from multi-view imagery, serving as a critical upstream component for applications ranging from 3D reconstruction to modern neural scene representations such as 3D Gaussian Splatting. However, most mature SfM systems remain CPU-centric and built upon traditional optimization toolchains, creating a growing mismatch with modern GPU-based, learning-driven pipelines and limiting scalability in large-scale scenes. While recent advances in GPU-accelerated bundle adjustment (BA) have demonstrated the potential of parallel sparse optimization, extending these techniques to build a complete global SfM system remains challenging due to unresolved issues in metric scale recovery and numerical robustness. In this paper, we implement a fully GPU-based and PyTorch-compatible global SfM system, named InstantSfM, to integrate seamlessly with modern learning pipelines. InstantSfM embeds metric depth priors directly into both global positioning and BA through a depth-constrained Jacobian structure, thereby resolving scale ambiguity within the optimization framework. To ensure numerical stability, we employ explicit filtering of under-constrained variables for the Jacobian matrix in an optimized GPU-friendly manner. Extensive experiments on diverse datasets demonstrate that InstantSfM achieves state-of-the-art efficiency while maintaining reconstruction accuracy comparable to both established classical pipelines and recent learning-based methods, showing up to ${\sim40\times}$ speedup over COLMAP on large-scale scenes.
Summary / 总结
The research aims to address the mismatch between traditional CPU-centric SfM systems and modern GPU-based learning pipelines. The authors developed InstantSfM, a fully GPU-based and PyTorch-compatible SfM system. Key features include embedding metric depth priors and employing explicit filtering for numerical stability. Experimental results show that InstantSfM achieves state-of-the-art efficiency, comparable reconstruction accuracy, and up to 40 times speedup over COLMAP on large-scale scenes.
本文介绍了InstantSfM,这是一个完全基于GPU的全局Structure-from-Motion (SfM)系统,旨在与现代学习管道无缝集成。它通过嵌入度量深度先验并采用显式过滤来确保数值稳定性,解决了传统SfM系统与GPU加速的学习驱动管道之间的不匹配问题。实验表明,InstantSfM在效率和重建精度方面达到了最先进的水平,在大规模场景中比COLMAP快多达40倍。
Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes
Authors: Aleksei Rozanov, Arvind Renganathan, Vipin Kumar
Venue: AAAI 2026
First: 2026-03-10T17:59:29+00:00 · Latest: 2026-03-11T16:27:50+00:00
Comments: Accepted to the KGML Bridge at AAAI 2026 (non-archival)
Abstract
Accurately upscaling terrestrial carbon fluxes is central to estimating the global carbon budget, yet remains challenging due to the sparse and regionally biased distribution of ground measurements. Existing data-driven upscaling products often fail to generalize beyond observed domains, leading to systematic regional biases and high predictive uncertainty. We introduce Task-Aware Modulation with Representation Learning (TAM-RL), a framework that couples spatio-temporal representation learning with knowledge-guided encoder-decoder architecture and loss function derived from the carbon balance equation. Across 150+ flux tower sites representing diverse biomes and climate regimes, TAM-RL improves predictive performance relative to existing state-of-the-art datasets, reducing RMSE by 8-9.6% and increasing explained variance (R2) from 19.4% to 43.8%, depending on the target flux. These results demonstrate that integrating physically grounded constraints with adaptive representation learning can substantially enhance the robustness and transferability of global carbon flux estimates.
中文标题/摘要
标题:基于表示学习的任务感知调制用于提升陆地碳通量
准确提升陆地碳通量对于估算全球碳预算至关重要,但由于地面测量稀疏且地区性偏差,这一任务仍然具有挑战性。现有的数据驱动提升产品往往无法在未观察到的领域泛化,导致系统性的区域偏差和高预测不确定性。我们引入了基于表示学习的任务感知调制(TAM-RL)框架,该框架结合了时空表示学习与知识引导的编码器-解码器架构,并且损失函数源自碳平衡方程。在代表不同生物群落和气候区的150多个通量塔站点上,TAM-RL相比现有最先进的数据集在预测性能上有所提升,RMSE降低了8-9.6%,解释方差(R2)从19.4%增加到43.8%,具体取决于目标通量。这些结果表明,将物理约束与自适应表示学习相结合可以显著增强全球碳通量估计的稳健性和可迁移性。
Summary / 总结
The research aims to improve the accuracy of upscaling terrestrial carbon fluxes, which is crucial for estimating the global carbon budget. The authors developed Task-Aware Modulation with Representation Learning (TAM-RL), a framework that combines spatio-temporal representation learning with a knowledge-guided encoder-decoder architecture and a loss function derived from the carbon balance equation. Across 150+ flux tower sites, TAM-RL outperformed existing methods, reducing RMSE by 8-9.6% and increasing the explained variance (R2) from 19.4% to 43.8%. This shows that integrating physical constraints with adaptive learning can significantly enhance the robustness and transferability of global carbon flux estimates.
研究旨在提高陆地碳通量的放大精度,这对于估算全球碳预算至关重要。作者开发了任务感知模态与表示学习(TAM-RL)框架,该框架结合了时空表示学习、知识引导的编码解码架构以及来自碳平衡方程的损失函数。在150多个涡流塔站点上,TAM-RL 比现有方法表现更优,RMSE 减少了8-9.6%,解释方差(R2)从19.4% 提高到43.8%。这表明将物理约束与自适应表示学习相结合可以显著增强全球碳通量估算的稳健性和可迁移性。
Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control
Authors: Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee, Scott Niekum
First: 2026-03-11T16:24:20+00:00 · Latest: 2026-03-11T16:24:20+00:00
Abstract
Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy's cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model's risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.
中文标题/摘要
标题:超越预期的Safe RLHF:随机优势在通用光谱风险控制中的应用
安全的人工反馈强化学习(Safe RLHF)通常通过期望成本约束来确保安全性,但期望值只捕捉了成本分布的一个统计特征,未能考虑分布不确定性,特别是在重尾分布或罕见灾难性事件下。当鲁棒性和风险敏感性至关重要时,这一局限性是存在问题的。随机优势提供了一种替代方案,通过比较整个成本分布而非仅仅平均值来实现,从而可以直接控制尾部风险和潜在的分布外故障,这些是基于期望约束可能忽略的。在本文中,我们提出了风险敏感对齐通过优势(RAD)的新颖对齐框架,用一阶随机优势(FSD)约束替代标量期望成本约束。我们通过在最优传输(OT)框架内比较目标策略的成本分布和参考策略的成本分布来实现这一约束,使用熵正则化和Sinkhorn迭代来获得一个可微且计算高效的优化目标,以实现稳定的一体化优化。此外,我们引入了分位数加权FSD约束,并证明加权FSD可以控制广泛的光谱风险度量(SRM),使得在加权优势下的改进意味着在相应光谱风险下的保证改进。这为通过分位数加权函数调整模型的风险特征提供了一种原则性的机制。实验证明,RAD在减少危害性方面优于基线,同时在有益性方面保持竞争力,并在分布外危害性评估中表现出更强的鲁棒性。
Summary / 总结
This paper addresses the limitation of using expected cost constraints in Safe RLHF by proposing a new framework called Risk-sensitive Alignment via Dominance (RAD), which uses First-Order Stochastic Dominance (FSD) constraints to control the entire cost distribution. The method compares the cost distribution of the target policy to a reference policy using an Optimal Transport framework with entropic regularization, enabling stable end-to-end optimization. Empirical results show that RAD enhances safety while maintaining performance and robustness against out-of-distribution failures.
本文提出了一种新的框架Risk-sensitive Alignment via Dominance (RAD),使用First-Order Stochastic Dominance (FSD)约束来控制整个成本分布,而不是仅仅控制期望成本。该方法通过Optimal Transport框架和熵正则化来比较目标策略的成本分布与参考策略的成本分布,确保了可微分和计算效率。实验证明,RAD在提高安全性的同时保持了性能,并在离分布场景中表现出更好的稳健性。
Offline Dynamic Inventory and Pricing Strategy: Addressing Censored and Dependent Demand
Authors: Korel Gundem, Zhengling Qi
First: 2025-04-14T02:57:51+00:00 · Latest: 2026-03-11T16:18:36+00:00
Abstract
In this paper, we study the offline sequential feature-based pricing and inventory control problem where the current demand depends on the past demand levels and any demand exceeding the available inventory is lost. Our goal is to leverage the offline dataset, consisting of past prices, ordering quantities, inventory levels, covariates, and censored sales levels, to estimate the optimal pricing and inventory control policy that maximizes long-term profit. While the underlying dynamic without censoring can be modeled by Markov decision process (MDP), the primary obstacle arises from the observed process where demand censoring is present, resulting in missing profit information, the failure of the Markov property, and a non-stationary optimal policy. To overcome these challenges, we first approximate the optimal policy by solving a high-order MDP characterized by the number of consecutive censoring instances, which ultimately boils down to solving a specialized Bellman equation tailored for this problem. Inspired by offline reinforcement learning and survival analysis, we propose two novel data-driven algorithms for solving these Bellman equations and, thus, estimate the optimal policy. Furthermore, we establish finite-sample regret bounds to validate the effectiveness of these algorithms. Finally, we conduct numerical experiments to demonstrate the efficacy of our algorithms in estimating the optimal policy. To the best of our knowledge, this is the first data-driven approach to learning optimal pricing and inventory control policies in a sequential decision-making environment characterized by censored and dependent demand. The implementations of the proposed algorithms are available at https://github.com/gundemkorel/Inventory_Pricing_Control
中文标题/摘要
标题:离线动态库存和定价策略:应对受限和相关需求
在本文中,我们研究了当前需求依赖于过去需求水平且超出可用库存的需求将被损失的离线序列特征定价和库存控制问题。我们的目标是利用包含过去价格、订货量、库存水平、协变量和受限销售水平的离线数据集,估计最大化长期利润的最佳定价和库存控制策略。虽然在没有截尾的情况下,基础动态可以由马尔可夫决策过程(MDP)建模,但主要障碍来自于观察过程,其中存在需求截尾,导致缺失利润信息、马尔可夫性质失效和非平稳最优策略。为克服这些挑战,我们首先通过解决由连续截尾实例数量定义的高阶MDP来近似最优策略,最终归结为解决针对该问题特制的贝尔曼方程。受离线强化学习和生存分析的启发,我们提出了两种新的数据驱动算法来解决这些贝尔曼方程,从而估计最优策略。此外,我们建立了有限样本后悔界以验证这些算法的有效性。最后,我们通过数值实验展示了这些算法在估计最优策略方面的有效性。据我们所知,这是首个在受限和相关需求的序列决策环境中学习最优定价和库存控制策略的数据驱动方法。所提出的算法的实现可在https://github.com/gundemkorel/Inventory_Pricing_Control 获取。
Summary / 总结
This paper addresses the offline dynamic inventory and pricing strategy problem where demand is dependent on past levels and any excess demand is lost. The authors leverage an offline dataset to estimate the optimal policy that maximizes long-term profit. They overcome challenges due to demand censoring by approximating the optimal policy through solving a high-order Markov decision process and propose two novel algorithms based on offline reinforcement learning and survival analysis. The algorithms are validated with finite-sample regret bounds and numerical experiments show their effectiveness in estimating the optimal policy for censored and dependent demand scenarios.
本文研究了当前需求依赖于过去水平且超出库存需求被忽略的离线动态库存和定价策略问题。作者利用历史数据来估计最大化长期利润的最佳策略。他们通过高阶马尔可夫决策过程近似最优策略,并解决特定的贝尔曼方程来克服需求截尾带来的挑战。提出了两个新的数据驱动算法,并建立了有限样本后悔界来验证其有效性。数值实验展示了这些算法在截尾和相关需求场景下估计最优策略的有效性。
Zero-Shot Transferable Solution Method for Parametric Optimal Control Problems
Authors: Xingjian Li, Kelvin Kan, Deepanshu Verma, Krishna Kumar, Stanley Osher, Ján Drgoňa
First: 2025-09-22T20:38:05+00:00 · Latest: 2026-03-11T16:17:48+00:00
Comments: 11 pages, 6 figures, 3 tables
Abstract
This paper presents a transferable solution method for optimal control problems with varying objectives using function encoder (FE) policies. Traditional optimization-based approaches must be re-solved whenever objectives change, resulting in prohibitive computational costs for applications requiring frequent evaluation and adaptation. The proposed method learns a reusable set of neural basis functions that spans the control policy space, enabling efficient zero-shot adaptation to new tasks through either projection from data or direct mapping from problem specifications. The key idea is an offline-online decomposition: basis functions are learned once during offline imitation learning, while online adaptation requires only lightweight coefficient estimation. Numerical experiments across diverse dynamics, dimensions, and cost structures show our method delivers near-optimal performance with minimal overhead when generalizing across tasks, enabling semi-global feedback policies suitable for real-time deployment.
中文标题/摘要
标题:参数最优控制问题的零样本可移植解决方案方法
本文提出了一种使用函数编码器(FE)策略解决具有变化目标的最优控制问题的可移植解决方案方法。传统的基于优化的方法在目标变化时必须重新求解,导致频繁评估和适应的应用具有高昂的计算成本。所提出的方法学习一组可重用的神经基函数,覆盖控制策略空间,通过数据投影或直接从问题规范映射实现新的任务的零样本高效适应。关键思想是离线-在线分解:基函数在离线模仿学习过程中仅需学习一次,而在线适应只需进行轻量级的系数估计。在不同动力学、维度和成本结构的数值实验中显示,该方法在跨任务泛化时几乎达到最优性能,具有最小的开销,能够实现适用于实时部署的半全局反馈策略。
Summary / 总结
This paper introduces a transferable solution method for optimal control problems with varying objectives using function encoder (FE) policies. The method learns reusable neural basis functions offline, allowing for efficient zero-shot adaptation to new tasks through coefficient estimation. Experiments demonstrate near-optimal performance with minimal overhead, suitable for real-time deployment.
论文针对不同目标需要重新求解优化问题的挑战,提出了一种使用函数编码器策略来学习可重用的神经基函数的方法。这使得新任务的零样本适应无需重新求解优化问题即可高效实现。实验表明,该方法在跨任务泛化时能实现接近最优的性能,且开销极小,适用于实时部署。
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
Authors: Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino
First: 2026-03-11T16:13:19+00:00 · Latest: 2026-03-11T16:13:19+00:00
Abstract
We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.
中文标题/摘要
标题:终身模仿学习:多模态潜在重播与增量调整
我们提出了一种终身模仿学习框架,能够在现实的内存和数据约束下,实现跨序列任务的持续策略优化。我们的方法不同于传统的经验重播,完全在多模态潜在空间中操作,将视觉、语言和机器人状态信息的紧凑表示存储和重用以支持未来的学习。为了进一步稳定适应,我们引入了一种增量特征调整机制,通过角度间隔约束来规范任务嵌入的演变,从而保持任务间的独特性。我们的方法在LIBERO基准测试中建立了新的状态,AUC提高了10-17个百分点,遗忘率降低了高达65%。消融研究证实了每个组件的有效性,显示了相对于替代策略的一致改进。代码可在:https://github.com/yfqi/lifelong_mlr_ifa 获取。
Summary / 总结
The research aims to develop a lifelong imitation learning framework that can continuously refine policies across sequential tasks with limited memory and data. The method uses a multimodal latent space to store and reuse compact representations of visual, linguistic, and robot state information, and introduces an incremental feature adjustment mechanism to stabilize adaptation. Experimental results show that the approach outperforms previous methods in the LIBERO benchmarks, achieving higher AUC scores and less forgetting. Ablation studies validate the effectiveness of each component of the method.
研究旨在开发一种终身模仿学习框架,能够在有限的内存和数据条件下,持续优化跨序列任务的策略。该方法使用多模态隐空间存储和重用视觉、语言和机器人状态信息的紧凑表示,并引入增量特征调整机制以稳定适应。实验结果表明,该方法在LIBERO基准测试中优于先前的方法,实现了更高的AUC分数和较少的遗忘。消融研究验证了该方法每个组件的有效性。
BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs
Authors: Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela
First: 2026-02-05T08:41:00+00:00 · Latest: 2026-03-11T16:12:46+00:00
Abstract
Selecting the top $m$ from $n$ items via expensive $k$-wise comparisons is central to settings ranging from LLM-based document reranking to crowdsourced evaluation and tournament design. Existing methods either rely on heuristics that fail to fully exploit the information each comparison reveals, or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for $k$-wise ranking. Our key observation is that each $k$-item comparison reveals a complete tournament of $\binom{k}{2}$ pairwise preferences; aggregating these into a global preference graph and computing its transitive closure yields many additional orderings without further oracle calls. We formalize when an item's rank is certifiably determined and design a greedy query schedule that maximizes information gain towards identifying the top-$m$ items. The framework also gracefully handles non-transitive preferences (cycles induced by real-world oracles) by collapsing them into equivalence classes that yield principled tiered rankings. Applied to LLM reranking across 14 benchmarks and 5 models, our method achieves Pareto dominance over existing approaches: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable methods, and $7\times$ fewer than pairwise reranking at near-identical quality.
中文标题/摘要
标题:BLITZRANK:基于锦标赛图的原则性零样本排名代理
从$n$个项目中选择前$m$个项目,通过昂贵的$k$次比较,是从基于LLM的文档重排序到众包评估和锦标赛设计等多个场景的核心问题。现有方法要么依赖于未能充分利用每次比较所揭示信息的启发式方法,要么在利用这些信息时效率低下。我们提出了一种锦标赛图框架,为$k$次比较提供了原则性的基础。我们的关键观察是,每次$k$项比较揭示了一个完整的包含$\binom{k}{2}$个两两偏好的锦标赛;将这些偏好聚合到全局偏好图中并计算其传递闭包,可以得到许多额外的排序而无需进一步的查询。我们形式化了何时一项的排名可以被确定,并设计了一种贪婪的查询调度,以最大化识别前$m$项信息增益。该框架还优雅地处理了非传递性偏好(由现实世界或acles引起的循环),通过将它们合并为等价类来生成原则性的分层排名。应用于14个基准和5个模型的LLM重排序,我们的方法在帕累托优势上优于现有方法:在匹配或超过准确率的同时,比可比方法少需要25-40%的令牌,并且比两两重排序少需要7倍的令牌,同时保持相近的质量。
Summary / 总结
The research aims to improve the efficiency and accuracy of selecting top items from a large set using expensive comparisons. It introduces a tournament graph framework that leverages pairwise preferences from $k$-wise comparisons to determine item rankings. The method achieves Pareto dominance over existing approaches by requiring fewer tokens and achieving near-identical quality with higher efficiency, especially in LLM reranking tasks.
论文提出了BLITZRANK方法,利用tournament图框架从$n$个物品中通过$k$-wise比较选出前$m$个。该方法通过聚合每次$k$-wise比较中的两两偏好,计算传递闭包来确定物品的排名,无需额外的oracle调用。在14个基准测试和5个模型上,BLITZRANK在较少的tokens下达到了更高的准确性,且所需tokens仅为pairwise重排方法的$1/7$,质量相近。
ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection
Authors: Kadir-Kaan Özer, René Ebeling, Markus Enzweiler
First: 2026-03-11T16:08:56+00:00 · Latest: 2026-03-11T16:08:56+00:00
Comments: 6 pages, 3 figures, 5 tables
Abstract
Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under limited CPU parallelism. Accuracy-only leaderboards can therefore misrepresent which methods remain feasible under deployment-relevant constraints. We present ECoLAD (Efficiency Compute Ladder for Anomaly Detection), a deployment-oriented evaluation protocol instantiated as an empirical study on proprietary automotive telemetry (anomaly rate ${\approx}$0.022) and complementary public benchmarks. ECoLAD applies a monotone compute-reduction ladder across heterogeneous detector families using mechanically determined, integer-only scaling rules and explicit CPU thread caps, while logging every applied configuration change. Throughput-constrained behavior is characterized by sweeping target scoring rates and reporting (i) coverage (the fraction of entities meeting the target) and (ii) the best AUC-PR achievable among measured ladder configurations satisfying the target. On constrained automotive telemetry, lightweight classical detectors sustain both coverage and detection lift above the random baseline across the full throughput sweep. Several deep methods lose feasibility before they lose accuracy.
中文标题/摘要
标题:ECoLAD:面向部署的汽车时间序列异常检测评估
时间序列异常检测器通常在工作站级别的硬件上不受限制地执行进行比较。然而,车载监控需要在有限的CPU并行性下具有可预测的延迟和稳定的行为。因此,仅基于准确性的排行榜可能会误导人们了解哪些方法在部署相关约束下仍然可行。 我们提出了ECoLAD(效率计算梯度用于异常检测),这是一种面向部署的评估协议,通过在专有汽车遥测数据(异常率约为0.022)和补充的公共基准上进行实证研究来实现。ECoLAD使用机械确定的、仅整数的缩放规则和显式的CPU线程限制,在异构检测器家族中应用单调的计算减少梯度,同时记录每次应用的配置更改。通过吞吐量限制的行为通过扫掠目标评分率来表征,并报告(i)覆盖率(满足目标实体的比例)和(ii)在满足目标的测量梯度配置中可实现的最佳AUC-PR。在受限的汽车遥测数据中,轻量级的经典检测器在整个吞吐量扫掠过程中保持了覆盖率和检测提升,超过了随机基线。几种深度方法在失去准确性之前就失去了可行性。
Summary / 总结
ECoLAD evaluates time-series anomaly detectors under deployment-relevant constraints, contrasting workstation-class evaluations with in-vehicle requirements. It uses a compute-reduction ladder across various detector families, applying explicit CPU thread caps and mechanically determined scaling rules. On automotive telemetry, lightweight classical detectors maintain both coverage and detection lift, while several deep methods become infeasible before losing accuracy.
ECoLAD 在部署相关的约束条件下评估时间序列异常检测器,对比工作站级性能与车载需求。它使用一个计算减少梯度跨越各种检测器家族,应用显式的CPU线程限制和机械确定的缩放规则。在车载遥测数据上,轻量级的经典检测器在全吞吐量范围内保持覆盖率和检测提升,而几种深度方法在失去准确性之前变得不可行。
History
20260312_0350 20260311_0346 20260310_0350 20260309_0332 20260308_0331 20260307_0344 20260306_0400 20260305_0347 20260304_0347 20260303_0347 20260302_0330 20260301_0331 20260228_0349 20260227_0352 20260226_0405 20260225_0356 20260224_0404 20260223_0333 20260222_0333 20260221_0344 20260220_0347 20260219_0358 20260218_0358 20260217_0339 20260216_0334 20260215_0332 20260213_0402 20260212_0404 20260211_0409 20260210_0409 20260208_0334 20260207_0349 20260206_0347 20260205_0346 20260204_0352 20260202_0332 20260201_0328 20260131_0341 20260130_0339 20260129_0337 20260128_0335 20260127_0332 20260126_0325 20260125_0325 20260124_0333 20260123_0333 20260122_0339 20260121_0422 20260120_0328 20260119_0325 20260118_0324 20260117_0329 20260116_0332 20260115_0330 20260114_0329 20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553