arXiv 论文速递

2026-01-14 03:29
Snapshot: 20260114_0329
SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations
Authors: Mohammed Himayath Ali, Mohammed Aqib Abdullah, Mohammed Mudassir Uddin, Shahnawaz Alam
First: 2026-01-12T18:59:45+00:00 · Latest: 2026-01-12T18:59:45+00:00
Abstract
Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversarial cybersecurity environments exposes critical vulnerabilities to prompt injection attacks where malicious instructions embedded in security artifacts manipulate model behavior. This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails, adaptive constitution evolution, and Direct Preference Optimization for unlearning unsafe response patterns, addressing the unique challenges of high-stakes security contexts where traditional safety mechanisms prove insufficient against sophisticated adversarial manipulation. Experimental evaluation demonstrates that SecureCAI reduces attack success rates by 94.7% compared to baseline models while maintaining 95.1% accuracy on benign security analysis tasks, with the framework incorporating continuous red-teaming feedback loops enabling dynamic adaptation to emerging attack strategies and achieving constitution adherence scores exceeding 0.92 under sustained adversarial pressure, thereby establishing a foundation for trustworthy integration of language model capabilities into operational cybersecurity workflows and addressing a critical gap in current approaches to AI safety within adversarial domains.
中文标题/摘要
标题:SecureCAI:在对抗性网络安全环境中具有注入抗性的LLM辅助工具
大型语言模型已成为安全运营中心的变革性工具,能够实现自动化日志分析、钓鱼处理和恶意软件解释;然而,在对抗性网络安全环境中部署时,模型暴露于提示注入攻击中,恶意指令嵌入安全数据中操纵模型行为。本文介绍了SecureCAI,这是一种新颖的防御框架,结合了安全意识护栏、自适应宪法进化和直接偏好优化以消除不安全的响应模式,解决了传统安全机制在高风险安全环境中对抗复杂对手操纵时不足的问题。实验评估表明,与基线模型相比,SecureCAI将攻击成功率降低了94.7%,同时在良性安全分析任务上的准确率保持在95.1%,框架还集成了持续的红队反馈循环,以实现动态适应新兴攻击策略,并在持续的对抗压力下实现宪法遵守得分超过0.92,从而为将语言模型能力安全地集成到运营网络安全工作流中奠定了基础,并解决了当前对抗性领域中AI安全方法的关键空白。
Summary / 总结
SecureCAI is a defense framework designed to protect large language models in cybersecurity operations from prompt injection attacks. It integrates Constitutional AI principles with security-aware guardrails and adaptive constitution evolution to unlearn unsafe response patterns. Experimental results show that SecureCAI significantly reduces attack success rates by 94.7% while maintaining high accuracy on benign security tasks, and it dynamically adapts to new attack strategies through continuous red-teaming feedback loops.
SecureCAI 是一种防御框架,旨在保护大型语言模型免受网络安全操作中的提示注入攻击。它结合了宪法AI原则和安全意识护栏,并采用自适应宪法进化来消除不安全的响应模式。SecureCAI 将攻击成功率降低了 94.7%,同时在良性任务上的准确性保持在 95.1%,展示了其在高风险安全环境中的有效性。
Tuning-free Visual Effect Transfer across Videos
Authors: Maxwell Jones, Rameen Abdal, Or Patashnik, Ruslan Salakhutdinov, Sergey Tulyakov, Jun-Yan Zhu, Kuan-Chieh Jackson Wang
First: 2026-01-12T18:59:32+00:00 · Latest: 2026-01-12T18:59:32+00:00
Comments: Project Page: $\href{https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/}{this\ URL}$
Abstract
We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video's existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input's motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website $\href{https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/}{at\ this\ URL}$.
中文标题/摘要
标题:无需调参的视频视觉效果转移
我们提出了RefVFX,这是一种新的框架,能够以端到端的方式将参考视频中的复杂时间效果转移到目标视频或图像上。现有方法在基于提示或关键帧条件编辑方面表现出色,但在处理动态时间效果(如动态光照变化或角色变形)方面存在困难,这些效果难以通过文本或静态条件描述。将视频效果转移是一项挑战,因为模型必须将新的时间动态与输入视频的现有运动和外观相结合。为此,我们引入了一个大规模的三元组数据集,每个三元组包含一个参考效果视频、一个输入图像或视频以及一个显示转移效果的输出视频。创建这些数据并不容易,尤其是自然不存在的视频到视频效果三元组。为了生成这些数据,我们提出了一种可扩展的自动化管道,该管道创建高质量的配对视频,旨在保留输入的运动和结构,同时根据某些固定可重复的效果进行转换。然后,我们使用LoRA适配器和基于代码的时间效果通过程序化组合对这些数据进行增强。基于我们新的数据集,我们使用最近的文本到视频骨干网络训练参考条件模型。实验结果表明,RefVFX生成了视觉上一致且时间上连贯的编辑,能够在未见过的效果类别上泛化,并在定量指标和人类偏好方面优于仅提示基线。
Summary / 总结
RefVFX is a framework that transfers complex temporal effects from a reference video to a target video or image. It addresses the challenge of dynamic temporal effects that are hard to describe through text or static conditions. The method uses a large-scale dataset of triplets and a scalable automated pipeline to generate paired videos, which are then augmented with image-to-video effects and code-based temporal effects. Experiments show that RefVFX produces visually consistent and temporally coherent edits, generalizes well, and outperforms prompt-only baselines both quantitatively and in human preference tests.
RefVFX 是一种框架,可以从参考视频中转移复杂的动态效果到目标视频或图像。该方法使用大规模的三元组数据集和可扩展的自动化管道生成配对视频,并通过 LoRA 适配器和代码生成的时间效果进行增强。实验表明,RefVFX 生成了视觉上一致且时间上连贯的编辑效果,具有良好的泛化能力,并在定量指标和人类偏好测试中优于仅通过提示的方法。
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head
Authors: Kewei Zhang, Ye Huang, Yufan Deng, Jincheng Yu, Junsong Chen, Huan Ling, Enze Xie, Daquan Zhou
First: 2026-01-12T18:59:18+00:00 · Latest: 2026-01-12T18:59:18+00:00
Comments: Code: https://github.com/DAGroup-PKU/MHLA/ Project website: https://dagroup-pku.github.io/MHLA/
Abstract
While the Transformer architecture dominates many fields, its quadratic self-attention complexity hinders its use in large-scale applications. Linear attention offers an efficient alternative, but its direct application often degrades performance, with existing fixes typically re-introducing computational overhead through extra modules (e.g., depthwise separable convolution) that defeat the original purpose. In this work, we identify a key failure mode in these methods: global context collapse, where the model loses representational diversity. To address this, we propose Multi-Head Linear Attention (MHLA), which preserves this diversity by computing attention within divided heads along the token dimension. We prove that MHLA maintains linear complexity while recovering much of the expressive power of softmax attention, and verify its effectiveness across multiple domains, achieving a 3.6\% improvement on ImageNet classification, a 6.3\% gain on NLP, a 12.6\% improvement on image generation, and a 41\% enhancement on video generation under the same time complexity.
中文标题/摘要
标题:MHLA:通过令牌级多头恢复线性注意力的表达性
尽管Transformer架构在许多领域占主导地位,但其二次自注意力复杂性阻碍了其在大规模应用中的使用。线性注意力提供了一种高效的替代方案,但其直接应用通常会降低性能,现有的修复方法通常通过额外模块(例如深度可分离卷积)重新引入计算开销,从而违背了原始目的。在本文中,我们识别了这些方法中的一个关键失败模式:全局上下文崩溃,模型在此过程中失去了表示多样性。为了解决这一问题,我们提出了多头线性注意力(MHLA),通过在令牌维度上将注意力计算分割到不同的头中来保持这种多样性。我们证明MHLA保持了线性复杂性,同时恢复了softmax注意力的大部分表达能力,并在多个领域验证了其有效性,分别在ImageNet分类上提高了3.6%,在NLP上提高了6.3%,在图像生成上提高了12.6%,在视频生成上提高了41%,且在相同的时间复杂度下。
Summary / 总结
The research aims to improve the expressivity of linear attention while maintaining its efficiency. The authors propose Multi-Head Linear Attention (MHLA), which computes attention within divided heads along the token dimension to preserve representational diversity. Experiments show MHLA improves performance across various domains, achieving a 3.6% improvement on ImageNet classification, a 6.3% gain on NLP, a 12.6% improvement on image generation, and a 41% enhancement on video generation under the same time complexity.
该研究通过提出Multi-Head Linear Attention (MHLA),在沿token维度划分头内计算注意力的同时保持表示多样性,解决了线性注意力方法的性能下降问题。MHLA保持线性复杂度的同时恢复了softmax注意力的大部分表达能力,在多个领域取得了显著提升,包括ImageNet分类提升了3.6%,NLP提升了6.3%,图像生成提升了12.6%,视频生成提升了41%(均在相同时间复杂度下)
Failure-Aware RL: Reliable Offline-to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation
Authors: Huanyu Li, Kun Lei, Sheng Zang, Kaizhe Hu, Yongyuan Liang, Bo An, Xiaoli Li, Huazhe Xu
First: 2026-01-12T18:53:11+00:00 · Latest: 2026-01-12T18:53:11+00:00
Comments: Project page: https://failure-aware-rl.github.io
Abstract
Post-training algorithms based on deep reinforcement learning can push the limits of robotic models for specific objectives, such as generalizability, accuracy, and robustness. However, Intervention-requiring Failures (IR Failures) (e.g., a robot spilling water or breaking fragile glass) during real-world exploration happen inevitably, hindering the practical deployment of such a paradigm. To tackle this, we introduce Failure-Aware Offline-to-Online Reinforcement Learning (FARL), a new paradigm minimizing failures during real-world reinforcement learning. We create FailureBench, a benchmark that incorporates common failure scenarios requiring human intervention, and propose an algorithm that integrates a world-model-based safety critic and a recovery policy trained offline to prevent failures during online exploration. Extensive simulation and real-world experiments demonstrate the effectiveness of FARL in significantly reducing IR Failures while improving performance and generalization during online reinforcement learning post-training. FARL reduces IR Failures by 73.1% while elevating performance by 11.3% on average during real-world RL post-training. Videos and code are available at https://failure-aware-rl.github.io.
中文标题/摘要
标题:失败感知RL:基于自我恢复的离线到在线强化学习
基于深度强化学习的后训练算法可以推动机器人模型在特定目标上的极限,如泛化能力、准确性和鲁棒性。然而,在现实世界探索过程中,不可避免地会发生需要干预的失败(IR失败,例如机器人洒水或打碎易碎玻璃),这阻碍了这种范式的实际部署。为了解决这个问题,我们引入了失败感知离线到在线强化学习(FARL),这是一种新的范式,旨在减少现实世界强化学习中的失败。我们创建了FailureBench基准,该基准包含需要人类干预的常见失败场景,并提出了一种算法,该算法结合了基于世界模型的安全评论器和在离线训练的恢复策略,以防止在线探索中的失败。广泛的模拟和现实世界的实验表明,FARL在显著减少IR失败的同时,提高了在线强化学习后的性能和泛化能力。FARL在现实世界RL后训练中将IR失败减少了73.1%,平均性能提高了11.3%。有关视频和代码,请访问https://failure-aware-rl.github.io。
Summary / 总结
The research addresses the challenge of Intervention-requiring Failures (IR Failures) during real-world robotic exploration, which hinders the practical deployment of deep reinforcement learning. It introduces Failure-Aware Offline-to-Online Reinforcement Learning (FARL), which uses a world-model-based safety critic and a recovery policy trained offline to prevent IR Failures during online exploration. Experiments show that FARL reduces IR Failures by 73.1% and improves performance by 11.3% on average during real-world reinforcement learning post-training.
研究通过引入Failure-Aware Offline-to-Online Reinforcement Learning (FARL)来应对现实世界强化学习中的干预要求失败(IR Failures)问题。该方法使用FailureBench基准来模拟常见故障场景,并提出一种算法,其中包括基于世界模型的安全评论家和一个离线训练的恢复策略,以防止在线探索期间发生此类故障。实验表明,FARL将IR Failures减少了73.1%,并在现实世界的RL后训练中平均提高了11.3%的性能。
CLAPS: Posterior-Aware Conformal Intervals via Last-Layer Laplace
Authors: Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
First: 2025-12-01T07:58:21+00:00 · Latest: 2026-01-12T18:49:06+00:00
Comments: Revised for clarity and correctness; improved exposition and fixed minor issues
Abstract
We present CLAPS, a posterior-aware conformal regression method that pairs a Last-Layer Laplace Approximation with split-conformal calibration. From the resulting Gaussian posterior, CLAPS defines a simple two-sided posterior CDF score that aligns the conformity metric with the full predictive shape, not just a point estimate. This alignment can yield substantially narrower prediction intervals at a fixed target coverage, particularly on small to medium tabular datasets where data are scarce and uncertainty modeling is informative. We also provide a lightweight diagnostic suite that separates aleatoric and epistemic components and visualizes posterior behavior, helping practitioners assess when and why intervals shrink. Across multiple benchmarks using the same MLP backbone, CLAPS achieves nominal coverage and offers the most efficient intervals on small to medium datasets with mild heterogeneity, while remaining competitive and diagnostically transparent on large-scale heterogeneous data where Normalized-CP and CQR attain the tightest intervals.
中文标题/摘要
标题:CLAPS:基于后验的最后层拉普拉斯逼近与分割校准的区间方法
我们提出了CLAPS,一种基于后验的自适应区间回归方法,结合了最后层拉普拉斯逼近与分割校准。从得到的高斯后验中,CLAPS 定义了一个简单的双侧后验CDF分数,使一致性度量与预测的完整形状对齐,而不仅仅是点估计。这种对齐可以在固定目标覆盖率下显著减小预测区间,特别是在小到中型表格数据集中,数据稀缺且不确定性建模是有信息性的。我们还提供了一个轻量级的诊断套件,将 aleatoric 和 epistemic 组件分离并可视化后验行为,帮助实践者评估区间缩小的原因和时机。在使用相同MLP骨干网络的多个基准测试中,CLAPS 达到名义覆盖率,并在小到中型具有轻微异方性的数据集上提供最高效的区间,而在大规模异方性数据集中,CLAPS 保持竞争力并具有诊断透明性,而 Normalized-CP 和 CQR 获得最紧的区间。
ORACLE: Explaining Feature Interactions in Neural Networks with ANOVA
Authors: Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
First: 2025-09-13T14:44:45+00:00 · Latest: 2026-01-12T18:46:44+00:00
Comments: v4: Revised for clarity and correctness; improved exposition and fixed minor issues
Abstract
We introduce ORACLE, a framework for explaining neural networks on tabular data and scientific factorial designs. ORACLE summarizes a trained network's prediction surface with main effects and pairwise interactions by treating the network as a black-box response, discretizing the inputs onto a grid, and fitting an orthogonal factorial (ANOVA-style) surrogate -- the $L^2$ orthogonal projection of the model response onto a finite-dimensional factorial subspace. A simple centering and $μ$-rebalancing step then expresses this surrogate as main- and interaction-effect tables that remain faithful to the original model in the $L^2$ sense. The resulting grid-based interaction maps are easy to visualize, comparable across backbones, and directly aligned with classical design-of-experiments practice. On synthetic factorial benchmarks and low- to medium-dimensional tabular regression tasks, ORACLE more accurately recovers ground-truth interaction structure and hotspots than Monte Carlo SHAP-family interaction methods, as measured by ranking, localization, and cross-backbone stability. We also discuss its scope in latent image and text settings: grid-based factorial surrogates are most effective when features admit an interpretable factorial structure, making ORACLE particularly well-suited to scientific and engineering workflows that require stable DoE-style interaction summaries.
中文标题/摘要
标题:ORACLE:通过ANOVA解释神经网络中的特征交互
我们引入了ORACLE框架,用于解释表格数据和科学因子设计中的神经网络。ORACLE通过将网络视为黑盒响应,将输入离散化到网格上,并拟合正交因子(ANOVA风格)的替代模型——模型响应在有限维因子子空间上的$L^2$正交投影,来总结训练网络的预测表面,包括主效应和两两交互。通过简单的中心化和$μ$重新平衡步骤,将此替代模型表示为主效应和交互效应表,这些表在$L^2$意义上忠实于原始模型。基于网格的交互图易于可视化,可以在不同骨干网络之间进行比较,并直接与经典的设计实验实践对齐。在合成因子基准和低到中维表格回归任务上,ORACLE比蒙特卡洛SHAP族交互方法更准确地恢复了真实交互结构和热点,这通过排名、定位和跨骨干网络稳定性来衡量。我们还讨论了其在潜在图像和文本设置中的适用范围:基于网格的因子替代模型在特征允许可解释的因子结构时最有效,使ORACLE特别适合需要稳定的设计实验风格交互总结的科学和工程工作流。
Summary / 总结
ORACLE is a framework that explains feature interactions in neural networks by treating the network as a black-box response, fitting an ANOVA-style surrogate model, and expressing it as main- and interaction-effect tables. It outperforms Monte Carlo SHAP-family methods in recovering ground-truth interaction structures and hotspots on synthetic benchmarks and tabular regression tasks. ORACLE is particularly useful for scientific and engineering workflows requiring stable DoE-style interaction summaries.
ORACLE 是一个框架,通过将神经网络视为黑盒响应,拟合 ANOVA 样式的替代模型,并将其表示为主效应和交互效应表来解释特征交互。它在合成基准和表格回归任务中优于 Monte Carlo SHAP 家族方法,更准确地恢复了真实交互结构和热点。ORACLE 特别适用于需要稳定 DoE 样式交互总结的科学和工程工作流。
More Images, More Problems? A Controlled Analysis of VLM Failure Modes
Authors: Anurag Das, Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Bernt Schiele, Georgios Tzimiropoulos, Brais Martinez
First: 2026-01-12T18:45:13+00:00 · Latest: 2026-01-12T18:45:13+00:00
Comments: 19 pages, 16 figures
Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at https://github.com/anurag-198/MIMIC.
中文标题/摘要
标题:更多图像,更多问题?对VLM失败模式的受控分析
大型视觉语言模型(LVLMs)展示了卓越的能力,但在理解和推理多个图像方面的熟练程度仍鲜有探索。尽管现有基准已经启动了对多图像模型的评估,但对其核心弱点及其原因的全面分析仍然缺乏。在本文中,我们引入了MIMIC(多图像模型见解与挑战),这是一种新的基准,旨在严格评估LVLMs的多图像能力。使用MIMIC,我们进行了一系列诊断实验,揭示了普遍存在的问题:LVLMs往往无法在图像间汇总信息,并且难以同时跟踪或关注多个概念。为解决这些失败,我们提出了两种新的互补补救措施。在数据方面,我们提出了一种过程化的数据生成策略,将单图像注释组合成丰富的、有针对性的多图像训练示例。在优化方面,我们分析了逐层注意力模式,并推导出一种针对多图像输入的注意力掩蔽方案。实验显著提高了跨图像聚合能力,同时也在现有多图像基准测试上提高了性能,超越了先前的最先进水平。数据和代码将在https://github.com/anurag-198/MIMIC上提供。
Summary / 总结
This study addresses the limitations of Large Vision Language Models (LVLMs) in handling multiple images by introducing MIMIC, a new benchmark. Through diagnostic experiments, it reveals that LVLMs struggle to aggregate information across images and track multiple concepts simultaneously. To improve these capabilities, the authors propose a procedural data-generation strategy and an attention-masking scheme, which significantly enhance cross-image aggregation and outperform previous state-of-the-art models on multi-image benchmarks.
该研究通过引入MIMIC新基准来解决大型视觉语言模型(LVLM)在处理多张图片时的局限性。诊断实验显示,LVLM在跨图片信息聚合和跟踪多个概念方面存在困难。为改进这些能力,作者提出了一种程序化数据生成策略和注意力掩码方案,显著提升了跨图片聚合能力,并在多图片基准测试中超越了之前的最佳模型。
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Authors: Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev
First: 2025-09-29T21:40:10+00:00 · Latest: 2026-01-12T18:44:30+00:00
Comments: Code: \url{https://github.com/ontocord/mixturevitae}
Abstract
We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive-first, risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data-signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open-sci-ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M-1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb-Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction-tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36 times fewer tokens (300B vs. ~11T). Supported by a thorough decontamination analysis, these results show that permissive-first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
中文标题/摘要
标题:MixtureVitae:开放的网络规模预训练数据集,基于宽松许可的文本源构建,包含高质量的指令和推理数据
我们介绍了MixtureVitae,一个开放访问的预训练语料库,旨在最小化法律风险同时提供强大的下游性能。MixtureVitae 采用了一种宽松许可优先、风险缓解的采样策略,结合了公共领域和宽松许可的文本(例如CC-BY/Apache),以及经过仔细验证的低风险添加(例如政府作品和欧盟TDM合格来源)。MixtureVitae 采用了一种简单的单阶段预训练配方,整合了大量的宽松许可合成指令和推理数据信号,这些信号通常在后训练阶段引入,而在宽松许可的网络语料库中通常较为稀缺。我们将所有来源分为三级方案,反映不同的风险级别,并提供分片级别的来源元数据,以支持风险意识使用。在使用开放科学参考训练协议(固定架构和超参数;130M-1.7B参数,50B和300B令牌预算)的受控实验中,使用MixtureVitae 训练的模型在一系列标准基准测试中始终优于其他宽松许可的数据集,在1.7B参数/300B令牌设置下,它们超过了FineWeb-Edu,并接近DCLM的后期训练表现。特别是在MMLU和数学、代码基准测试中,表现尤为突出:一个使用300B MixtureVitae令牌预训练的1.7B模型,在GSM8K、HumanEval和MBPP基准测试中,尽管使用了超过36倍少的令牌(300B vs. ~11T),但与强大的1.7B指令调优基线相当或超过。通过彻底的去污分析支持,这些结果表明,基于许可优先的数据,按许可证和来源相关风险分级,可以提供一种实用且风险缓解的基础,用于训练强大的语言模型,减少对广泛网络抓取的依赖,而不牺牲竞争力。
Summary / 总结
MixtureVitae is an open-access pretraining dataset that combines public-domain and permissively licensed text with low-risk additions to minimize legal risk while providing strong downstream performance. It uses a simple, single-stage pretraining recipe with a large proportion of synthetic instruction and reasoning data. In experiments, models trained on MixtureVitae outperform other permissive datasets on standard benchmarks, especially on MMLU, math, and code benchmarks, where a 1.7B model pretrained on 300B tokens matches or exceeds a strong 1.7B instruction-tuned baseline despite using far fewer tokens. This shows that permissive-first data with high instruction and reasoning density can provide a practical and risk-mitigated foundation for training capable LLMs.
MixtureVitae 是一个开源预训练数据集,结合了公共领域和许可许可的文本以及精心选择的低风险添加,以最小化法律风险同时提供强大的下游性能。它使用简单的单阶段预训练配方,整合了大量的合成指令和推理数据。实验表明,使用 MixtureVitae 预训练的模型在各种基准测试中优于其他许可数据集,特别是在数学和代码任务中,一个 1.7B 模型在 300B 令牌上预训练的表现与一个强大的 1.7B 指令调优基线相当,尽管使用了显著较少的令牌。
The Confidence Trap: Gender Bias and Predictive Certainty in LLMs
Authors: Ahmed Sabir, Markus Kängsepp, Rajesh Sharma
Venue: AAAI 2026 Oral
First: 2026-01-12T18:38:05+00:00 · Latest: 2026-01-12T18:38:05+00:00
Comments: AAAI 2026 (AISI Track), Oral. Project page: https://bit.ly/4p8OKQD
Abstract
The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs' confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.
中文标题/摘要
标题:自信陷阱:性别偏见和LLM预测确定性
在敏感领域中对大型语言模型(LLMs)使用的增加引发了对其自信评分与公平性和偏见之间关系的兴趣。本研究探讨了LLM预测自信与人类标注偏见判断之间的契合度。研究重点是性别偏见,调查了涉及性别代词解析的自信概率校准情况。目标是评估基于预测自信评分的校准指标是否能有效捕捉LLMs中的公平性差异。结果显示,在六个最先进的模型中,Gemma-2在性别偏见基准测试中的校准最差。本工作的主要贡献是对LLMs的自信校准进行公平性意识评估,为伦理部署提供指导。此外,我们引入了一个新的校准指标,性别-ECE,用于衡量解析任务中的性别差异。
Summary / 总结
This study investigates the relationship between confidence scores and gender bias in Large Language Models (LLMs). By focusing on gender bias in contexts involving gendered pronoun resolution, the research evaluates the calibration of confidence scores across six state-of-the-art models. The findings indicate that Gemma-2 has the worst calibration according to the gender bias benchmark. The study introduces a new calibration metric, Gender-ECE, to measure gender disparities in resolution tasks, contributing to a fairness-aware evaluation of LLMs for ethical deployment.
研究探讨了大型语言模型(LLMs)在性别偏见敏感情境下,如性别代词解析,其预测置信度与实际性别偏见的一致性。通过评估六种最先进的模型,研究发现Gemma-2在性别偏见基准测试中的校准效果最差。主要贡献是引入了一个新的校准度量标准,性别ECE,用于衡量解析任务中的性别差异,旨在指导LLMs的伦理部署。
Exchange Is All You Need for Remote Sensing Change Detection
Authors: Sijun Dong, Siming Fu, Kaiyu Li, Xiangyong Cao, Xiaoliang Meng, Bo Du
First: 2026-01-12T18:36:51+00:00 · Latest: 2026-01-12T18:36:51+00:00
Abstract
Remote sensing change detection fundamentally relies on the effective fusion and discrimination of bi-temporal features. Prevailing paradigms typically utilize Siamese encoders bridged by explicit difference computation modules, such as subtraction or concatenation, to identify changes. In this work, we challenge this complexity with SEED (Siamese Encoder-Exchange-Decoder), a streamlined paradigm that replaces explicit differencing with parameter-free feature exchange. By sharing weights across both Siamese encoders and decoders, SEED effectively operates as a single parameter set model. Theoretically, we formalize feature exchange as an orthogonal permutation operator and prove that, under pixel consistency, this mechanism preserves mutual information and Bayes optimal risk, whereas common arithmetic fusion methods often introduce information loss. Extensive experiments across five benchmarks, including SYSU-CD, LEVIR-CD, PX-CLCD, WaterCD, and CDD, and three backbones, namely SwinT, EfficientNet, and ResNet, demonstrate that SEED matches or surpasses state of the art methods despite its simplicity. Furthermore, we reveal that standard semantic segmentation models can be transformed into competitive change detectors solely by inserting this exchange mechanism, referred to as SEG2CD. The proposed paradigm offers a robust, unified, and interpretable framework for change detection, demonstrating that simple feature exchange is sufficient for high performance information fusion. Code and full training and evaluation protocols will be released at https://github.com/dyzy41/open-rscd.
中文标题/摘要
标题:无需复杂操作,交换即所有所需——遥感变化检测
遥感变化检测本质上依赖于有效融合和区分双时相特征。现有主流方法通常使用通过显式差异计算模块(如减法或连接)连接的双编码器结构来识别变化。本文通过SEED(双编码器-交换-解码器)简化了这一过程,用无参数特征交换替代了显式差异计算。通过在双编码器和解码器中共享权重,SEED 实际上作为一个单一参数集模型运作。理论上,我们将特征交换形式化为正交置换操作,并证明在像素一致性条件下,该机制能够保持互信息和贝叶斯最优风险,而常见的算术融合方法往往会导致信息损失。在SYSU-CD、LEVIR-CD、PX-CLCD、WaterCD和CDD五个基准数据集以及SwinT、EfficientNet和ResNet三种骨干网络上进行的大量实验表明,尽管结构简单,SEED仍能匹配或超越现有最先进的方法。此外,我们发现,通过插入这种交换机制,标准语义分割模型可以转变为具有竞争力的变化检测器,称为SEG2CD。所提出的方法为变化检测提供了一个稳健、统一且可解释的框架,证明了简单的特征交换足以实现高性能的信息融合。代码和完整的训练与评估协议将在https://github.com/dyzy41/open-rscd/发布。
Summary / 总结
This paper addresses the challenge of remote sensing change detection by proposing SEED (Siamese Encoder-Exchange-Decoder), which simplifies the process by replacing explicit feature differencing with a parameter-free feature exchange mechanism. Theoretical analysis shows that this approach preserves mutual information and Bayes optimal risk under pixel consistency, while common arithmetic methods often lead to information loss. Experimental results across five benchmarks and three backbones demonstrate that SEED matches or outperforms existing methods, proving that simple feature exchange can achieve high performance in change detection. Additionally, the exchange mechanism can transform standard semantic segmentation models into competitive change detectors, offering a unified and interpretable framework for remote sensing change detection.
本文提出SEED(Siamese Encoder-Exchange-Decoder)来简化遥感变化检测过程,通过参数无关的特征交换机制替代显式的特征差异计算。理论分析表明,在像素一致性条件下,该方法能保持互信息和贝叶斯最优风险,而常见的算术融合方法往往会引入信息损失。实验结果表明,SEED 在五个基准和三种骨干网络上与现有方法相当或更优,证明简单的特征交换足以实现高性能的信息融合。此外,通过插入此交换机制,标准语义分割模型可以转变为具有竞争力的变化检测器,提供了一个统一且可解释的变化检测框架。
Discovering Coordinated Joint Options via Inter-Agent Relative Dynamics
Authors: Raul D. Steleac, Mohan Sridharan, David Abel
First: 2025-12-31T12:39:22+00:00 · Latest: 2026-01-12T18:29:50+00:00
Abstract
Temporally extended actions improve the ability to explore and plan in single-agent settings. In multi-agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi-agent options particularly challenging. Existing multi-agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviours. Toward addressing these limitations, we describe a novel approach for multi-agent option discovery. Specifically, we propose a joint-state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the \textit{Fermat} state, and use it to define a measure of \textit{spreadness}, capturing team-level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two multi-agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.
中文标题/摘要
标题:通过代理相对动力学发现协调联合选项
时间扩展的动作在单代理环境中提高了探索和规划的能力。在多代理环境中,随着代理数量的增加,联合状态空间的指数增长使得协调行为更加有价值。然而,这种指数增长使得多代理选项的设计变得尤为具有挑战性。现有的多代理选项发现方法往往通过产生松散耦合或完全独立的行为来牺牲协调性。为了解决这些限制,我们描述了一种新的多代理选项发现方法。具体来说,我们提出了一种联合状态抽象,该抽象压缩了状态空间,同时保留了发现强烈协调行为所需的信息。我们的方法基于这样一个归纳偏见:代理状态的同步为在没有明确目标的情况下协调提供了自然的基础。我们首先近似一个与团队最大对齐的虚构状态,即“费马”状态,并使用它来定义一个“分散度”的度量,捕捉每个个体状态维度上的团队级不对齐。基于这种表示,我们然后使用神经图拉普拉斯估计器来推导出捕捉代理间状态同步模式的选项。我们在两个多代理领域中的多个场景中评估了这些选项,结果显示它们在下游协调能力方面优于其他选项发现方法。
Summary / 总结
The paper addresses the challenge of discovering coordinated joint options in multi-agent systems, where the exponential growth of the joint state space complicates the design of coordinated behaviors. The authors propose a novel approach that uses a joint-state abstraction to compress the state space while preserving the necessary information for discovering strongly coordinated behaviors. By approximating a fictitious state of maximal alignment and defining a measure of spreadness, the method identifies options that capture state synchronization patterns between agents. Experimental results across multiple scenarios in two multi-agent domains demonstrate that these options provide stronger coordination capabilities compared to existing methods.
该论文通过提出一种新颖的方法来解决多智能体系统中发现协调行为的挑战,该方法利用联合状态抽象来压缩状态空间,同时保留足够的信息以实现强协调。该方法使用一个最大对齐的虚构状态,称为费马状态,来定义一个衡量分散性的度量,该度量捕捉团队在每个个体状态维度上的不协调程度。通过使用神经图拉普拉斯估计器,该方法提取出能够捕捉智能体之间状态同步模式的选项,这些选项在各种场景中的下游协调能力比现有方法更强。
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Authors: Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian
First: 2025-03-27T18:04:05+00:00 · Latest: 2026-01-12T18:27:42+00:00
Comments: To be presented at EACL2026
Abstract
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
中文标题/摘要
标题:StarFlow:从草图图像生成结构化工作流输出
工作流是企业平台自动化的基本组成部分,能够实现任务编排、数据处理和系统集成。尽管被广泛使用,但构建工作流往往复杂,通常需要通过低代码平台或可视化编程工具进行手动配置。为了简化这一过程,我们探索了使用生成基础模型,特别是视觉语言模型(VLMs),从视觉输入自动生成结构化工作流的方法。将手绘草图或计算机生成的图表转换为可执行的工作流具有挑战性,因为自由形式的绘制具有歧义性,图表风格存在差异,从视觉元素中推断执行逻辑也具有难度。为了解决这一问题,我们引入了StarFlow框架,用于使用视觉语言模型从草图生成结构化工作流输出。我们收集了一个多样化的数据集,包括合成的、手动标注的和实际工作流示例,以实现稳健的训练和评估。我们对多个视觉语言模型进行了微调和基准测试,并进行了一系列消融研究,以分析我们方法的优势和局限性。我们的结果显示,微调显著提高了结构化工作流生成的效果,在此任务上优于大型视觉语言模型。
Summary / 总结
The paper aims to simplify the process of building workflows by using generative foundation models, specifically vision-language models (VLMs), to automatically generate structured workflows from sketch images. The authors introduce StarFlow, a framework that curates a diverse dataset and finetunes VLMs to address the challenges of translating ambiguous free-form drawings into executable workflows. Experimental results demonstrate that finetuning VLMs significantly improves structured workflow generation, outperforming large VLMs on this task.
研究动机是利用生成基础模型,特别是视觉语言模型(VLMs),从手绘草图等视觉输入自动生成结构化的流程图,简化流程构建过程。主要方法包括收集多样化的数据集并对VLMs进行微调,以解决将自由形式的绘制转换为可执行流程图的挑战。关键实验发现表明,微调显著提高了结构化流程图的生成能力,优于大型VLMs在该任务上的表现。
AgentCompress: Task-Aware Compression for Affordable Large Language Model Agents
Authors: Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam
First: 2026-01-08T18:13:46+00:00 · Latest: 2026-01-12T18:25:18+00:00
Abstract
Large language models hold considerable promise for various applications, but their computational requirements create a barrier that many institutions cannot overcome. A single session using a 70-billion-parameter model can cost around $127 in cloud computing fees, which puts these tools out of reach for organizations operating on limited budgets. We present AgentCompress, a framework that tackles this problem through task-aware dynamic compression. The idea comes from a simple observation: not all tasks require the same computational effort. Complex reasoning, for example, is far more demanding than text reformatting, yet conventional compression applies the same reduction to both. Our approach uses a lightweight neural controller that looks at the first few tokens of each request, estimates how complex the task will be, and sends it to an appropriately quantized version of the model. This routing step adds only about 12 milliseconds of overhead. We tested the framework on 290 multi-stage workflows from domains including computer science, physics, chemistry, and biology. The results show a 68.3% reduction in computational costs while preserving 96.2% of the original success rate. These findings suggest that routing queries intelligently can make powerful language models substantially more affordable without sacrificing output quality
中文标题/摘要
标题:AgentCompress:面向任务的压缩技术以实现负担得起的大语言模型代理
大型语言模型在各种应用中具有巨大的潜力,但其计算需求构成了许多机构无法克服的障碍。使用一个包含700亿参数的模型进行单次会话的云计算费用约为127美元,这使得这些工具对于预算有限的组织来说遥不可及。我们提出了AgentCompress框架,通过任务感知动态压缩来解决这一问题。这一想法源于一个简单的观察:并非所有任务都需要相同的计算努力。例如,复杂的推理远比文本重排更为耗时,而传统的压缩方法对两者都应用相同的缩减。我们的方法使用一个轻量级的神经控制器,它查看每个请求的前几个标记,估计任务的复杂性,并将其发送到适当量化版本的模型。这一路由步骤仅增加了大约12毫秒的开销。我们在包括计算机科学、物理学、化学和生物学在内的领域中的290个多阶段工作流上测试了该框架。结果显示,在保持原始成功率96.2%的情况下,计算成本降低了68.3%。这些发现表明,智能路由查询可以显著降低强大语言模型的成本,而不会牺牲输出质量
Summary / 总结
AgentCompress is a framework that uses task-aware dynamic compression to reduce the computational costs of large language models. By estimating the complexity of each task based on the first few tokens of the request, it routes the task to an appropriately quantized version of the model, adding only 12 milliseconds of overhead. The framework achieved a 68.3% reduction in computational costs while maintaining 96.2% of the original success rate across 290 multi-stage workflows from various scientific domains.
AgentCompress 是一种任务感知压缩框架,旨在降低大型语言模型的计算成本,使其对预算有限的组织更具可访问性。它使用轻量级神经控制器根据任务复杂度动态压缩模型,增加的开销很小。该框架在计算机科学、物理学、化学和生物学等多个科学领域中的 290 个多阶段工作流上进行了测试,实现了 68.3% 的计算成本降低,同时保持了 96.2% 的原始成功率。
Near-Real-Time Resource Slicing for QoS Optimization in 5G O-RAN using Deep Reinforcement Learning
Authors: Peihao Yan, Jie Lu, Huacheng Zeng, Y. Thomas Hou
Venue: P. Yan, J. Lu, H. Zeng and Y. Thomas Hou, "Near-Real-Time Resource Slicing for QoS Optimization in 5G O-RAN Using Deep Reinforcement Learning," in IEEE Transactions on Networking, vol. 34, pp. 1596-1611, 2026
First: 2025-09-17T18:20:04+00:00 · Latest: 2026-01-12T18:11:14+00:00
Comments: Published in: IEEE Transactions on Networking
Abstract
Open-Radio Access Network (O-RAN) has become an important paradigm for 5G and beyond radio access networks. This paper presents an xApp called xSlice for the Near-Real-Time (Near-RT) RAN Intelligent Controller (RIC) of 5G O-RANs. xSlice is an online learning algorithm that adaptively adjusts MAC-layer resource allocation in response to dynamic network states, including time-varying wireless channel conditions, user mobility, traffic fluctuations, and changes in user demand. To address these network dynamics, we first formulate the Quality-of-Service (QoS) optimization problem as a regret minimization problem by quantifying the QoS demands of all traffic sessions through weighting their throughput, latency, and reliability. We then develop a deep reinforcement learning (DRL) framework that utilizes an actor-critic model to combine the advantages of both value-based and policy-based updating methods. A graph convolutional network (GCN) is incorporated as a component of the DRL framework for graph embedding of RAN data, enabling xSlice to handle a dynamic number of traffic sessions. We have implemented xSlice on an O-RAN testbed with 10 smartphones and conducted extensive experiments to evaluate its performance in realistic scenarios. Experimental results show that xSlice can reduce performance regret by 67% compared to the state-of-the-art solutions. Source code is available at https://github.com/xslice-5G/code.
中文标题/摘要
标题:基于深度强化学习的5G O-RAN服务质量优化的近实时资源切片
开放无线接入网络(O-RAN)已成为5G及更高级无线接入网络的重要范式。本文提出了一种名为xSlice的xApp,用于5G O-RAN的近实时(Near-RT)无线接入网络智能控制器(RIC)。xSlice是一种在线学习算法,能够根据动态网络状态,包括时间变化的无线信道条件、用户移动性、流量波动和用户需求变化,自适应地调整MAC层资源分配。为应对这些网络动态,我们首先将服务质量(QoS)优化问题形式化为一个遗憾最小化问题,通过加权计算所有流量会话的吞吐量、延迟和可靠性来量化QoS需求。然后,我们开发了一种深度强化学习(DRL)框架,利用演员-评论家模型结合基于价值和基于策略更新方法的优点。图卷积网络(GCN)被用作DRL框架的一部分,用于RAN数据的图嵌入,使xSlice能够处理动态数量的流量会话。我们已在包含10部智能手机的O-RAN测试平台上实现了xSlice,并进行了广泛的实验以评估其在现实场景中的性能。实验结果表明,与最先进的解决方案相比,xSlice可将性能遗憾降低67%。源代码可在https://github.com/xslice-5G/code/获取。
Summary / 总结
This paper introduces xSlice, an xApp for the 5G O-RAN RIC that uses deep reinforcement learning to adaptively adjust MAC-layer resource allocation in response to dynamic network conditions. By formulating the QoS optimization problem as a regret minimization task and leveraging a graph convolutional network, xSlice effectively handles varying traffic sessions. Experimental results demonstrate that xSlice reduces performance regret by 67% compared to existing solutions.
该论文提出了一种名为xSlice的在线学习算法,用于5G O-RAN的近实时RIC,能够根据动态网络状态自适应调整MAC层资源分配。算法将QoS优化问题表述为一个遗憾最小化任务,并采用深度强化学习框架结合演员-评论家模型和图卷积网络进行高效的资源切片。实验结果表明,xSlice相比现有解决方案将性能遗憾降低了67%。
Kinship Data Benchmark for Multi-hop Reasoning
Authors: Tianda Sun, Dimitar Kazakov
First: 2026-01-12T18:07:41+00:00 · Latest: 2026-01-12T18:07:41+00:00
Comments: 11 pages, 2 figures, 9 tables
Abstract
Large language models (LLMs) are increasingly evaluated on their ability to perform multi-hop reasoning, i.e., to combine multiple pieces of information into a coherent inference. We introduce KinshipQA, a benchmark designed to probe this capability through reasoning over kinship relations. The central contribution of our work is a generative pipeline that produces, on demand, large-scale, realistic, and culture-specific genealogical data: collections of interconnected family trees that satisfy explicit marriage constraints associated with different kinship systems. This allows task difficulty, cultural assumptions, and relational depth to be systematically controlled and varied. From these genealogies, we derive textual inference tasks that require reasoning over implicit relational chains. We evaluate the resulting benchmark using six state-of-the-art LLMs, spanning both open-source and closed-source models, under a uniform zero-shot protocol with deterministic decoding. Performance is measured using exact-match and set-based metrics. Our results demonstrate that KinshipQA yields a wide spread of outcomes and exposes systematic differences in multi-hop reasoning across models and cultural settings.
中文标题/摘要
标题:亲属关系数据基准测试用于多跳推理
大型语言模型(LLMs)越来越多地被评估其进行多跳推理的能力,即结合多个信息进行连贯的推理。我们引入了KinshipQA,一个通过推理亲属关系来测试这种能力的基准测试。我们工作的主要贡献是一个生成管道,可以根据需求生成大规模、现实且文化特定的家谱数据:一系列相互连接的家庭树集合,满足与不同亲属制度相关的明确婚姻约束。这使得任务难度、文化假设和关系深度可以系统地控制和变化。从这些家谱中,我们推导出需要推理隐含关系链的文本推理任务。我们使用六种最先进的LLM进行评估,这些模型涵盖了开源和闭源模型,采用统一的零样本协议和确定性解码。性能通过精确匹配和基于集合的度量进行衡量。我们的结果表明,KinshipQA产生了广泛的结果,并在模型和文化背景下揭示了多跳推理的系统性差异。
Summary / 总结
The research aims to evaluate large language models' ability to perform multi-hop reasoning through a new benchmark called KinshipQA, which involves reasoning over kinship relations. The method involves generating large-scale, realistic, and culture-specific genealogical data using a generative pipeline. Key findings show that KinshipQA exposes systematic differences in multi-hop reasoning capabilities across various state-of-the-art models and cultural settings, with a wide spread of outcomes observed. Performance is measured using exact-match and set-based metrics under a zero-shot protocol.
研究旨在通过一个新的基准KinshipQA来评估大型语言模型在处理亲缘关系推理时的多跳推理能力。方法是使用生成管道生成大规模、现实且具有文化特异性的家谱数据。关键发现表明,KinshipQA揭示了不同模型在多跳推理能力和文化背景下的系统性差异,观察到的结果范围广泛。性能通过精确匹配和集合基线度量标准,在零样本协议下进行评估。
Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification
Authors: Yahya Masri, Emily Ma, Zifu Wang, Joseph Rogers, Chaowei Yang
First: 2026-01-12T18:02:33+00:00 · Latest: 2026-01-12T18:02:33+00:00
Comments: 28 pages, 5 figures, 7 tables
Abstract
System logs are crucial for monitoring and diagnosing modern computing infrastructure, but their scale and complexity require reliable and efficient automated interpretation. Since severity levels are predefined metadata in system log messages, having a model merely classify them offers limited standalone practical value, revealing little about its underlying ability to interpret system logs. We argue that severity classification is more informative when treated as a benchmark for probing runtime log comprehension rather than as an end task. Using real-world journalctl data from Linux production servers, we evaluate nine small language models (SLMs) and small reasoning language models (SRLMs) under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting. The results reveal strong stratification. Qwen3-4B achieves the highest accuracy at 95.64% with RAG, while Gemma3-1B improves from 20.25% under few-shot prompting to 85.28% with RAG. Notably, the tiny Qwen3-0.6B reaches 88.12% accuracy despite weak performance without retrieval. In contrast, several SRLMs, including Qwen3-1.7B and DeepSeek-R1-Distill-Qwen-1.5B, degrade substantially when paired with RAG. Efficiency measurements further separate models: most Gemma and Llama variants complete inference in under 1.2 seconds per log, whereas Phi-4-Mini-Reasoning exceeds 228 seconds per log while achieving <10% accuracy. These findings suggest that (1) architectural design, (2) training objectives, and (3) the ability to integrate retrieved context under strict output constraints jointly determine performance. By emphasizing small, deployable models, this benchmark aligns with real-time requirements of digital twin (DT) systems and shows that severity classification serves as a lens for evaluating model competence and real-time deployability, with implications for root cause analysis (RCA) and broader DT integration.
中文标题/摘要
标题:小型语言模型和小型推理语言模型在系统日志严重性分类中的基准测试
系统日志对于监控和诊断现代计算基础设施至关重要,但其规模和复杂性需要可靠的高效自动化解释。由于严重性级别是系统日志消息中的预定义元数据,因此仅让模型对其进行分类提供的独立实用价值有限,无法揭示其解释系统日志的潜在能力。我们认为,将严重性分类视为测试运行时日志理解能力的基准,而不是作为最终任务,更具信息性。使用来自Linux生产服务器的实际journalctl数据,我们评估了九种小型语言模型(SLMs)和小型推理语言模型(SRLMs)在零样本、少样本和检索增强生成(RAG)提示下的表现。结果表明存在明显的分层。Qwen3-4B在RAG下达到95.64%的最高准确率,而Gemma3-1B在少样本提示下从20.25%提高到RAG下的85.28%。值得注意的是,尽管在没有检索的情况下表现较弱,但Qwen3-0.6B仍达到88.12%的准确率。相比之下,包括Qwen3-1.7B和DeepSeek-R1-Distill-Qwen-1.5B在内的几种SRLMs在与RAG配对时表现大幅下降。效率测量进一步区分了模型:大多数Gemma和Llama变体每条日志的推理时间少于1.2秒,而Phi-4-Mini-Reasoning每条日志超过228秒,准确率低于10%。这些发现表明,(1)架构设计,(2)训练目标,以及(3)在严格输出约束下整合检索上下文的能力共同决定了性能。通过强调小型、可部署的模型,该基准与数字孪生(DT)系统的实时要求相一致,并表明严重性分类作为评估模型能力和实时部署性的镜像,具有对根本原因分析(RCA)和更广泛DT集成的含义。
Summary / 总结
The study evaluates small language models (SLMs) and small reasoning language models (SRLMs) on system log severity classification, using real-world journalctl data from Linux servers. Models were tested under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting. Key findings include strong performance stratification, with Qwen3-4B achieving 95.64% accuracy with RAG, and Qwen3-0.6B reaching 88.12% despite weak performance without retrieval. Efficiency measurements show that most Gemma and Llama variants complete inference quickly, while Phi-4-Mini-Reasoning is much slower and less accurate. These results highlight the importance of architectural design, training objectives, and the ability to integrate retrieved context for effective severity classification.
该研究评估了小语言模型(SLMs)和小推理语言模型(SRLMs)在系统日志严重性分类任务上的表现,将其作为运行时日志理解的基准。九个模型分别在零样本、少量样本和检索增强生成(RAG)提示下进行了测试。Qwen3-4B在RAG提示下达到了最高的95.64%准确率,而Gemma3-1B在RAG提示下从20.25%提高到了85.28%。尽管表现较差,但Qwen3-0.6B仍达到了88.12%的准确率。几个SRLMs,包括Qwen3-1.7B和DeepSeek-R1-Distill-Qwen-1.5B,在RAG提示下表现较差。效率测试显示,大多数Gemma和Llama变体每条日志的推理时间在1.2秒以内,而Phi-4-Mini-Reasoning则超过了228秒且准确率低于10%。这些结果强调了架构设计、训练目标以及在严格输出约束下整合检索上下文的重要性,对于有效严重性分类具有重要意义。
Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning
Authors: Wei Fang, James Glass
First: 2026-01-12T17:58:39+00:00 · Latest: 2026-01-12T17:58:39+00:00
Abstract
LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
中文标题/摘要
标题:超越单轮:通过查询规划实现多步工具检索
在操作大规模动态工具库的LLM代理依赖于有效的检索,但标准的一轮密集检索器在处理复杂请求时存在困难。这些失败主要源于抽象用户目标与技术文档之间的脱节,以及固定大小嵌入的有限能力来建模工具组合。为了解决这些挑战,我们提出TOOLQP,这是一种轻量级框架,将检索建模为迭代查询规划。TOOLQP 不是进行一轮匹配,而是将指令分解为子任务,并动态生成查询与检索器交互,通过针对组合所需的特定子任务来有效弥合语义差距。我们使用合成查询轨迹训练TOOLQP,然后通过可验证奖励的强化学习(RLVR)进行优化。实验表明,TOOLQP 达到了最先进的性能,展示了出色的零样本泛化能力、跨不同检索器的鲁棒性以及下游代理执行的显著改进。
Summary / 总结
The research addresses the limitations of single-shot dense retrievers in handling complex requests for LLM agents operating over large, dynamic tool libraries. It introduces TOOLQP, a framework that models retrieval as iterative query planning, decomposing instructions into sub-tasks and dynamically generating queries. Experiments show that TOOLQP outperforms existing methods in zero-shot generalization and robustness across different retrievers, and enhances downstream agentic execution.
研究针对大规模动态工具库中LLM代理处理复杂请求时单次检索方法的局限性。提出了一种名为TOOLQP的框架,将检索建模为迭代查询规划,将指令分解为子任务并动态生成查询。实验结果表明,TOOLQP在零样本泛化和不同检索器下的鲁棒性方面优于现有方法,并提升了下游代理执行效果。
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
Authors: Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding
First: 2026-01-12T17:55:51+00:00 · Latest: 2026-01-12T17:55:51+00:00
Comments: 31 pages, 11 figures, 12 tables
Abstract
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
中文标题/摘要
标题:OS-Symphony:一种全面的鲁棒且通用的计算机使用代理框架
尽管视觉语言模型(VLMs)显著推进了计算机使用代理(CUAs)的发展,但当前框架在长时序工作流程中的鲁棒性和新领域中的泛化能力方面存在局限。这些局限源于对历史视觉上下文编纂缺乏细粒度控制以及缺乏视觉感知的教程检索。为解决这些问题,我们提出了OS-Symphony,一种全面的框架,该框架包括一个协调两个关键创新的协调器,以实现鲁棒自动化:(1)一个反思记忆代理,利用里程碑驱动的长期记忆来实现轨迹级自我纠正,有效缓解长时序任务中的视觉上下文丢失问题;(2)多功能工具代理,配备多模态搜索器,采用“看做-行动”(SeeAct)范式在基于浏览器的沙箱中导航以合成实时、视觉对齐的教程,从而解决未见过场景中的保真度问题。实验结果表明,OS-Symphony在不同模型规模下实现了显著的性能提升,在三个在线基准测试中建立了新的最先进结果,特别是在OSWorld上达到65.84%。
Summary / 总结
The paper addresses the limitations of current Vision-Language Models (VLMs) in Computer-Using Agents (CUAs) by introducing OS-Symphony, a holistic framework. It includes an Orchestrator that manages a Reflection-Memory Agent and Versatile Tool Agents. The Reflection-Memory Agent uses milestone-driven long-term memory for trajectory-level self-correction, while the Versatile Tool Agents employ a Multimodal Searcher to create live, visually aligned tutorials in a browser-based sandbox. The framework significantly improves performance across different model scales and sets new state-of-the-art results on three online benchmarks, achieving 65.84% on OSWorld.
研究针对当前视觉-语言模型(VLMs)在计算机使用代理(CUAs)中的局限性,引入了OS-Symphony这一整体框架。该框架包括一个协调反射记忆代理和多功能工具代理的协调器。反射记忆代理使用长期记忆进行轨迹级自我纠正,而多功能工具代理通过SeeAct范式生成实时、视觉对齐的教程。该框架显著提高了不同模型规模下的性能,并在三个在线基准测试中建立了新的最先进结果,特别是在OSWorld上达到了65.84%。
Towards Mitigating Excessive Forgetting in LLM Unlearning via Entanglement-Guidance with Proxy Constraint
Authors: Zhihao Liu, Jian Lou, Yuke Hu, Xiaochen Li, Yitian Chen, Tailun Chen, Zhizhen Qin, Kui Ren, Zhan Qin
First: 2025-08-28T05:45:40+00:00 · Latest: 2026-01-12T17:50:04+00:00
Abstract
Large language models (LLMs) are trained on massive datasets that may include private or copyrighted content. Due to growing privacy and ownership concerns, data owners may request the removal of their data from trained models. Machine unlearning provides a practical solution by removing the influence of specific data without full retraining. However, most existing methods still suffer from over-unlearning due to the lack of a principled mechanism to regulate the forgetting boundary, leading to unnecessary utility degradation and heightened privacy and robustness risks. In this work, we propose EGUP (Entanglement-Guided Unlearning with Proxy Constraint), a novel framework that leverages entanglement and proxy constraint to guide the unlearning process while mitigating over-unlearning. Within each iteration, EGUP employs inter-sample entanglement to adaptively reweight the unlearning strength, assigning greater unlearning efforts to forget samples that are semantically closer to retained knowledge. Across iterations, EGUP leverages intra-sample entanglement to track the representation shift of each forget sample and dynamically adjust its unlearning effort. In addition, we incorporate a proxy constraint that approximates the model's expected outputs after unlearning, forming a reference boundary that softly regularizes the unlearning process. EGUP is compatible with existing gradient-based objectives and serves as a plug-and-play enhancement. We evaluate EGUP on the TOFU and MUSE benchmarks, demonstrating consistent improvements in the unlearning-utility trade-off across multiple LLMs. Moreover, EGUP achieves performance close to the retrained model while remaining scalable and robust.
中文标题/摘要
标题:通过代理约束引导的纠缠指导以减轻大语言模型卸载中的过度遗忘
大型语言模型(LLMs)在大规模数据集上进行训练,这些数据集可能包含私人或版权内容。由于隐私和所有权问题日益突出,数据所有者可能会要求从训练模型中移除其数据。机器卸载提供了一种实用的解决方案,通过移除特定数据的影响而不进行完全重新训练。然而,大多数现有方法仍然因缺乏调节遗忘边界的原则性机制而遭受过度卸载的问题,导致不必要的性能下降和增强隐私和鲁棒性风险。在本研究中,我们提出了EGUP(纠缠引导的代理约束卸载),这是一种新颖的框架,利用纠缠和代理约束来引导卸载过程并减轻过度卸载。在每次迭代中,EGUP 使用样本间纠缠自适应地重新加权卸载强度,将更多的卸载努力分配给与保留知识语义上更接近的样本。在迭代过程中,EGUP 利用样本内纠缠跟踪每个遗忘样本的表示变化,并动态调整其卸载努力。此外,我们引入了一个代理约束,它近似卸载后的模型预期输出,形成一个软性调节卸载过程的参考边界。EGUP 与现有的基于梯度目标兼容,并作为即插即用增强。我们在TOFU和MUSE基准上评估EGUP,展示了在多个LLM上卸载-性能权衡的一致改进。此外,EGUP 达到了与重新训练模型相近的性能,同时保持了可扩展性和鲁棒性。
Summary / 总结
This work addresses the issue of excessive forgetting in machine unlearning of large language models (LLMs) by proposing EGUP, a framework that uses entanglement and proxy constraint to guide the unlearning process. EGUP reweights unlearning strength based on inter-sample entanglement and adjusts unlearning efforts dynamically using intra-sample entanglement. It also incorporates a proxy constraint to approximate expected outputs after unlearning, providing a reference boundary for the unlearning process. Experiments on TOFU and MUSE benchmarks show consistent improvements in the unlearning-utility trade-off and performance close to retrained models.
本文提出了一种名为EGUP(Entanglement-Guided Unlearning with Proxy Constraint)的方法,以解决大规模语言模型(LLMs)机器卸载过程中过度遗忘的问题。该方法利用纠缠和代理约束来引导卸载过程,并根据语义相似性和动态跟踪表示变化来适应性调整卸载强度。在TOFU和MUSE基准上的实验表明,EGUP在多个LLM上一致地改善了卸载-效用权衡,性能接近重新训练的模型,同时保持可扩展性和鲁棒性。
Accelerating Discrete Facility Layout Optimization: A Hybrid CDCL and CP-SAT Architecture
Authors: Joshua Gibson, Kapil Dhakal
First: 2025-12-19T20:03:37+00:00 · Latest: 2026-01-12T17:46:55+00:00
Abstract
Discrete facility layout design involves placing physical entities to minimize handling costs while adhering to strict safety and spatial constraints. This combinatorial problem is typically addressed using Mixed Integer Linear Programming (MILP) or Constraint Programming (CP), though these methods often face scalability challenges as constraint density increases. This study systematically evaluates the potential of Conflict-Driven Clause Learning (CDCL) with VSIDS heuristics as an alternative computational engine for discrete layout problems. Using a unified benchmarking harness, we conducted a controlled comparison of CDCL, CP-SAT, and MILP across varying grid sizes and constraint densities. Experimental results reveal a distinct performance dichotomy: while CDCL struggles with optimization objectives due to cost-blind branching, it demonstrates unrivaled dominance in feasibility detection, solving highly constrained instances orders of magnitude faster than competing paradigms. Leveraging this finding, we developed a novel "Warm-Start" hybrid architecture that utilizes CDCL to rapidly generate valid feasibility hints, which are then injected into a CP-SAT optimizer. Our results confirm that this layered approach successfully accelerates exact optimization, using SAT-driven pruning to bridge the gap between rapid satisfiability and proven optimality.
中文标题/摘要
标题:加速离散设施布局优化:混合CDCL和CP-SAT架构
离散设施布局设计涉及将物理实体放置以最小化处理成本,同时遵守严格的安全和空间约束。这是一个组合问题,通常使用混合整数线性规划(MILP)或约束编程(CP)来解决,但这些方法在约束密度增加时往往面临可扩展性挑战。本研究系统评估了冲突驱动的子句学习(CDCL)与VSIDS启发式算法作为离散布局问题替代计算引擎的潜力。使用统一的基准测试框架,我们在不同的网格大小和约束密度下对CDCL、CP-SAT和MILP进行了受控比较。实验结果揭示了一种性能二分法:虽然CDCL由于成本盲目的分支在优化目标上表现不佳,但在可行性检测方面却表现出无与伦比的主导地位,比竞争范式快了几个数量级。利用这一发现,我们开发了一种新颖的“预热启动”混合架构,利用CDCL快速生成有效的可行性提示,然后将其注入CP-SAT优化器。我们的结果证实,这种分层方法成功地加速了精确优化,利用SAT驱动的剪枝来弥合快速满足性和证明最优性之间的差距。
Summary / 总结
This study addresses the scalability challenges of discrete facility layout optimization by evaluating the potential of Conflict-Driven Clause Learning (CDCL) with VSIDS heuristics. Through a controlled comparison with Constraint Programming (CP) and Mixed Integer Linear Programming (MILP), the research demonstrates that while CDCL is less effective for optimization objectives, it excels in feasibility detection. A novel hybrid architecture combining CDCL for feasibility hints with CP-SAT for optimization was developed, significantly accelerating exact optimization for highly constrained instances.
该研究通过评估冲突驱动的子句学习(CDCL)与VSIDS启发式算法在离散设施布局优化中的潜力,应对该问题的可扩展性挑战。通过与约束编程(CP)和混合整数线性规划(MILP)的对照实验,研究发现虽然CDCL在优化目标方面表现较弱,但在可行性检测方面表现出色。提出了一种结合CDCL生成可行性提示与CP-SAT进行优化的新型混合架构,显著加速了精确优化过程。
LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Authors: Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Venue: Neurips 2025
First: 2025-10-29T08:21:59+00:00 · Latest: 2026-01-12T17:46:52+00:00
Comments: 10 pages, 5 figures, 14 tables, Neurips 2025
Abstract
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
中文标题/摘要
标题:LangHOPS:基于语言的开放词汇层次部件分割
我们提出了LangHOPS,这是第一个基于多模态大型语言模型(MLLM)的开放词汇对象部件实例分割框架。给定一张图像,LangHOPS 可以从开放词汇候选类别中联合检测和分割层次化对象和部件实例。与依赖启发式或可学习视觉分组的先前方法不同,我们的方法将对象部件层次结构扎根于语言空间。它将 MLLM 集成到对象部件解析管道中,利用其丰富的知识和推理能力,并在层次结构内链接多粒度概念。我们在多个具有挑战性的场景中评估了 LangHOPS,包括领域内和跨数据集对象部件实例分割以及零样本语义分割。LangHOPS 达到了最先进的结果,在 PartImageNet 数据集上超越了先前方法 5.5% 的平均精度(AP)(领域内)和 4.8%(跨数据集),以及在 ADE20K 中未见过的对象部件上达到了 2.5% 的 mIOU(零样本)。消融研究进一步验证了语言扎根层次结构和 MLLM 驱动的部件查询精炼策略的有效性。代码将在此发布。
Summary / 总结
LangHOPS is a framework that uses a Multimodal Large Language Model to perform open-vocabulary object-part instance segmentation. It can detect and segment hierarchical object and part instances from a wide range of categories in an image. Unlike previous methods that use visual grouping, LangHOPS grounds object-part hierarchies in language space, integrating the MLLM into the parsing pipeline to leverage its knowledge and reasoning capabilities. LangHOPS outperforms previous methods by 5.5% AP in-domain and 4.8% AP cross-dataset on PartImageNet, and by 2.5% mIOU on unseen object parts in ADE20K for zero-shot segmentation. Ablation studies confirm the effectiveness of the language-grounded hierarchy and part query refinement strategy.
LangHOPS 是一种使用多模态大型语言模型进行开放词汇对象部分实例分割的框架。它可以检测和分割图像中各种类别中的层次化对象和部分实例。与之前的方法不同,LangHOPS 将对象部分层次结构嵌入到语言空间中,并将 MLLM 集成到解析管道中,以利用其知识和推理能力。LangHOPS 在 PartImageNet 上的室内和跨数据集 AP 分别比之前的方法高出 5.5% 和 4.8%,在 ADE20K 上对未见过的对象部分的零样本分割的 mIOU 高出 2.5%。消融研究进一步验证了语言嵌入的层次结构和部分查询精炼策略的有效性。
Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding
Authors: Yanxiang Huang, Guohua Gao, Zhaoyang Wei, Jianyuan Ni
Venue: ICME 2026
First: 2026-01-12T17:46:10+00:00 · Latest: 2026-01-12T17:46:10+00:00
Comments: 6 pages
Abstract
Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
中文标题/摘要
标题:视频证据到推理:通过明确的证据关联实现高效视频理解
大型视觉-语言模型(LVLMs)在视频推理中面临一个根本性的困境:它们在冗长推理的高昂计算成本和高效但未关联方法的幻觉风险之间徘徊。为了解决这一问题,我们引入了证据链(CoE),这是一种新颖的框架,通过架构解耦和联合优化感知关联和推理效率。CoE 包含两个核心创新:(1)一种轻量级的证据关联模块(EGM),作为查询引导的过滤器,动态识别并提取一组紧凑的高质量视觉证据;(2)一种通过强化学习优化的证据锚定协议。关键的是,我们设计了一种复合奖励机制,强制模型在推理过程中严格参考已识别的时间锚点,从而减轻幻觉。为了实现这一点,我们构建了CoE-指令,这是一个大规模数据集(164,000个样本),包含一种新的双注释方案,用于分别监督感知和推理。在包括Video-MME、MVBench和VSI-Bench在内的五个基准上的广泛实验表明,增强后的CoE模型达到了新的最佳水平。它们在准确性上显著优于现有方法,证明CoE是一种强大且实用的框架,用于可靠的视频理解。
Summary / 总结
The paper addresses the challenge of efficient video understanding by introducing the Chain of Evidence (CoE) framework, which decouples perceptual grounding and reasoning efficiency. CoE includes a lightweight Evidence Grounding Module (EGM) that filters visual evidence and an Evidence-Anchoring Protocol optimized via Reinforcement Learning. The composite reward mechanism ensures the model references temporal anchors, reducing hallucinations. Experiments on five benchmarks show that CoE-enhanced models outperform existing methods in accuracy, establishing a new state-of-the-art.
本文提出了一种名为Chain of Evidence (CoE)的框架,以解决高效视频理解的挑战,该框架将感知接地和推理效率分离。CoE 包含一个轻量级的Evidence Grounding Module (EGM) 和一个通过强化学习优化的Evidence-Anchoring Protocol。该框架使用复合奖励机制确保模型严格参考时间锚点,减少幻觉。在五个基准上的实验表明,CoE增强的模型达到了最先进的准确率,优于现有方法。
Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning
Authors: Shao-Ting Chiu, Siu Wun Cheung, Ulisses Braga-Neto, Chak Shing Lee, Rui Peng Li
First: 2026-01-12T17:45:31+00:00 · Latest: 2026-01-12T17:45:31+00:00
Abstract
Kolmogorov-Arnold Networks (KANs) have shown strong potential for efficiently approximating complex nonlinear functions. However, the original KAN formulation relies on B-spline basis functions, which incur substantial computational overhead due to De Boor's algorithm. To address this limitation, recent work has explored alternative basis functions such as radial basis functions (RBFs) that can improve computational efficiency and flexibility. Yet, standard RBF-KANs often sacrifice accuracy relative to the original KAN design. In this work, we propose Free-RBF-KAN, a RBF-based KAN architecture that incorporates adaptive learning grids and trainable smoothness to close this performance gap. Our method employs freely learnable RBF shapes that dynamically align grid representations with activation patterns, enabling expressive and adaptive function approximation. Additionally, we treat smoothness as a kernel parameter optimized jointly with network weights, without increasing computational complexity. We provide a general universality proof for RBF-KANs, which encompasses our Free-RBF-KAN formulation. Through a broad set of experiments, including multiscale function approximation, physics-informed machine learning, and PDE solution operator learning, Free-RBF-KAN achieves accuracy comparable to the original B-spline-based KAN while delivering faster training and inference. These results highlight Free-RBF-KAN as a compelling balance between computational efficiency and adaptive resolution, particularly for high-dimensional structured modeling tasks.
中文标题/摘要
标题:Free-RBF-KAN:具有自适应径向基函数的柯尔莫哥洛夫-阿诺尔德网络,用于高效函数学习
柯尔莫哥洛夫-阿诺尔德网络(KANs)在高效逼近复杂非线性函数方面显示出强大的潜力。然而,原始的KAN公式依赖于B样条基函数,由于De Boor算法,这会导致大量的计算开销。为了解决这一限制,最近的工作探索了替代基函数,如径向基函数(RBFs),以提高计算效率和灵活性。然而,标准的RBF-KAN通常在准确度上不如原始的KAN设计。在本文中,我们提出了一种基于RBF的KAN架构——Free-RBF-KAN,该架构结合了自适应学习网格和可训练的平滑度,以弥补这一性能差距。我们的方法使用可自由学习的RBF形状,动态地使网格表示与激活模式对齐,从而实现具有表达性和适应性的函数逼近。此外,我们将平滑度视为与网络权重联合优化的核参数,而不增加计算复杂度。我们为RBF-KAN提供了一般性通用性证明,涵盖了我们的Free-RBF-KAN公式。通过一系列广泛的实验,包括多尺度函数逼近、基于物理的机器学习和PDE解算器学习,Free-RBF-KAN在准确度上与基于B样条的原始KAN相当,同时提供更快的训练和推理速度。这些结果突显了Free-RBF-KAN在计算效率和自适应分辨率之间的平衡,特别是在高维结构化建模任务中具有吸引力。
Summary / 总结
Free-RBF-KAN is a novel RBF-based Kolmogorov-Arnold Network that uses adaptive learning grids and trainable smoothness to improve computational efficiency and accuracy. It dynamically adjusts RBF shapes to align with activation patterns, enabling efficient and flexible function approximation. Experimental results show that Free-RBF-KAN achieves comparable accuracy to the original B-spline-based KAN while offering faster training and inference, especially for high-dimensional structured modeling tasks.
Free-RBF-KAN 是一种基于 RBF 的 Kolmogorov-Arnold 网络,通过自适应学习网格和可训练的平滑度来提高计算效率和准确性。它动态调整 RBF 形状以与激活模式对齐,提供比传统 KAN 更快的训练和推理速度。实验表明,Free-RBF-KAN 在准确度上与基于 B-spline 的 KAN 相当,但在高维结构化建模任务中更为高效。
MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models
Authors: Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller
First: 2025-12-24T15:15:18+00:00 · Latest: 2026-01-12T17:32:16+00:00
Abstract
Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers -- a property we term 'latent solvability'. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.
中文标题/摘要
标题:MiST:理解中期科学训练在开发化学推理模型中的作用
大型语言模型可以通过基于规则的奖励进行在线微调来发展推理能力。然而,最近的研究揭示了一个关键限制:强化学习仅在基础模型已经对正确答案赋予非忽略概率时才能成功——我们称这一特性为“潜在可解性”。本研究探讨了化学推理能力的出现及其先决条件对化学领域意味着什么。我们确定了基于强化学习的化学推理的两个必要条件:1)符号能力,2)潜在的化学知识。我们提出了中期科学训练(MiST):一系列中期训练技术以满足这些条件,包括数据混合、SMILES/CIF意识预处理、继续预训练29亿个标记以及监督微调1亿个标记。这些步骤将3B和7B模型的潜在可解性得分提高至最高1.8倍,并使强化学习在有机反应命名中的顶级准确率从10.9%提升至63.9%,在无机材料生成中的顶级准确率从40.6%提升至67.4%。对于其他具有挑战性的化学任务,也观察到了类似的结果,同时产生了可解释的推理痕迹。我们的结果定义了化学推理训练的明确先决条件,并突显了中期训练在解锁推理能力中的更广泛作用。
Summary / 总结
This study investigates the role of mid-stage scientific training (MiST) in developing chemical reasoning capabilities in large language models. It identifies two prerequisites: symbolic competence and latent chemical knowledge. The proposed MiST techniques, including data-mixing and pre-processing, significantly improve the models' latent solvability, enabling reinforcement learning to achieve top-1 accuracy of 63.9% in organic reaction naming and 67.4% in inorganic material generation, compared to 10.9% and 40.6% respectively without MiST.
本研究探讨了中期科学训练(MiST)在开发化学推理能力中的作用。它确定了两个先决条件:符号能力与潜在的化学知识。通过数据混合、持续预训练和监督微调等MiST技术,显著提高了模型的潜在可解性,使有机反应命名和无机材料生成任务的准确率分别从10.9%提升至63.9%和从40.6%提升至67.4%。
Cost-Awareness in Tree-Search LLM Planning: A Systematic Study
Authors: Zihao Zhang, Hui Wei, Kenan Jiang, Shijia Pan, Shu Kai, Fei Liu
First: 2025-05-20T17:43:33+00:00 · Latest: 2026-01-12T17:29:58+00:00
Abstract
Planning under resource constraints is central to real-world decision making, yet most large language model (LLM) planners assume uniform action costs. We systematically analyze whether tree-search LLM planners are cost-aware and whether they efficiently generate budget-feasible plans. In contrast to black-box prompting, explicit search trees expose intermediate decisions, node evaluations, and failure modes, which allows for controlled ablations of planner behavior. We study depth-first search, breadth-first search, Monte Carlo Tree Search, and bidirectional search within a unified framework. Our experiments show that existing tree-based LLM planners often struggle to find cost-optimal plans, and that additional search computation does not reliably improve optimality. Among the methods evaluated, bidirectional search achieves the best overall efficiency and success rate. MCTS achieves the highest optimality on short-horizon tasks. Tree-search planners are especially valuable for studying LLM planning because their reasoning steps are explicit, in contrast to plain LLMs that internalize planning dynamics through post-training trajectories. Our findings suggest that improving LLM planning under resource constraints will likely require new search algorithms, rather than solely scaling inference-time compute.
中文标题/摘要
标题:树搜索LLM规划中的成本意识:一项系统研究
在资源约束下的规划是现实世界决策的核心,然而大多数大型语言模型(LLM)规划者假设动作成本均匀。我们系统地分析了树搜索LLM规划者是否具有成本意识以及它们是否能高效地生成预算可行的计划。与黑盒提示不同,显式的搜索树暴露了中间决策、节点评估和失败模式,这使得可以对规划者行为进行控制性的消融分析。我们在一个统一框架内研究了深度优先搜索、广度优先搜索、蒙特卡洛树搜索和双向搜索。实验结果显示,现有的基于树的LLM规划者往往难以找到成本最优的计划,而额外的搜索计算并不能可靠地提高最优性。在评估的方法中,双向搜索在整体效率和成功率方面表现最佳。蒙特卡洛树搜索在短期任务中达到最高的最优性。树搜索规划者特别适用于研究LLM规划,因为它们的推理步骤是明确的,与通过后训练轨迹内部化规划动力学的普通LLM不同。我们的研究结果表明,改进在资源约束下的LLM规划可能需要新的搜索算法,而不仅仅是扩展推理时的计算量。
Summary / 总结
This study investigates the cost-awareness of tree-search large language model (LLM) planners by analyzing depth-first, breadth-first, Monte Carlo Tree Search, and bidirectional search methods. The research finds that existing planners often fail to find cost-optimal plans, and additional search computation does not reliably improve optimality. Bidirectional search shows the best overall efficiency and success rate, while MCTS achieves the highest optimality on short-horizon tasks. The study highlights the need for new search algorithms to enhance LLM planning under resource constraints.
研究探讨了树搜索大语言模型(LLM)规划器的成本意识,对比了黑盒提示方法。研究评估了深度优先搜索、广度优先搜索、蒙特卡洛树搜索和双向搜索,发现现有规划器往往难以找到成本最优的计划,且额外的搜索计算并不能可靠地提高最优性。双向搜索和蒙特卡洛树搜索分别在短时域任务上表现出最佳的整体效率和最优性。研究强调,为了在资源约束下改进LLM规划,需要开发新的搜索算法。
PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials
Authors: Teddy Koker, Abhijeet Gangan, Mit Kotak, Jaime Marian, Tess Smidt
First: 2026-01-12T17:20:09+00:00 · Latest: 2026-01-12T17:20:09+00:00
Abstract
Many materials properties depend on higher-order derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with standard a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP (trained on Materials Project) by 55% on average across phonon thermodynamic properties and achieves state-of-the-art performance among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.
中文标题/摘要
标题:PFT:声子微调以优化机器学习原子势
许多材料性质依赖于势能面的高阶导数,然而,通过标准损失函数(基于能量、力和应力误差)训练的机器学习原子势(MLIPs)可能会在曲率上出现误差,从而影响振动性质的预测。我们引入了声子微调(PFT),它直接监督材料的二阶力常数,通过使MLIP的能量哈密顿矩阵与从有限位移声子计算中得到的DFT计算力常数匹配。为了扩展到大型超晶胞,PFT随机采样哈密顿矩阵的列,并使用单个哈密顿-向量乘积计算损失。我们还使用简单的协同训练方案来整合上游数据,以减轻灾难性遗忘。在MDR声子基准测试中,PFT在平均声子热力学性质上提高了Nequix MP(基于Materials Project训练)55%,并实现了在Materials Project轨迹上训练的模型中的最佳性能。PFT还能够泛化以改进超出二阶导数的性质,提高依赖于势能三次导数的热导率预测。
Summary / 总结
The research aims to improve the accuracy of machine learned interatomic potentials (MLIPs) in predicting higher-order derivatives of the potential energy surface, which are crucial for vibrational properties. The method, phonon fine-tuning (PFT), directly supervises the second-order force constants by matching MLIP energy Hessians to DFT-computed force constants. PFT improves the performance of Nequix MP on the MDR Phonon benchmark by 55% on average and achieves state-of-the-art performance for phonon thermodynamic properties and thermal conductivity predictions, demonstrating its effectiveness in scaling to large supercells and generalizing to third-order derivatives.
研究解决了机器学习原子势(MLIPs)在曲率预测上的误差问题,这会降低振动性质的预测精度。研究引入了声子精细调谐(PFT)方法,直接监督第二阶力常数,通过匹配MLIP能量哈密顿量与密度泛函计算的力常数。PFT在MDR声子基准测试中将Nequix MP的性能平均提高了55%,并达到了基于材料项目轨迹训练的模型中的最佳性能。此外,PFT还能够改进依赖于势能三次导数的热导率预测。
Evaluating the encoding competence of visual language models using uncommon actions
Authors: Chen Ling, Nai Ding
First: 2026-01-12T17:15:45+00:00 · Latest: 2026-01-12T17:15:45+00:00
Abstract
We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in fine-grained reasoning. We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning. Experiments show that all models perform significantly worse than humans in semantic judgment, especially in distinguishing grammatical correctness from semantic rationality. Further experiments show that even the lightweight model can improve its accuracy after fine-tuning, demonstrating the great potential of directional adaptation. This study not only reveals the key weaknesses of VLMs, but also provides diagnostic tools and research directions for the development of robust models with real visual semantic reasoning capabilities.
中文标题/摘要
标题:使用不常见动作评估视觉语言模型的编码能力
我们提出了UAIT(不常识动作图像-文本)数据集,这是一个新的评估基准,旨在测试视觉语言模型(VLMs)在不常识动作场景中的语义理解能力。与以往主要关注统计上占优势的常见视觉场景的数据集不同,UAIT 挑战模型处理语法合理但语义上违背常识的图像-文本配对。这些任务要求模型超越表面模式识别,展示出对执行者-受体关系和物理可行性深刻的理解。为了构建UAIT,我们设计了一种半自动过程,使用大型语言模型、少量提示工程和文本到图像生成来合成高质量的不常识图像-文本样本。每个样本都附带了一个精心设计的选择题,以测试模型在细粒度推理中的能力。我们评估了多个最先进的视觉语言模型,并将其与基于对比学习的模型进行了比较。实验表明,所有模型在语义判断上的表现都远逊于人类,尤其是在区分语法正确性与语义合理性方面。进一步的实验表明,即使是轻量级模型在微调后也能提高其准确性,这表明定向适应的巨大潜力。这项研究不仅揭示了视觉语言模型的关键弱点,还为开发具有真实视觉语义推理能力的稳健模型提供了诊断工具和研究方向。
Summary / 总结
The study introduces UAIT, a new dataset to evaluate the semantic understanding of visual language models in uncommon-sense action scenes. Unlike previous datasets, UAIT focuses on grammatically reasonable but semantically counter-intuitive image-text pairs, challenging models to understand agent-patient relationships and physical feasibility. Experiments show that state-of-the-art models perform poorly in distinguishing grammatical correctness from semantic rationality, and even lightweight models can improve with fine-tuning. This highlights the need for models to go beyond superficial pattern recognition and develop robust visual semantic reasoning capabilities.
研究引入了UAIT数据集,旨在评估视觉语言模型在不常见常识动作场景中的语义理解能力。不同于以往专注于常见视觉场景的数据集,UAIT通过提供语法规则正确但语义上不合常理的图像-文本配对,挑战模型对动作场景中主体-客体关系和物理可行性进行深层次理解的能力。实验结果显示,最先进的模型在区分语法规则正确性与语义合理性方面表现不佳,但即使是轻量级模型通过微调也能提高准确性,表明定向适应的巨大潜力。这项研究不仅揭示了视觉语言模型的关键弱点,还为开发具有真实视觉语义推理能力的稳健模型提供了诊断工具和研究方向。
FBFL: A Field-Based Coordination Approach for Data Heterogeneity in Federated Learning
Authors: Davide Domini, Gianluca Aguzzi, Lukas Esterle, Mirko Viroli
First: 2025-02-12T17:10:53+00:00 · Latest: 2026-01-12T17:08:29+00:00
Abstract
In the last years, Federated learning (FL) has become a popular solution to train machine learning models in domains with high privacy concerns. However, FL scalability and performance face significant challenges in real-world deployments where data across devices are non-independently and identically distributed (non-IID). The heterogeneity in data distribution frequently arises from spatial distribution of devices, leading to degraded model performance in the absence of proper handling. Additionally, FL typical reliance on centralized architectures introduces bottlenecks and single-point-of-failure risks, particularly problematic at scale or in dynamic environments. To close this gap, we propose Field-Based Federated Learning (FBFL), a novel approach leveraging macroprogramming and field coordination to address these limitations through: (i) distributed spatial-based leader election for personalization to mitigate non-IID data challenges; and (ii) construction of a self-organizing, hierarchical architecture using advanced macroprogramming patterns. Moreover, FBFL not only overcomes the aforementioned limitations, but also enables the development of more specialized models tailored to the specific data distribution in each subregion. This paper formalizes FBFL and evaluates it extensively using MNIST, FashionMNIST, and Extended MNIST datasets. We demonstrate that, when operating under IID data conditions, FBFL performs comparably to the widely-used FedAvg algorithm. Furthermore, in challenging non-IID scenarios, FBFL not only outperforms FedAvg but also surpasses other state-of-the-art methods, namely FedProx and Scaffold, which have been specifically designed to address non-IID data distributions. Additionally, we showcase the resilience of FBFL's self-organizing hierarchical architecture against server failures.
中文标题/摘要
标题:FBFL:一种针对联邦学习中数据异构性的场基协调方法
近年来,联邦学习(FL)已成为处理高隐私关注领域机器学习模型训练的一种流行解决方案。然而,在实际部署中,设备间数据非独立且非同分布(non-IID)导致FL的可扩展性和性能面临重大挑战。数据分布的异构性通常源于设备的空间分布,这在缺乏适当处理的情况下会降低模型性能。此外,FL通常依赖于集中式架构,这在大规模或动态环境中引入了瓶颈和单点故障风险。为解决这些问题,我们提出了一种新的方法——场基联邦学习(FBFL),该方法通过利用宏编程和场协调来解决这些限制:(i)分布式基于空间的领导者选举以减轻非IID数据挑战;(ii)使用先进的宏编程模式构建自组织、分层架构。此外,FBFL不仅克服了上述限制,还能够开发出更专门化的模型,以适应每个子区域的数据分布。本文形式化了FBFL,并使用MNIST、FashionMNIST和扩展MNIST数据集进行了广泛评估。我们证明,在IID数据条件下,FBFL的表现与广泛使用的FedAvg算法相当。在具有挑战性的非IID场景中,FBFL不仅优于FedAvg,还超越了专门设计用于解决非IID数据分布的FedProx和Scaffold等其他最先进的方法。此外,我们展示了FBFL自组织分层架构在服务器故障情况下的鲁棒性。
Summary / 总结
FBFL is a novel approach in federated learning that addresses scalability and performance issues in non-IID data scenarios by leveraging distributed spatial-based leader election and a self-organizing hierarchical architecture. Extensive evaluations using MNIST, FashionMNIST, and Extended MNIST datasets show that FBFL performs comparably to FedAvg under IID conditions and outperforms FedAvg and other state-of-the-art methods like FedProx and Scaffold in non-IID scenarios, demonstrating its resilience against server failures.
FBFL 是一种新型的联邦学习方法,旨在解决非IID数据场景下的可扩展性和性能问题。它通过分布式空间领导者选举和自组织分层架构来缓解非IID数据挑战并减少集中化风险。实验结果显示,FBFL 在 IID 条件下与 FedAvg 性能相当,并在非IID场景中优于 FedAvg、FedProx 和 Scaffold 等其他先进方法。
Hiking in the Wild: A Scalable Perceptive Parkour Framework for Humanoids
Authors: Shaoting Zhu, Ziwen Zhuang, Mengjie Zhao, Kun-Ying Lee, Hang Zhao
First: 2026-01-12T16:50:50+00:00 · Latest: 2026-01-12T16:50:50+00:00
Comments: Project Page: https://project-instinct.github.io/hiking-in-the-wild
Abstract
Achieving robust humanoid hiking in complex, unstructured environments requires transitioning from reactive proprioception to proactive perception. However, integrating exteroception remains a significant challenge: mapping-based methods suffer from state estimation drift; for instance, LiDAR-based methods do not handle torso jitter well. Existing end-to-end approaches often struggle with scalability and training complexity; specifically, some previous works using virtual obstacles are implemented case-by-case. In this work, we present \textit{Hiking in the Wild}, a scalable, end-to-end parkour perceptive framework designed for robust humanoid hiking. To ensure safety and training stability, we introduce two key mechanisms: a foothold safety mechanism combining scalable \textit{Terrain Edge Detection} with \textit{Foot Volume Points} to prevent catastrophic slippage on edges, and a \textit{Flat Patch Sampling} strategy that mitigates reward hacking by generating feasible navigation targets. Our approach utilizes a single-stage reinforcement learning scheme, mapping raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Extensive field experiments on a full-size humanoid demonstrate that our policy enables robust traversal of complex terrains at speeds up to 2.5 m/s. The training and deployment code is open-sourced to facilitate reproducible research and deployment on real robots with minimal hardware modifications.
中文标题/摘要
标题:在野地远足:为类人机器人设计的可扩展感知极限运动框架
在复杂、未结构化的环境中实现稳健的人形远足需要从反应式本体感知过渡到主动感知。然而,整合外部感知仍然是一个重大挑战:基于地图的方法会遭受状态估计漂移;例如,基于LiDAR的方法无法很好地处理躯干抖动。现有的端到端方法通常在可扩展性和训练复杂性方面存在问题;具体来说,一些先前的工作使用虚拟障碍物是逐案实现的。在本文中,我们提出了“在野地远足”(Hiking in the Wild),这是一种为实现稳健的人形远足而设计的可扩展、端到端的感知极限运动框架。为了确保安全和训练稳定性,我们引入了两种关键机制:一种结合了可扩展的“地形边缘检测”与“足部体积点”的足部安全机制,以防止在边缘处发生灾难性滑动;以及一种“平坦区域采样”策略,通过生成可行的导航目标来缓解奖励作弊。我们的方法采用了一阶段强化学习方案,直接将原始深度输入和本体感知映射到关节动作,而不依赖于外部状态估计。在全尺寸人形机器人上的大量实地实验表明,我们的策略能够在2.5 m/s的速度下实现复杂地形的稳健穿越。训练和部署代码已开源,以促进可重复研究和在最少硬件修改的情况下部署到真实机器人。
Summary / 总结
This paper introduces Hiking in the Wild, a scalable end-to-end parkour perceptive framework for humanoid robots to navigate complex, unstructured environments. It addresses the challenges of integrating exteroception by incorporating a foothold safety mechanism and a flat patch sampling strategy. The framework utilizes a single-stage reinforcement learning scheme that maps raw depth inputs and proprioception directly to joint actions, without relying on external state estimation. Experimental results show that the policy enables robust traversal of complex terrains at speeds up to 2.5 m/s, demonstrating significant improvements in scalability and training stability compared to previous approaches.
本文介绍了Hiking in the Wild,一种用于复杂环境中的类人猿徒步行走的可扩展端到端框架。该框架通过结合地形边缘检测和足部体积点来防止滑落,并使用平坦区域采样策略避免奖励作弊。该方法使用单阶段强化学习直接将深度输入和本体感受映射到关节动作,实现每秒2.5米的稳健穿越。该框架已开源,以促进在真实机器人上的可重复研究和部署。
Deep Whole-body Parkour
Authors: Ziwen Zhuang, Shaoting Zhu, Mengjie Zhao, Hang Zhao
First: 2026-01-12T16:33:16+00:00 · Latest: 2026-01-12T16:33:16+00:00
Abstract
Current approaches to humanoid control generally fall into two paradigms: perceptive locomotion, which handles terrain well but is limited to pedal gaits, and general motion tracking, which reproduces complex skills but ignores environmental capabilities. This work unites these paradigms to achieve perceptive general motion control. We present a framework where exteroceptive sensing is integrated into whole-body motion tracking, permitting a humanoid to perform highly dynamic, non-locomotion tasks on uneven terrain. By training a single policy to perform multiple distinct motions across varied terrestrial features, we demonstrate the non-trivial benefit of integrating perception into the control loop. Our results show that this framework enables robust, highly dynamic multi-contact motions, such as vaulting and dive-rolling, on unstructured terrain, significantly expanding the robot's traversability beyond simple walking or running. https://project-instinct.github.io/deep-whole-body-parkour
中文标题/摘要
标题:深度全身parkour
当前的人形控制方法通常分为两类:感知性移动,能够处理地形但仅限于足步态;以及通用运动跟踪,能够再现复杂技能但忽略环境能力。本研究将这两类方法结合起来,实现感知性通用运动控制。我们提出了一种框架,将外部感知整合到全身运动跟踪中,使一个人形机器人能够在不平坦的地形上执行高度动态且非移动任务。通过训练单一策略在不同地形特征上执行多种不同动作,我们展示了将感知整合到控制回路中的非平凡益处。我们的结果表明,这种框架能够实现稳健且高度动态的多接触动作,如翻越和俯冲滚翻,显著扩展了机器人的可穿越性,远超简单的行走或跑步。https://project-instinct.github.io/deep-whole-body-parkour
Summary / 总结
This work aims to combine perceptive locomotion and general motion tracking to achieve robust and dynamic whole-body control for humanoid robots. The method integrates exteroceptive sensing into whole-body motion tracking, allowing the robot to perform complex tasks like vaulting and dive-rolling on uneven terrain. Key findings show that this approach enables the robot to traverse unstructured environments with high dynamic motions, far beyond simple walking or running, demonstrating the non-trivial benefits of integrating perception into the control loop.
该研究旨在结合感知式移动和通用运动跟踪,实现人形机器人的稳健和动态全身控制。方法将外部感知整合到全身运动跟踪中,使机器人能够在不平坦地形上执行复杂的任务,如翻越和滚翻。关键发现表明,这种方法使机器人能够以高动态运动穿越未结构化的环境,远超简单的行走或跑步,展示了将感知整合到控制回路中的非平凡益处。
Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model
Authors: Siwen Jiao, Tianxiong Lv, Kangan Qian, Chenxu Zhao, Xiuyuan Zhu, Tianlun Li, Xiaolong Cheng, Jinyu Li, Zhihao Liao, Yang Cai
First: 2026-01-12T16:26:42+00:00 · Latest: 2026-01-12T16:26:42+00:00
Abstract
Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
中文标题/摘要
标题:平滑操作员:平滑可验证奖励激活视觉语言模型的空间推理能力
视觉语言模型(VLMs)在实现精确的数值预测以理解3D场景方面面临关键瓶颈。传统的强化学习(RL)方法,主要基于相对排名,通常遭受严重的奖励稀疏性和梯度不稳定性,无法有效利用由3D物理约束提供的可验证信号。值得注意的是,在标准GRPO框架中,相对归一化导致“接近但未命中”的样本(特征为小但非零的误差)遭受优势坍塌。这导致优化过程中有价值的边界样本被丢弃,造成严重的数据利用瓶颈。为解决这一问题,我们引入了平滑数值奖励激活(SNRA)操作和绝对保持GRPO(AP-GRPO)框架。SNRA采用动态参数化的Sigmoid函数将原始反馈转换为密集的连续奖励连续体。同时,AP-GRPO整合绝对标量梯度以减轻传统相对排名机制固有的数值信息损失。通过这种方法,我们构建了包含50,000个可验证3D子任务的数据集Numerical3D-50k。实验证明,AP-GRPO在性能上与大规模监督方法相当,同时保持更高的数据效率,有效激活了VLMs中的潜在3D推理能力,无需进行架构修改。
Summary / 总结
The research aims to improve the precision of numerical predictions in 3D scene understanding for Vision-Language Models (VLMs) by addressing the issues of reward sparsity and gradient instability in traditional reinforcement learning methods. It introduces the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA transforms raw feedback into a dense, continuous reward, while AP-GRPO mitigates numerical information loss. These methods enable the construction of Numerical3D-50k, a dataset of 50,000 verifiable 3D subtasks, and demonstrate that AP-GRPO achieves performance comparable to large-scale supervised methods with higher data efficiency, enhancing 3D reasoning in VLMs without architectural changes.
研究通过引入Smooth Numerical Reward Activation (SNRA) 操作符和Absolute-Preserving GRPO (AP-GRPO) 框架来解决Vision-Language Models (VLMs) 在3D场景理解中的精确数值预测问题。SNRA将原始反馈转换为密集的连续奖励,而AP-GRPO整合绝对标量梯度以减少数值信息损失。这些方法应用于Numerical3D-50k数据集,结果显示AP-GRPO在性能上与大规模监督方法相当,但具有更高的数据效率,从而增强VLMs的3D推理能力。
Leveraging 3D Representation Alignment and RGB Pretrained Priors for LiDAR Scene Generation
Authors: Nicolas Sereyjol-Garros, Ellington Kirby, Victor Besnier, Nermin Samet
First: 2026-01-12T16:20:20+00:00 · Latest: 2026-01-12T16:20:20+00:00
Abstract
LiDAR scene synthesis is an emerging solution to scarcity in 3D data for robotic tasks such as autonomous driving. Recent approaches employ diffusion or flow matching models to generate realistic scenes, but 3D data remains limited compared to RGB datasets with millions of samples. We introduce R3DPA, the first LiDAR scene generation method to unlock image-pretrained priors for LiDAR point clouds, and leverage self-supervised 3D representations for state-of-the-art results. Specifically, we (i) align intermediate features of our generative model with self-supervised 3D features, which substantially improves generation quality; (ii) transfer knowledge from large-scale image-pretrained generative models to LiDAR generation, mitigating limited LiDAR datasets; and (iii) enable point cloud control at inference for object inpainting and scene mixing with solely an unconditional model. On the KITTI-360 benchmark R3DPA achieves state of the art performance. Code and pretrained models are available at https://github.com/valeoai/R3DPA.
Summary / 总结
The research aims to address the scarcity of 3D data for robotic tasks by proposing R3DPA, a method that aligns intermediate features of a generative model with self-supervised 3D features and transfers knowledge from large-scale image-pretrained models to LiDAR scene generation. Key findings include improved generation quality, state-of-the-art performance on the KITTI-360 benchmark, and the ability to control point clouds for object inpainting and scene mixing with an unconditional model. Code and pretrained models are available at https://github.com/valeoai/R3DPA.
研究旨在通过提出R3DPA方法来解决机器人任务中3D数据稀缺的问题,该方法通过将生成模型的中间特征与自监督3D特征对齐,并从大规模图像预训练模型中转移知识到LiDAR场景生成。关键发现包括生成质量的提升、在KITTI-360基准上的最先进性能以及能够通过无条件模型控制点云进行物体修复和场景混合。代码和预训练模型可在https://github.com/valeoai/R3DPA获取。
Physics-Informed Singular-Value Learning for Cross-Covariances Forecasting in Financial Markets
Authors: Efstratios Manolakis, Christian Bongiorno, Rosario Nunzio Mantegna
First: 2026-01-12T16:18:08+00:00 · Latest: 2026-01-12T16:18:08+00:00
Abstract
A new wave of work on covariance cleaning and nonlinear shrinkage has delivered asymptotically optimal analytical solutions for large covariance matrices. Building on this progress, these ideas have been generalized to empirical cross-covariance matrices, whose singular-value shrinkage characterizes comovements between one set of assets and another. Existing analytical cross-covariance cleaners are derived under strong stationarity and large-sample assumptions, and they typically rely on mesoscopic regularity conditions such as bounded spectra; macroscopic common modes (e.g., a global market factor) violate these conditions. When applied to real equity returns, where dependence structures drift over time and global modes are prominent, we find that these theoretically optimal formulas do not translate into robust out-of-sample performance. We address this gap by designing a random-matrix-inspired neural architecture that operates in the empirical singular-vector basis and learns a nonlinear mapping from empirical singular values to their corresponding cleaned values. By construction, the network can recover the analytical solution as a special case, yet it remains flexible enough to adapt to non-stationary dynamics and mode-driven distortions. Trained on a long history of equity returns, the proposed method achieves a more favorable bias-variance trade-off than purely analytical cleaners and delivers systematically lower out-of-sample cross-covariance prediction errors. Our results demonstrate that combining random-matrix theory with machine learning makes asymptotic theories practically effective in realistic time-varying markets.
中文标题/摘要
标题:基于物理信息的奇异值学习在金融市场协方差预测中的应用
关于协方差清理和非线性收缩的新一波研究已经提供了大型协方差矩阵的渐近最优解析解。在此基础上,这些想法被推广到经验交叉协方差矩阵,其奇异值收缩表征了一组资产与另一组资产之间的共同变动。现有的解析交叉协方差清理器是在强平稳性和大样本假设下推导出来的,通常依赖于中间尺度的正则条件,如有界谱;宏观尺度的共同模式(例如,全球市场因子)违反这些条件。当应用于实际的股票回报时,其中依赖结构随时间漂移且全球模式显著,我们发现这些理论上最优的公式并未转化为稳健的外样本表现。我们通过设计一种基于随机矩阵的神经架构来解决这一差距,该架构在经验奇异向量基上操作,并学习从经验奇异值到相应清理值的非线性映射。通过构造,该网络可以恢复解析解作为特殊情况,但仍然足够灵活以适应非平稳动态和模式驱动的失真。在长期股票回报历史数据上训练,所提出的方法在偏倚-方差权衡上优于纯粹的解析清理器,并系统地降低了外样本交叉协方差预测误差。我们的结果表明,将随机矩阵理论与机器学习相结合,使渐近理论在现实的时变市场中实际有效。
Summary / 总结
The research aims to improve the out-of-sample performance of cross-covariance forecasting in financial markets by addressing the limitations of existing analytical cleaners, which often fail due to non-stationary dynamics and prominent global modes. The authors propose a physics-informed singular-value learning method using a neural network that operates in the empirical singular-vector basis, allowing for flexibility in adapting to time-varying dependence structures. This method outperforms purely analytical cleaners in terms of bias-variance trade-off and prediction accuracy.
研究旨在通过改进现有的分析解决方案来提高金融市场的跨协方差预测性能,这些解决方案在非平稳和模式驱动的环境中往往表现不佳。作者提出了一种基于奇异值学习的方法,利用神经网络从经验奇异值到清洁值学习非线性映射。该方法通过实现更好的偏差-方差权衡和减少外样本预测误差,优于纯粹的分析清洁器。
Self-Creating Random Walks for Decentralized Learning under Pac-Man Attacks
Authors: Xingran Chen, Parimal Parag, Rohit Bhagat, Salim El Rouayheb
First: 2026-01-12T16:00:21+00:00 · Latest: 2026-01-12T16:00:21+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2508.05663
Abstract
Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an adversarial threat that we term the ``Pac-Man'' attack, in which a malicious node probabilistically terminates any RW that visits it. This stealthy behavior gradually eliminates active RWs from the network, effectively halting the learning process without triggering failure alarms. To counter this threat, we propose the CREATE-IF-LATE (CIL) algorithm, which is a fully decentralized, resilient mechanism that enables self-creating RWs and prevents RW extinction in the presence of Pac-Man. Our theoretical analysis shows that the CIL algorithm guarantees several desirable properties, such as (i) non-extinction of the RW population, (ii) almost sure boundedness of the RW population, and (iii) convergence of RW-based stochastic gradient descent even in the presence of Pac-Man with a quantifiable deviation from the true optimum. Moreover, the learning process experiences at most a linear time delay due to Pac-Man interruptions and RW regeneration. Our extensive empirical results on both synthetic and public benchmark datasets validate our theoretical findings.
中文标题/摘要
标题:自我生成随机游走用于在Pac-Man攻击下的去中心化学习
基于随机游走(RW)的算法由于低开销和可扩展性,在分布式系统中长期流行,并且最近在去中心化学习中的应用也在增长。然而,它们依赖于局部交互,使其天生容易受到恶意行为的影响。在这项工作中,我们研究了一种我们称之为“Pac-Man”攻击的敌对威胁,在这种攻击中,恶意节点以概率方式终止访问它的任何RW。这种隐蔽的行为逐渐消除了网络中的活跃RW,有效地停止了学习过程,而不会触发故障警报。为了应对这一威胁,我们提出了CREATE-IF-LATE (CIL)算法,这是一种完全去中心化的、具有弹性的机制,能够在Pac-Man存在的情况下使RW自我生成并防止RW灭绝。我们的理论分析表明,CIL算法保证了几个期望的性质,如(i) RW种群的非灭绝,(ii) RW种群的几乎肯定有界,以及(iii) 即使在Pac-Man存在的情况下,基于RW的随机梯度下降也收敛,其偏差可以量化。此外,由于Pac-Man中断和RW再生,学习过程最多经历线性时间延迟。我们在合成数据集和公共基准数据集上的广泛实验证明了我们的理论发现。
Summary / 总结
The research aims to address the vulnerability of random walk (RW) algorithms in decentralized learning to a stealthy malicious attack called 'Pac-Man' that terminates RWs visiting a malicious node. The study proposes the CIL algorithm, which ensures the non-extinction and boundedness of RWs and maintains the convergence of RW-based stochastic gradient descent despite Pac-Man attacks. Empirical results show that the learning process experiences only a linear time delay due to interruptions and RW regeneration.
论文针对Pac-Man攻击(恶意节点终止访问它的随机游走)对基于随机游走(RW)算法的脆弱性进行了研究。作者提出了CIL算法,确保RW的非灭绝和基于RW的随机梯度下降的收敛性。理论分析和实验结果表明,CIL算法保证了RW的非灭绝、有界性和收敛性,且由于中断和RW再生导致的时间延迟最多为线性级。
MoE3D: A Mixture-of-Experts Module for 3D Reconstruction
Authors: Zichen Wang, Ang Cao, Liam J. Wang, Jeong Joon Park
First: 2026-01-08T18:33:52+00:00 · Latest: 2026-01-12T15:58:03+00:00
Abstract
We propose a simple yet effective approach to enhance the performance of feed-forward 3D reconstruction models. Existing methods often struggle near depth discontinuities, where standard regression losses encourage spatial averaging and thus blur sharp boundaries. To address this issue, we introduce a mixture-of-experts formulation that handles uncertainty at depth boundaries by combining multiple smooth depth predictions. A softmax weighting head dynamically selects among these hypotheses on a per-pixel basis. By integrating our mixture model into a pre-trained state-of-the-art 3D model, we achieve a substantial reduction of boundary artifacts and gains in overall reconstruction accuracy. Notably, our approach is highly compute efficient, delivering generalizable improvements even when fine-tuned on a small subset of training data while incurring only negligible additional inference computation, suggesting a promising direction for lightweight and accurate 3D reconstruction.
中文标题/摘要
标题:MoE3D:一种用于3D重建的混合专家模块
我们提出了一种简单而有效的方法,以提高前馈3D重建模型的性能。现有方法在深度不连续区域经常遇到困难,标准回归损失倾向于进行空间平均,从而模糊了锐利边界。为了解决这一问题,我们引入了一种混合专家公式,通过结合多个平滑的深度预测来处理深度边界处的不确定性。一个softmax加权头在每个像素基础上动态选择这些假设。通过将我们的混合模型集成到预训练的最新3D模型中,我们实现了边界伪影的显著减少和整体重建精度的提升。值得注意的是,我们的方法具有很高的计算效率,在仅对少量训练数据进行微调的情况下,仍能提供可泛化的改进,且仅产生微小的额外推理计算,这表明了轻量级且准确的3D重建的一个有前景的方向。
Summary / 总结
This paper proposes MoE3D, a mixture-of-experts module to improve 3D reconstruction models by addressing depth discontinuities. The method uses a softmax weighting head to dynamically select among multiple smooth depth predictions, reducing boundary artifacts and improving overall accuracy. It integrates seamlessly into pre-trained models and provides significant improvements with minimal additional computational cost.
该论文提出了MoE3D模块,通过解决深度不连续性问题来提升3D重建模型的性能。它使用softmax权重头在每个像素基础上动态选择多个平滑的深度预测,从而减少边界伪影并提高整体重建精度。该方法计算效率高,即使在有限的微调数据下也能实现显著的改进。
REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion
Authors: Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi
Venue: AAAI 2026
First: 2025-11-21T21:59:24+00:00 · Latest: 2026-01-12T15:56:26+00:00
Comments: 26 pages; Accepted to AAAI 2026; Code available at https://github.com/merlresearch/radar-bbox-diffusion
Abstract
Multi-view indoor radar perception has drawn attention due to its cost-effectiveness and low privacy risks. Existing methods often rely on {implicit} cross-view radar feature association, such as proposal pairing in RFMask or query-to-feature cross-attention in RETR, which can lead to ambiguous feature matches and degraded detection in complex indoor scenes. To address these limitations, we propose \textbf{REXO} (multi-view Radar object dEtection with 3D bounding boX diffusiOn), which lifts the 2D bounding box (BBox) diffusion process of DiffusionDet into the 3D radar space. REXO utilizes these noisy 3D BBoxes to guide an {explicit} cross-view radar feature association, enhancing the cross-view radar-conditioned denoising process. By accounting for prior knowledge that the person is in contact with the ground, REXO reduces the number of diffusion parameters by determining them from this prior. Evaluated on two open indoor radar datasets, our approach surpasses state-of-the-art methods by a margin of +4.22 AP on the HIBER dataset and +11.02 AP on the MMVR dataset. The REXO implementation is available at https://github.com/merlresearch/radar-bbox-diffusion.
中文标题/摘要
标题:REXO:通过3D边界框扩散的室内多视角雷达目标检测
多视角室内雷达感知由于其成本效益和低隐私风险而引起了关注。现有方法通常依赖于隐式的跨视角雷达特征关联,如RFMask中的提案配对或RETR中的查询到特征的交叉注意力,这可能导致特征匹配模糊并在复杂室内场景中降低检测性能。为了解决这些限制,我们提出了**REXO**(多视角雷达目标检测中的3D边界框扩散),将DiffusionDet中的2D边界框扩散过程提升到3D雷达空间。REXO利用这些嘈杂的3D边界框来引导显式的跨视角雷达特征关联,增强跨视角雷达条件下的去噪过程。通过考虑先验知识,即人与地面接触,REXO减少了扩散参数的数量,从先验中确定它们。在两个公开的室内雷达数据集上评估,我们的方法在HIBER数据集上以+4.22 AP的优势超过了最先进的方法,在MMVR数据集上以+11.02 AP的优势超过了最先进的方法。REXO的实现可在https://github.com/merlresearch/radar-bbox-diffusion获得。
Summary / 总结
REXO is designed to improve indoor multi-view radar object detection by addressing the limitations of implicit cross-view feature association in existing methods. It uses 3D bounding box diffusion to guide explicit cross-view radar feature association, enhancing denoising. By leveraging the prior knowledge that people are in contact with the ground, REXO reduces the number of diffusion parameters. The approach outperforms state-of-the-art methods by 4.22 AP on HIBER and 11.02 AP on MMVR datasets.
REXO旨在通过解决现有方法中隐式跨视图特征关联的局限性,改进室内多视图雷达目标检测。它利用3D边界框扩散过程来引导显式的跨视图雷达特征关联,增强去噪过程。REXO在HIBER数据集上实现了4.22 AP的提升,在MMVR数据集上实现了11.02 AP的提升,优于最先进的方法。
Advancing Multinational License Plate Recognition Through Synthetic and Real Data Fusion: A Comprehensive Evaluation
Authors: Rayson Laroca, Valter Estevam, Gladston J. P. Moreira, Rodrigo Minetto, David Menotti
First: 2026-01-12T15:52:52+00:00 · Latest: 2026-01-12T15:52:52+00:00
Comments: IET Intelligent Transport Systems, vol. 19, no. 1, p. e70086, 2025
Abstract
Automatic License Plate Recognition is a frequent research topic due to its wide-ranging practical applications. While recent studies use synthetic images to improve License Plate Recognition (LPR) results, there remain several limitations in these efforts. This work addresses these constraints by comprehensively exploring the integration of real and synthetic data to enhance LPR performance. We subject 16 Optical Character Recognition (OCR) models to a benchmarking process involving 12 public datasets acquired from various regions. Several key findings emerge from our investigation. Primarily, the massive incorporation of synthetic data substantially boosts model performance in both intra- and cross-dataset scenarios. We examine three distinct methodologies for generating synthetic data: template-based generation, character permutation, and utilizing a Generative Adversarial Network (GAN) model, each contributing significantly to performance enhancement. The combined use of these methodologies demonstrates a notable synergistic effect, leading to end-to-end results that surpass those reached by state-of-the-art methods and established commercial systems. Our experiments also underscore the efficacy of synthetic data in mitigating challenges posed by limited training data, enabling remarkable results to be achieved even with small fractions of the original training data. Finally, we investigate the trade-off between accuracy and speed among different models, identifying those that strike the optimal balance in each intra-dataset and cross-dataset settings.
中文标题/摘要
标题:通过合成与真实数据融合推动跨国车牌识别技术发展:全面评估
自动车牌识别是频繁的研究主题,因其广泛的实际应用。尽管最近的研究使用合成图像来提高车牌识别(LPR)结果,但这些努力仍存在一些局限性。本研究通过全面探索真实和合成数据的整合来克服这些限制,以提升LPR性能。我们对16种光学字符识别(OCR)模型进行了基准测试,涉及来自不同地区的12个公开数据集。我们的研究发现几个关键点。首先,大量使用合成数据在跨数据集和数据集内部场景中显著提升了模型性能。我们探讨了三种生成合成数据的方法:基于模板生成、字符排列以及使用生成对抗网络(GAN)模型,每种方法都对性能提升有显著贡献。这些方法的结合展示了明显的协同效应,最终结果超过了最先进的方法和现有商业系统。我们的实验还强调了合成数据在缓解有限训练数据带来的挑战方面的有效性,即使使用原始训练数据的小部分也能取得显著成果。最后,我们研究了不同模型在准确性和速度之间的权衡,确定了在每个数据集内部和跨数据集设置中达到最佳平衡的模型。
Summary / 总结
This study aims to improve Automatic License Plate Recognition (LPR) by fusing real and synthetic data. It evaluates 16 OCR models using 12 public datasets and finds that integrating synthetic data significantly enhances model performance, especially when combined with template-based generation, character permutation, and GAN models. The study also highlights the effectiveness of synthetic data in overcoming data scarcity issues, achieving good results with limited training data. Additionally, it explores the trade-off between accuracy and speed among different models, identifying those that perform best in both intra- and cross-dataset scenarios.
本研究旨在通过融合真实和合成数据来提升自动车牌识别(LPR)性能。研究评估了16种OCR模型在12个公共数据集上的表现,并发现大量使用合成数据显著提升了LPR在跨数据集和同数据集场景中的性能。研究采用了三种合成数据生成方法:模板生成、字符置换和生成对抗网络(GAN)模型,每种方法都对性能提升有贡献。这些方法的综合使用使得最终结果超越了最先进的方法和商用系统。此外,研究还探讨了不同模型在准确性和速度之间的权衡,确定了在每个数据集内和跨数据集设置中表现最佳的模型。
Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference
Authors: Rei Taniguchi, Yuyang Dong, Makoto Onizuka, Chuan Xiao
First: 2026-01-12T15:47:35+00:00 · Latest: 2026-01-12T15:47:35+00:00
Comments: Source code is available at https://github.com/TANIGUCHIREI/ASL
Abstract
Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget requirement. ASL operates during the prefilling stage and can be jointly used with existing KV cache reduction methods such as SnapKV to optimize the decoding stage. By evaluations on the InfiniteBench, RULER, and NIAH benchmarks, we show that equipped with one-shot token selection, where tokens are selected at a layer and propagated to deeper layers, ASL outperforms state-of-the-art layer-wise token selection methods in accuracy while maintaining decoding speed and KV cache reduction.
中文标题/摘要
标题:适应性层选择在LLM推理中按层剪枝令牌
由于大型语言模型(LLMs)的普及,LLM推理中的键值(KV)缓存减少受到了极大的关注。近年来,提出的各种方法中,按层剪枝令牌方法是最受欢迎的方案之一。这些方法主要采用一组预定义的层,在这些层上选择令牌并剪枝其他令牌。这种设计在灵活性方面存在不足,因为其准确率在不同任务中差异显著,在如键值检索等较难的任务中会下降。在本文中,我们提出了一种无需训练的方法ASL,该方法能够适应性地选择KV缓存减少的层,利用按注意力分数排序的令牌排名的方差。该方法在满足用户指定的KV预算要求的同时,平衡了不同任务的性能。ASL在预填充阶段运行,并可以与现有的KV缓存减少方法(如SnapKV)联合使用,以优化解码阶段。通过在InfiniteBench、RULER和NIAH基准上的评估,我们展示了ASL在准确率上优于最先进的按层剪枝令牌方法,同时保持了解码速度和KV缓存减少。
Summary / 总结
This paper addresses the issue of key-value (KV) cache reduction in large language model (LLM) inference by proposing ASL, an adaptive layer selection method. Unlike existing layer-wise token pruning approaches that use fixed layers, ASL dynamically selects the layer for token selection based on the variance of token ranks by attention score. This method improves performance across different tasks while adhering to user-specified KV budgets. Experimental results on InfiniteBench, RULER, and NIAH benchmarks demonstrate that ASL outperforms state-of-the-art methods in accuracy while maintaining decoding speed and KV cache reduction.
本文提出了一种名为ASL的自适应层选择方法,以解决大型语言模型推理中的关键值缓存减少问题。与之前使用固定层进行 token 剪枝的方法不同,ASL 根据注意力分数排序的 token 排序的方差动态选择剪枝层。该方法在满足用户指定的关键值预算的同时,提高了不同任务的性能。实验结果表明,ASL 在 InfiniteBench、RULER 和 NIAH 基准上的准确率优于现有层间 token 选择方法,同时保持了解码速度和关键值缓存减少。
Modern Neuromorphic AI: From Intra-Token to Inter-Token Processing
Authors: Osvaldo Simeone
First: 2026-01-01T07:38:07+00:00 · Latest: 2026-01-12T15:47:06+00:00
Abstract
The rapid growth of artificial intelligence (AI) has brought novel data processing and generative capabilities but also escalating energy requirements. This challenge motivates renewed interest in neuromorphic computing principles, which promise brain-like efficiency through discrete and sparse activations, recurrent dynamics, and non-linear feedback. In fact, modern AI architectures increasingly embody neuromorphic principles through heavily quantized activations, state-space dynamics, and sparse attention mechanisms. This paper elaborates on the connections between neuromorphic models, state-space models, and transformer architectures through the lens of the distinction between intra-token processing and inter-token processing. Most early work on neuromorphic AI was based on spiking neural networks (SNNs) for intra-token processing, i.e., for transformations involving multiple channels, or features, of the same vector input, such as the pixels of an image. In contrast, more recent research has explored how neuromorphic principles can be leveraged to design efficient inter-token processing methods, which selectively combine different information elements depending on their contextual relevance. Implementing associative memorization mechanisms, these approaches leverage state-space dynamics or sparse self-attention. Along with a systematic presentation of modern neuromorphic AI models through the lens of intra-token and inter-token processing, training methodologies for neuromorphic AI models are also reviewed. These range from surrogate gradients leveraging parallel convolutional processing to local learning rules based on reinforcement learning mechanisms.
中文标题/摘要
标题:现代类脑AI:从词内处理到词间处理
人工智能(AI)的迅速发展带来了新颖的数据处理和生成能力,但也伴随着不断上升的能量需求。这一挑战促使人们重新关注类脑计算原理,这些原理通过离散和稀疏激活、循环动力学和非线性反馈,有望实现类似大脑的效率。事实上,现代AI架构越来越多地通过高度量化激活、状态空间动力学和稀疏注意力机制体现类脑原理。本文从词内处理和词间处理的区别视角,探讨了类脑模型、状态空间模型和变换器架构之间的联系。早期的类脑AI研究主要基于脉冲神经网络(SNNs)进行词内处理,即对同一向量输入(如图像的像素)的多个通道或特征进行变换。相比之下,最近的研究则探索了如何利用类脑原理设计高效的词间处理方法,这些方法根据上下文相关性选择性地组合不同的信息元素。这些方法利用状态空间动力学或稀疏自注意力实现关联记忆机制。除了从词内处理和词间处理的视角系统地介绍现代类脑AI模型外,本文还回顾了类脑AI模型的训练方法,这些方法从利用并行卷积处理的替代梯度到基于强化学习机制的局部学习规则不等。
Summary / 总结
This paper explores the integration of neuromorphic principles into modern AI architectures, focusing on the distinction between intra-token and inter-token processing. It reviews how early work utilized spiking neural networks for intra-token processing, while recent research has applied neuromorphic principles to inter-token processing, which involves combining different information elements based on contextual relevance. The study also covers training methodologies, including surrogate gradients and local learning rules based on reinforcement learning.
该论文探讨了将神经形态原理整合到现代AI架构中的方法,重点关注内令牌处理和跨令牌处理的区别。早期工作使用脉冲神经网络进行内令牌处理,而最近的研究则将神经形态原理应用于跨令牌处理,即根据上下文相关性结合不同的信息元素。研究还涵盖了训练方法,包括利用并行卷积处理的替代梯度方法和基于强化学习机制的局部学习规则。
Variational Contrastive Learning for Skeleton-based Action Recognition
Authors: Dang Dinh Nguyen, Decky Aspandi Latif, Titus Zaharia
First: 2026-01-12T15:45:40+00:00 · Latest: 2026-01-12T15:45:40+00:00
Abstract
In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.
中文标题/摘要
标题:基于变分对比学习的骨架动作识别
近年来,基于对比学习方法的发展,骨架动作识别的自监督表示学习取得了进步。然而,大多数对比学习范式本质上是区分性的,往往难以捕捉人类运动内在的多样性和不确定性。为了解决这一问题,我们提出了一种结合概率潜在建模与对比自监督学习的变分对比学习框架。该框架能够学习跨不同数据集和监督水平的一般化结构化和语义化表示。在三个广泛使用的骨架动作识别基准上的广泛实验表明,我们提出的方法在低标签情况下始终优于现有方法。此外,定性分析表明,与其他方法相比,我们方法提供的特征更相关,更注重重要骨架关节,与动作和样本特征更为匹配。
Summary / 总结
The research aims to improve skeleton-based action recognition by addressing the limitations of discriminative contrastive learning methods, which often fail to capture motion variability and uncertainty. The proposed variational contrastive learning framework combines probabilistic latent modeling with contrastive self-supervised learning to generate structured and semantically meaningful representations. Experimental results on three benchmarks demonstrate that the proposed method outperforms existing approaches, especially in scenarios with limited labeled data, and the generated features are more relevant and focused on important joints compared to other methods.
论文提出了一种变分对比学习框架,用于骨架基于的动作识别,以解决对比学习方法在捕捉动作的多样性和不确定性方面的局限性。通过将概率潜变量建模与对比自监督学习相结合,该方法能够学习结构化和语义上具有意义的表示,这些表示在不同数据集和监督水平下具有良好的泛化能力。实验结果表明,该方法在有限标注数据的情况下优于现有方法。
History
20260113_0330 20260112_0330 20260111_0327 20260110_0328 20260109_0331 20260108_0330 20260107_0325 20260106_0331 20260105_0324 20260104_0324 20260103_0322 20260102_0335 20260101_0325 20251231_0331 20251230_0328 20251229_0326 20251228_0329 20251227_0325 20251226_0326 20251225_0325 20251224_0328 20251223_0327 20251222_0324 20251221_0326 20251220_0327 20251219_0327 20251218_0339 20251217_0331 20251216_0329 20251215_0331 20251214_0324 20251213_0324 20251212_0329 20251211_0326 20251210_0323 20251209_0326 20251208_0324 20251207_0323 20251206_0325 20251205_0326 20251204_0326 20251203_0328 20251202_0331 20251201_0324 20251130_0323 20251129_0323 20251128_0324 20251127_0324 20251126_0325 20251125_0322 20251124_0323 20251123_0323 20251122_0325 20251121_0324 20251120_0326 20251119_0325 20251118_0324 20251117_0322 20251116_0322 20251115_0324 20251114_0325 20251113_0326 20251112_0326 20251111_0318 20251110_0322 20251109_0323 20251108_0321 20251107_0320 20251106_0322 20251105_0321 20251104_0324 20251103_0317 20251102_0321 20251101_0317 20251031_0318 20251030_0328 20251029_0325 20251028_0324 20251027_0320 20251026_0328 20251025_0320 20251024_0328 20251023_1235 20251023_0316 20251022_0319 20251021_1916 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553