We present Recurrent Video Masked-Autoencoders (RVM): a novel video representation learning approach that uses a transformer-based recurrent neural network to aggregate dense image features over time, effectively capturing the spatio-temporal structure of natural video data. RVM learns via an asymmetric masked prediction task requiring only a standard pixel reconstruction objective. This design yields a highly efficient ``generalist'' encoder: RVM achieves competitive performance with state-of-the-art video models (e.g. VideoMAE, V-JEPA) on video-level tasks like action recognition and point/object tracking, while also performing favorably against image models (e.g. DINOv2) on tasks that test geometric and dense spatial understanding. Notably, RVM achieves strong performance in the small-model regime without requiring knowledge distillation, exhibiting up to 30x greater parameter efficiency than competing video masked autoencoders. Moreover, we demonstrate that RVM's recurrent nature allows for stable feature propagation over long temporal horizons with linear computational cost, overcoming some of the limitations of standard spatio-temporal attention-based architectures. Finally, we use qualitative visualizations to highlight that RVM learns rich representations of scene semantics, structure, and motion.
我们提出了递归视频掩码自编码器(RVM):一种新颖的视频表示学习方法,它使用基于变压器的循环神经网络来在时间维度上聚合密集图像特征,从而有效地捕捉自然视频数据中的空间-时间结构。RVM 通过一个非对称的掩码预测任务进行学习,该任务只需要标准像素重建目标即可完成。这种设计产生了一个高效的“通才”编码器:RVM 在诸如动作识别和点/对象跟踪等视频级别的任务上表现出与最先进的视频模型(如 VideoMAE、V-JEPA)相当的性能,并且在测试几何学和密集空间理解的任务中,其表现也优于图像模型(例如 DINOv2)。值得注意的是,即使不使用知识蒸馏,在小规模模型环境下 RVM 也能取得优异的成绩,比竞争中的视频掩码自编码器参数效率高出多达30倍。此外,我们证明了由于 RVM 的递归特性,它能够在长时序范围内以线性计算成本稳定地传播特征,克服了一些基于空间-时间注意的架构的限制。最后,我们通过定性的可视化展示了 RVM 学习到了丰富的场景语义、结构和运动表示。
https://arxiv.org/abs/2512.13684
Envy is a common human behavior that shapes competitiveness and can alter outcomes in team settings. As large language models (LLMs) increasingly act on behalf of humans in collaborative and competitive workflows, there is a pressing need to evaluate whether and under what conditions they exhibit envy-like preferences. In this paper, we test whether LLMs show envy-like behavior toward each other. We considered two scenarios: (1) A point allocation game that tests whether a model tries to win over its peer. (2) A workplace setting observing behaviour when recognition is unfair. Our findings reveal consistent evidence of envy-like patterns in certain LLMs, with large variation across models and contexts. For instance, GPT-5-mini and Claude-3.7-Sonnet show a clear tendency to pull down the peer model to equalize outcomes, whereas Mistral-Small-3.2-24B instead focuses on maximizing its own individual gains. These results highlight the need to consider competitive dispositions as a safety and design factor in LLM-based multi-agent systems.
嫉妒是一种常见的行为,会塑造竞争性,并在团队环境中改变结果。随着大型语言模型(LLM)越来越多地代表人类参与协作和竞争的工作流程中发挥作用,评估它们是否以及在何种条件下表现出类似嫉妒的偏好变得越来越迫切。在这篇论文中,我们测试了LLM之间是否存在类似嫉妒的行为模式。我们考虑了两个场景:(1) 一个点分配游戏,测试模型是否试图超越其同行。(2) 观察工作场所环境中当认可不公平时的行为表现。 我们的研究结果揭示了一些LLM在某些情况下表现出一致的类似嫉妒行为模式,且这种行为在不同模型和情境下差异很大。例如,GPT-5-mini 和 Claude-3.7-Sonnet 显示出明显的倾向,试图拉低同行的表现以实现成果均等化,而 Mistral-Small-3.2-24B 则更专注于最大化自身的个体收益。 这些结果强调了在基于LLM的多代理系统设计中考虑竞争倾向作为安全和设计因素的重要性。
https://arxiv.org/abs/2512.13481
Electroencephalographic (EEG) signals have long been applied in the field of affective brain-computer interfaces (aBCIs). Cross-subject EEG-based emotion recognition has demonstrated significant potential in practical applications due to its suitability across diverse people. However, most studies on cross-subject EEG-based emotion recognition neglect the presence of inter-individual variability and negative transfer phenomena during model training. To address this issue, a cross-subject EEG-based emotion recognition through source selection with adversarial strategy is introduced in this paper. The proposed method comprises two modules: the source selection network (SS) and the adversarial strategies network (AS). The SS uses domain labels to reverse-engineer the training process of domain adaptation. Its key idea is to disrupt class separability and magnify inter-domain differences, thereby raising the classification difficulty and forcing the model to learn domain-invariant yet emotion-relevant representations. The AS gets the source domain selection results and the pretrained domain discriminators from SS. The pretrained domain discriminators compute a novel loss aimed at enhancing the performance of domain classification during adversarial training, ensuring the balance of adversarial strategies. This paper provides theoretical insights into the proposed method and achieves outstanding performance on two EEG-based emotion datasets, SEED and SEED-IV. The code can be found at this https URL.
脑电图(EEG)信号长期以来被应用于情感型脑机接口(aBCIs)领域。跨受试者基于EEG的情感识别由于其在不同人群中的适用性,已经在实际应用中展示了巨大的潜力。然而,大多数关于跨受试者基于EEG的情感识别的研究忽略了个体间差异和模型训练过程中负面迁移现象的存在。为了解决这个问题,本文提出了一种通过对抗策略进行源选择的跨受试者EEG情感识别方法。 该提议的方法包括两个模块:源选择网络(SS)与对抗策略网络(AS)。SS使用领域标签来逆向工程域适应的训练过程。其核心思想是破坏类别的可分离性,放大不同领域的差异,从而提高分类难度,并迫使模型学习到对领域不变但情感相关的表示形式。而AS从SS获取源域选择结果和预训练的领域判别器。这些预训练的领域判别器计算一种新的损失函数,旨在通过对抗式训练来增强领域分类性能,确保对抗策略的平衡性。 本文提供了对该方法的理论见解,并在两个基于EEG的情感数据集SEED和SEED-IV上实现了卓越的表现。代码可在此网址访问:[提供一个URL以供参考]。 请注意,上述文本最后提供的URL是一个占位符,请根据实际情况替换为实际可用的链接地址。
https://arxiv.org/abs/2512.13458
Continuous sign language recognition (CSLR) requires precise spatio-temporal modeling to accurately recognize sequences of gestures in videos. Existing frameworks often rely on CNN-based spatial backbones combined with temporal convolution or recurrent modules. These techniques fail in capturing fine-grained hand and facial cues and modeling long-range temporal dependencies. To address these limitations, we propose the Unified Spatio-Temporal Modeling (USTM) framework, a spatio-temporal encoder that effectively models complex patterns using a combination of a Swin Transformer backbone enhanced with lightweight temporal adapter with positional embeddings (TAPE). Our framework captures fine-grained spatial features alongside short and long-term temporal context, enabling robust sign language recognition from RGB videos without relying on multi-stream inputs or auxiliary modalities. Extensive experiments on benchmarked datasets including PHOENIX14, PHOENIX14T, and CSL-Daily demonstrate that USTM achieves state-of-the-art performance against RGB-based as well as multi-modal CSLR approaches, while maintaining competitive performance against multi-stream approaches. These results highlight the strength and efficacy of the USTM framework for CSLR. The code is available at this https URL
连续手语识别(CSLR)需要精确的时空建模,以准确地从视频中识别手势序列。现有的框架通常依赖于基于CNN的空间骨干网络,并结合时间卷积或递归模块。这些技术无法捕捉精细的手部和面部线索,也无法建立长时间的时序依赖关系。为了克服这些限制,我们提出了统一时空模型(USTM)框架,这是一种时空编码器,通过将Swin Transformer骨干网与轻量级的时间适配器结合使用,并添加位置嵌入(TAPE),来有效地建模复杂模式。我们的框架可以捕捉精细的视觉特征以及短期和长期时序上下文,从而能够仅从RGB视频中进行稳健的手语识别,而无需依赖多流输入或辅助模式。在包括PHOENIX14、PHOENIX14T和CSL-Daily在内的基准数据集上的广泛实验表明,USTM框架在基于RGB的方法以及多模态CSLR方法方面取得了最先进的性能,并且与多流方法相比保持了竞争性的表现。这些结果突显了USTM框架在CSLR中的强大有效性和适用性。代码可以在提供的链接中获取。
https://arxiv.org/abs/2512.13415
Face recognition systems rely on learning highly discriminative and compact identity clusters to enable accurate retrieval. However, as with other surveillance-oriented technologies, such systems raise serious privacy concerns due to their potential for unauthorized identity tracking. While several works have explored machine unlearning as a means of privacy protection, their applicability to face retrieval - especially for modern embedding-based recognition models - remains largely unexplored. In this work, we study the problem of face identity unlearning for retrieval systems and present its inherent challenges. The goal is to make selected identities unretrievable by dispersing their embeddings on the hypersphere and preventing the formation of compact identity clusters that enable re-identification in the gallery. The primary challenge is to achieve this forgetting effect while preserving the discriminative structure of the embedding space and the retrieval performance of the model for the remaining identities. To address this, we evaluate several existing approximate class unlearning methods (e.g., Random Labeling, Gradient Ascent, Boundary Unlearning, and other recent approaches) in the context of face retrieval and propose a simple yet effective dispersion-based unlearning approach. Extensive experiments on standard benchmarks (VGGFace2, CelebA) demonstrate that our method achieves superior forgetting behavior while preserving retrieval utility.
面部识别系统依赖于学习高度判别性和紧凑的身份聚类,以实现准确的检索。然而,与其他监控技术一样,这类系统由于其潜在的未经授权的身份追踪能力而引发了严重的隐私问题。尽管已有几项研究探索了机器遗忘作为隐私保护手段的应用,但它们在面部检索(尤其是针对现代嵌入式识别模型)中的适用性仍然鲜有探讨。 在这项工作中,我们研究了面向检索系统的面部身份遗忘问题,并展示了其固有的挑战。我们的目标是通过分散选定身份的嵌入以使其无法被检索,在超球面上防止紧凑的身份聚类形成,从而阻止重新识别。主要挑战在于在保留嵌入空间的判别性结构和模型对剩余身份的检索性能的同时实现这种忘记效果。 为解决这一问题,我们评估了几种现有的近似类别遗忘方法(如随机标签法、梯度上升法、边界遗忘法及其他最近的方法)在面部检索中的应用,并提出了一种简单而有效的基于分散的遗忘策略。我们在标准基准测试集(VGGFace2和CelebA)上的广泛实验表明,我们的方法能够实现更优的忘记行为同时保持检索效用。
https://arxiv.org/abs/2512.13317
Visual Place Recognition (VPR) has advanced significantly with high-capacity foundation models like DINOv2, achieving remarkable performance. Nonetheless, their substantial computational cost makes deployment on resource-constrained devices impractical. In this paper, we introduce an efficient asymmetric VPR framework that incorporates a high-capacity gallery model for offline feature extraction with a lightweight query network for online processing. A key challenge in this setting is ensuring compatibility between these heterogeneous networks, which conventional approaches address through computationally expensive k-NN-based compatible training. To overcome this, we propose a geographical memory bank that structures gallery features using geolocation metadata inherent in VPR databases, eliminating the need for exhaustive k-NN computations. Additionally, we introduce an implicit embedding augmentation technique that enhances the query network to model feature variations despite its limited capacity. Extensive experiments demonstrate that our method not only significantly reduces computational costs but also outperforms existing asymmetric retrieval techniques, establishing a new aspect for VPR in resource-limited environments. The code is available at this https URL
视觉位置识别(VPR)领域已经通过像DINOv2这样的高容量基础模型取得了显著进步,实现了卓越的性能。然而,这些模型巨大的计算成本使其在资源受限设备上的部署变得不切实际。在这篇论文中,我们介绍了一种高效的不对称VPR框架,该框架结合了用于离线特征提取的高容量画廊模型和在线处理的轻量级查询网络。这一设定中的一个关键挑战是确保这些异构网络之间的兼容性,传统方法通常通过昂贵的k-NN(k近邻)训练来解决这个问题。为了克服这一点,我们提出了一种地理记忆库,它利用VPR数据库中固有的地理位置元数据结构化画廊特征,从而消除了对费时的k-NN计算的需求。 此外,我们还引入了一种隐式嵌入增强技术,该技术增强了查询网络以在有限容量的情况下建模特征变化。广泛的实验表明,我们的方法不仅显著降低了计算成本,而且超越了现有的不对称检索技术,在资源受限环境中为VPR领域开辟了一个新方面。代码可在提供的链接处获取。
https://arxiv.org/abs/2512.13055
Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered nature of SSL encoders. To address this limitation, we propose the layer-aware time-delay neural network (L-TDNN), which directly performs layer/frame-wise processing on the layer-wise hidden state outputs from pre-trained models, extracting fixed-size speaker vectors. L-TDNN comprises a layer-aware convolutional network, a frame-adaptive layer aggregation, and attentive statistic pooling, explicitly modeling of the recognition and processing of previously overlooked layer dimension. We evaluated L-TDNN across multiple speech SSL Transformers and diverse speech-speaker corpora against other approaches for leveraging pre-trained encoders. L-TDNN consistently demonstrated robust verification performance, achieving the lowest error rates throughout the experiments. Concurrently, it stood out in terms of model compactness and exhibited inference efficiency comparable to the existing systems. These results highlight the advantages derived from the proposed layer-aware processing approach. Future work includes exploring joint training with SSL frontends and the incorporation of score calibration to further enhance state-of-the-art verification performance.
最近的自监督学习(SSL)在Transformer上的进展显著提高了说话人验证(SV)的效果,通过提供领域通用的语音表示。然而,现有的方法未能充分利用SSL编码器的多层特性。为了解决这一限制,我们提出了基于层次感知的时间延迟神经网络(L-TDNN),它直接对预训练模型各层隐藏状态输出进行逐层/逐帧处理,提取固定大小的说话人向量。L-TDNN包括一个层次感知卷积网络、帧适应性层级聚合以及注意统计池化模块,显式建模了之前被忽略的层级维度的识别和处理过程。 我们在多个语音SSL Transformer模型和多样化的语音-说话人语料库中评估了L-TDNN,并将其与利用预训练编码器的其他方法进行了比较。L-TDNN在所有实验中均表现出稳健的验证性能,实现了最低的错误率。同时,在模型紧凑性和推理效率方面也表现突出,与现有系统相当。 这些结果强调了所提出的层次感知处理方法的优势。未来的工作将探索SSL前端和评分校准的联合训练,以进一步提升最先进的验证性能。
https://arxiv.org/abs/2409.07770
Large Language Models (LLMs) are often fine-tuned to adapt their general-purpose knowledge to specific tasks and domains such as cyber threat intelligence (CTI). Fine-tuning is mostly done through proprietary datasets that may contain sensitive information. Owners expect their fine-tuned model to not inadvertently leak this information to potentially adversarial end users. Using CTI as a use case, we demonstrate that data-extraction attacks can recover sensitive information from fine-tuned models on CTI reports, underscoring the need for mitigation. Retraining the full model to eliminate this leakage is computationally expensive and impractical. We propose an alternative approach, which we call privacy alignment, inspired by safety alignment in LLMs. Just like safety alignment teaches the model to abide by safety constraints through a few examples, we enforce privacy alignment through few-shot supervision, integrating a privacy classifier and a privacy redactor, both handled by the same underlying LLM. We evaluate our system, called CTIGuardian, using GPT-4o mini and Mistral-7B Instruct models, benchmarking against Presidio, a named entity recognition (NER) baseline. Results show that CTIGuardian provides a better privacy-utility trade-off than NER based models. While we demonstrate its effectiveness on a CTI use case, the framework is generic enough to be applicable to other sensitive domains.
大型语言模型(LLMs)通常通过微调来适应其通用知识以适用于特定任务和领域,如网络威胁情报(CTI)。这种微调主要是通过包含敏感信息的专有数据集进行的。所有者期望他们的微调后的模型不会无意中向潜在对手用户泄露这些信息。我们使用CTI作为用例,证明了可以从微调后的CTI报告模型中恢复出敏感信息的数据提取攻击,这强调了缓解措施的需求。重新训练整个模型以消除这种泄漏是计算上昂贵且不切实际的。为此,我们提出了一种替代方法,称为隐私对齐,这是受LLM中的安全对齐启发提出的。就像安全对齐通过少量示例教模型遵守安全约束一样,我们也通过对少量监督来强制执行隐私对齐,整合了一个隐私分类器和一个隐私编辑器,两者均由同一底层LLM处理。我们使用GPT-4o mini和Mistral-7B Instruct模型评估我们的系统CTIGuardian,并将其与基于命名实体识别(NER)的基准Presidio进行对比测试。结果表明,CTIGuardian在隐私-效用权衡方面优于基于NER的模型。虽然我们证明了其在CTI应用场景中的有效性,但该框架足够通用,可以应用于其他敏感领域。 总结来说:为了防止大型语言模型(如GPT)微调后泄露敏感信息,研究人员开发了一种名为“隐私对齐”的方法。这种方法通过少量数据监督训练,能够有效整合保护隐私的分类器和编辑器模块,避免了重新全面训练整个模型所引起的计算成本问题,并且在CTI案例中表现出了优于传统方案的效果。此外,该框架具有通用性,可以应用于其他敏感领域。
https://arxiv.org/abs/2512.12914
Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
自动道路标志识别是智能交通系统中的一个关键任务,但传统深度学习方法在处理大量标志类别和创建详尽标注数据集的不切实际性方面存在困难。本文介绍了一种新颖的零样本识别框架,该框架将检索增强生成(RAG)范式应用于解决这一挑战。我们的方法首先使用视觉语言模型(VLM)从输入图像中生成标志的文字描述。然后利用这些文字描述从参考设计向量数据库中检索出一组最相关的候选标志。随后,大型语言模型(LLM)对检索到的候选进行推理,以做出最终、细粒度的识别判断。我们在来自俄亥俄州MUTCD的一套全面的303个规范性道路标志上验证了这一方法的有效性。实验结果表明该框架在理想参考图像上的准确率为95.58%,而在具有挑战性的实际道路数据集上的准确率则为82.45%。这项工作展示了基于RAG架构创建无需特定任务训练的可扩展且精确的道路标志识别系统的可行性。
https://arxiv.org/abs/2512.12885
Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.
将大规模多模态模型(LMMs)扩展到三维理解中面临独特挑战:点云数据稀疏且不规则,现有模型依赖于具有特定模式编码器的碎片化架构,并且训练流程经常不稳定且缺乏可扩展性。我们引入了 Lemon 架构,这是一个统一的变压器架构,通过同时处理 3D 点云补丁和语言标记作为单一序列来解决这些挑战。与之前依靠特定模式编码器和跨模态对齐模块的工作不同,这种设计能够实现早期的空间-语言融合,消除冗余编码器,提高参数效率,并支持更有效的模型扩展。为了处理三维数据的复杂性,我们开发了一种结构化的补丁化和标记方案以保持空间上下文信息,并且采用了一个三阶段训练课程,逐步构建从对象级识别到场景级空间推理的能力。 Lemon 在全面的 3D 理解和推理任务中建立了新的最先进的性能,包括物体识别、描述以及在三维场景中的空间推理能力,同时展示了随着模型大小和训练数据增加时稳健扩展性的特点。我们的工作为推进现实世界应用中三维空间智能提供了一个统一的基础框架。
https://arxiv.org/abs/2512.12822
Voice-based interaction has emerged as a natural and intuitive modality for controlling IoT devices. However, speech-driven edge devices face a fundamental trade-off between cloud-based solutions, which offer stronger language understanding capabilities at the cost of latency, connectivity dependence, and privacy concerns, and edge-based solutions, which provide low latency and improved privacy but are limited by computational constraints. This paper presents ASTA, an adaptive speech-to-action solution that dynamically routes voice commands between edge and cloud inference to balance performance and system resource utilization. ASTA integrates on-device automatic speech recognition and lightweight offline language-model inference with cloud-based LLM processing, guided by real-time system metrics such as CPU workload, device temperature, and network latency. A metric-aware routing mechanism selects the inference path at runtime, while a rule-based command validation and repair component ensures successful end-to-end command execution. We implemented our solution on an NVIDIA Jetson-based edge platform and evaluated it using a diverse dataset of 80 spoken commands. Experimental results show that ASTA successfully routes all input commands for execution, achieving a balanced distribution between online and offline inference. The system attains an ASR accuracy of 62.5% and generates executable commands without repair for only 47.5% of inputs, highlighting the importance of the repair mechanism in improving robustness. These results suggest that adaptive edge-cloud orchestration is a viable approach for resilient and resource-aware voice-controlled IoT systems.
基于语音的交互已成为控制物联网设备的一种自然且直观的方式。然而,以语音驱动的边缘设备面临着一个基本的权衡:云端解决方案提供了更强的语言理解能力,但代价是增加了延迟、依赖于网络连接和隐私问题;而边缘计算解决方案则提供低延迟和改进后的隐私保护,但由于计算资源限制而受到约束。本文提出了ASTA,这是一种自适应的语音转操作解决方案,它能够在边缘设备与云推理之间动态地路由语音命令,以平衡性能和系统资源利用。 ASTA整合了设备上的自动语音识别、轻量级离线语言模型推理以及基于云端的大规模语言模型处理,并根据实时系统指标(如CPU负载、设备温度和网络延迟)进行指导。一个具有感知能力的路由机制在运行时选择推理路径,而一个基于规则的命令验证与修复组件则确保了从端到端命令执行的成功性。 我们在NVIDIA Jetson边缘平台上实现了该解决方案,并使用包含80种口语命令的多样化数据集进行了评估。实验结果表明,ASTA成功地将所有输入命令路由至执行状态,在线推理和离线推理之间达到了平衡分布。系统实现了62.5%的自动语音识别准确率,并且只有47.5%的输入在不经过修复的情况下就能生成可执行命令,这凸显了修复机制对于提高鲁棒性的重要性。 这些结果表明,自适应边缘-云端编排是一种为具备弹性和资源意识的语音控制物联网系统提供支持的有效方法。
https://arxiv.org/abs/2512.12769
Large-scale digitization initiatives have unlocked massive collections of historical newspapers, yet effective computational access remains hindered by OCR corruption, multilingual orthographic variation, and temporal language drift. We develop and evaluate a multilingual Retrieval-Augmented Generation pipeline specifically designed for question answering on noisy historical documents. Our approach integrates: (i) semantic query expansion and multi-query fusion using Reciprocal Rank Fusion to improve retrieval robustness against vocabulary mismatch; (ii) a carefully engineered generation prompt that enforces strict grounding in retrieved evidence and explicit abstention when evidence is insufficient; and (iii) a modular architecture enabling systematic component evaluation. We conduct comprehensive ablation studies on Named Entity Recognition and embedding model selection, demonstrating the importance of syntactic coherence in entity extraction and balanced performance-efficiency trade-offs in dense retrieval. Our end-to-end evaluation framework shows that the pipeline generates faithful answers for well-supported queries while correctly abstaining from unanswerable questions. The hybrid retrieval strategy improves recall stability, particularly benefiting from RRF's ability to smooth performance variance across query formulations. We release our code and configurations at this https URL, providing a reproducible foundation for robust historical document question answering.
大规模数字化计划解锁了大量历史报纸的收藏,然而有效的计算机访问仍然受到OCR错误、多语言正写法变化和时间语言漂移的阻碍。我们开发并评估了一种专门针对嘈杂的历史文件问答设计的多语言检索增强生成管道。我们的方法整合了以下内容: (i) 使用互惠排名融合(Reciprocal Rank Fusion)进行语义查询扩展和多查询融合,以提高词汇不匹配情况下的检索鲁棒性; (ii) 一个精心设计的生成提示,强制执行对检索证据的严格依据,并在证据不足时明确拒绝回答; (iii) 模块化架构,使系统组件能够进行系统的评估。 我们在命名实体识别和嵌入模型选择上进行了全面的消融研究,证明了语法一致性在实体提取中的重要性以及密集检索中性能效率权衡的重要性。我们的端到端评估框架显示,该管道为有充分支持的问题生成忠实的答案,同时正确地拒绝无法回答的问题。混合检索策略提高了召回稳定性,特别是受益于RRF平滑不同查询形式的性能差异的能力。 我们在此[URL]发布代码和配置文件,为稳健的历史文档问答提供了可重复的基础。
https://arxiv.org/abs/2512.12694
Despite the rapid progress of deep learning in video action recognition (VAR) in recent years, privacy leakage in videos remains a critical concern. Current state-of-the-art privacy-preserving methods often rely on anonymization. These methods suffer from (1) low concealment, where producing visually distorted videos that attract attackers' attention during transmission, and (2) spatiotemporal disruption, where degrading essential spatiotemporal features for accurate VAR. To address these issues, we propose StegaVAR, a novel framework that embeds action videos into ordinary cover videos and directly performs VAR in the steganographic domain for the first time. Throughout both data transmission and action analysis, the spatiotemporal information of hidden secret video remains complete, while the natural appearance of cover videos ensures the concealment of transmission. Considering the difficulty of steganographic domain analysis, we propose Secret Spatio-Temporal Promotion (STeP) and Cross-Band Difference Attention (CroDA) for analysis within the steganographic domain. STeP uses the secret video to guide spatiotemporal feature extraction in the steganographic domain during training. CroDA suppresses cover interference by capturing cross-band semantic differences. Experiments demonstrate that StegaVAR achieves superior VAR and privacy-preserving performance on widely used datasets. Moreover, our framework is effective for multiple steganographic models.
尽管近年来深度学习在视频动作识别(VAR)方面取得了迅速进展,但视频中的隐私泄露仍然是一个关键问题。目前最先进的隐私保护方法通常依赖于匿名化处理。这些方法存在以下两个主要缺点:一是低隐蔽性,在传输过程中生成视觉上失真的视频会吸引攻击者的注意;二是时空干扰,破坏了进行准确动作识别所需的必要时空特征。为了解决这些问题,我们提出了StegaVAR,这是一个新颖的框架,它将动作视频嵌入到普通的封面视频中,并首次在隐写域直接执行视频动作识别。在整个数据传输和动作分析过程中,隐藏的秘密视频的时空信息保持完整,而封面视频的自然外观则确保了传输过程中的隐蔽性。 考虑到隐写领域分析的难度,我们提出了Secret Spatio-Temporal Promotion(STeP)和Cross-Band Difference Attention(CroDA),以在隐写域内进行有效分析。STeP利用秘密视频来指导训练过程中隐写域内的时空特征提取。CroDA通过捕捉跨频带语义差异来抑制封面干扰。 实验表明,StegaVAR在广泛使用的数据集上实现了卓越的视频动作识别和隐私保护性能。此外,我们的框架对多种隐写模型均有效。
https://arxiv.org/abs/2512.12586
The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.
世界上大多数语言,特别是像纳加梅斯语这样的克里奥尔语,在自然语言处理(NLP)方面资源严重匮乏,这在数字技术中对其表示构成了重大障碍。本文介绍了NagaNLP,这是一个针对纳加梅斯语的全面开源工具包,通过一种新颖的方法建立起来,该方法依赖于大型语言模型驱动但由人类验证的合成数据生成。 我们详细描述了一个多阶段管道,在这个过程中,一个专家指导下的大型语言模型(如Gemini)首先生成候选语料库,然后这些材料经过本地母语者的优化和标注。这种合成-混合的方法产生了一个10,000对会话数据集以及用于基础任务的高质量注释语料库。 为了评估我们的方法的有效性,我们训练了判别模型和生成模型。我们将微调后的XLM-RoBERTa-base模型作为纳加梅斯语的新基准,在词性标注上达到了93.81%的准确率(0.90 F1-Macro),在命名实体识别上的F1-Macro得分为0.75,大大超越了强大的零样本基线。此外,我们还微调了一个Llama-3.2-3B Instruct模型,并将其命名为NagaLLaMA,在对话任务中表现出了优越性能,其困惑度为3.85,相比少量示例的对照组(96.76)有了数量级的改进。 我们将发布NagaNLP工具包,包括所有数据集、模型和代码,这为以前服务不足的语言提供了一个基础资源,并为减少其他低资源环境中的数据稀缺性提供了可重复框架。
https://arxiv.org/abs/2512.12537
Few-shot image classification remains difficult under limited supervision and visual domain shift. Recent cache-based adaptation approaches (e.g., Tip-Adapter) address this challenge to some extent by learning lightweight residual adapters over frozen features, yet they still inherit CLIP's tendency to encode global, general-purpose representations that are not optimally discriminative to adapt the generalist to the specialist's domain in low-data regimes. We address this limitation with a novel patch-driven relational refinement that learns cache adapter weights from intra-image patch dependencies rather than treating an image embedding as a monolithic vector. Specifically, we introduce a relational gated graph attention network that constructs a patch graph and performs edge-aware attention to emphasize informative inter-patch interactions, producing context-enriched patch embeddings. A learnable multi-aggregation pooling then composes these into compact, task-discriminative representations that better align cache keys with the target few-shot classes. Crucially, the proposed graph refinement is used only during training to distil relational structure into the cache, incurring no additional inference cost beyond standard cache lookup. Final predictions are obtained by a residual fusion of cache similarity scores with CLIP zero-shot logits. Extensive evaluations on 11 benchmarks show consistent gains over state-of-the-art CLIP adapter and cache-based baselines while preserving zero-shot efficiency. We further validate battlefield relevance by introducing an Injured vs. Uninjured Soldier dataset for casualty recognition. It is motivated by the operational need to support triage decisions within the "platinum minutes" and the broader "golden hour" window in time-critical UAV-driven search-and-rescue and combat casualty care.
在监督不足和视觉领域变化的情况下,少样本图像分类仍然面临挑战。最近基于缓存的适应方法(如Tip-Adapter)通过在冻结特征上学习轻量级残差适配器来一定程度上解决这一问题,但它们仍继承了CLIP编码全局、通用表示的习惯,这在低数据环境中无法最优地区分一般领域和特定领域的差异。 为了解决这个问题,我们提出了一种新颖的基于补丁的关系细化方法。这种方法通过学习来自图像内部补丁依赖性的缓存适配器权重来改进这一过程,而不是将图像嵌入视为单一向量处理。具体而言,我们引入了一个关系门控图注意力网络,它构建了补丁图,并执行边缘感知注意以强调信息丰富的跨补丁交互,从而产生富含上下文的补丁嵌入。然后,一个可学习的多聚合池化将这些补丁嵌入组合成紧凑且任务区分度高的表示形式,使缓存键更好地与目标少样本类别对齐。 值得注意的是,所提出的图细化方法仅在训练期间使用,并通过标准缓存查找过程将其关系结构注入到缓存中,不会增加额外的推理成本。最终预测是通过对缓存相似性得分和CLIP零样本逻辑进行残差融合来获得的。 我们在11个基准测试上的广泛评估表明,在保持零样本效率的同时,该方法在最先进的CLIP适配器和基于缓存的方法上取得了持续性的改进。此外,我们通过引入一个用于伤亡识别的受伤与未受伤士兵数据集进一步验证了战场相关性。这个数据集受到操作需求的驱动,即支持在“黄金时间”内进行搜救和战地急救决策中的分诊决定,并特别强调了在时间关键的无人机搜索行动中最初的“铂金分钟”的重要性。
https://arxiv.org/abs/2512.12498
With the advent of machine learning and quantum computing, the 21st century has gone from a place of relative algorithmic security, to one of speculative unease and possibly, cyber catastrophe. Modern algorithms like Elliptic Curve Cryptography (ECC) are the bastion of current cryptographic security protocols that form the backbone of consumer protection ranging from Hypertext Transfer Protocol Secure (HTTPS) in the modern internet browser, to cryptographic financial instruments like Bitcoin. And there's been very little work put into testing the strength of these ciphers. Practically the only study that I could find was on side-channel recognition, a joint paper from the University of Milan, Italy and King's College, London\cite{battistello2025ecc}. These algorithms are already considered bulletproof by many consumers, but exploits already exist for them, and with computing power and distributed, federated compute on the rise, it's only a matter of time before these current bastions fade away into obscurity, and it's on all of us to stand up when we notice something is amiss, lest we see such passages claim victims in that process. In this paper, we seek to explore the use of modern language model architecture in cracking the association between a known public key, and its associated private key, by intuitively learning to reverse engineer the public keypair generation process, effectively solving the curve. Additonally, we attempt to ascertain modern machine learning's ability to memorize public-private secp256r1 keypairs, and to then test their ability to reverse engineer the public keypair generation process. It is my belief that proof-for would be equally valuable as proof-against in either of these categories. Finally, we'll conclude with some number crunching on where we see this particular field heading in the future.
随着机器学习和量子计算的出现,21世纪从相对算法安全的时代过渡到了充满猜测性不安甚至可能引发网络灾难的时代。现代加密算法如椭圆曲线密码学(ECC)是当前加密安全协议的核心,这些协议构成了从现代互联网浏览器中的HTTPS到比特币等加密金融工具在内的消费者保护体系的基础。然而,对这些密钥强度的测试工作却非常少。我找到的唯一相关研究是一项关于旁路识别的研究,该研究是由意大利米兰大学和英国伦敦国王学院联合发表的论文\cite{battistello2025ecc}。 尽管许多用户认为当前使用的算法坚不可摧,但针对这些算法的攻击手段已经存在,并且随着计算能力以及分布式、联邦式计算技术的发展,这些现有的安全堡垒终将消失。因此,当我们发现问题时,每个人都需要挺身而出,否则这种脆弱性可能会导致灾难性的后果。在本文中,我们试图探讨现代语言模型架构在破解已知公钥与其关联私钥之间联系上的应用:通过直观地学习逆向工程公钥对生成过程,从而有效地“解开”椭圆曲线。此外,我们将尝试评估现代机器学习能否记忆secp256r1公私密钥对,并测试它们逆转公钥对生成过程的能力。我认为,在这两个类别中,证明可能和不可能一样具有价值。 最后,我们将通过一些数据分析来预测这一特定领域未来的发展方向。
https://arxiv.org/abs/2512.12483
Olympic Taekwondo has faced challenges in spectator engagement due to static, defensive gameplay and contentious scoring. Current Protector and Scoring Systems (PSS) rely on impact sensors and simplistic logic, encouraging safe strategies that diminish the sport's dynamism. This paper proposes an AI-powered scoring system that integrates existing PSS sensors with additional accelerometers, gyroscopes, magnetic/RFID, and impact force sensors in a sensor fusion framework. The system classifies kicks in real-time to identify technique type, contact location, impact force, and even the part of the foot used. A machine learning pipeline employing sensor fusion and Support Vector Machines (SVMs) is detailed, enabling automatic kick technique recognition for scoring. We present a novel kick scoring rubric that awards points based on specific kick techniques (e.g., turning and spinning kicks) to incentivize dynamic attacks. Drawing on a 2024 study achieving 96-98% accuracy, we validate the feasibility of real-time kick classification and further propose enhancements to this methodology, such as ensemble SVM classifiers and expanded datasets, to achieve the high-stakes accuracy required by the sport. We analyze how the proposed system can improve scoring fairness, reduce rule exploitation and illegitimate tactics, encourage more dynamic techniques, and enhance spectator understanding and excitement. The paper includes system design illustrations, a kick scoring table from an AI-augmented rule set, and discusses anticipated impacts on Olympic Taekwondo.
奥运会跆拳道比赛在观众参与方面面临着挑战,因为静态防守型的比赛风格和争议性的评分系统影响了比赛的观赏性。现有的保护和计分系统(PSS)依赖于冲击传感器和简单的逻辑判断,这种设计鼓励选手采取安全策略,从而削弱了运动的动力学特点。本文提出了一种基于人工智能的评分系统,该系统结合了现有PSS传感器与额外加速度计、陀螺仪、磁性和RFID传感器以及冲击力传感器,在一个传感器融合框架内运作。 这个新系统能够实时分类踢击动作,识别技术类型、接触位置、冲击力量甚至使用的脚部部分。文中详细介绍了利用传感器融合和支恃向量机(SVMs)的机器学习管道,该管道可实现自动化的踢技认别以供评分使用。我们提出了一套新颖的踢分规则,根据具体的踢击技术(如旋转和翻滚踢)来授予分数,以此激励动态进攻。 基于2024年的一项研究中达到了96-98%准确率的技术,本文验证了实时踢击分类的可行性,并进一步提出了对这一方法进行改进的方式,比如使用集成支恃向量机分类器以及扩展数据集等手段,以满足比赛所需的高精度要求。此外,我们分析了该系统如何提高评分公正性、减少规则滥用及非法战术、鼓励更多动态技术应用并增强观众的理解和兴奋度。 本文包括系统设计图示、基于AI增强规则集的踢分表格,并讨论了这一提议对奥运会跆拳道的影响。
https://arxiv.org/abs/2512.12474
Although Large language Model (LLM)-powered information extraction (IE) systems have shown impressive capabilities, current fine-tuning paradigms face two major limitations: high training costs and difficulties in aligning with LLM preferences. To address these issues, we propose a novel universal IE paradigm, the Self-Correcting Iterative Refinement (SCIR) framework, along with a Multi-task Bilingual (Chinese-English) Self-Correcting (MBSC) dataset containing over 100,000 entries. The SCIR framework achieves plug-and-play compatibility with existing LLMs and IE systems through its Dual-Path Self-Correcting module and feedback-driven optimization, thereby significantly reducing training costs. Concurrently, the MBSC dataset tackles the challenge of preference alignment by indirectly distilling GPT-4's capabilities into IE result detection models. Experimental results demonstrate that SCIR outperforms state-of-the-art IE methods across three key tasks: named entity recognition, relation extraction, and event extraction, achieving a 5.27 percent average improvement in span-based Micro-F1 while reducing training costs by 87 percent compared to baseline approaches. These advancements not only enhance the flexibility and accuracy of IE systems but also pave the way for lightweight and efficient IE paradigms.
尽管大型语言模型(LLM)驱动的信息抽取(IE)系统展现了令人印象深刻的能力,但当前的微调范式面临着两个主要限制:高昂的训练成本和与LLM偏好对齐的困难。为了解决这些问题,我们提出了一种新型的通用信息抽取框架——自我校正迭代精炼(Self-Correcting Iterative Refinement, SCIR)以及一个多任务双语(中英文)自我校正数据集MBSC,包含超过10万条记录。通过其双路径自我校正模块和反馈驱动优化,SCIR框架实现了与现有LLM和IE系统的即插即用兼容性,从而大幅降低了训练成本。同时,MBSC数据集通过间接提炼GPT-4的能力到信息抽取结果检测模型中,解决了偏好对齐的挑战。 实验结果显示,在命名实体识别、关系提取和事件抽取三项关键任务上,SCIR方法超越了现有的最先进的IE方法,平均提高了基于跨度的Micro-F1分数5.27%,并且相比基准方法将训练成本降低了87%。这些进展不仅提升了信息抽取系统的灵活性和准确性,还为轻量级且高效的IE范式铺平了道路。
https://arxiv.org/abs/2512.12337
Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic object proposals across categories. Importantly, the proposal embeddings are class-specific and enable a new application, object retrieval, supporting retrieval objects in historical data. (3) Integration with LMMs for referring expression comprehension (REC). We further propose WeDetect-Ref, an LMM-based object classifier to handle complex referring expressions, which retrieves target objects from the proposal list extracted by WeDetect-Uni. It discards next-token prediction and classifies objects in a single forward pass. Together, the WeDetect family unifies detection, proposal generation, object retrieval, and REC under a coherent retrieval framework, achieving state-of-the-art performance across 15 benchmarks with high inference efficiency.
开词汇对象检测的目标是通过文本提示来检测任意类别。不包含跨模态融合层(非融合)的方法通过将识别视为检索问题,即在共享嵌入空间中匹配区域与文本查询,从而提供更快的推理速度。本文详细探讨了这一检索理念,并通过一个名为WeDetect的模型家族展示了其在效率和通用性方面的独特优势:(1) 行业领先性能。WeDetect采用双塔架构,是一款实时检测器。我们证明,在经过精心整理的数据集上进行全面训练后,非融合型的WeDetect超过了其他融合模型,建立了强大的开词汇基础。(2) 快速回溯历史数据。基于WeDetect构建了通用提案生成器WeDetect-Uni,我们将整个检测器冻结,并仅对物性提示进行微调以跨类别检索通用对象提案。重要的是,这些提案嵌入是特定于类别的,这使新的应用成为可能,即对象检索,支持从历史数据中检索对象。(3) 与LMMs结合用于指代表达理解(REC)。我们进一步提出了WeDetect-Ref,这是一种基于LMM的对象分类器,可处理复杂的指代表达。它通过从WeDetect-Uni提取的提案列表中检索目标物体来工作,并在单次前向传播过程中完成对象分类而不使用下一个标记预测。 总的来说,WeDetect家族统一了检测、提案生成、对象检索和REC,在一个连贯的检索框架内实现了15个基准测试中的行业领先性能,同时保持了高效的推理效率。
https://arxiv.org/abs/2512.12309
As artificial intelligence (AI) systems become increasingly embedded in our daily life, the ability to recognize and adapt to human emotions is essential for effective human-computer interaction. Facial expression recognition (FER) provides a primary channel for inferring affective states, but the dynamic and culturally nuanced nature of emotions requires models that can learn continuously without forgetting prior knowledge. In this work, we propose a hybrid framework for FER in a continual learning setting that mitigates catastrophic forgetting. Our approach integrates two complementary modalities: deep convolutional features and facial Action Units (AUs) derived from the Facial Action Coding System (FACS). The combined representation is modelled through Bayesian Gaussian Mixture Models (BGMMs), which provide a lightweight, probabilistic solution that avoids retraining while offering strong discriminative power. Using the Compound Facial Expression of Emotion (CFEE) dataset, we show that our model can first learn basic expressions and then progressively recognize compound expressions. Experiments demonstrate improved accuracy, stronger knowledge retention, and reduced forgetting. This framework contributes to the development of emotionally intelligent AI systems with applications in education, healthcare, and adaptive user interfaces.
随着人工智能(AI)系统在日常生活中的应用日益广泛,识别和适应人类情绪的能力对于有效的交互至关重要。面部表情识别(FER)提供了推断情感状态的主要渠道,但情感的动态性和文化差异性要求模型能够在不断学习新知识的同时不遗忘先前的知识。在此项工作中,我们提出了一种用于持续学习设置下的面部表情识别框架,该框架能够缓解灾难性的遗忘现象。我们的方法整合了两种互补的模式:深度卷积特征和从面部动作编码系统(FACS)中提取的面部动作单元(AUs)。这种组合表示通过贝叶斯高斯混合模型(BGMMs)进行建模,提供了轻量级的概率解决方案,既避免了重新训练又保持了强大的鉴别能力。利用复合情感面部表情数据集(CFEE),我们展示了我们的模型能够首先学习基本的表情,并逐步识别复杂的表情。实验结果表明,该框架提高了准确率,增强了知识保留能力并减少了遗忘现象的发生。这一框架有助于开发具有教育、医疗保健和适应性用户界面等应用领域的智能情绪感知AI系统。
https://arxiv.org/abs/2512.12277