Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting BERT, a language model, as an ASV system. On the VoicePrivacy Attacker Challenge datasets, our method achieves a mean equal error rate (EER) of 35%, with certain speakers attaining EERs as low as 2%, based solely on the textual content of their utterances. Our explainability study reveals that the system decisions are linked to semantically similar keywords within utterances, stemming from how LibriSpeech is curated. Our study suggests reworking the VoicePrivacy datasets to ensure a fair and unbiased evaluation and challenge the reliance on global EER for privacy evaluations.
说话人匿名化系统在隐藏说话者身份的同时,保留了诸如语言内容和情感等其他信息。为了评估这些系统的隐私优势,会使用自动说话人验证(ASV)系统这类攻击方式进行测试。在这项研究中,我们通过将BERT这种语言模型改编为ASV系统,来评估训练集与评估集中同一说话者之间语言内容相似性对攻击效果的影响。在VoicePrivacy攻击挑战数据集上,我们的方法实现了平均等错误率(EER)35%,某些说话者的EER低至2%,仅基于他们话语的文本内容就能达到这样的效果。可解释性研究表明,系统决策与LibriSpeech数据集中语义相似的关键字有关联。这项研究建议重新构建VoicePrivacy数据集,以确保评估的公平性和无偏性,并对隐私评估中依赖于全局EER的方法提出挑战。
https://arxiv.org/abs/2506.09521
The proliferation of online news enables potential widespread publication of perceived low-quality news headlines/links. As a result, we investigated whether it was possible to automatically distinguish perceived lower-quality news headlines/links from perceived higher-quality headlines/links. We evaluated twelve machine learning models on a binary, balanced dataset of 57,544,214 worldwide news website links/headings from 2018-2024 (28,772,107 per class) with 115 extracted linguistic features. Binary labels for each text were derived from scores based on expert consensus regarding the respective news domain quality. Traditional ensemble methods, particularly the bagging classifier, had strong performance (88.1% accuracy, 88.3% F1, 80/20 train/test split). Fine-tuned DistilBERT achieved the highest accuracy (90.3%, 80/20 train/test split) but required more training time. The results suggest that both NLP features with traditional classifiers and deep learning models can effectively differentiate perceived news headline/link quality, with some trade-off between predictive performance and train time.
在线新闻的泛滥使得低质量新闻标题/链接可能被广泛传播。因此,我们研究了是否有可能自动区分被认为较低质量和较高质量的新闻标题/链接。我们在一个二元平衡数据集上评估了12种机器学习模型,该数据集包含来自2018年至2024年全球新闻网站的57,544,214个链接/标题(每类28,772,107),并且提取出了115个语言特征。每个文本的二元标签基于专家对相应新闻领域质量的共识评分得出。 传统集成方法,特别是袋式分类器,在此任务上表现出色(准确率88.1%,F1值88.3%,训练/测试集比例为80/20)。经过微调的DistilBERT模型达到了最高的准确率(90.3%,训练/测试集比例为80/20),但需要更多的训练时间。 研究结果表明,结合传统分类器和深度学习模型都可以有效地区分新闻标题/链接的质量感知,虽然在预测性能与训练时间之间存在一些权衡。
https://arxiv.org/abs/2506.09381
We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories. Please visit this https URL for the dataset, code, videos, and submission.
我们介绍了Princeton365,这是一个包含365个视频的大规模多样化数据集,并且每个视频都具有精确的相机姿态。我们的数据集通过引入一个新颖的真实情况收集框架来弥合当前SLAM基准测试中准确性与数据多样性之间的差距,该框架利用校准板和360度摄像头进行工作。我们采集了室内、室外以及物体扫描视频,包括同步的单目和立体RGB视频输出及IMU(惯性测量单元)信息。 此外,我们还提出了一种新的基于场景尺度感知的SLAM评估指标,该指标是根据相机姿态估计误差所引起的光流来衡量。与现有的指标如平均轨迹错误(ATE)相比,我们的新度量标准允许跨不同场景比较SLAM方法的性能,这使得研究者能够分析其方法的失效模式。 我们还提出了一项具有挑战性的新型视图合成基准测试,该测试涵盖了当前NVS(Novel View Synthesis,新颖视角合成)基准所未涉及的情况,例如全非朗伯场景以及360度相机轨迹。请访问此[链接](https://example.com)获取数据集、代码、视频和提交指南等相关信息。
https://arxiv.org/abs/2506.09035
Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633 GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually require many more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer toward Adam-Scale Speed. FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step sizes based on the standard deviation of batch losses. It also accelerates per-batch computation through the use of Rademacher random vector perturbations coupled with CUDA's parallel processing. Extensive experiments on diverse models, including RoBERTa-large, OPT (350M-66B), Phi-2, and Llama3, across 11 tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by 3 percent in accuracy while requiring 3 times fewer forward passes. For RoBERTa-large, FZOO achieves average improvements of 5.6 percent in accuracy and an 18 times reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO's formal equivalence to a normalized-SGD update rule and its convergence guarantees. FZOO integrates smoothly into PEFT techniques, enabling even larger memory savings. Overall, our results make single-GPU, high-speed, full-parameter fine-tuning practical and point toward future work on memory-efficient pre-training.
大型语言模型(LLM)的微调经常遇到GPU内存瓶颈:一阶优化器如Adam在反向传播过程中将内存使用量增加到推理水平的10倍以上(例如,OPT-30B需要633GB)。零阶(ZO)优化器通过仅从正向传递中估计梯度来避免这种成本,但现有方法如MeZO通常需要更多的步骤才能收敛。能否从根本上改善零阶优化器在速度和内存之间的权衡?Normalized-SGD展示出比Adam更高效的内存效率以及强大的实证性能。鉴于此,我们引入了FZOO,这是一种面向亚当级速度的快速零阶优化器。 FZOO通过使用基于批量损失标准差调整步长的一侧估计批量来减少收敛所需的总正向传递次数,并通过结合Rademacher随机向量扰动和CUDA并行处理加速每批次计算。在包括RoBERTa-large、OPT(350M-66B)、Phi-2和Llama3等多样化模型的11个任务上的广泛实验验证了FZOO的有效性。总体而言,与MeZO相比,FZOO平均提高了3%的准确性,并且所需的正向传递次数减少了三倍。对于RoBERTa-large,与MeZO相比,FZOO实现了5.6%的准确度提升和18倍的前向传递减少,收敛速度接近Adam。我们还提供了理论分析,证明了FZOO在形式上等价于一个归一化的SGD更新规则,并且具备其自身的收敛保证。 FZOO可以平滑地集成到PEFT技术中,进一步节省内存。总的来说,我们的结果使得单GPU、高速、全参数微调变得实用,并为未来针对高效预训练的工作铺平了道路。
https://arxiv.org/abs/2506.09034
The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantBert, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantBert is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantBert to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantBert exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields. By providing a scalable and reproducible framework for high-resolution entity recognition, PlantBert bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
基于变压器的语言模型的快速进步已经推动了生物医学和临床自然语言处理领域的突破;然而,植物科学领域仍然明显缺乏此类专门化的工具支持。在此项工作中,我们介绍了PlantBert,这是一个高性能、开源的语言模型,特别针对从植物应激反应文献中提取结构化知识而设计。PlantBert基于DeBERTa架构构建,该架构以其分离的注意力机制和强大的上下文编码著称,并在经过精心策划的、由专家标注摘要组成的语料库上进行了微调,其主要关注的是豌豆(Lens culinaris)对各种非生物性和生物性应激因子的响应。我们的方法结合了基于变压器的建模与规则增强的语言后处理和本体导向的实体规范化技术,使PlantBert能够以高精度和语义准确性捕捉生物学相关的关联关系。基础语料库使用层次化方案进行了标注,该方案与作物本体相一致,涵盖了植物适应性的分子、生理学、生化及农艺维度。PlantBert在不同类型的实体中表现出强大的泛化能力,并证明了低资源科学领域稳健的领域适应性是可行的。通过提供可扩展且可重复的高分辨率实体识别框架,PlantBert填补了农业NLP中的一个关键空白,并为植物基因组学、表型组学和农艺知识发现领域的智能数据驱动系统开辟了道路。我们的模型向公众发布以促进透明度并加速计算植物科学跨学科创新的步伐。
https://arxiv.org/abs/2506.08897
We explore a generative relation extraction (RE) pipeline tailored to the study of interactions in the intestinal microbiome, a complex and low-resource biomedical domain. Our method leverages summarization with large language models (LLMs) to refine context before extracting relations via instruction-tuned generation. Preliminary results on a dedicated corpus show that summarization improves generative RE performance by reducing noise and guiding the model. However, BERT-based RE approaches still outperform generative models. This ongoing work demonstrates the potential of generative methods to support the study of specialized domains in low-resources setting.
我们探索了一种针对肠道微生物组相互作用研究的生成式关系抽取(RE)管道,这是一个复杂且资源匮乏的生物医学领域。我们的方法利用大型语言模型(LLMs)进行摘要提炼,在此基础上通过指令调优生成的方式提取关系。在专门构建的数据集上的初步结果显示,摘要提炼能够减少噪音并指导模型,从而提高生成式RE的性能。然而,基于BERT的关系抽取方法仍然优于生成式模型。这项正在进行的工作展示了生成式方法在资源匮乏环境中支持特定领域研究的巨大潜力。
https://arxiv.org/abs/2506.08647
The computation of excited states in strongly interacting quantum many-body systems is of fundamental importance. Yet, it is notoriously challenging due to the exponential scaling of the Hilbert space dimension with the system size. Here, we introduce a neural network-based algorithm that can simultaneously output multiple low-lying excited states of a quantum many-body spin system in an accurate and efficient fashion. This algorithm, dubbed the neural quantum excited-state (NQES) algorithm, requires no explicit orthogonalization of the states and is generally applicable to higher dimensions. We demonstrate, through concrete examples including the Haldane-Shastry model with all-to-all interactions, that the NQES algorithm is capable of efficiently computing multiple excited states and their related observable expectations. In addition, we apply the NQES algorithm to two classes of long-range interacting trapped-ion systems in a two-dimensional Wigner crystal. For non-decaying all-to-all interactions with alternating signs, our computed low-lying excited states bear spatial correlation patterns similar to those of the ground states, which closely match recent experimental observations that the quasi-adiabatically prepared state accurately reproduces analytical ground-state correlations. For a system of up to 300 ions with power-law decaying antiferromagnetic interactions, we successfully uncover its gap scaling and correlation features. Our results establish a scalable and efficient algorithm for computing excited states of interacting quantum many-body systems, which holds potential applications ranging from benchmarking quantum devices to photoisomerization.
强相互作用量子多体系统的激发态计算具有根本的重要性,但由于希尔伯特空间维度随系统大小呈指数级增长,这一任务历来非常具有挑战性。在这里,我们介绍了一种基于神经网络的算法,该算法能够准确且高效地输出量子自旋多体系统中的多个低能激发态。这个名为神经量子激发态(NQES)算法的方法不需要显式正交化,并且可以普遍应用于更高维度的情况。通过具体实例(包括全对全相互作用的哈拉尼-夏斯特里模型),我们证明了NQES算法能够高效地计算多个激发态及其相关的可观测期望值。 此外,我们将NQES算法应用于两类具有长程相互作用的二维威格纳晶格中的捕获离子系统。对于所有相互作用衰减不明显的交替符号情况,我们计算出的低能激发态的空间相关模式与基态非常相似,这与最近实验观察结果相符——准绝热准备状态能够准确再现分析基态的相关性。在具有300个离子和幂律衰减反铁磁相互作用的情况下,我们成功揭示了其能量间隙缩放特性和关联特性。 我们的研究成果确立了一种可扩展且高效的算法,用于计算强相互作用量子多体系统的激发态,并为包括基准测试量子设备到光异构化等领域的潜在应用提供了可能性。
https://arxiv.org/abs/2506.08594
Vision-Language Navigation (VLN) enables intelligent agents to navigate environments by integrating visual perception and natural language instructions, yet faces significant challenges due to the scarcity of fine-grained cross-modal alignment annotations. Existing datasets primarily focus on global instruction-trajectory matching, neglecting sub-instruction-level and entity-level alignments critical for accurate navigation action decision-making. To address this limitation, we propose FCA-NIG, a generative framework that automatically constructs navigation instructions with dual-level fine-grained cross-modal annotations. In this framework, an augmented trajectory is first divided into sub-trajectories, which are then processed through GLIP-based landmark detection, crafted instruction construction, OFA-Speaker based R2R-like instruction generation, and CLIP-powered entity selection, generating sub-instruction-trajectory pairs with entity-landmark annotations. Finally, these sub-pairs are aggregated to form a complete instruction-trajectory pair. The framework generates the FCA-R2R dataset, the first large-scale augmentation dataset featuring precise sub-instruction-sub-trajectory and entity-landmark alignments. Extensive experiments demonstrate that training with FCA-R2R significantly improves the performance of multiple state-of-the-art VLN agents, including SF, EnvDrop, RecBERT, and HAMT. Incorporating sub-instruction-trajectory alignment enhances agents' state awareness and decision accuracy, while entity-landmark alignment further boosts navigation performance and generalization. These results highlight the effectiveness of FCA-NIG in generating high-quality, scalable training data without manual annotation, advancing fine-grained cross-modal learning in complex navigation tasks.
视觉语言导航(VLN)使智能代理能够通过融合视觉感知和自然语言指令在环境中导航,但因细粒度跨模态对齐标注的稀缺而面临重大挑战。现有数据集主要集中在全局指令-轨迹匹配上,忽略了准确的导航动作决策所需的子指令级别和实体级别的对齐。为解决这一局限性,我们提出了FCA-NIG框架,该框架能够自动构建具有双层细粒度跨模态标注的导航指令。在该框架中,首先将增强后的轨迹划分为子轨迹,然后通过基于GLIP的地标的检测、定制的指令构建、基于OFA-Speaker的R2R样式的指令生成以及CLIP驱动的实体选择来处理这些子轨迹,从而生成带有实体-地标标注的子指令-轨迹对。最后,将这些子对汇总形成完整的指令-轨迹对。该框架生成了FCA-R2R数据集,这是第一个大规模增强型数据集,具有精确的子指令-子轨迹和实体-地标对齐。广泛实验表明,在FCA-R2R上进行训练可以显著提升多个最先进的VLN代理(包括SF、EnvDrop、RecBERT和HAMT)的表现。整合细粒度的子指令-轨迹对齐增强了代理的状态感知力和决策准确性,而实体-地标对齐进一步提升了导航性能和泛化能力。这些结果突显了FCA-NIG在生成高质量且可扩展训练数据(无需人工标注)方面的有效性,并推动了复杂导航任务中的细粒度跨模态学习的进步。
https://arxiv.org/abs/2506.08566
Differential equations are involved in modeling many engineering problems. Many efforts have been devoted to solving differential equations. Due to the flexibility of neural networks, Physics Informed Neural Networks (PINNs) have recently been proposed to solve complex differential equations and have demonstrated superior performance in many applications. While the L2 loss function is usually a default choice in PINNs, it has been shown that the corresponding numerical solution is incorrect and unstable for some complex equations. In this work, we propose a new PINNs framework named Kernel Packet accelerated PINNs (KP-PINNs), which gives a new expression of the loss function using the reproducing kernel Hilbert space (RKHS) norm and uses the Kernel Packet (KP) method to accelerate the computation. Theoretical results show that KP-PINNs can be stable across various differential equations. Numerical experiments illustrate that KP-PINNs can solve differential equations effectively and efficiently. This framework provides a promising direction for improving the stability and accuracy of PINNs-based solvers in scientific computing.
微分方程在许多工程问题的建模中起着关键作用。人们投入了大量努力来解决这些方程。由于神经网络的灵活性,物理信息神经网络(PINNs)最近被提出用于解决复杂的微分方程,并且在很多应用中表现出卓越性能。尽管L2损失函数通常是PINNs中的默认选择,但已证明对于某些复杂方程,对应的数值解是不正确的且不稳定。为此,我们在本工作中提出了一个新的名为内核包加速物理信息神经网络(KP-PINNs)的框架,该框架采用再生希尔伯特空间(RKHS)范数来表示损失函数,并利用内核包方法进行计算加速。理论结果表明,KP-PINNs在各种微分方程中都能保持稳定。数值实验也证明了KP-PINNs能有效地解决微分方程并具有高效率。这一框架为提高基于PINNs的求解器在科学计算中的稳定性和准确性提供了有前景的方向。
https://arxiv.org/abs/2506.08563
The rapid adoption of Small Language Models (SLMs) for on-device and resource-constrained deployments has outpaced our understanding of their ethical risks. To the best of our knowledge, we present the first large-scale audit of instruction-tuned SLMs spanning 0.5 to 5 billion parameters-an overlooked "middle tier" between BERT-class encoders and flagship LLMs. Our evaluation includes nine open-source models from the Qwen 2.5, LLaMA 3.2, Gemma 3, and Phi families. Using the BBQ benchmark under zero-shot prompting, we analyze both utility and fairness across ambiguous and disambiguated contexts. This evaluation reveals three key insights. First, competence and fairness need not be antagonistic: Phi models achieve F1 scores exceeding 90 percent while exhibiting minimal bias, showing that efficient and ethical NLP is attainable. Second, social bias varies significantly by architecture: Qwen 2.5 models may appear fair, but this often reflects vacuous neutrality, random guessing, or evasive behavior rather than genuine ethical alignment. In contrast, LLaMA 3.2 models exhibit stronger stereotypical bias, suggesting overconfidence rather than neutrality. Third, compression introduces nuanced trade-offs: 4-bit AWQ quantization improves F1 scores in ambiguous settings for LLaMA 3.2-3B but increases disability-related bias in Phi-4-Mini by over 7 percentage points. These insights provide practical guidance for the responsible deployment of SLMs in applications demanding fairness and efficiency, particularly benefiting small enterprises and resource-constrained environments.
小型语言模型(SLMs)在设备端和资源受限环境中的快速采用已经超过了我们对其伦理风险的理解。据我们所知,这是首次对参数范围从0.5亿到50亿的指令调优SLMs进行大规模审计,填补了BERT类编码器与旗舰级大语言模型之间的“中间层级”。我们的评估涵盖了Qwen 2.5、LLaMA 3.2、Gemma 3和Phi家族中的九种开源模型。通过在零样本提示下使用BBQ基准测试,我们分析了这些模型在模糊和非模糊情境下的实用性和公平性。 此次评估揭示了三个关键洞察: 首先,效率与公平并非必然对立:Phi系列模型实现了超过90%的F1评分,并且表现出极小的偏见,表明高效而合乎伦理的自然语言处理是可行的。 其次,社会偏见在不同的架构中表现差异显著:Qwen 2.5模型可能看起来较为公正,但这往往反映了空洞的中立性、随机猜测或回避行为,而非真正的道德一致性。相比之下,LLaMA 3.2系列模型表现出更强的刻板印象偏见,表明其更倾向于自信而不是中立。 第三,压缩引入了复杂的权衡:对于LLaMA 3.2-3B模型,4位AWQ量化在模糊情境下提高了F1评分,但对于Phi-4-Mini模型而言,则增加了与残疾相关的偏见超过7个百分点。 这些洞察为SLMs在需要公平性和效率的应用中负责任地部署提供了实用指导,尤其有益于小型企业和资源受限的环境。
https://arxiv.org/abs/2506.08487
The rise of large language models (LLMs) has created new possibilities for digital twins in healthcare. However, the deployment of such systems in consumer health contexts raises significant concerns related to hallucination, bias, lack of transparency, and ethical misuse. In response to recommendations from health authorities such as the World Health Organization (WHO), we propose Responsible Health Twin (RHealthTwin), a principled framework for building and governing AI-powered digital twins for well-being assistance. RHealthTwin processes multimodal inputs that guide a health-focused LLM to produce safe, relevant, and explainable responses. At the core of RHealthTwin is the Responsible Prompt Engine (RPE), which addresses the limitations of traditional LLM configuration. Conventionally, users input unstructured prompt and the system instruction to configure the LLM, which increases the risk of hallucination. In contrast, RPE extracts predefined slots dynamically to structure both inputs. This guides the language model to generate responses that are context aware, personalized, fair, reliable, and explainable for well-being assistance. The framework further adapts over time through a feedback loop that updates the prompt structure based on user satisfaction. We evaluate RHealthTwin across four consumer health domains including mental support, symptom triage, nutrition planning, and activity coaching. RPE achieves state-of-the-art results with BLEU = 0.41, ROUGE-L = 0.63, and BERTScore = 0.89 on benchmark datasets. Also, we achieve over 90% in ethical compliance and instruction-following metrics using LLM-as-judge evaluation, outperforming baseline strategies. We envision RHealthTwin as a forward-looking foundation for responsible LLM-based applications in health and well-being.
大型语言模型(LLMs)的兴起为医疗保健中的数字孪生创造了新的可能性。然而,在消费者健康环境中部署此类系统引发了与幻觉、偏见、缺乏透明度和伦理滥用相关的重大担忧。响应世界卫生组织(WHO)等卫生机构的建议,我们提出了负责任健康孪生体(RHealthTwin),这是一个用于构建和管理旨在支持健康的AI驱动数字孪生的基本框架。RHealthTwin处理多模态输入,引导以健康为重点的LLM生成安全、相关且可解释的响应。 在RHealthTwin的核心是负责任提示引擎(Responsible Prompt Engine, RPE),它解决了传统LLM配置的局限性。通常情况下,用户会输入无结构化的提示和系统指令来配置LLM,这增加了幻觉的风险。相比之下,RPE动态提取预定义槽位以结构化这些输入。这引导语言模型生成具有上下文感知、个性化、公平、可靠且可解释性的响应,以支持健康福祉。 该框架通过反馈循环进行迭代更新,根据用户满意度调整提示结构。我们在包括心理健康支持、症状分诊、营养规划和活动指导在内的四个消费者健康领域对RHealthTwin进行了评估。在基准数据集上,RPE在BLEU得分为0.41、ROUGE-L得分为0.63以及BERTScore为0.89的指标中取得了最先进的结果。此外,我们通过LLM-as-judge评估实现了超过90%的伦理合规性和指令遵循度指标,超过了基线策略的表现。 我们设想RHealthTwin将成为一个前瞻性的基础框架,用于在健康和福祉领域负责任地应用基于LLM的应用程序。
https://arxiv.org/abs/2506.08486
Social media platforms are critical spaces for public discourse, shaping opinions and community dynamics, yet their widespread use has amplified harmful content, particularly hate speech, threatening online safety and inclusivity. While hate speech detection has been extensively studied in languages like English and Spanish, Urdu remains underexplored, especially using translation-based approaches. To address this gap, we introduce a trilingual dataset of 10,193 tweets in English (3,834 samples), Urdu (3,197 samples), and Spanish (3,162 samples), collected via keyword filtering, with a balanced distribution of 4,849 Hateful and 5,344 Not-Hateful labels. Our methodology leverages attention layers as a precursor to transformer-based models and large language models (LLMs), enhancing feature extraction for multilingual hate speech detection. For non-transformer models, we use TF-IDF for feature extraction. The dataset is benchmarked using state-of-the-art models, including GPT-3.5 Turbo and Qwen 2.5 72B, alongside traditional machine learning models like SVM and other transformers (e.g., BERT, RoBERTa). Three annotators, following rigorous guidelines, ensured high dataset quality, achieving a Fleiss' Kappa of 0.821. Our approach, integrating attention layers with GPT-3.5 Turbo and Qwen 2.5 72B, achieves strong performance, with macro F1 scores of 0.87 for English (GPT-3.5 Turbo), 0.85 for Spanish (GPT-3.5 Turbo), 0.81 for Urdu (Qwen 2.5 72B), and 0.88 for the joint multilingual model (Qwen 2.5 72B). These results reflect improvements of 8.75% in English (over SVM baseline 0.80), 8.97% in Spanish (over SVM baseline 0.78), 5.19% in Urdu (over SVM baseline 0.77), and 7.32% in the joint multilingual model (over SVM baseline 0.82). Our framework offers a robust solution for multilingual hate speech detection, fostering safer digital communities worldwide.
社交媒体平台是公共讨论的关键空间,它们塑造了公众意见和社区动态,但其广泛使用也放大了有害内容,尤其是仇恨言论,威胁到了线上安全与包容性。尽管英语和西班牙语中的仇恨言论检测已经得到了深入研究,乌尔都语却尚未得到充分探索,特别是基于翻译的方法较少被采用。为弥补这一空白,我们引入了一个包含10,193条推文的三语言数据集,包括英语(3,834个样本)、乌尔都语(3,197个样本)和西班牙语(3,162个样本),这些推文通过关键词过滤收集而来,并且仇恨与非仇恨标签分配均衡。我们的研究方法利用注意力层作为基于转换器模型及大型语言模型(LLMs)的前导,以增强多语言仇恨言论检测中的特征提取。对于非转换器模型,则采用TF-IDF进行特征抽取。该数据集使用了包括GPT-3.5 Turbo和Qwen 2.5 72B在内的最新模型以及传统的机器学习模型(如SVM和支持其他转换器模型,例如BERT、RoBERTa)进行了基准测试。 三名注释者遵循严格的指南保证了高质量的数据集,并实现了Fleiss’ Kappa系数0.821。我们的方法结合注意力层和GPT-3.5 Turbo及Qwen 2.5 72B取得了显著成果,具体表现为:英语(GPT-3.5 Turbo)的宏平均F1得分为0.87、西班牙语(GPT-3.5 Turbo)为0.85、乌尔都语(Qwen 2.5 72B)为0.81以及多语言联合模型(Qwen 2.5 72B)的得分则达到了0.88。这些结果相比基线(SVM)分别在英语中提高了8.75%、西班牙语中提升了8.97%,乌尔都语中提高5.19%,多语言联合模型方面提升为7.32%。 我们的框架提供了一种稳健的解决方案,用于多语言仇恨言论检测,从而在全球范围内推动更安全的数字社区建设。
https://arxiv.org/abs/2506.08147
Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to No_Relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson's choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.
大型语言模型(LLMs)在关系抽取任务中表现出明显的保守偏见,当没有适当选项时,往往会默认选择“No_Relation”标签。虽然这种行为有助于防止错误的关系分配,但我们的分析发现,在推理未明确包含在输出中的情况下,它也会导致大量信息丢失。我们系统地评估了这一权衡,并针对多种提示、数据集和关系类型进行了测试,引入霍本斯选择(Hobson's choice)的概念来捕捉模型在安全但不具信息性的标签与虚构标签之间进行选择的场景。我们的研究发现保守偏见发生的频率是幻觉的两倍。为了量化这种效应,我们使用SBERT和LLM提示语来捕获在限制性提示中的保守偏见行为与半限制性和开放式提示生成的标签之间的语义相似度。 简化后的中文翻译如下: 大型语言模型(LLMs)在关系抽取任务中倾向于表现出保守倾向,在没有合适选项时,默认选择“No_Relation”标签。虽然这有助于避免错误,但同时也导致了信息丢失的问题,特别是在推理未明确输出的情况下。我们系统地评估了这一权衡,并引入了一个概念——霍本斯选择(Hobson's choice)来描述模型在安全但无用的标签和虚构标签之间做出的选择情况。我们的研究发现保守倾向发生的频率是幻觉两倍。为了量化这种影响,我们使用SBERT和LLM提示语分析了限制性提示中的保守行为与半限制性和开放式提示生成的标签之间的相似度。
https://arxiv.org/abs/2506.08120
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles -- generating adversarial prompts and safeguarding against them -- while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
传统的语言模型(LM)安全性对齐依赖于一种反应式的、分离的程序:攻击者利用静态模型,随后防御方通过微调来修补暴露出来的漏洞。这种顺序方法导致了不匹配——攻击者过度适应过时的防护措施,而防守方则永远落后于新兴威胁。为了解决这个问题,我们提出了Self-RedTeam,这是一种在线自玩强化学习算法,在此算法中,攻防代理通过持续互动共进化。我们将安全性对齐视为一个两人零和博弈,在这个博弈中,单一模型交替扮演攻击者和防御者的角色——生成对抗性提示并防护这些提示,同时由奖励语言模型裁定结果。这使得动态共同适应成为可能。基于零和游戏的博弈理论框架,我们建立了理论上保证安全性的理论依据:如果自玩收敛于纳什均衡点,防守方将可靠地对任何对抗性输入产生安全响应。实证上,Self-RedTeam与训练静态防御者相比,发现了更多样化的攻击(+21.8% SBERT),并且在安全性基准测试中比针对静态攻击者的静态防护措施表现出更高的鲁棒性(例如,在WildJailBreak上的提高为+65.5%)。我们进一步提出隐藏的思维链机制,让代理可以私下策划策略,这提升了对抗多样性并减少了过度拒绝。我们的研究成果促使从反应式修补转向通过多智能体强化学习(MARL)进行主动共演化的LM安全性训练,从而实现规模化、自主且稳健的自我改进。
https://arxiv.org/abs/2506.07468
Spatial awareness is a critical capability for embodied agents, as it enables them to anticipate and reason about unobserved regions. The primary challenge arises from learning the distribution of indoor semantics, complicated by sparse, imbalanced object categories and diverse spatial scales. Existing methods struggle to robustly generate unobserved areas in real time and do not generalize well to new environments. To this end, we propose \textbf{MapBERT}, a novel framework designed to effectively model the distribution of unseen spaces. Motivated by the observation that the one-hot encoding of semantic maps aligns naturally with the binary structure of bit encoding, we, for the first time, leverage a lookup-free BitVAE to encode semantic maps into compact bitwise tokens. Building on this, a masked transformer is employed to infer missing regions and generate complete semantic maps from limited observations. To enhance object-centric reasoning, we propose an object-aware masking strategy that masks entire object categories concurrently and pairs them with learnable embeddings, capturing implicit relationships between object embeddings and spatial tokens. By learning these relationships, the model more effectively captures indoor semantic distributions crucial for practical robotic tasks. Experiments on Gibson benchmarks show that MapBERT achieves state-of-the-art semantic map generation, balancing computational efficiency with accurate reconstruction of unobserved regions.
空间感知是具身代理(如机器人)的一项关键能力,因为它使它们能够预测和推理未观察到的区域。主要挑战在于学习室内语义分布,这一过程因稀疏、不平衡的对象类别以及多样的空间尺度而变得复杂。现有的方法难以实时稳健地生成未观测区域,并且在新环境中泛化性能不佳。为此,我们提出了一种名为**MapBERT**的新框架,旨在有效地建模未见空间的分布。 受语义地图的一_hot编码与位编码二进制结构自然对齐这一观察结果启发,我们首次利用了一个无查找表的BitVAE来将语义地图编码为紧凑的比特标记。在此基础上,使用了掩码变换器从有限观测中推断缺失区域并生成完整的语义地图。 为了增强以对象为中心的推理能力,我们提出了一种基于对象的掩码策略,该策略同时掩蔽整个对象类别,并与可学习嵌入配对,捕捉对象嵌入和空间标记之间的隐含关系。通过学习这些关系,模型能够更有效地捕获室内语义分布,这对于实际机器人任务至关重要。 在Gibson基准测试中的实验表明,MapBERT实现了最先进的语义地图生成性能,在计算效率与未观测区域的精确重建之间取得了良好的平衡。
https://arxiv.org/abs/2506.07350
We present JavelinGuard, a suite of low-cost, high-performance model architectures designed for detecting malicious intent in Large Language Model (LLM) interactions, optimized specifically for production deployment. Recent advances in transformer architectures, including compact BERT(Devlin et al. 2019) variants (e.g., ModernBERT (Warner et al. 2024)), allow us to build highly accurate classifiers with as few as approximately 400M parameters that achieve rapid inference speeds even on standard CPU hardware. We systematically explore five progressively sophisticated transformer-based architectures: Sharanga (baseline transformer classifier), Mahendra (enhanced attention-weighted pooling with deeper heads), Vaishnava and Ashwina (hybrid neural ensemble architectures), and Raudra (an advanced multi-task framework with specialized loss functions). Our models are rigorously benchmarked across nine diverse adversarial datasets, including popular sets like the NotInject series, BIPIA, Garak, ImprovedLLM, ToxicChat, WildGuard, and our newly introduced JavelinBench, specifically crafted to test generalization on challenging borderline and hard-negative cases. Additionally, we compare our architectures against leading open-source guardrail models as well as large decoder-only LLMs such as gpt-4o, demonstrating superior cost-performance trade-offs in terms of accuracy, and latency. Our findings reveal that while Raudra's multi-task design offers the most robust performance overall, each architecture presents unique trade-offs in speed, interpretability, and resource requirements, guiding practitioners in selecting the optimal balance of complexity and efficiency for real-world LLM security applications.
我们介绍了JavelinGuard,这是一个低成本、高性能的模型架构套件,专为检测大型语言模型(LLM)交互中的恶意意图而设计,并针对生产部署进行了优化。最近在Transformer架构方面取得的进步,包括紧凑型BERT(Devlin等人2019年)变体(例如ModernBERT(Warner等人2024年)),使我们能够构建具有约400M参数的高精度分类器,在标准CPU硬件上也能实现快速推理速度。我们系统地探索了五种渐进式复杂的基于Transformer架构:Sharanga(基准Transformer分类器)、Mahendra(增强注意力加权池化和更深的头部)、Vaishnava和Ashwina(混合神经网络集成架构)以及Raudra(一种先进的多任务框架,具有专门的损失函数)。我们的模型在九个多样化的对抗性数据集上进行了严格的基准测试,包括流行的NotInject系列、BIPIA、Garak、ImprovedLLM、ToxicChat、WildGuard和我们新推出的JavelinBench等,后者特别设计用于测试对挑战性的边缘情况和硬负样本的泛化能力。此外,我们将我们的架构与领先的开源护栏模型以及大型解码器独占型LLMs(如gpt-4o)进行了比较,在精度和延迟方面展示了更优的成本性能权衡。研究结果表明,尽管Raudra的多任务设计在整体上提供了最稳健的性能,但每个架构在速度、可解释性和资源需求方面都具有独特的权衡,从而为实践者提供指导,帮助他们在实际LLM安全应用中选择最优的复杂性和效率平衡。
https://arxiv.org/abs/2506.07330
The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size.
采样温度是大型语言模型(LLM)中的一个关键超参数,它在softmax层之前修改logits值,从而重塑输出标记的分布。最近的研究挑战了“随机鹦鹉”的类比,表明LLMs不仅能够记忆数据,还能够理解语义,并且由采样温度调节的随机性在模型推理中起着至关重要的作用。在这项研究中,我们系统地评估了在0到2范围内温度对六种不同能力的数据集的影响,在三个不同规模的开源模型(小规模:1B-4B、中等规模:6B-13B和大规模:40B-80B)上进行了统计分析。我们的发现揭示了温度对特定技能影响的复杂性,突显了在实际应用中选择最佳温度的挑战性。为了解决这一问题,我们提出了一种基于BERT的温度选择器,利用观察到的影响来确定给定提示的最佳温度。我们证明这种方法可以显著提高小规模和中等规模模型在SuperGLUE数据集上的性能。此外,我们的研究扩展到了FP16精度推理,并发现温度效果与4位量化模型中的观察结果一致。通过评估三个量化模型中高达4.0的温度效应,我们发现在这些模型中性能显著变化发生的突变温度随模型规模增大而增加。
https://arxiv.org/abs/2506.07295
Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.
长文档分类面临挑战,主要是由于基于转换器的模型(如BERT)在计算上的限制,这些模型受到固定输入长度和二次注意力复杂度的影响。此外,使用整个文档进行分类通常是冗余的,因为通常只有部分句子包含必要的信息。为解决这一问题,我们提出了一种基于TF-IDF的句子排名方法,通过选择最有信息量的内容来提高效率。我们的方法探索了固定数量和百分比基线的句子选取,并结合归一化后的TF-IDF得分与句子长度改进评分策略。 在MahaNews LDC数据集(包含长马哈拉斯特里语新闻文章)上进行评估后,该方法始终优于基准方法,如首句、末句及随机选择句子的方法。使用MahaBERT-v2模型,在仅损失0.33%准确率的情况下实现了与完整上下文基线几乎相同的分类准确性,同时将输入大小减少了超过50%,推理延迟降低了43%。这表明在不牺牲性能的前提下大幅减少上下文是可能的,使该方法对实际长文档分类任务具有实用性。
https://arxiv.org/abs/2506.07248
3D Gaussian Splatting (3DGS) has recently gained significant attention for high-quality and efficient view synthesis, making it widely adopted in fields such as AR/VR, robotics, and autonomous driving. Despite its impressive algorithmic performance, real-time rendering on resource-constrained devices remains a major challenge due to tight power and area budgets. This paper presents an architecture-algorithm co-design to address these inefficiencies. First, we reveal substantial redundancy caused by repeated computation of common terms/expressions during the conventional rasterization. To resolve this, we propose axis-oriented rasterization, which pre-computes and reuses shared terms along both the X and Y axes through a dedicated hardware design, effectively reducing multiply-and-add (MAC) operations by up to 63%. Second, by identifying the resource and performance inefficiency of the sorting process, we introduce a novel neural sorting approach that predicts order-independent blending weights using an efficient neural network, eliminating the need for costly hardware sorters. A dedicated training framework is also proposed to improve its algorithmic stability. Third, to uniformly support rasterization and neural network inference, we design an efficient reconfigurable processing array that maximizes hardware utilization and throughput. Furthermore, we introduce a $\pi$-trajectory tile schedule, inspired by Morton encoding and Hilbert curve, to optimize Gaussian reuse and reduce memory access overhead. Comprehensive experiments demonstrate that the proposed design preserves rendering quality while achieving a speedup of $23.4\sim27.8\times$ and energy savings of $28.8\sim51.4\times$ compared to edge GPUs for real-world scenes. We plan to open-source our design to foster further development in this field.
最近,3D高斯点阵(3D Gaussian Splatting, 3DGS)因其高质量且高效的视图合成而在AR/VR、机器人技术和自动驾驶等领域得到了广泛应用。尽管其算法性能令人印象深刻,但在资源受限设备上的实时渲染仍然是一个重大挑战,因为这些设备在电源和面积预算上都非常紧张。本文提出了一种架构-算法协同设计方法来解决这些问题。首先,我们揭示了传统光栅化过程中由于重复计算公共项/表达式而导致的大量冗余问题。为了解决这个问题,我们提出了轴向光栅化,通过专门的硬件设计预先计算并重用X和Y轴上的共享项,从而有效减少了乘加(MAC)操作多达63%。 其次,针对排序过程中的资源和性能效率低下问题,我们引入了一种新颖的神经网络排序方法,该方法使用高效的神经网络预测顺序无关混合权重,消除了昂贵硬件排序器的需求。同时提出了一套专门的训练框架以提高其算法稳定性。 第三,为了统一支持光栅化和神经网络推理,我们设计了一个高效可重构处理阵列,最大化了硬件利用率和吞吐量。此外,我们引入了一种受Morton编码和Hilbert曲线启发的π轨迹分块调度方法,优化高斯重用并减少内存访问开销。 全面实验表明,所提出的方案在保持渲染质量的同时,在现实场景中与边缘GPU相比实现了23.4到27.8倍的速度提升以及28.8到51.4倍的能量节省。我们计划开放源代码以促进该领域进一步发展。
https://arxiv.org/abs/2506.07069
The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.
大型语言模型(LLMs)能力的增强引发了对其在AI生成的剽窃和社会工程中的滥用的担忧。尽管已经提出了多种用于减轻这些风险的AI文本检测器,但许多检测器仍然容易受到简单的规避技术如改写等攻击。然而,最近的一些检测器展示了更强的抗此类基本攻击的能力。在这项工作中,我们介绍了对抗性改写(Adversarial Paraphrasing),这是一种无需训练的攻击框架,能够将任何由AI生成的文本“人性化”,从而更有效地逃避检测。我们的方法利用了一个现成的指令跟随LLM,在AI文本检测器的指导下重述AI生成的内容,产生专门优化以绕过检测的对抗性样本。广泛的实验表明,我们的攻击既广泛有效,又在多个检测系统中具有高度可转移性。例如,与简单的改写攻击相比——其讽刺地增加了RADAR上1%假阳性(F@1%)时的真实阳性的8.57%,以及Fast-DetectGPT上的15.03%——由OpenAI-RoBERTa-Large引导的对抗性改写在RADAR和Fast-DetectGPT上的T@1%F分别减少了64.49%和惊人的98.96%。在包括基于神经网络、水印和零样本的方法在内的多种检测器中,我们的攻击平均将T@1%F降低了87.88%,由OpenAI-RoBERTa-Large引导。我们还分析了文本质量和攻击成功率之间的权衡,发现我们的方法可以显著降低检测率,而大部分情况下只导致轻微的文本质量下降。这种对抗性设置强调,在面临越来越复杂的规避技术的情况下,需要更加强大和有弹性的检测策略。
https://arxiv.org/abs/2506.07001