Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER's merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.
常见的子词标记化算法,如BPE(Byte Pair Encoding)和UnigramLM,假设可以通过简单的拼接操作将文本分割成有意义的单位。然而,对于希伯来语、阿拉伯语等语言来说,这种假设并不成立,因为这些语言中的形态学信息是通过根形模式编码的;而对于马来语和格鲁吉亚语,则普遍存在分裂词缀的现象。我们提出了一种名为SPLINTER的预处理步骤,该步骤可以将文本重新排列成一种线性形式,以更好地表示此类非拼接(nonconcatenative)形态学结构,从而使标记器能够找到有意义且连续的片段。 我们通过内在度量评估希伯来语、阿拉伯语和马来语中的词典词汇,并使用基于BERT架构模型在希伯来语下游任务上的表现,展示了SPLINTER的优点。
https://arxiv.org/abs/2503.14433
The advancement of large language models (LLMs) has allowed them to be proficient in various tasks, including content generation. However, their unregulated usage can lead to malicious activities such as plagiarism and generating and spreading fake news, especially for low-resource languages. Most existing machine-generated text detectors are trained on high-resource languages like English, French, etc. In this study, we developed the first large-scale detector that can distinguish between human- and machine-generated content in Hausa. We scrapped seven Hausa-language media outlets for the human-generated text and the Gemini-2.0 flash model to automatically generate the corresponding Hausa-language articles based on the human-generated article headlines. We fine-tuned four pre-trained Afri-centric models (AfriTeVa, AfriBERTa, AfroXLMR, and AfroXLMR-76L) on the resulting dataset and assessed their performance using accuracy and F1-score metrics. AfroXLMR achieved the highest performance with an accuracy of 99.23% and an F1 score of 99.21%, demonstrating its effectiveness for Hausa text detection. Our dataset is made publicly available to enable further research.
大型语言模型(LLMs)的发展使其能够胜任包括内容生成在内的多种任务。然而,未经监管的使用可能导致诸如抄袭和传播假新闻等恶意行为,尤其是在资源较少的语言环境中。目前大多数现有的机器生成文本检测器都是基于高资源语言如英语、法语等训练的。在这项研究中,我们开发了第一个可以区分豪萨语(Hausa)人类生成内容与机器生成内容的大规模检测器。 为了进行这项研究,我们在七个豪萨语媒体网站上抓取了人工生成的文本,并使用Gemini-2.0 flash模型根据这些新闻标题自动生成相应的豪萨语文章。我们对四个预训练的非洲中心化模型(AfriTeVa、AfriBERTa、AfroXLMR和AfroXLMR-76L)进行了微调,然后使用准确率和F1分数指标评估了它们的表现。 其中,AfroXLMR表现最佳,其准确率达到99.23%,F1得分为99.21%,显示出它在豪萨语文本检测方面的有效性。我们的数据集已公开发布以促进进一步的研究。
https://arxiv.org/abs/2503.13101
Despite advances in language modelling, distributional methods that build semantic representations from co-occurrences fail to discriminate between plausible and implausible events. In this work, we investigate how plausibility prediction can be improved by injecting latent knowledge prompted from large language models using parameter-efficient fine-tuning. We train 12 task adapters to learn various physical properties and association measures and perform adapter fusion to compose latent semantic knowledge from each task on top of pre-trained AlBERT embeddings. We automate auxiliary task data generation, which enables us to scale our approach and fine-tune our learned representations across two plausibility datasets. Our code is available at this https URL.
尽管在语言模型方面取得了进展,但仅基于共现构建语义表示的分布方法无法区分合理和不合理事件。在这项工作中,我们探讨了通过使用参数高效微调方法将大型语言模型提示出的潜在知识注入,来改进合理性预测的方法。我们训练了12个任务适配器以学习各种物理属性和关联度量,并执行适配器融合,以便在预训练的AlBERT嵌入之上组合来自每个任务的潜在语义知识。我们自动化辅助任务数据生成过程,这使我们能够扩展我们的方法并在两个合理性数据集上微调我们学到的表示形式。我们的代码可在上述链接中获取。
https://arxiv.org/abs/2503.12667
This paper presents UniBERT, a compact multilingual language model that leverages an innovative training framework integrating three components: masked language modeling, adversarial training, and knowledge distillation. Pre-trained on a meticulously curated Wikipedia corpus spanning 107 languages, UniBERT is designed to reduce the computational demands of large-scale models while maintaining competitive performance across various natural language processing tasks. Comprehensive evaluations on four tasks -- named entity recognition, natural language inference, question answering, and semantic textual similarity -- demonstrate that our multilingual training strategy enhanced by an adversarial objective significantly improves cross-lingual generalization. Specifically, UniBERT models show an average relative improvement of 7.72% over traditional baselines, which achieved an average relative improvement of only 1.17%, with statistical analysis confirming the significance of these gains (p-value = 0.0181). This work highlights the benefits of combining adversarial training and knowledge distillation to build scalable and robust language models, thereby advancing the field of multilingual and cross-lingual natural language processing.
本文介绍了UniBERT,这是一种紧凑型的多语言语言模型,它利用了一个创新性的训练框架,该框架整合了三个组成部分:遮蔽语言建模、对抗性训练和知识蒸馏。UniBERT在经过精心编纂的覆盖107种语言的维基百科语料库上进行了预训练,旨在降低大规模模型的计算需求,同时保持其在各种自然语言处理任务中的竞争力。 在四个任务——命名实体识别、自然语言推理、问答以及语义文本相似性上的全面评估表明,我们的多语言训练策略通过引入对抗性目标显著提升了跨语言泛化能力。具体来说,UniBERT模型相对于传统基线方法显示出平均7.72%的相对改进率,而后者仅实现了1.17%的平均相对改进率。统计分析证实了这些改善的重要性(p值=0.0181)。 这项工作强调了将对抗性训练和知识蒸馏相结合以构建可扩展且稳健的语言模型的优点,从而推进了多语言及跨语言自然语言处理领域的进展。
https://arxiv.org/abs/2503.12608
A nonparametric adaptive crane control system is proposed where the crane payload tracks a desired trajectory with feedback from the payload position. The payload motion is controlled with the position of the crane tip using partial feedback linearization. This is made possible by introducing a novel model structure given in Cartesian coordinates. This Cartesian model structure makes it possible to implement a nonparametric adaptive controller which cancels disturbances by approximating the effects of unknown disturbance forces and structurally unknown dynamics in a reproducing kernel Hilbert space (RKHS). It is shown that the nonparametric adaptive controller leads to uniformly ultimately bounded errors in the presence of unknown forces and unmodeled dynamics. Moreover, it is shown that the Cartesian formulation has certain advantages in payload tracking control also in the non-adaptive case. The performance of the nonparametric adaptive controller is validated in simulation and experiments with good results.
本文提出了一种非参数自适应起重机控制系统,该系统使吊载物能够通过来自负载位置的反馈跟踪期望轨迹。使用部分反馈线性化方法,利用起重机顶端的位置来控制吊载物的运动。这一创新是通过引入以笛卡尔坐标表示的新模型结构实现的。这种笛卡尔模型结构使得可以实施一种非参数自适应控制器,在核希尔伯特空间(RKHS)中通过逼近未知扰动力和结构未知的动力学效应,从而消除干扰。 文中证明了在存在未知力量和未建模动力学的情况下,非参数自适应控制器能够使误差最终保持一致且有界。此外,还表明即使在不采用自适应策略的情况下,笛卡尔模型形式也具有一定的优势,特别是在负载跟踪控制方面。通过仿真和实验验证了非参数自适应控制器的性能,并取得了良好的结果。
https://arxiv.org/abs/2503.12250
Clinically acquired brain MRIs and radiology reports are valuable but underutilized resources due to the challenges of manual analysis and data heterogeneity. We developed fine-tuned language models (LMs) to classify brain MRI reports as normal (reports with limited pathology) or abnormal, fine-tuning BERT, BioBERT, ClinicalBERT, and RadBERT on 44,661 reports. We also explored the reasoning capabilities of a leading LM, Gemini 1.5-Pro, for normal report categorization. Automated image processing and modeling generated brain growth charts from LM-classified normal scans, comparing them to human-derived charts. Fine-tuned LMs achieved high classification performance (F1-Score >97%), with unbalanced training mitigating class imbalance. Performance was robust on out-of-distribution data, with full text outperforming summary (impression) sections. Gemini 1.5-Pro showed a promising categorization performance, especially with clinical inference. LM-derived brain growth charts were nearly identical to human-annotated charts (r = 0.99, p < 2.2e-16). Our LMs offer scalable analysis of radiology reports, enabling automated classification of brain MRIs in large datasets. One application is automated generation of brain growth charts for benchmarking quantitative image features. Further research is needed to address data heterogeneity and optimize LM reasoning.
临床上获取的脑部MRI影像和放射学报告由于手动分析的挑战及数据异质性问题而未得到充分利用,但它们是非常宝贵的资源。我们开发了经过微调的语言模型(LMs),用于将脑部MRI报告分类为正常(病理有限)或异常,并针对44,661份报告对BERT、BioBERT、ClinicalBERT和RadBERT进行了微调。此外,我们还探索了领先语言模型Gemini 1.5-Pro在正常报告归类中的推理能力。自动化图像处理和建模根据LM分类的正常扫描生成脑发育图,并将其与人工制作的图表进行比较。经过微调的语言模型达到了很高的分类性能(F1-Score >97%),不平衡训练缓解了类别失衡的问题。其性能在分布外的数据上表现稳健,全文内容优于摘要部分的表现。Gemini 1.5-Pro展示了有前景的归类性能,特别是在临床推理方面。由LM生成的脑发育图与人工标注的图表几乎完全一致(r = 0.99, p < 2.2e-16)。我们的语言模型为大规模数据集中的放射学报告提供了可扩展分析方法,并能够实现脑部MRI的自动分类。其中一个应用是自动生成用于基准测试定量图像特征的脑发育图。未来需要进一步研究以解决数据异质性问题并优化LM推理能力。
https://arxiv.org/abs/2503.12143
Large Language Models (LLMs) may portray discrimination towards certain individuals, especially those characterized by multiple attributes (aka intersectional bias). Discovering intersectional bias in LLMs is challenging, as it involves complex inputs on multiple attributes (e.g. race and gender). To address this challenge, we propose HInter, a test technique that synergistically combines mutation analysis, dependency parsing and metamorphic oracles to automatically detect intersectional bias in LLMs. HInter generates test inputs by systematically mutating sentences using multiple mutations, validates inputs via a dependency invariant and detects biases by checking the LLM response on the original and mutated sentences. We evaluate HInter using six LLM architectures and 18 LLM models (GPT3.5, Llama2, BERT, etc) and find that 14.61% of the inputs generated by HInter expose intersectional bias. Results also show that our dependency invariant reduces false positives (incorrect test inputs) by an order of magnitude. Finally, we observed that 16.62% of intersectional bias errors are hidden, meaning that their corresponding atomic cases do not trigger biases. Overall, this work emphasize the importance of testing LLMs for intersectional bias.
大型语言模型(LLMs)可能会对特定个体表现出歧视,尤其是那些具有多重属性的个体(即交叉偏见)。发现LLM中的交叉偏见是一项挑战,因为这涉及到基于多个属性(如种族和性别)的复杂输入。为了解决这一难题,我们提出了HInter,这是一种测试技术,它巧妙地结合了突变分析、依存句法解析和元orphic预言机来自动检测LLM中的交叉偏见。HInter通过系统性地使用多种变异方式对句子进行突变生成测试输入,并通过依赖不变量验证这些输入,然后通过检查原始语句及其突变版本的模型响应来发现偏差。我们利用六种不同的LLM架构(包括GPT3.5、Llama2和BERT等)中的18个不同模型评估了HInter的效果,结果表明由HInter生成的测试输入中有14.61%暴露出了交叉偏见。实验还显示,我们的依赖不变量将误报降低了约一个数量级。此外,我们观察到有16.62%的交叉偏差错误是隐藏的,这意味着它们相应的原子案例并未触发偏差。总的来说,这项工作强调了在LLM中测试交叉偏见的重要性。
https://arxiv.org/abs/2503.11962
This study aims to develop an efficient and accurate model for detecting malicious comments, addressing the increasingly severe issue of false and harmful content on social media platforms. We propose a deep learning model that combines BERT and BiLSTM. The BERT model, through pre-training, captures deep semantic features of text, while the BiLSTM network excels at processing sequential data and can further model the contextual dependencies of text. Experimental results on the Jigsaw Unintended Bias in Toxicity Classification dataset demonstrate that the BERT+BiLSTM model achieves superior performance in malicious comment detection tasks, with a precision of 0.94, recall of 0.93, and accuracy of 0.94. This surpasses other models, including standalone BERT, TextCNN, TextRNN, and traditional machine learning algorithms using TF-IDF features. These results confirm the superiority of the BERT+BiLSTM model in handling imbalanced data and capturing deep semantic features of malicious comments, providing an effective technical means for social media content moderation and online environment purification.
这项研究旨在开发一种高效且准确的模型,用于检测恶意评论,以应对社交媒体平台上虚假和有害内容日益严重的问题。我们提出了一种结合BERT和BiLSTM的深度学习模型。BERT模型通过预训练能够捕捉文本中的深层语义特征,而BiLSTM网络则擅长处理序列数据,并能进一步建模文本上下文依赖关系。在Jigsaw Unintended Bias in Toxicity Classification 数据集上的实验结果表明,BERT+BiLSTM 模型在恶意评论检测任务中表现出色,精确度为0.94,召回率为0.93,准确率也为0.94。这些成绩优于其他模型,包括单独的 BERT、TextCNN、TextRNN 以及使用 TF-IDF 特征的传统机器学习算法。这些结果确认了BERT+BiLSTM 模型在处理不平衡数据和捕捉恶意评论深层语义特征方面具有优越性,为社交媒体内容管理和净化网络环境提供了有效的技术手段。
https://arxiv.org/abs/2503.11084
Applying deep learning and computational intelligence to finance has been a popular area of applied research, both within academia and industry, and continues to attract active attention. The inherently high volatility and non-stationary of the data pose substantial challenges to machine learning models, especially so for today's expressive and highly-parameterized deep learning models. Recent work has combined natural language processing on data from social media to augment models based purely on historic price data to improve performance has received particular attention. Previous work has achieved state-of-the-art performance on this task by combining techniques such as bidirectional GRUs, variational autoencoders, word and document embeddings, self-attention, graph attention, and adversarial training. In this paper, we demonstrated the efficacy of BERTweet, a variant of BERT pre-trained specifically on a Twitter corpus, and the transformer architecture by achieving competitive performance with the existing literature and setting a new baseline for Matthews Correlation Coefficient on the Stocknet dataset without auxiliary data sources.
将深度学习和计算智能应用于金融领域,无论是学术界还是工业界,都是一项热门的应用研究课题,并且仍然吸引了大量的关注。金融市场数据的固有高波动性和非平稳性对机器学习模型构成了重大挑战,尤其是对于当今表达能力强、参数众多的深度学习模型来说更是如此。最近的研究结合了社交媒体上的自然语言处理技术与仅基于历史价格数据的模型相结合的方法来提升性能,尤其受到了特别的关注。之前的工作通过将双向GRU(门控循环单元)、变分自编码器、词和文档嵌入、自我注意机制、图注意力以及对抗训练等技术结合起来,在这一任务中达到了最先进的性能水平。 在这篇论文中,我们展示了BERTweet的有效性,这是一种专门在Twitter语料库上预训练的BERT模型变体,并通过使用转换器架构取得了与现有文献相当的成绩,并且在没有辅助数据源的情况下,在Stocknet数据集上为马修斯相关系数(Matthews Correlation Coefficient)设置了一个新的基准。
https://arxiv.org/abs/2503.10957
The increasing volume of textual data poses challenges in reading and comprehending large documents, particularly for scholars who need to extract useful information from research articles. Automatic text summarization has emerged as a powerful tool to condense lengthy documents into concise and informative summaries. Depending on the approach used, text summarization can be categorized as either extractive or abstractive. While extractive methods are commonly used due to their simplicity, they often miss important information. On the other hand, Abstractive Summarization can generate more coherent and informative summaries by understanding the underlying meaning of the text. Abstractive techniques have gained attention in various languages, and recent advancements have been achieved through pre-training models such as BERT, BART, and T5. However, the challenge of summarizing long documents remains, and alternative models like Longformer have been introduced to address this limitation. In this context, this paper focuses on abstractive summarization in the Persian language. The authors introduce a new dataset of 300,000 full-text Persian papers obtained from the Ensani website and apply the ARMAN model, based on the Longformer architecture, to generate summaries. The experimental results demonstrate promising performance in Persian text summarization. The paper provides a comprehensive overview of related work, discusses the methodology, presents the experimental results, and concludes with future research directions.
不断增加的文本数据量给阅读和理解长篇文档带来了挑战,尤其是对于需要从研究论文中提取有用信息的学者来说更是如此。自动文本摘要技术作为一种将冗长文件压缩成简洁且有信息量的小结的强大工具已经出现。根据所采用的方法不同,文本摘要可以分为抽取式(Extractive)和生成式(Abstractive)。虽然由于其简单性,抽取出现在被广泛使用,但它们往往忽视了重要信息。另一方面,生成式总结通过理解文本背后的含义可以产生更连贯、更有信息量的摘要,在各种语言中都引起了人们的关注,并且最近通过BERT、BART和T5等预训练模型取得了进展。然而,对长文档进行总结仍然是一个挑战,为此引入了像Longformer这样的替代模型来解决这一局限性。在此背景下,本文专注于波斯语的生成式文本摘要研究。作者介绍了一个新数据集,包含了从Ensani网站获取的30万篇完整的波斯语文本,并使用基于Longformer架构的ARMAN模型来生成摘要。实验结果表明,在波斯文文本总结中具有令人鼓舞的表现。该论文提供了相关工作的全面概述、讨论了方法论、展示了实验结果,并提出了未来的研究方向。
https://arxiv.org/abs/2503.10233
Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.
大型语言模型(LLMs)在从在线文本预测心理健康结果方面展示了潜力,但传统的分类方法往往缺乏可解释性和鲁棒性。本研究评估了结构化推理技术——链式思维(CoT)、自我一致性(SC-CoT)和思想树(ToT),以提高来自Reddit的多个心理健康数据集上的分类准确性。我们分析了包括零样本链式思维和少量样本链式思维在内的基于推理的提示策略,并使用平衡准确率、F1分数以及敏感度/特异度等关键性能指标进行评估。我们的研究发现表明,增强推理的技术比直接预测在复杂案例中提高了分类性能。与基线模型(如零样本非CoT提示和预训练转换器BERT及Mental-RoBerta)相比,以及与开源LLMs(如Mental Alpaca和Mental-Flan-T5)微调后的结果相比,基于推理的LLMs在Dreaddit数据集上提高了0.52%(相对于M-LLM)和0.82%(相对于BERT),在SDCNL数据集上分别提升了4.67%(相对于M-LLM)和2.17%(相对于BERT)。然而,在抑郁症严重程度以及CSSRS预测中的性能下降表明了特定于数据集的限制,这可能是由于使用了更广泛的数据测试集。在提示策略中,少量样本链式思维一贯优于其他方法,进一步验证了基于推理的LLMs的有效性。尽管如此,数据集的变化突显了模型可靠性和可解释性的挑战。 本研究为基于推理的LLM技术在心理健康文本分类上的全面基准提供了依据,并为其在未来临床应用中的潜力和面临的关键挑战提供了见解。
https://arxiv.org/abs/2503.10095
Singing voice beat tracking is a challenging task, due to the lack of musical accompaniment that often contains robust rhythmic and harmonic patterns, something most existing beat tracking systems utilize and can be essential for estimating beats. In this paper, a novel temporal convolutional network-based beat-tracking approach featuring self-supervised learning (SSL) representations and adapter tuning is proposed to track the beat and downbeat of singing voices jointly. The SSL DistilHuBERT representations are utilized to capture the semantic information of singing voices and are further fused with the generic spectral features to facilitate beat estimation. Sources of variabilities that are particularly prominent with the non-homogeneous singing voice data are reduced by the efficient adapter tuning. Extensive experiments show that feature fusion and adapter tuning improve the performance individually, and the combination of both leads to significantly better performances than the un-adapted baseline system, with up to 31.6% and 42.4% absolute F1-score improvements on beat and downbeat tracking, respectively.
声乐歌声节拍跟踪是一项具有挑战性的任务,由于缺乏通常包含稳健节奏和和声模式的音乐伴奏,而这些往往是大多数现有节拍跟踪系统所依赖的关键要素。在本文中,提出了一种基于时间卷积网络(TCN)的新方法,该方法结合了自监督学习(SSL)表示与适配器微调技术,旨在同时追踪声乐歌声中的节拍和强拍。 文中使用SSL DistilHuBERT 表示来捕捉声乐的语义信息,并进一步融合通用频谱特征以促进节拍估计。通过高效的适配器微调减少了非同质化声乐数据中尤为突出的变化来源。大量的实验表明,特征融合与适配器微调分别提高了性能,在两者结合使用时,则显著优于未经调整的基础系统:在节拍跟踪方面提高了高达31.6%的F1得分,在强拍跟踪方面提高了高达42.4%的F1得分。
https://arxiv.org/abs/2503.10086
In personalized technology and psychological research, precisely detecting demographic features and personality traits from digital interactions becomes ever more important. This work investigates implicit categorization, inferring personality and gender variables directly from linguistic patterns in Telegram conversation data, while conventional personality prediction techniques mostly depend on explicitly self-reported labels. We refine a Transformer-based language model (RoBERTa) to capture complex linguistic cues indicative of personality traits and gender differences using a dataset comprising 138,866 messages from 1,602 users annotated with MBTI types and 195,016 messages from 2,598 users annotated with gender. Confidence levels help to greatly raise model accuracy to 86.16\%, hence proving RoBERTa's capacity to consistently identify implicit personality types from conversational text data. Our results highlight the usefulness of Transformer topologies for implicit personality and gender classification, hence stressing their efficiency and stressing important trade-offs between accuracy and coverage in realistic conversational environments. With regard to gender classification, the model obtained an accuracy of 74.4\%, therefore capturing gender-specific language patterns. Personality dimension analysis showed that people with introverted and intuitive preferences are especially more active in text-based interactions. This study emphasizes practical issues in balancing accuracy and data coverage as Transformer-based models show their efficiency in implicit personality and gender prediction tasks from conversational texts.
在个性化技术和心理研究中,从数字交互中精确检测人口统计特征和个人特质变得愈发重要。这项工作探讨了隐式分类,即直接通过Telegram对话数据中的语言模式来推断个性和性别变量,而传统的性格预测技术主要依赖于被试者自我报告的标签。我们改进了一个基于Transformer的语言模型(RoBERTa),利用包含138,866条消息的1,602个用户的数据集进行训练,该数据集中每个用户的MBTI类型已被标注,并且使用了另一个包括2598名用户所发195,016条消息的数据集来标注性别。通过提高模型的信心水平,我们成功地将准确率提升至86.16%,证明了RoBERTa能够持续从对话文本中识别出隐式的个性类型。我们的研究结果强调了Transformer架构在隐式性格和性别分类中的实用性,并突出了其效率以及准确性与覆盖范围之间的重要权衡关系,在实际的对话环境中尤为明显。 就性别分类而言,模型获得了74.4%的准确率,成功捕捉到了性别特异性的语言模式。个性维度分析显示,具有内向性和直觉性偏好的人们在基于文本的互动中特别活跃。这项研究表明了平衡准确性与数据覆盖范围的实际问题,尤其是在Transformer模型用于从对话文本进行隐式性格和性别预测任务时的有效性。
https://arxiv.org/abs/2503.09853
Foodborne gastrointestinal (GI) illness is a common cause of ill health in the UK. However, many cases do not interact with the healthcare system, posing significant challenges for traditional surveillance methods. The growth of publicly available online restaurant reviews and advancements in large language models (LLMs) present potential opportunities to extend disease surveillance by identifying public reports of GI illness. In this study, we introduce a novel annotation schema, developed with experts in GI illness, applied to the Yelp Open Dataset of reviews. Our annotations extend beyond binary disease detection, to include detailed extraction of information on symptoms and foods. We evaluate the performance of open-weight LLMs across these three tasks: GI illness detection, symptom extraction, and food extraction. We compare this performance to RoBERTa-based classification models fine-tuned specifically for these tasks. Our results show that using prompt-based approaches, LLMs achieve micro-F1 scores of over 90% for all three of our tasks. Using prompting alone, we achieve micro-F1 scores that exceed those of smaller fine-tuned models. We further demonstrate the robustness of LLMs in GI illness detection across three bias-focused experiments. Our results suggest that publicly available review text and LLMs offer substantial potential for public health surveillance of GI illness by enabling highly effective extraction of key information. While LLMs appear to exhibit minimal bias in processing, the inherent limitations of restaurant review data highlight the need for cautious interpretation of results.
食品引发的胃肠道(GI)疾病是英国常见的健康问题。然而,许多病例并未与医疗系统互动,这给传统的监测方法带来了重大挑战。公开可用的在线餐厅评论数量的增长以及大型语言模型(LLMs)的进步为通过识别公众报告的GI疾病来扩展疾病监控提供了潜在机会。在这项研究中,我们引入了一种由胃肠道疾病专家开发的新颖标注方案,并将其应用于Yelp开放数据集中的评论。我们的注释不仅限于二元疾病的检测,还详细提取了症状和食物的相关信息。我们评估了开源LLMs在以下三个任务上的性能:GI疾病检测、症状提取和食物提取,并将这些结果与为特定任务微调的基于RoBERTa的分类模型进行了比较。结果显示,使用提示方法,LLMs在这三项任务上均达到了超过90%的微平均F1分数。仅通过使用提示法,我们就取得了优于较小微调模型的微平均F1分数。此外,我们还展示了在GI疾病检测中偏见聚焦实验中的LLMs的稳健性。我们的结果表明,公开可用的评论文本和LLMs为胃肠道疾病的公共卫生监测提供了巨大的潜力,因为它们能够有效地提取关键信息。尽管LLMs似乎在处理过程中表现出最小的偏差,但餐厅评论数据本身的固有限制强调了对结果进行谨慎解释的重要性。
https://arxiv.org/abs/2503.09743
Detecting harmful and non-inclusive terminology in technical contexts is critical for fostering inclusive environments in computing. This study explores the impact of model architecture on harmful language detection by evaluating a curated database of technical terms, each paired with specific use cases. We tested a range of encoder, decoder, and encoder-decoder language models, including BERT-base-uncased, RoBERTa large-mnli, Gemini Flash 1.5 and 2.0, GPT-4, Claude AI Sonnet 3.5, T5-large, and BART-large-mnli. Each model was presented with a standardized prompt to identify harmful and non-inclusive language across 64 terms. Results reveal that decoder models, particularly Gemini Flash 2.0 and Claude AI, excel in nuanced contextual analysis, while encoder models like BERT exhibit strong pattern recognition but struggle with classification certainty. We discuss the implications of these findings for improving automated detection tools and highlight model-specific strengths and limitations in fostering inclusive communication in technical domains.
在技术环境中检测有害和非包容性术语对于培养包容性的计算环境至关重要。本研究探讨了模型架构对有害语言检测的影响,通过评估一组精心挑选的技术词汇数据库来进行,每个词汇都搭配有特定的使用案例。我们测试了一系列编码器、解码器以及编码-解码器的语言模型,包括 BERT-base-uncased、RoBERTa-large-mnli、Gemini Flash 1.5 和 2.0、GPT-4、Claude AI Sonnet 3.5、T5-large 以及 BART-large-mnli。每个模型都通过一个标准化的提示来识别包含64个术语中的有害和非包容性语言。 结果表明,解码器模型(特别是 Gemini Flash 2.0 和 Claude AI)在细致的情境分析方面表现出色,而像 BERT 这样的编码器模型则在模式识别上表现强劲,但在分类确定性方面存在困难。我们讨论了这些发现对改进自动化检测工具的意义,并强调各模型在促进技术领域包容性沟通方面的特定优势和局限。
https://arxiv.org/abs/2503.09341
The rise of Generative AI has led to a surge in AI-generated reviews, often posing a serious threat to the credibility of online platforms. Reviews serve as the primary source of information about products and services. Authentic reviews play a vital role in consumer decision-making. The presence of fabricated content misleads consumers, undermines trust and facilitates potential fraud in digital marketplaces. This study focuses on detecting AI-generated product reviews in Tamil and Malayalam, two low-resource languages where research in this domain is relatively under-explored. We worked on a range of approaches - from traditional machine learning methods to advanced transformer-based models such as Indic-BERT, IndicSBERT, MuRIL, XLM-RoBERTa and MalayalamBERT. Our findings highlight the effectiveness of leveraging the state-of-the-art transformers in accurately identifying AI-generated content, demonstrating the potential in enhancing the detection of fake reviews in low-resource language settings.
生成式AI的兴起导致了大量由AI生成的产品评论激增,这对在线平台的可信度构成了严重威胁。评论是消费者了解产品和服务的主要信息来源。真实有效的评论在消费者的决策过程中扮演着至关重要的角色。虚假内容的存在误导了消费者,削弱了信任,并可能促进了数字市场中的欺诈行为。本研究专注于检测两种低资源语言——泰米尔语和马拉雅拉姆语中的AI生成的产品评论,在这两个领域内相关研究相对较少。我们采用了多种方法,从传统的机器学习技术到先进的基于转换器的模型(如Indic-BERT、IndicSBERT、MuRIL、XLM-RoBERTa 和 MalayalamBERT)。我们的发现突显了利用最新转换器模型在准确识别AI生成内容方面的有效性,这表明在低资源语言环境中可以提升虚假评论检测的可能性。
https://arxiv.org/abs/2503.09289
In the rapidly evolving field of artificial intelligence (AI), mapping innovation patterns and understanding effective technology transfer from academic research to practical applications are essential for economic growth. This paper introduces DeepInnovationAI, the first comprehensive global dataset designed to bridge the gap between academic papers and industrial patents. However, existing data infrastructures face three major limitations: fragmentation, incomplete coverage, and insufficient evaluative capacity. Here, we present DeepInnovationAI, a comprehensive global dataset documenting AI innovation trajectories. The dataset comprises three structured files: this http URL: Contains 2,356,204 patent records with 8 field-specific attributes. this http URL: Encompasses 3,511,929 academic publications with 13 metadata fields. These two datasets employ large language models, multilingual text analysis and dual-layer BERT classifiers to accurately identify AI-related content and utilizing hypergraph analysis methods to create robust innovation metrics. In addition, this http URL: By applying semantic vector proximity analysis, this file presents approximately one hundred million calculated paper-patent similarity pairs to enhance understanding of how theoretical advancements translate into commercial technologies. This enables researchers, policymakers, and industry leaders to anticipate trends and identify emerging areas for collaboration. With its extensive temporal and geographical scope, DeepInnovationAI supports detailed analysis of technological development patterns and international competition dynamics, providing a robust foundation for modeling AI innovation dynamics and technology transfer processes.
在迅速发展的人工智能(AI)领域中,绘制创新模式并理解从学术研究到实际应用的有效技术转移对于经济增长至关重要。本文介绍了DeepInnovationAI,这是第一个旨在弥合学术论文与工业专利之间差距的全面全球数据集。然而,现有的数据基础设施面临三个主要限制:碎片化、覆盖不完整和评估能力不足。 在这里,我们介绍DeepInnovationAI,这是一个记录人工智能创新轨迹的综合性全球数据集。该数据集包括三个结构化的文件: - 这个链接包含了2,356,204项专利记录,每条记录包含8个特定领域的属性。 - 这个链接涵盖了3,511,929篇学术出版物,并包含13个元数据字段。这两个数据集利用大型语言模型、多语言文本分析和双层BERT分类器来准确识别与AI相关的内容,并通过超图分析方法创建强大的创新指标。 此外,这个链接:通过应用语义向量邻近度分析,该文件呈现了大约一亿计算的论文-专利相似对,以增强理论进步如何转化为商业技术的理解。这使得研究人员、政策制定者和行业领导者能够预测趋势并识别新兴合作领域。借助其广泛的时间跨度和地区范围,DeepInnovationAI支持详细的技术发展模式和国际竞争动态分析,为模型AI创新动力学和技术转移过程提供了坚实的基础。
https://arxiv.org/abs/2503.09257
Query and product relevance prediction is a critical component for ensuring a smooth user experience in e-commerce search. Traditional studies mainly focus on BERT-based models to assess the semantic relevance between queries and products. However, the discriminative paradigm and limited knowledge capacity of these approaches restrict their ability to comprehend the relevance between queries and products fully. With the rapid advancement of Large Language Models (LLMs), recent research has begun to explore their application to industrial search systems, as LLMs provide extensive world knowledge and flexible optimization for reasoning processes. Nonetheless, directly leveraging LLMs for relevance prediction tasks introduces new challenges, including a high demand for data quality, the necessity for meticulous optimization of reasoning processes, and an optimistic bias that can result in over-recall. To overcome the above problems, this paper proposes a novel framework called the LLM-based RElevance Framework (LREF) aimed at enhancing e-commerce search relevance. The framework comprises three main stages: supervised fine-tuning (SFT) with Data Selection, Multiple Chain of Thought (Multi-CoT) tuning, and Direct Preference Optimization (DPO) for de-biasing. We evaluate the performance of the framework through a series of offline experiments on large-scale real-world datasets, as well as online A/B testing. The results indicate significant improvements in both offline and online metrics. Ultimately, the model was deployed in a well-known e-commerce application, yielding substantial commercial benefits.
查询与产品相关性预测是确保电子商务搜索中顺畅用户体验的关键组成部分。传统研究主要集中在使用基于BERT的模型来评估查询和产品之间的语义相关性上。然而,这些方法的判别范式和有限的知识容量限制了它们全面理解查询和产品之间相关性的能力。随着大型语言模型(LLM)技术的迅速发展,最近的研究已经开始探索将这些模型应用于工业搜索系统中的可能性,因为LLMs提供了广泛的世界知识以及灵活的推理过程优化能力。然而,直接利用LLM进行相关性预测任务也带来了一些新的挑战,包括对数据质量的高需求、推理流程精细优化的必要性,以及可能产生的乐观偏差导致召回率过高的问题。 为了克服上述难题,本文提出了一种名为基于大型语言模型的相关性框架(LREF)的新方法,旨在提升电子商务搜索的相关性。该框架包含了三个主要阶段:带有数据选择的监督微调(SFT)、多链思考调整(Multi-CoT)和直接偏好优化(DPO)去偏。 我们通过大规模真实世界数据集上的离线实验以及在线A/B测试来评估这一框架的表现。结果表明,无论是在离线还是在线指标上都取得了显著的改进。最终,该模型被部署在一家知名电子商务应用中,并产生了可观的商业效益。
https://arxiv.org/abs/2503.09223
This paper introduces a novel approach to Dialogue State Tracking (DST) that leverages Large Language Models (LLMs) to generate natural language descriptions of dialogue states, moving beyond traditional slot-value representations. Conventional DST methods struggle with open-domain dialogues and noisy inputs. Motivated by the generative capabilities of LLMs, our Natural Language DST (NL-DST) framework trains an LLM to directly synthesize human-readable state descriptions. We demonstrate through extensive experiments on MultiWOZ 2.1 and Taskmaster-1 datasets that NL-DST significantly outperforms rule-based and discriminative BERT-based DST baselines, as well as generative slot-filling GPT-2 DST models, in both Joint Goal Accuracy and Slot Accuracy. Ablation studies and human evaluations further validate the effectiveness of natural language state generation, highlighting its robustness to noise and enhanced interpretability. Our findings suggest that NL-DST offers a more flexible, accurate, and human-understandable approach to dialogue state tracking, paving the way for more robust and adaptable task-oriented dialogue systems.
本文介绍了一种新颖的对话状态跟踪(DST)方法,该方法利用大型语言模型(LLMs)生成对话状态的自然语言描述,超越了传统的槽值表示。传统DST方法在处理开放领域对话和嘈杂输入时面临挑战。受到LLM生成能力的启发,我们开发了一种名为Natural Language DST (NL-DST) 的框架,该框架训练LLM直接合成人类可读的状态描述。 我们在MultiWOZ 2.1和Taskmaster-1数据集上进行了广泛的实验,证明了NL-DST在联合目标准确率(Joint Goal Accuracy)和槽位准确率(Slot Accuracy)方面显著优于基于规则的方法、判别性的BERT基线模型以及生成式填充槽的GPT-2 DST模型。 消融研究和人工评估进一步验证了自然语言状态生成的有效性,强调其对噪声的强大鲁棒性和增强的可解释性。我们的发现表明,NL-DST为对话状态跟踪提供了一种更灵活、准确且易于人类理解的方法,为构建更加稳健和适应性强的任务导向型对话系统铺平道路。
https://arxiv.org/abs/2503.08857
Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), serves as a crucial area within the domain of Natural Language Processing (NLP). This area fundamentally empowers machines to discern semantic relationships between assorted sections of text. Even though considerable work has been executed for the English language, it has been observed that efforts for the Spanish language are relatively sparse. Keeping this in view, this paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships. A preliminary baseline has been conceptualized and subjected to an evaluation, leveraging models drawn from the BERT family. The findings signify that the enrichment of genres essentially contributes to the enrichment of the model's capability to generalize. The code, notebooks and whole datasets for this experiments is available at: this https URL. If you are interested only in the dataset you can find it here: this https URL.
自然语言推理(NLI),又称文本蕴含识别(RTE),是自然语言处理(NLP)领域中的一个关键研究方向。这一领域的核心在于使机器能够理解不同文本片段之间的语义关系。尽管英语方面的相关工作已经取得了显著进展,但针对西班牙语的研究相对较少。考虑到这一点,本文重点关注创建一个多文体的西班牙语NLI数据集ESNLIR,并特别关注因果关系。我们还设计了一个初步基准模型,并利用来自BERT家族的模型进行了评估。研究结果表明,增加文本类型的多样性实际上有助于提升模型泛化的性能。 该实验的相关代码、笔记本和整个数据集可以在以下网址获取:[提供实际URL]。如果您只对数据集感兴趣,可以在这里找到它:[提供实际URL]。
https://arxiv.org/abs/2503.08803