We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
我们介绍了DocPolarBERT,这是一种针对文档理解的布局感知型BERT模型,它消除了对绝对二维位置嵌入的需求。我们将自注意力机制扩展为考虑基于相对极坐标系统而非笛卡尔坐标的文本块位置。尽管是在比广泛使用的IIT-CDIP语料库小六倍多的数据集上进行预训练的,DocPolarBERT仍取得了最先进的成果。这些结果表明,精心设计的注意机制可以补偿较小规模的预训练数据量,并为文档理解提供了高效且有效的替代方案。
https://arxiv.org/abs/2507.08606
This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.
这篇论文介绍了我们为SemEval 2025任务11:文本情感检测中的桥梁(A组轨道)开发的系统,该任务专注于短文本中的多标签情绪识别。我们提出了一种以特征为中心的框架,该框架能够动态调整文档表示和学习算法,从而优化特定语言的表现。我们的研究评估了三个关键组成部分:文档表示、降维以及在28种语言中进行模型训练的情况,并对其中五种语言进行了详细分析。 结果显示,在资源匮乏的语言中,TF-IDF依然非常有效;而FastText等上下文嵌入和基于Transformer的Sentence-BERT文档表征,则展示了特定语言的优势。主成分分析(PCA)能够在不牺牲性能的前提下减少训练时间,尤其有利于FastText以及多层感知机(MLP)这类神经网络模型。 计算效率分析强调了在处理成本与模型复杂度之间的权衡。我们的框架为解决语言多样性及资源限制问题提供了可扩展的解决方案,在多种语言的情感检测方面展现出强大的能力。
https://arxiv.org/abs/2507.08499
In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, O*NET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in O*NET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.
在这项研究中,我们专注于为美国知名的基于职业数据库O*NET所定义的职业提出一种最优的聚类机制。尽管所有职业都是根据在美国进行的良好调查来定义的,但这些定义在不同公司和国家之间可能会有所不同。因此,如果有人希望扩展已收集于O*NET中的数据以适用于由不同任务定义的职业,则需要一个映射来连接不同的定义。我们提出了一种使用多种基于BERT的技术以及各种聚类方法的工作流程,以获得这样的映射。此外,我们还研究了降维方法对用于衡量聚类算法性能的多个指标的影响。最后,通过采用专门的轮廓法(silhouette approach),我们进一步改进了我们的结果。这种新的结合了聚类和降维的方法可以自动区分职业,并为那些想要改变职业生涯的人开辟新的路径。
https://arxiv.org/abs/2507.07582
The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.
氧还原反应(ORR)催化剂在提高燃料电池效率方面起着关键作用,因此成为了材料科学研究中的重点。然而,从大量的科学文献中提取关于ORR催化剂的结构化信息仍然是一个重大挑战,这主要是由于文本数据的复杂性和多样性所致。在此研究中,我们提出了一种使用DyGIE++和多种预训练BERT变体(包括MatSciBERT和PubMedBERT)进行命名实体识别(NER)与关系抽取(RE),以从科学文献中提取ORR催化剂相关信息的方法,并将这些信息整合到一个燃料电池材料信息语料库(FC-CoMIcs)中。我们手动构建了一个全面的数据集,该数据集中包含了12个关键实体和两个实体对之间的关系类型。我们的方法包括数据标注、集成以及基于转换器模型的微调,以提高信息提取精度。我们评估了不同BERT变体对提取性能的影响,并研究了注释一致性的影响。实验结果表明,经过微调后的PubMedBERT模型在NER方面取得了最高的F1值82.19%,而MatSciBERT模型则在RE方面达到了最佳的F1值66.10%。此外,与人工标注者的比较突显了这些细调模型用于提取ORR催化剂信息的可靠性,并展示了它们进行大规模自动化文献分析的巨大潜力。研究结果表明,在ORR催化剂提取方面,特定领域的BERT模型优于如BlueBERT等通用科学模型。
https://arxiv.org/abs/2507.07499
Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.
驱逐是一项重要的但研究不足的社会决定因素(SDoH),与住房不稳定、失业和心理健康问题有关。虽然驱逐情况出现在非结构化的电子健康记录(EHR)中,但它很少在结构化字段中标注,这限制了其下游应用的可能性。我们引入了一个名为SynthEHR-Eviction的可扩展流程,该流程结合了大型语言模型(LLM)、人工辅助标注和自动提示优化技术(APO),以从临床记录中提取驱逐状态信息。利用这一流程,我们创建了迄今为止最大的公开驱逐相关SDoH数据集,包含14种细致分类。在SynthEHR-Eviction上进行微调的LLMs(例如Qwen2.5、LLaMA3)在经过人工验证的数据集上分别获得了88.8%(针对驱逐情况)和90.3%(针对其他SDoH)的宏平均F1分数,超过了GPT-4o-APO (87.8%, 87.3%)、GPT-4o-mini-APO (69.1%, 78.1%) 和BioBERT (60.7%, 68.3%) 的表现,并且能够以成本效益的方式在各种模型大小上部署。该流程减少了超过80%的标注工作量,加速了数据集创建过程,支持大规模驱逐检测,并能推广到其他信息提取任务中去。
https://arxiv.org/abs/2507.07421
Dialectical systems are a mathematical formalism for modeling an agent updating a knowledge base seeking consistency. Introduced in the 1970s by Roberto Magari, they were originally conceived to capture how a working mathematician or a research community refines beliefs in the pursuit of truth. Dialectical systems also serve as natural models for the belief change of an automated agent, offering a unifying, computable framework for dynamic belief management. The literature distinguishes three main models of dialectical systems: (d-)dialectical systems based on revising beliefs when they are seen to be inconsistent, p-dialectical systems based on revising beliefs based on finding a counterexample, and q-dialectical systems which can do both. We answer an open problem in the literature by proving that q-dialectical systems are strictly more powerful than p-dialectical systems, which are themselves known to be strictly stronger than (d-)dialectical systems. This result highlights the complementary roles of counterexample and contradiction in automated belief revision, and thus also in the reasoning processes of mathematicians and research communities.
辩证系统是一种数学形式化方法,用于模拟代理人在寻求一致性时更新知识库的过程。这一概念由罗伯托·马加里在20世纪70年代提出,最初旨在捕捉工作中的数学家或研究社群如何在追求真理的过程中细化信念。辩证系统也为自动化代理人的信仰变化提供了自然模型,并提供了一个统一且可计算的框架来管理动态信仰。 文献中区分了三种主要的辩证系统模型:基于发现矛盾而修正信念的(d-)辩证系统、基于找到反例而修正信念的p-辩证系统,以及能够同时进行这两种操作的q-辩证系统。我们解决了文献中的一个开放问题,证明了q-辩证系统比p-辩证系统更强大,而后者本身已知比(d-)辩证系统更强。 这一结果突显了反例和矛盾在自动信仰修正中各自扮演的重要互补角色,同时也体现了数学家和研究社群的推理过程。
https://arxiv.org/abs/2507.06798
Self-supervised learning (SSL) models such as Wav2Vec 2.0 and HuBERT have shown remarkable success in extracting phonetic information from raw audio without labelled data. While prior work has demonstrated that SSL embeddings encode phonetic features at the frame level, it remains unclear whether these models preserve temporal structure, specifically, whether embeddings at phoneme boundaries reflect the identity and order of adjacent phonemes. This study investigates the extent to which boundary-sensitive embeddings from HubertSoft, a soft-clustering variant of HuBERT, encode phoneme transitions. Using the CORPRES Russian speech corpus, we labelled 20 ms embedding windows with triplets of phonemes corresponding to their start, centre, and end segments. A neural network was trained to predict these positions separately, and multiple evaluation metrics, such as ordered, unordered accuracy and a flexible centre accuracy, were used to assess temporal sensitivity. Results show that embeddings extracted at phoneme boundaries capture both phoneme identity and temporal order, with especially high accuracy at segment boundaries. Confusion patterns further suggest that the model encodes articulatory detail and coarticulatory effects. These findings contribute to our understanding of the internal structure of SSL speech representations and their potential for phonological analysis and fine-grained transcription tasks.
自监督学习(SSL)模型,如Wav2Vec 2.0和HuBERT,在无需标注数据的情况下从原始音频中提取音位信息方面表现出色。尽管先前的工作已经证明了SSL嵌入在帧级别编码音位特征,但这些模型是否保留了时间结构仍不清楚,特别是边界处的嵌入是否反映了相邻音位的身份和顺序尚不明确。本研究探讨了来自HubertSoft(HuBERT的一种软聚类变体)的边界敏感嵌入在多大程度上编码音位过渡。使用CORPRES俄罗斯语音语料库,在20毫秒的嵌入窗口中用起始、中心和结束段对应的三个连续音标对其进行标记。分别训练了一个神经网络来预测这些位置,并使用有序准确度、无序准确度及灵活中心准确度等多重评估指标来考察时间敏感性。结果显示,提取自音位边界处的嵌入捕捉到了音位身份以及时间顺序,特别是在分段边界处准确性特别高。混淆模式进一步表明该模型编码了发音细节和协同发音效应。这些发现有助于我们理解SSL语音表征的内部结构及其在音系学分析及细粒度转录任务中的潜在应用价值。
https://arxiv.org/abs/2507.06794
We propose a unified food-domain QA framework that combines a large-scale multimodal knowledge graph (MMKG) with generative AI. Our MMKG links 13,000 recipes, 3,000 ingredients, 140,000 relations, and 14,000 images. We generate 40,000 QA pairs using 40 templates and LLaVA/DeepSeek augmentation. Joint fine-tuning of Meta LLaMA 3.1-8B and Stable Diffusion 3.5-Large improves BERTScore by 16.2\%, reduces FID by 37.8\%, and boosts CLIP alignment by 31.1\%. Diagnostic analyses-CLIP-based mismatch detection (35.2\% to 7.3\%) and LLaVA-driven hallucination checks-ensure factual and visual fidelity. A hybrid retrieval-generation strategy achieves 94.1\% accurate image reuse and 85\% adequacy in synthesis. Our results demonstrate that structured knowledge and multimodal generation together enhance reliability and diversity in food QA.
我们提出了一种统一的食品领域问答框架,该框架结合了大规模多模态知识图谱(MMKG)和生成式人工智能。我们的MMKG连接了13,000个食谱、3,000种食材、140,000条关系以及14,000张图片。我们使用40个模板及LLaVA/DeepSeek增强技术生成了40,000对问答数据。通过将Meta LLaMA 3.1-8B与Stable Diffusion 3.5-Large联合微调,BERTScore提升了16.2%,FID降低了37.8%,CLIP一致性提高了31.1%。诊断分析——基于CLIP的不匹配检测(从35.2%降至7.3%)以及LLaVA驱动的幻觉检查——确保了事实和视觉的一致性。混合检索生成策略实现了94.1%的准确图像重用率及85%的合成充分性。我们的结果显示,结构化知识与多模态生成技术相结合能够提升食品问答领域的可靠性和多样性。
https://arxiv.org/abs/2507.06571
In this paper, we, as the DS@GT team for CLEF 2025 CheckThat! Task 4a Scientific Web Discourse Detection, present the methods we explored for this task. For this multiclass classification task, we determined if a tweet contained a scientific claim, a reference to a scientific study or publication, and/or mentions of scientific entities, such as a university or a scientist. We present 3 modeling approaches for this task: transformer finetuning, few-shot prompting of LLMs, and a combined ensemble model whose design was informed by earlier experiments. Our team placed 7th in the competition, achieving a macro-averaged F1 score of 0.8611, an improvement over the DeBERTaV3 baseline of 0.8375. Our code is available on Github at this https URL.
在这篇论文中,作为CLEF 2025 CheckThat! Task 4a 科学网络话语检测任务的DS@GT团队,我们介绍了我们在该任务上探索的方法。对于这个多分类任务,我们确定了一条推文是否包含科学声明、对科学研究或出版物的引用以及/或者提及了如大学或科学家等科学实体。我们为这项任务提出了三种建模方法:transformer微调、LLM的few-shot提示法和一个结合模型,该模型的设计受到了早期实验结果的影响。我们的团队在比赛中排名第七,取得了0.8611的宏平均F1分数,相较于DeBERTaV3基准线(0.8375)有了提升。我们代码托管于Github,链接为[此处](https://github.com/your-team-repo/checkthat2025)。
https://arxiv.org/abs/2507.06205
Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at this https URL.
数值声明、涉及数量的陈述、比较和时间参考对自动事实核查系统提出了独特的挑战。在这项研究中,我们使用QuanTemp数据集评估了针对此类声明真实性预测的建模策略,并构建了自己的证据检索管道。我们探讨了三个关键因素:(1)使用ModernBERT时更多证据及更长输入上下文窗口的影响;(2)从右到左(R2L)分词的效果;以及(3)这些方法相结合对分类性能产生的影响。与之前在算术推理任务中的发现相反,R2L分词并未增强自然语言推理(NLI)在数值任务上的表现。同样,更长的上下文窗口也没有提升真实性预测的表现,这突显了证据质量是主要瓶颈所在。我们最佳系统的宏平均F1得分为0.57,在CheckThat! 2025竞赛的任务3中排名前四。我们的代码可在上述链接获取。
https://arxiv.org/abs/2507.06195
Technical reports and articles often contain valuable information in the form of semi-structured data like charts, and figures. Interpreting these and using the information from them is essential for downstream tasks such as question answering (QA). Current approaches to visual question answering often struggle with the precision required for scientific data interpretation, particularly in handling numerical values, multi-step reasoning over visual elements, and maintaining consistency between visual observation and textual reasoning. We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles. We conducted a series of experiments using models with 5B to 8B parameters. Our strongest individual model, InternVL3, achieved ROUGE-1 and ROUGE-L F1 scores of \textbf{0.740} and a BERTScore of \textbf{0.983} on the SciVQA test split. We also developed an ensemble model with multiple vision language models (VLMs). Through error analysis on the validation split, our ensemble approach improved performance compared to most individual models, though InternVL3 remained the strongest standalone performer. Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
技术报告和文章中经常包含以半结构化数据形式(如图表)呈现的宝贵信息。解释这些数据并从中提取信息对于诸如问答(QA)等下游任务至关重要。目前,视觉问答的方法在处理科学数据解读时往往面临精度方面的挑战,尤其是在数值处理、基于视觉元素的多步推理以及保持视觉观察与文本推理之间的一致性方面。 我们介绍了针对SciVQA 2025共享任务的方法,重点在于根据学术文章中的科学图表回答视觉和非视觉问题。通过使用具有5B到8B参数量的不同模型进行了一系列实验。我们的最强单一模型InternVL3在SciVQA测试集上分别取得了ROUGE-1和ROUGE-L F1分数0.740以及BERTScore 0.983的成绩。 我们还开发了一种多模态视觉语言模型(VLM)的集成方法。通过对验证集进行错误分析,我们的集成方法在大多数单一模型的基础上提高了性能,尽管InternVL3依然是最强独立表现者。 我们的研究结果强调了优化提示、链式思维推理和集成建模对于提升模型在视觉问答方面的能力的有效性。
https://arxiv.org/abs/2507.06183
We introduce OpenFActScore, an open-source implementation of the FActScore framework for evaluating the factuality of text generated by large language models (LLMs). FActScore evaluates the factual accuracy of long-form text by using Atomic Fact Generation (AFG) to extract individual factual claims and Atomic Fact Validation (AFV) to verify each claim against a trusted knowledge source. While the original FActScore relies on closed-source and commercial models such as InstructGPT and ChatGPT, OpenFActScore enables the use of any Hugging Face-compatible model for both AFG and AFV. We provide a detailed technical overview of our implementation, highlighting design choices and modifications made to support open models. We evaluate multiple open-source LLMs on both AFG and AFV using the original FActScore benchmark, reporting BERTScore-F1 for AFG and Error Rate relative to human annotations for AFV. Our results show that open models can approximate the performance of closed-source systems, with Gemma achieving the best overall performance, and our final setup obtains a 0.99 Pearson correlation with the original FActScore experiments. OpenFActScore promotes transparency, reproducibility, and cost-effective evaluation, and is available at: this https URL.
我们介绍了OpenFActScore,这是一个开源实现的框架,用于评估大型语言模型(LLMs)生成文本的事实准确性。FActScore通过使用原子事实生成(AFG)提取单个事实声明,并利用原子事实验证(AFV)将每个声明与受信任的知识源进行核对来评估长篇文本的事实准确性。虽然原始的FActScore依赖于闭源和商业模型,如InstructGPT和ChatGPT,但OpenFActScore允许使用任何兼容Hugging Face的模型来进行原子事实生成(AFG)和验证(AFV)。我们提供了对我们实现的技术概述,强调了为支持开源模型所做的设计选择和修改。我们在原始FActScore基准上对多个开源LLMs进行了评估,报告了用于AFG的BERTScore-F1以及相对于人工标注的错误率来衡量AFV性能。我们的结果显示,开源模型可以接近闭源系统的性能表现,在所有模型中Gemma表现出最佳的整体性能,并且我们最终的设置获得了与原始FActScore实验0.99的相关性系数(Pearson)。OpenFActScore促进了透明度、可重复性和成本效益评估,其代码和文档可在[这个链接](this https URL)获取。
https://arxiv.org/abs/2507.05965
Motivated by the growing demand for retrieval systems that operate across modalities, we introduce llama-nemoretriever-colembed, a unified text-image retrieval model that delivers state-of-the-art performance across multiple benchmarks. We release two model variants, 1B and 3B. The 3B model achieves state of the art performance, scoring NDCG@5 91.0 on ViDoRe V1 and 63.5 on ViDoRe V2, placing first on both leaderboards as of June 27, 2025. Our approach leverages the NVIDIA Eagle2 Vision-Language model (VLM), modifies its architecture by replacing causal attention with bidirectional attention, and integrates a ColBERT-style late interaction mechanism to enable fine-grained multimodal retrieval in a shared embedding space. While this mechanism delivers superior retrieval accuracy, it introduces trade-offs in storage and efficiency. We provide a comprehensive analysis of these trade-offs. Additionally, we adopt a two-stage training strategy to enhance the model's retrieval capabilities.
受跨模态检索系统需求增长的推动,我们推出了 llama-nemoretriever-colembed,这是一种统一的文字-图像检索模型,在多个基准测试中均表现出业界领先性能。我们发布了两个模型变体:1B和3B版本。其中,3B版本在 ViDoRe V1 和 ViDoRe V2 上分别取得了 NDCG@5 分数为 91.0 和 63.5 的成绩,在截至 2025 年 6 月 27 日的两个排行榜上均位列第一。 我们的方法利用了 NVIDIA Eagle2 视觉-语言模型(VLM),通过用双向注意力机制替换因果关系注意来修改其架构,并集成了类似 ColBERT 的晚期交互机制,从而在共享嵌入空间中实现了精细的跨模态检索。虽然这种机制提高了检索精度,但它也引入了存储和效率方面的权衡。我们对这些权衡进行了全面分析。 此外,我们采用了一种两阶段训练策略来增强模型的检索能力。
https://arxiv.org/abs/2507.05513
Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.
大型语言模型(LLMs)拥有非凡的能力,能够生成连贯且符合上下文的文本,并且这些文本与人类写作极为相似。它们可以适应各种风格和体裁,产出既语法正确又语义丰富的内容。最近,这些模型被滥用以创建高度逼真的网络钓鱼邮件、传播虚假新闻、生成自动化网络犯罪的代码以及撰写欺诈性科学论文。此外,在许多现实世界的应用中,生成的内容包括其写作风格、主题及产生这些内容的具体模型事先都是未知的。随着人工智能(AI)生成文本的数量和复杂性的增加,检测它们变得越来越具有挑战性。人们已经尝试了多种方法来区分机器生成的文本与人类编写的文本,这种方法涵盖了语言学、统计学、机器学习以及基于集成的方法。 这项工作主要聚焦于两个目标:任务A是区分由人手书写的文本与机器生成的文本;任务B则是识别特定大型语言模型(LLM)所生成的内容。这两个任务都是通过微调“Generative Pre-trained Transformer (GPT_4o-mini)”、“Large Language Model Meta AI (LLaMA) 3 8B”以及“Bidirectional Encoder Representations from Transformers (BERT)”这三个模型来完成的。经过微调后,GPT_4o-mini和BERT模型在任务A中的准确率分别为0.9547,在任务B中为0.4698。
https://arxiv.org/abs/2507.05157
Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.
大型语言模型(LLMs)通过其生成类似人类文本的能力,在自然语言处理方面取得了持续的进步,这一能力跨越了各种任务。尽管LLMs在自然语言处理(NLP)领域取得了非凡的成功,但它们在不同领域和数据集中的文本摘要性能尚未得到全面评估。同时,有效总结文本而不依赖于大量训练数据的能力已经成为一个关键瓶颈。为了解决这些问题,我们提出了对六种大型语言模型在四个数据集上的系统性评价:CNN/Daily Mail和NewsRoom(新闻),SAMSum(对话)以及ArXiv(科学)。通过利用包括零样本学习和上下文学习在内的提示工程技术,我们的研究使用ROUGE和BERTScore指标评估了这些模型的性能。此外,还进行了详细的推理时间分析,以便更好地理解摘要质量和计算效率之间的权衡。 针对长文档,我们引入了一种基于句子分割的战略,这使得具有较短上下文窗口的LLMs能够通过多阶段处理来总结较长输入内容。研究结果表明,在新闻和对话任务中,LLMs表现出了竞争性,而在科学文献摘要上,当借助分块策略时,其性能显著提升。此外,根据模型参数、数据集属性以及提示设计的不同,观察到了明显的性能差异。 这些成果为不同大型语言模型在各种任务类型中的行为提供了可操作的见解,并对高效的指令式自然语言处理系统的研究做出了贡献。
https://arxiv.org/abs/2507.05123
We propose a pre-trained BERT-like model for symbolic music understanding that achieves competitive performance across a wide range of downstream tasks. To achieve this target, we design two novel pre-training objectives, namely token correction and pianoroll prediction. First, we sample a portion of note tokens and corrupt them with a limited amount of noise, and then train the model to denoise the corrupted tokens; second, we also train the model to predict bar-level and local pianoroll-derived representations from the corrupted note tokens. We argue that these objectives guide the model to better learn specific musical knowledge such as pitch intervals. For evaluation, we propose a benchmark that incorporates 12 downstream tasks ranging from chord estimation to symbolic genre classification. Results confirm the effectiveness of the proposed pre-training objectives on downstream tasks.
我们提出了一种类似BERT的预训练模型,用于符号音乐的理解,并在广泛的下游任务中实现了具有竞争力的表现。为了达成这一目标,我们设计了两个新颖的预训练目标,即标记校正和钢琴卷帘预测。首先,我们在音符标记的一部分上添加少量噪声以进行数据扰动,然后训练模型去除这些被污染的标记中的噪声;其次,我们也训练模型从受干扰的音符标记中预测出小节级及局部钢琴卷帘衍生表示。我们认为,这些目标能够引导模型更好地学习特定的音乐知识,例如音高间隔等。 为了评估我们的模型,我们提出了一个包含12个下游任务的新基准测试集,这些任务涵盖了从和弦估计到符号风格分类等多个方面。实验结果证实了所提出的预训练目标在下游任务中的有效性。
https://arxiv.org/abs/2507.04776
In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model's perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay's live Edge Recommendation system since January 2024, currently serving the app's recommendation traffic across \textbf{8.4 million daily active devices}.
在移动计算时代,将高效的自然语言处理(NLP)模型部署到资源受限的边缘环境中面临着重大挑战,尤其是在需要严格隐私合规、实时响应和多样化多任务处理能力的情况下。这些挑战迫切地要求开发出超紧凑型模型,在满足严格的内存限制的同时,仍能保持强大的跨各种NLP任务性能。为此,我们引入了Edge ultra-lIte BERT框架(EI-BERT)及其创新的交叉蒸馏方法。EI-BERT通过包括硬令牌剪枝、交叉蒸馏和参数量化在内的全面压缩管道高效地缩小模型规模。特别是,交叉蒸馏方法使教师模型能够从学生模型的角度理解问题,并确保通过参数整合与模型间的相互作用进行高效的知识传递。通过广泛的实验,我们成功构建了一个仅1.91 MB的基于BERT的小型模型——迄今为止用于自然语言理解和(NLU)任务中最小的模型之一。这种超紧凑型模型已经在支付宝生态系统内的多个场景下成功部署,并在实际应用中显示出显著改进。例如,该模型自2024年1月以来已集成到支付宝的实时边缘推荐系统中,当前为应用程序的推荐流量服务着**840万台每日活跃设备**。
https://arxiv.org/abs/2507.04636
The use of generative artificial intelligence (AI) models is becoming ubiquitous in many fields. Though progress continues to be made, general purpose large language AI models (LLM) show a tendency to deliver creative answers, often called "hallucinations", which have slowed their application in the medical and biomedical fields where accuracy is paramount. We propose that the design and use of much smaller, domain and even task-specific LM may be a more rational and appropriate use of this technology in biomedical research. In this work we apply a very small LM by today's standards to the specialized task of predicting regulatory interactions between molecular components to fill gaps in our current understanding of intracellular pathways. Toward this we attempt to correctly posit known pathway-informed interactions recovered from manually curated pathway databases by selecting and using only the most informative examples as part of an active learning scheme. With this example we show that a small (~110 million parameters) LM based on a Bidirectional Encoder Representations from Transformers (BERT) architecture can propose molecular interactions relevant to tuberculosis persistence and transmission with over 80% accuracy using less than 25% of the ~520 regulatory relationships in question. Using information entropy as a metric for the iterative selection of new tuning examples, we also find that increased accuracy is driven by favoring the use of the incorrectly assigned statements with the highest certainty (lowest entropy). In contrast, the concurrent use of correct but least certain examples contributed little and may have even been detrimental to the learning rate.
生成式人工智能(AI)模型在许多领域中的使用正变得越来越普遍。尽管取得了进展,但通用大型语言AI模型(LLM)倾向于提供创意性答案,这些答案常被称为“幻觉”,这种现象减缓了它们在医学和生物医学领域的应用进程,在这些领域中准确性至关重要。我们建议设计并使用较小的、专门针对特定领域甚至任务的语言模型(LM),这可能是更合理且更适合在这类研究中运用该技术的方式。在此项工作中,我们将一个相对较小的标准语言模型应用于预测分子组件间的调控交互这一专业任务上,以填补我们对细胞内通路当前理解中的空白。 为此目的,我们试图通过选择并使用最具有信息性的例子来正确地推断出从人工整理的路径数据库中恢复的已知路径相关的相互作用。借助这个实例,我们展示了一个基于双向编码器表示(BERT)架构的小型语言模型(约1.1亿参数),能够以超过80%的准确率提出与结核病持续性和传播有关的分子交互,在仅使用大约520个调节关系中不到四分之一的例子的情况下做到这一点。 我们还发现,利用信息熵作为迭代选择新调整示例的标准,增加准确性主要归因于倾向于使用错误分配且确定性最高(熵最低)的陈述。相比之下,并行地使用正确但不确定性最高的例子对学习率几乎没有帮助,甚至可能是不利的。
https://arxiv.org/abs/2507.04432
The recent advancement of artificial intelligence, especially machine learning (ML), has significantly impacted software engineering research, including bug report analysis. ML aims to automate the understanding, extraction, and correlation of information from bug reports. Despite its growing importance, there has been no comprehensive review in this area. In this paper, we present a systematic literature review covering 1,825 papers, selecting 204 for detailed analysis. We derive seven key findings: 1) Extensive use of CNN, LSTM, and $k$NN for bug report analysis, with advanced models like BERT underutilized due to their complexity. 2) Word2Vec and TF-IDF are popular for feature representation, with a rise in deep learning approaches. 3) Stop word removal is the most common preprocessing, with structural methods rising after 2020. 4) Eclipse and Mozilla are the most frequently evaluated software projects. 5) Bug categorization is the most common task, followed by bug localization and severity prediction. 6) There is increasing attention on specific bugs like non-functional and performance bugs. 7) Common evaluation metrics are F1-score, Recall, Precision, and Accuracy, with $k$-fold cross-validation preferred for model evaluation. 8) Many studies lack robust statistical tests. We also identify six promising future research directions to provide useful insights for practitioners.
最近的人工智能(AI)特别是机器学习(ML)的进步,对软件工程研究产生了显著影响,其中包括错误报告分析。机器学习的目标是自动理解、提取并关联错误报告中的信息。尽管其重要性日益增加,但在这一领域还没有进行过全面的回顾。在这篇论文中,我们呈现了一份系统文献综述,涵盖了1,825篇文献,并从中选取了204篇进行详细分析。我们得出了七个关键发现: 1. **广泛使用CNN、LSTM和$k$NN**:这些模型被大量用于错误报告的分析,而像BERT这样的先进模型因复杂性原因未得到充分利用。 2. **Word2Vec和TF-IDF流行于特征表示**:这两种方法常用于表示文本特征,并且近年来深度学习的方法越来越受欢迎。 3. **停止词移除是最常用的预处理步骤**:然而,在2020年后,结构化方法的使用率有所上升。 4. **Eclipse和Mozilla是被评估最多的软件项目**:这两个项目在错误报告分析研究中最为常见。 5. **错误分类是最常见的任务**:其次是错误定位(bug localization)和严重性预测(severity prediction)。 6. **越来越关注特定类型的错误**:例如非功能性问题和性能相关的问题,这些领域的研究也在增加。 7. **常用评估指标包括F1-score、召回率、准确率以及精度**:在模型评估中,$k$-折交叉验证尤为受欢迎。 8. **许多研究缺乏稳健的统计测试**:这是需要进一步改进的地方。 此外,我们还确定了未来可能的研究方向有六个方面,这些方向为实践者提供了有价值的见解。
https://arxiv.org/abs/2507.04422
Recent advances in natural language processing (NLP) have been driven bypretrained language models like BERT, RoBERTa, T5, and GPT. Thesemodels excel at understanding complex texts, but biomedical literature, withits domain-specific terminology, poses challenges that models likeWord2Vec and bidirectional long short-term memory (Bi-LSTM) can't fullyaddress. GPT and T5, despite capturing context, fall short in tasks needingbidirectional understanding, unlike BERT. Addressing this, we proposedMedicalBERT, a pretrained BERT model trained on a large biomedicaldataset and equipped with domain-specific vocabulary that enhances thecomprehension of biomedical terminology. MedicalBERT model is furtheroptimized and fine-tuned to address diverse tasks, including named entityrecognition, relation extraction, question answering, sentence similarity, anddocument classification. Performance metrics such as the F1-score,accuracy, and Pearson correlation are employed to showcase the efficiencyof our model in comparison to other BERT-based models such as BioBERT,SciBERT, and ClinicalBERT. MedicalBERT outperforms these models onmost of the benchmarks, and surpasses the general-purpose BERT model by5.67% on average across all the tasks evaluated respectively. This work alsounderscores the potential of leveraging pretrained BERT models for medicalNLP tasks, demonstrating the effectiveness of transfer learning techniques incapturing domain-specific information. (PDF) MedicalBERT: enhancing biomedical natural language processing using pretrained BERT-based model. Available from: this https URL [accessed Jul 06 2025].
近期,自然语言处理(NLP)领域的进展主要归功于像BERT、RoBERTa、T5和GPT这样的预训练语言模型。这些模型在理解复杂文本方面表现出色,但生物医学文献因其领域特定的术语而提出了挑战,这是Word2Vec和双向长短时记忆网络(Bi-LSTM)等传统方法难以完全解决的问题。虽然GPT和T5能够捕捉上下文信息,但在需要双向理解的任务中不如BERT表现优秀。为此,我们提出了一种称为MedicalBERT的新模型,这是一种基于大量生物医学数据集训练的预训练BERT模型,并配备了领域特定词汇表以增强对生物医学术语的理解能力。通过进一步优化和微调,MedicalBERT能够在命名实体识别、关系抽取、问答、句子相似度和文档分类等多样化任务中发挥出色性能。 为了展示我们的模型相较于其他基于BERT的模型(如BioBERT、SciBERT和ClinicalBERT)在效率上的优势,我们采用了诸如F1分数、准确率以及Pearson相关系数等多种性能指标进行评估。实验结果表明,在大多数基准测试上,MedicalBERT均超越了这些模型,并且平均而言比通用型的BERT模型高出5.67%的表现(按任务分别计算)。这项工作不仅强调了利用预训练的BERT模型来处理医疗NLP任务的巨大潜力,还证明了迁移学习技术在捕捉领域特定信息方面的有效性。 论文《MedicalBERT:使用基于预训练BERT的模型增强生物医学自然语言处理》可以从以下链接下载:[此链接] [最后访问日期: 2025年7月6日]。
https://arxiv.org/abs/2507.08013