Large-language models (LLMs) have been shown to respond in a variety of ways for classification tasks outside of question-answering. LLM responses are sometimes called "hallucinations" since the output is not what is ex pected. Memorization strategies in LLMs are being studied in detail, with the goal of understanding how LLMs respond. We perform a deep dive into a classification task based on United States Supreme Court (SCOTUS) decisions. The SCOTUS corpus is an ideal classification task to study for LLM memory accuracy because it presents significant challenges due to extensive sentence length, complex legal terminology, non-standard structure, and domain-specific vocabulary. Experimentation is performed with the latest LLM fine tuning and retrieval-based approaches, such as parameter-efficient fine-tuning, auto-modeling, and others, on two traditional category-based SCOTUS classification tasks: one with 15 labeled topics and another with 279. We show that prompt-based models with memories, such as DeepSeek, can be more robust than previous BERT-based models on both tasks scoring about 2 points better than previous models not based on prompting.
大型语言模型(LLMs)已被证明在问答之外的分类任务中会以多种方式响应,有时这些响应被称为“幻觉”,因为输出结果不符合预期。对于LLM中的记忆策略研究正在深入进行,目的是理解这些模型如何做出反应。我们针对美国最高法院(SCOTUS)裁决开展了一项基于分类任务的深度分析。由于句子长度长、法律术语复杂、结构非标准化以及专业词汇的存在,SCOTUS语料库成为了研究LLM记忆准确性的一个理想分类任务。 实验采用了最新的参数高效微调和检索式方法,例如参数高效的微调、自动建模等技术,在两个基于传统类别划分的SCOTUS分类任务上进行测试:一个是包含15个主题标签的任务,另一个则包含了279个。我们展示了带有记忆功能的提示驱动模型(如DeepSeek)在两项任务中的表现均优于先前的BERT基线模型,得分大约高出2分。 通过这种方法的研究表明,具有记忆机制和基于提示调整能力的大型语言模型能够更好地完成复杂法律文本分类任务,并且相较于传统的仅依赖微调的方法,这些新方法可以有效提升模型的表现。
https://arxiv.org/abs/2512.13654
This study investigates emotion drift: the change in emotional state across a single text, within mental health-related messages. While sentiment analysis typically classifies an entire message as positive, negative, or neutral, the nuanced shift of emotions over the course of a message is often overlooked. This study detects sentence-level emotions and measures emotion drift scores using pre-trained transformer models such as DistilBERT and RoBERTa. The results provide insights into patterns of emotional escalation or relief in mental health conversations. This methodology can be applied to better understand emotional dynamics in content.
这项研究探讨了情绪漂移:即在与心理健康相关的消息中,情感状态随文本内容变化的现象。尽管通常的情感分析会将整个消息归类为积极、消极或中立之一,但消息过程中情感细微的变化往往被忽略。本研究使用预训练的变压器模型(如DistilBERT和RoBERTa)检测句子级别的情绪,并测量情绪漂移分数。研究结果提供了心理健康对话中情绪升级或缓解模式的见解。该方法可用于更好地理解内容中的情感动态。
https://arxiv.org/abs/2512.13363
The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
生成模型的迅速发展增加了对能够跨多种多样且不断演进的生成技术进行泛化的图像检测器的需求。然而,现有的方法,包括那些利用预训练的视觉-语言模型的方法,通常会产生高度纠缠的表示形式,将与任务相关的法医线索(因果特征)和虚假或无关模式(非因果特征)混在一起,从而限制了其泛化能力。为了解决这个问题,我们提出了CausalCLIP框架,该框架明确地分离出因果特征和非因果特征,并采用由因果推理原则指导的目标过滤来保留最具有可迁移性和判别性的法医线索。通过使用结构因果模型建模生成过程,并借助基于Gumbel-Softmax的特征掩码和Hilbert-Schmidt独立性准则(HSIC)约束强制执行统计独立性,CausalCLIP能够隔离出稳定且鲁棒于分布变化的因果特征。在对不同系列的未见过的生成模型进行测试时,CausalCLIP展示了强大的泛化能力,在准确率和平均精度上分别比最先进的方法提高了6.83% 和4.06%。
https://arxiv.org/abs/2512.13285
The rapid acceleration of scientific publishing has created substantial challenges for researchers attempting to discover, contextualize, and interpret relevant literature. Traditional keyword-based search systems provide limited semantic understanding, while existing AI-driven tools typically focus on isolated tasks such as retrieval, clustering, or bibliometric visualization. This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction. The system builds a comprehensive corpus by merging full-text data from arXiv with structured metadata from OpenAlex. A hybrid retrieval architecture fuses BM25 lexical search with embedding-based semantic search using Reciprocal Rank Fusion. Topic modeling is performed on retrieved results using BERTopic or non-negative matrix factorization depending on computational resources. A knowledge graph unifies papers, authors, institutions, countries, and extracted topics into an interpretable structure. The system provides a multi-layered exploration environment that reveals not only relevant publications but also the conceptual and relational landscape surrounding a query. Evaluation across multiple queries demonstrates improvements in retrieval relevance, topic coherence, and interpretability. The proposed framework contributes an extensible foundation for AI-assisted scientific discovery.
科学研究出版的快速加速为研究人员在发现、理解及解释相关文献方面带来了重大挑战。传统的基于关键词的搜索系统提供的语义理解有限,而现有的人工智能驱动工具通常专注于孤立的任务如检索、聚类或引文分析可视化。本文提出了一种集成了大规模数据获取、混合检索、语义主题建模和异构知识图构建的综合系统,用于科学文献探索。 该系统通过合并arXiv中的全文数据与OpenAlex中的结构化元数据来构建一个全面的知识库。采用一种融合了BM25词汇搜索与基于嵌入式语义搜索(使用互惠排名融合)的混合检索架构。根据计算资源的不同,从检索结果中分别应用BERTopic或非负矩阵分解进行主题建模。通过知识图将论文、作者、机构、国家和提取的主题统一为一个可解释的结构。 该系统提供了一个多层次探索环境,不仅揭示与查询相关的出版物,还展示了围绕查询的概念及其关联性景观。经过多轮查询评估证明了在检索相关性、话题连贯性和可解释性方面的改进。所提出的框架为人工智能辅助科学发现提供了可扩展的基础架构。
https://arxiv.org/abs/2512.12760
We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.
我们探讨了在资源受限条件下,为下游文本分类任务高效微调解码器独享的大规模语言模型(LLMs)的策略。研究调查了两种方法:(1) 将分类头附加到预训练的因果LMM上,并针对特定任务进行微调(使用LMM最终的标记嵌入作为序列表示),以及 (2) 采用指令调整,以提示-响应格式对LLM进行微调用于分类。为了实现单GPU环境下高达80亿参数模型的有效微调,我们结合了4位模型量化与低秩适应(LoRA)技术,以达到高效的训练效果。 在两个数据集上进行了实验:一个是专有的单一标签数据集和公开的WIPO-Alpha专利数据集(极端多标签分类)。实验结果显示,基于嵌入的方法在F1分数方面显著优于指令调优方法,并且与针对特定领域的微调模型(如BERT)相比,在相同的任务中表现出色甚至超越。 这些结果表明,在计算资源有限的情况下,直接利用因果LMM内部表示以及高效的微调技术可以实现卓越的分类性能。我们讨论了每种方法的优势,并提出了优化LLM在分类场景下的微调的实际指南和未来方向。
https://arxiv.org/abs/2512.12677
The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.
世界上大多数语言,特别是像纳加梅斯语这样的克里奥尔语,在自然语言处理(NLP)方面资源严重匮乏,这在数字技术中对其表示构成了重大障碍。本文介绍了NagaNLP,这是一个针对纳加梅斯语的全面开源工具包,通过一种新颖的方法建立起来,该方法依赖于大型语言模型驱动但由人类验证的合成数据生成。 我们详细描述了一个多阶段管道,在这个过程中,一个专家指导下的大型语言模型(如Gemini)首先生成候选语料库,然后这些材料经过本地母语者的优化和标注。这种合成-混合的方法产生了一个10,000对会话数据集以及用于基础任务的高质量注释语料库。 为了评估我们的方法的有效性,我们训练了判别模型和生成模型。我们将微调后的XLM-RoBERTa-base模型作为纳加梅斯语的新基准,在词性标注上达到了93.81%的准确率(0.90 F1-Macro),在命名实体识别上的F1-Macro得分为0.75,大大超越了强大的零样本基线。此外,我们还微调了一个Llama-3.2-3B Instruct模型,并将其命名为NagaLLaMA,在对话任务中表现出了优越性能,其困惑度为3.85,相比少量示例的对照组(96.76)有了数量级的改进。 我们将发布NagaNLP工具包,包括所有数据集、模型和代码,这为以前服务不足的语言提供了一个基础资源,并为减少其他低资源环境中的数据稀缺性提供了可重复框架。
https://arxiv.org/abs/2512.12537
The rapid expansion of online courses and social media has generated large volumes of unstructured learner-generated text. Understanding how learners construct knowledge in these spaces is crucial for analysing learning processes, informing content design, and providing feedback at scale. However, existing approaches typically rely on manual coding of well-structured discussion forums, which does not scale to the fragmented discourse found in online learning. This study proposes and validates a framework that combines a codebook inspired by the Interaction Analysis Model with an automated classifier to enable large-scale analysis of knowledge construction in unstructured online discourse. We adapt four comment-level categories of knowledge construction: Non-Knowledge Construction, Share, Explore, and Integrate. Three trained annotators coded a balanced sample of 20,000 comments from YouTube education channels. The codebook demonstrated strong reliability, with Cohen's kappa = 0.79 on the main dataset and 0.85--0.93 across four additional educational domains. For automated classification, bag-of-words baselines were compared with transformer-based language models using 10-fold cross-validation. A DeBERTa-v3-large model achieved the highest macro-averaged F1 score (0.841), outperforming all baselines and other transformer models. External validation on four domains yielded macro-F1 above 0.705, with stronger transfer in medicine and programming, where discourse was more structured and task-focused, and weaker transfer in language and music, where comments were more varied and context-dependent. Overall, the study shows that theory-driven, semi-automated analysis of knowledge construction at scale is feasible, enabling the integration of knowledge-construction indicators into learning analytics and the design of online learning environments.
在线课程和社交媒体的快速扩张产生了大量非结构化的学习者生成文本。理解学生在这些空间中如何构建知识对于分析学习过程、指导内容设计以及提供大规模反馈至关重要。然而,现有方法通常依赖于对结构良好的讨论论坛的手动编码,这种方法无法扩展到在线学习中发现的碎片化话语。本研究提出并验证了一个框架,该框架结合了由互动分析模型启发的代码本和自动分类器,以实现对非结构化在线对话中的知识构建进行大规模分析。我们调整了四个评论级别的知识建构类别:无知识建构、分享、探索和整合。三位经过培训的注释者为来自YouTube教育频道的20,000条评论样本进行了编码。代码本显示出了很强的一致性,在主要数据集上的Cohen's kappa系数为0.79,而在四个额外的教育领域中则在0.85到0.93之间。 对于自动化分类,使用10倍交叉验证比较了基于词袋的基线方法与基于变压器的语言模型。DeBERTa-v3-large 模型达到了最高的宏观平均F1分数(0.841),超过了所有基准和其他变压器模型。在四个不同领域的外部验证中,其宏观-F1得分均高于0.705,在医学和编程领域表现尤为突出——这些领域的讨论更为结构化且以任务为中心;而在语言和音乐领域则稍逊一筹,因为这些评论更多样化并依赖于具体语境。 总体而言,该研究表明基于理论的、半自动化的知识建构大规模分析是可行的。这将使知识构建指标能够融入学习分析,并设计在线学习环境。
https://arxiv.org/abs/2510.19858
Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
从大量未结构化的历史报纸档案中提取连贯且易于人类理解的主题面临着重大挑战,这些挑战包括主题演变、光学字符识别(OCR)噪声以及海量文本数据。传统的主题建模方法,如潜在狄利克雷分配(LDA),在捕捉历史性文献复杂性和动态性方面往往表现不足。为了克服这一局限性,我们采用BERTopic进行研究。这种方法利用基于变压器的嵌入技术来提取和分类主题,尽管其日益流行,在历史研究中仍被广泛忽视。我们的研究关注1955年至2018年间发表的文章,特别是关于核能与核电安全方面的讨论。通过分析整个语料库中的各种主题分布及其时间演变,我们揭示了长期趋势以及公共话语的转变,并能够更准确地探索公众讨论模式,包括相关主题(如核能和核武器)的同时出现及其随时间变化的主题重要性。 我们的研究展示了BERTopic在处理历史文献方面的可扩展性和上下文敏感度,为传统方法提供了一种替代方案。这为我们提供了对从报纸档案中提取的历史话语的更深层次理解。这些发现不仅有助于历史、核能和社会科学的研究,而且还反思了当前限制,并提出了未来工作的潜在方向。
https://arxiv.org/abs/2512.11635
We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets -- one from internal medicine and emergency departments, and another from oncology -- annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions.
我们提出了一种新的希伯来语医学语言模型,该模型旨在从电子健康记录中提取结构化的临床时间线,从而构建患者的医疗历程。我们的模型基于DictaBERT 2.0,并在超过五百万条去标识化医院记录上进行了持续的预训练。为了评估其有效性,我们引入了两个新的数据集——一个来自内科和急诊科,另一个来自肿瘤科,这些数据集被标记为事件时间关系。实验结果显示,在这两个数据集中,我们的模型均表现出色。此外,我们还发现词汇适应可以提高令牌效率,并且去标识化不会影响下游性能,这支持了隐私保护的模型开发策略。该模型在符合伦理限制的情况下可供研究使用。
https://arxiv.org/abs/2512.11502
Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts' decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6\% macro-F1), classify traditional types of legal argument (77.5\% macro-F1), and classify decisions as formalistic/non-formalistic (83.2\% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at this https URL.
法院必须为其决定提供理由,但系统地大规模分析司法推理仍然是一项挑战。本研究通过开发自动化方法来检测和分类捷克最高法院裁决中的司法推理,反驳了中欧和东欧(CEE)形式主义审判的说法,并采用了最先进的自然语言处理技术。 我们创建了一个名为MADON的数据集,其中包括来自两个捷克最高法院的272份判决,并且专家对其中9,183个段落进行了标注,涵盖八种论据类型以及整体的形式主义标签,用于监督训练和评估。利用一个包含30万份捷克法院裁决的语料库,我们通过持续预训练来适应Transformer LLMs以适用于捷克法律领域,并尝试使用不对称损失和类别加权等方法解决数据不平衡问题。 最佳模型成功地检测论据段落(82.6%宏F1),分类传统的法律论据类型(77.5%宏F1),以及将裁决分为形式主义或非形式主义(83.2%宏F1)。我们结合ModernBERT、Llama 3.1和传统基于特征的机器学习的方法,实现了一种三阶段管道,在减少计算成本的同时提高了决策分类的可解释性。从实证角度来看,本研究挑战了关于CEE形式主义的主流观点。 这项工作表明,法律论据挖掘可以可靠地对司法哲学进行分类,并展示了在计算法学研究中的其他重要任务中使用法律论据挖掘的潜力。我们的方法很容易跨管辖范围复制,整个管道、数据集、指南、模型和源代码均可在此网址获取(请参见原文链接)。
https://arxiv.org/abs/2512.11374
The rapid integration of generative artificial intelligence into education has driven digital transformation in e-teaching, yet user perceptions of AI educational apps remain underexplored. This study performs a sentiment-driven evaluation of user reviews from top AI ed-apps on the Google Play Store to assess efficacy, challenges, and pedagogical implications. Our pipeline involved scraping app data and reviews, RoBERTa for binary sentiment classification, GPT-4o for key point extraction, and GPT-5 for synthesizing top positive/negative themes. Apps were categorized into seven types (e.g., homework helpers, math solvers, language tools), with overlaps reflecting multifunctional designs. Results indicate predominantly positive sentiments, with homework apps like Edu AI (95.9% positive) and this http URL (92.7%) leading in accuracy, speed, and personalization, while language/LMS apps (e.g., Teacher AI at 21.8% positive) lag due to instability and limited features. Positives emphasize efficiency in brainstorming, problem-solving, and engagement; negatives center on paywalls, inaccuracies, ads, and glitches. Trends show that homework helpers outperform specialized tools, highlighting AI's democratizing potential amid risks of dependency and inequity. The discussion proposes future ecosystems with hybrid AI-human models, VR/AR for immersive learning, and a roadmap for developers (adaptive personalization) and policymakers (monetization regulation for inclusivity). This underscores generative AI's role in advancing e-teaching by enabling ethical refinements that foster equitable, innovative environments. The full dataset is available here(this https URL).
生成人工智能在教育领域的迅速整合推动了在线教学的数字化转型,然而用户对AI教育应用程序的看法仍鲜有研究。本研究通过分析Google Play Store上顶级AI教育应用的用户评论的情感驱动评估来评价其有效性、挑战和教学法意义。我们的流程包括抓取应用数据与评论、使用RoBERTa进行二元情感分类、利用GPT-4o提取关键点,并借助GPT-5合成主要正面/负面主题。应用程序被分为七类(如家庭作业助手、数学解题器、语言工具),重叠类别反映了多功能设计的特点。研究结果显示,总体情绪偏向积极,其中像Edu AI (95.9%的正面评价)和另一个应用(92.7%的正面评价)这样的家庭作业应用程序在准确度、速度和个人化方面表现突出;而语言/LMS应用程序(如Teacher AI仅21.8%的正面评价)则因稳定性差和功能有限而落后。正面反馈主要集中在创意生成、问题解决以及互动性方面的效率,负面反馈则聚焦于付费墙、准确性低、广告过多及技术故障等问题。趋势显示家庭作业助手类应用优于专业工具,这凸显了AI在促进教育平等化中的潜力,同时也揭示了对依赖性和不平等等风险的担忧。 讨论部分提出了未来生态系统中混合AI与人类模型的应用、VR/AR技术在沉浸式学习中的作用,并为开发者(如自适应个性化)和政策制定者(货币化监管以实现包容性)制定了路线图。这强调了生成人工智能在推进在线教学方面的角色,通过促进伦理改进来创造公平且创新的学习环境。完整的数据集可在此处获取(该链接应替换为您提供的具体网址)。
https://arxiv.org/abs/2512.11934
Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees; but its slow performance has limited practical deployment. Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their accessibility. GPUs are a far more accessible platform, but achieving ASIC-level performance using GPUs has remained elusive. Furthermore, state-of-the-art approaches primarily focus on small models that fit comfortably within a single device. Supporting large models such as LLMs in FHE introduces a dramatic increase in computational complexity that requires optimized GPU kernels, along with managing terabyte-scale memory footprints that far exceed the capacity of a single GPU. This paper presents Cerium, a multi-GPU framework for FHE inference on large models. Cerium integrates a domain-specific language, an optimizing compiler, and a runtime system to automatically generate high-performance GPU kernels, manage terabyte-scale memory footprints, and parallelize computation across multiple GPUs. It introduces new IR constructs, compiler passes, sparse polynomial representations, memory-efficient data layouts, and communication-aware parallelization techniques that together enable encrypted inference for models ranging from small CNNs to Llama3-8B. We build Cerium on NVIDIA GPUs and demonstrate significant performance gains. For small models, Cerium outperforms expert-written hand-optimized GPU libraries by up to 2.25 times. Cerium achieves performance competitive with state-of-the-art FHE ASICs, outright matching prior FHE ASIC CraterLake. It is the first GPU system to execute bootstrapping in under 10 milliseconds, achieving 7.5 milliseconds, and is the first to demonstrate encrypted inference for BERT-Base and Llama3-8B in 8 seconds and 134 seconds, respectively.
使用全同态加密(FHE)的加密AI提供了强大的隐私保证;但其缓慢的性能限制了其实用部署。最近的研究提出了专用集成电路(ASIC)来加速FHE,但这需要昂贵的先进制造工艺,从而限制了它们的可访问性。相比之下,GPU是一种更为普及的平台,但在使用GPU实现接近ASIC级别的性能方面一直难以取得进展。此外,最先进的方法主要关注于能够轻松适应单个设备的小型模型。支持如大型语言模型(LLM)这样的大模型在FHE环境下引入了计算复杂性的显著增加,这不仅需要优化的GPU内核,还需要处理超出单个GPU容量的太字节级内存足迹。 本文介绍了Cerium框架,这是一个用于大型模型全同态加密推理的多GPU框架。Cerium集成了特定领域的语言、一个优化编译器和运行时系统来自动生成高性能的GPU内核,管理太字节规模的内存足迹,并跨多个GPU进行并行计算。它引入了新的中间表示(IR)构造、编译器通路、稀疏多项式表示法、内存高效的数据布局以及通信感知的并行化技术,共同使加密推理适用于从小型CNN到Llama3-8B的大范围模型成为可能。 我们在NVIDIA GPU上构建了Cerium,并展示了显著的性能提升。对于小模型来说,Cerium比专家手写的优化GPU库快最多2.25倍。Cerium实现了与最先进的FHE ASIC相当的性能,甚至在某些情况下达到了CraterLake FHE ASIC的同等水平。它是第一个能够在不到10毫秒内执行引导操作(bootstrapping)的GPU系统,并且它以7.5毫秒的速度完成了这一任务;此外,它是首个展示BERT-Base和Llama3-8B模型分别在8秒和134秒以内完成加密推理的系统。
https://arxiv.org/abs/2512.11269
SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding including scholarly document processing.
SciLaD 是一个新颖的大规模科学语言数据集,完全使用开源框架和公开可用的数据源构建而成。它包含一份精心整理的英语部分,包括超过1000万篇科学出版物,以及一个多语种、未经筛选的TEI XML部分,涵盖超过3500万篇出版物。我们还发布了用于生成SciLaD的可扩展管道。数据集的构造和处理工作流程展示了如何使用开源工具进行大规模科学数据整理,并保持高质量的数据标准。 最后,我们在我们的数据集上预训练了一个RoBERTa模型,并在全面的基准测试中对其进行了评估,取得了与类似规模的其他科学语言模型相当的表现,从而验证了SciLaD的质量和实用性。我们发布该数据集和评估管道以促进可重复性、透明度以及包括学术文献处理在内的自然语言处理和理解领域的进一步研究。
https://arxiv.org/abs/2512.11192
We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.
我们介绍了StereoSpace,这是一个基于扩散的单目到立体合成框架,通过视点条件化来建模几何结构,而不使用显式的深度信息或图像变形。一个标准的校正空间和条件引导生成器进行端到端地推理对应关系并填补遮挡区域。为了确保公平且无泄漏的评估,我们提出了一种排除所有地面真实数据或代理几何估计的端到端协议。该协议强调反映下游相关性的指标:使用iSQoE来衡量感知舒适度和MEt3R来衡量几何一致性。 StereoSpace在视差清晰度和复杂场景(如分层结构和非朗伯体场景)中的鲁棒性方面超越了其他方法,包括变形与修复、潜在变形以及条件化变形的类别。这证明了基于视点条件化的扩散方法可以成为一种无需深度信息就能生成立体图像的大规模解决方案。
https://arxiv.org/abs/2512.10959
LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.
LabelFusion 是一种用于文本分类的融合集成方法,它能够学习将传统的基于变压器的分类器(例如 RoBERTa)与一个或多个大型语言模型(LLM,如 OpenAI GPT、Google Gemini 或 DeepSeek)结合在一起,以在多类和多标签任务中提供准确且成本意识强的预测。该软件包提供了简单易用的高级接口 (AutoFusionClassifier),可以进行端到端的完整流水线训练,并需要最少的配置;同时,它还为高级用户提供了一个灵活的 API。 从内部机制来看,LabelFusion 通过将机器学习骨干模型(ML backbone)的嵌入与大型语言模型(LLM)生成的每个类别的得分相连接来整合来自两个来源向量信号——这些得分是通过结构化提示工程策略获得的。然后将这种联合表示输入到一个紧凑的多层感知器(FusionMLP)中,以产生最终预测结果。 这种方法通过学习融合能够捕捉大型语言模型推理和传统基于变压器分类器各自的优势,在多个领域中均表现出稳健性能——在 AG News 数据集上达到了 92.4% 的准确率,在包含十个类别的 Reuters 21578 主题分类数据集上则为 92.3%,同时还能实现准确性、延迟和成本之间的实用权衡。
https://arxiv.org/abs/2512.10793
This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.
本文介绍了一种具备心理感知能力的对话代理,旨在提升教育环境中学生的学习表现和情感福祉。该系统结合了大型语言模型(LLMs)、知识图谱增强版的BERT(KG-BERT)以及双向长短时记忆网络(LSTM)与注意力机制,以实时分类学生的认知状态和情绪状态。不同于以往仅限于辅导或情感支持的聊天机器人,我们的方法利用多模态数据——包括文本语义、韵律语音特征及时间行为趋势——来推断学生参与度、压力水平以及概念理解情况。 一项针对大学生进行的试点研究表明,在与基准方法相比时,该系统可以提高学生的积极性、减少他们的压力,并带来中等程度的成绩提升。这些结果表明了将语义推理、多模态融合和时间建模相结合以支持适应性、学生中心化教育干预措施的巨大潜力。
https://arxiv.org/abs/2512.10441
Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.
大型语言模型(如Claude、Mistral IA和GPT-4)在自然语言处理方面表现出色,但缺乏结构化知识,导致事实上的不一致性。为了解决这一问题,我们通过将知识图谱(KG-BERT)集成到这些模型中来增强其基于证据的事实确认和推理能力。实验表明,在问答和实体链接等需要大量知识的任务上取得了显著的进步。这种方法提高了事实的可靠性,并使下一代语言模型能够更好地理解和利用上下文信息。
https://arxiv.org/abs/2512.10440
The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.
科学文献的完整性和可靠性正面临着敌对文本生成技术的严重威胁,尤其是自动改写工具被用于掩盖抄袭行为。这些工具会产生“扭曲的短语”,即统计上不太可能的同义词(例如,“伪造意识”替代“人工智能”),它们在保留局部语法的同时隐藏了原始来源。现有的大多数检测方法依赖于静态黑名单或通用领域的语言模型,这导致对新颖混淆的假阴性率高,并且无法确定抄袭内容的来源。在这篇论文中,我们提出了语义重构敌对剽窃(SRAP)框架,该框架不仅设计用于检测这些异常情况,还旨在通过数学手段恢复原始术语。 我们的方法采用两阶段架构:(1) 使用领域特定的掩码语言模型(SciBERT)进行统计异常检测,并利用标记级伪困惑度;(2) 使用密集向量检索(FAISS)和句子级别对齐(SBERT)进行基于源语义重构。在对抗性科学文本平行语料库上的实验表明,零样本基准方法完全失败(恢复准确率为0.00%),而我们的增强检索方法实现了23.67%的恢复准确性,显著优于基线方法。 此外,我们还展示了静态决策边界对于识别富含专业术语的科学文献中的剽窃行为是必需的,因为动态阈值在高方差情况下无法正常工作。SRAP通过将混淆表达链接回最可能的源文档来实现法医分析。
https://arxiv.org/abs/2512.10435
Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.
精确的空间理解对于图像引导手术、增强现实集成和情境感知至关重要。在微创手术中,视觉输入是术中的唯一模式,因此,在内窥镜帧之间建立精确的像素级对应关系对三维重建、相机跟踪以及场景解释来说是非常关键的。然而,外科领域存在独特的挑战:透视线索弱化、非朗伯体组织反射和复杂可变形解剖结构降低了传统计算机视觉技术的表现能力。虽然深度学习模型在自然场景中表现出色,但其特征并不适合于手术图像中的细粒度匹配,并且需要进行针对性的调整以满足该领域的需求。 这项研究提出了一种新的深度学习管道,在内窥镜图像对之间建立特征对应关系,并引入了一个自我监督优化框架来训练模型。所提出的策略利用一种新颖视角合成管道生成地面实况内的对应关系,随后在对比学习范式中用于挖掘三元组。通过这种自我监督方法,我们在DINOv2主干上添加了额外的Transformer层,该层专门针对通过余弦相似度阈值化进行直接匹配所需的嵌入进行了优化。 实验评估表明,在SCARED数据集上的实验结果证明我们的管道超越了当前最先进的技术,并且在改进对应关系精度和降低极线误差方面优于相关工作。所提出的框架为手术内窥镜中的高级计算机视觉应用提供了重要的贡献,使得这些应用更加准确。
https://arxiv.org/abs/2512.10379
Conspiratorial discourse is increasingly embedded within digital communication ecosystems, yet its structure and spread remain difficult to study. This work analyzes conspiratorial narratives in Singapore-based Telegram groups, showing that such content is woven into everyday discussions rather than confined to isolated echo chambers. We propose a two-stage computational framework. First, we fine-tune RoBERTa-large to classify messages as conspiratorial or not, achieving an F1-score of 0.866 on 2,000 expert-labeled messages. Second, we build a signed belief graph in which nodes represent messages and edge signs reflect alignment in belief labels, weighted by textual similarity. We introduce a Signed Belief Graph Neural Network (SiBeGNN) that uses a Sign Disentanglement Loss to learn embeddings that separate ideological alignment from stylistic features. Using hierarchical clustering on these embeddings, we identify seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN yields stronger clustering quality (cDBI = 8.38) than baseline methods (13.60 to 67.27), supported by 88 percent inter-rater agreement in expert evaluations. Our analysis shows that conspiratorial messages appear not only in clusters focused on skepticism or distrust, but also within routine discussions of finance, law, and everyday matters. These findings challenge common assumptions about online radicalization by demonstrating that conspiratorial discourse operates within ordinary social interaction. The proposed framework advances computational methods for belief-driven discourse analysis and offers applications for stance detection, political communication studies, and content moderation policy.
阴谋论话语在数字通信生态系统中日益嵌入,但其结构和传播仍然难以研究。这项工作分析了新加坡Telegram群组中的阴谋论叙事,发现此类内容融入日常讨论之中,而不仅仅局限于孤立的回音室。我们提出了一种两阶段计算框架。首先,我们将RoBERTa-large模型微调用于将消息分类为阴谋论或非阴谋论,并在2000条专家标注的消息上实现了F1评分为0.866。其次,我们构建了一个有向信念图,在其中节点代表消息,边的符号反映信念标签的一致性,并根据文本相似度加权。 我们引入了带有意识形态一致性与风格特征分离损失函数的Signed Belief Graph Neural Network (SiBeGNN)模型,该模型能够学习出区分意识形态一致性和风格特征的嵌入表示。通过在这些嵌入上进行层次聚类,我们在553,648条消息中识别出了七个叙事原型:法律话题、医疗关注点、媒体讨论、金融问题、权威机构之间的矛盾、群组管理以及日常聊天。 SiBeGNN模型比基线方法(cDBI值范围从13.60到67.27)获得了更好的聚类质量(cDBI = 8.38),并得到了专家评估中高达88%的一致性。我们的分析表明,阴谋论消息不仅出现在怀疑或不信任的集群中,还存在于关于金融、法律和日常生活常规讨论中。这些发现挑战了关于在线激进化的一些普遍假设,并展示了阴谋论话语如何在普通的社交互动环境中运作。 所提出的框架推进了基于信念驱动的对话分析的计算方法,并为立场检测、政治传播研究以及内容监管政策提供了应用价值。
https://arxiv.org/abs/2512.10105