We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
我们提出了OpenVoxel,这是一种无需训练的算法,用于对稀疏体素进行分组和加标签,以实现开放词汇3D场景理解任务。给定从一个3D场景多视角图像获得的稀疏体素栅格化(SVR)模型,我们的OpenVoxel能够生成有意义的群组来描述场景中的不同物体。此外,通过利用强大的视觉语言模型(VLMs)和多模态大型语言模型(MLLMs),我们的OpenVoxel成功地为每个群组生成了有信息量的场景地图,从而支持进一步的3D场景理解任务,如开放词汇分割(OVS)或指代表达分割(RES)。与以前的方法不同,我们这种方法无需训练,并且不从CLIP/BERT文本编码器引入嵌入向量。相反,我们直接使用MLLM进行文本到文本搜索。通过广泛的实验,我们的方法在最近的研究中表现出色,特别是在复杂的指代表达分割(REM)任务上。代码将开源。
https://arxiv.org/abs/2601.09575
Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.
近年来,预训练语言模型(LMs)在社会应用和培训成本方面都有了显著的增长。这种快速扩大规模的情况限制了对它们偏见的理解和缓解工作。由于重新训练大型语言模型的成本过高,大多数消除偏见的研究主要集中在事后处理或基于掩码的方法上,这些方法往往无法解决偏见的根本原因。在这项工作中,我们试图通过使用低成本的代理模型来使预模型去偏研究民主化。具体而言,我们调查了BabyLMs,这是一种小型、可变语料库训练的紧凑型BERT类似模型,可以近似大型模型获取和学习偏见的动力学。 我们的研究表明,尽管BabyLMs的大小大幅减少,但它们在内在偏见形成和发展方面的表现与标准的BERT模型非常一致。此外,BabyLMs与BERT之间的相关性在多种内源去偏方法和事后处理去偏策略上都存在。利用这些相似之处,我们使用BabyLMs进行了预训练去偏实验,并复制了先前的研究发现,同时提出了关于性别不平衡和毒性对偏见形成影响的新见解。 我们的结果表明,BabyLMs可以作为大规模语言模型的有效试验平台,将预训练成本从超过500个GPU小时降低到不到30个GPU小时。这为预模型去偏研究的民主化提供了一种途径,并使构建更公平的语言模型的方法探索更加迅速且易于访问。
https://arxiv.org/abs/2601.09421
Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional this http URL propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.
标准的学生教学评价通常会遇到可靠性低、选项受限和回答失真的问题。现有的机器学习方法通常将开放性评论简化为二元情感反馈,忽略了诸如内容清晰度、反馈及时性和教师举止等具体关注点,并且对教学改进的指导作用有限。为此,我们提出了TeachPro框架,这是一个多标签学习系统,旨在全面评估五个关键的教学维度:专业能力、教学行为、教育效果、课堂体验和其他表现指标。 在该框架中,首先提出了一种“锚定维度的证据编码器”,它包含了三个核心组件: (i) 一个预训练的文字编码器,它可以将定性的反馈注释转换为上下文化的嵌入; (ii) 一个提示模块,代表五个教学维度作为可学习的语义锚点;以及 (iii) 一种跨注意力机制,在结构化的语义空间内对齐证据与教育维度。 接着我们提出了“交叉视图图协同网络”,用于表示学生评论。这个网络包括两个组成部分: (i) 句法分支,从句法树中提取显式的语法依赖关系;以及 (ii) 语义分支,根据基于BERT的相似性图形建模隐含的概念联系。 双向融合模块对齐了句法和语义单元,而差异正则化器分离嵌入以鼓励互补表示。 最后,一种跨注意力机制将维度锚定证据与多视图评论表示连接起来。我们还贡献了一个新的基准数据集,该数据集包含专家的定性注释和多标签评分。广泛实验表明,TeachPro在各种评估设置下提供了更精细的诊断粒度和更强健的表现。
https://arxiv.org/abs/2601.09246
Current patent claim generation systems face three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment. We introduce a novel three-stage framework that addresses these challenges through relationship-aware similarity analysis, domain-adaptive claim generation, and unified quality assessment. Our approach employs multi-head attention with eight specialized heads for explicit relationship modeling, integrates curriculum learning with dynamic LoRA adapter selection across five patent domains, and implements cross-attention mechanisms between evaluation aspects for comprehensive quality assessment. Extensive experiments on USPTO HUPD dataset, EPO patent collections, and Patent-CE benchmark demonstrate substantial improvements: 7.6-point ROUGE-L gain over GPT-4o, 8.3\% BERTScore enhancement over Llama-3.1-8B, and 0.847 correlation with human experts compared to 0.623 for separate evaluation models. Our method maintains 89.4\% cross-jurisdictional performance retention versus 76.2\% for baselines, establishing a comprehensive solution for automated patent prosecution workflows.
当前的专利申请生成系统面临三个基本限制:跨司法辖区泛化能力差、声明与现有技术之间的语义关系建模不足以及质量评估不可靠。我们提出了一种新颖的三阶段框架,通过关系感知相似度分析、领域自适应声明生成和统一的质量评估来解决这些挑战。 我们的方法采用多头注意力机制,并使用八个专门化的头部进行显式关系建模;将课程学习与动态LoRA适配器选择集成在五个专利领域中;以及实施评价方面之间的交叉注意机制,以实现全面的质量评估。在美国专利商标局(USPTO)HUPD数据集、欧洲专利局(EPO)专利集合和Patent-CE基准上的广泛实验表明,我们的方法取得了显著改进:ROUGE-L得分比GPT-4o高出7.6分;BERTScore相对于Llama-3.1-8B提高了8.3%;与单独的评估模型相比,我们的人类专家相关性评分达到了0.847(后者为0.623)。此外,在跨司法辖区性能保持方面,我们的方法达到89.4%,而基准线仅为76.2%。这表明我们的方法为自动化专利申请工作流程提供了一个全面的解决方案。
https://arxiv.org/abs/2601.09120
Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations--especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF--IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.
精细粒度的劳动力市场分析越来越依赖于将非结构化的职位广告映射到诸如ESCO(欧洲技能/能力、资格框架)等标准化的技能分类体系。这种映射自然可以被形式化为一个极多标签分类(XMLC)问题,但监督解决方案由于大规模、与分类体系对齐标注数据的稀缺和高昂成本而受到限制——特别是在非英语环境下,职位广告语言与正式技能定义之间存在显著差异的情况下更是如此。我们提出了一种零样本技能提取框架,该框架消除了手动标记职位广告训练数据的需求。该框架使用大型语言模型(LLM)从ESCO定义中合成训练实例,并引入基于ESCO第二级类别的分层约束多技能生成机制,以在多标签上下文中提高语义一致性。在此人工合成的语料库之上,我们训练了一个对比双编码器来将职位广告句子与ESCO技能描述对齐到共享嵌入空间;该编码器通过添加双向长短时记忆(BiLSTM)和注意力池化层增强BERT骨干模型,以更好地处理长且信息密集的要求陈述。一个上游的RoBERTa基二元过滤器用于去除非技能语句,从而提高整体精度。 实验表明: (i) 分层条件生成提高了相对于无约束配对而言的语言流畅性和可区分性; (ii) 所得到的多标签模型能够有效转移到真实世界中的中文职位广告上,并取得了强大的零样本检索性能(F1@5 = 0.72),并且优于TF-IDF和标准BERT基线。 总体来说,所提出的处理流程为劳动力经济学和工作力量分析领域的自动化技能编码提供了一条可扩展且数据高效的途径。
https://arxiv.org/abs/2601.09119
Digital platforms have an ever-expanding user base, and act as a hub for communication, business, and connectivity. However, this has also allowed for the spread of hate speech and misogyny. Artificial intelligence models have emerged as an effective solution for countering online hate speech but are under explored for low resource and code-mixed languages and suffer from a lack of interpretability. Explainable Artificial Intelligence (XAI) can enhance transparency in the decisions of deep learning models, which is crucial for a sensitive domain such as hate speech detection. In this paper, we present a multi-modal and explainable web application for detecting misogyny in text and memes in code-mixed Hindi and English. The system leverages state-of-the-art transformer-based models that support multilingual and multimodal settings. For text-based misogyny identification, the system utilizes XLM-RoBERTa (XLM-R) and multilingual Bidirectional Encoder Representations from Transformers (mBERT) on a dataset of approximately 4,193 comments. For multimodal misogyny identification from memes, the system utilizes mBERT + EfficientNet, and mBERT + ResNET trained on a dataset of approximately 4,218 memes. It also provides feature importance scores using explainability techniques including Shapley Additive Values (SHAP) and Local Interpretable Model Agnostic Explanations (LIME). The application aims to serve as a tool for both researchers and content moderators, to promote further research in the field, combat gender based digital violence, and ensure a safe digital space. The system has been evaluated using human evaluators who provided their responses on Chatbot Usability Questionnaire (CUQ) and User Experience Questionnaire (UEQ) to determine overall usability.
数字平台的用户群体不断扩张,这些平台已成为沟通、商务和连接的核心枢纽。然而,这也导致了仇恨言论和性别歧视言论在网络上的传播。人工智能模型已被证明是应对在线仇恨言论的有效解决方案,但它们在低资源语言和代码混合语言中的应用尚不充分,并且缺乏解释性。 可解释的人工智能(XAI)能够增强深度学习模型决策的透明度,在如仇恨言论检测这样敏感领域中尤为重要。本文介绍了一个多模态、可解释的网页应用程序,用于检测代码混合的语言——印地语和英语中的性别歧视文本及网络图片(memes)。该系统利用了支持多语言和多模式设置的最先进的变压器模型。对于基于文本的性别歧视识别,系统使用XLM-RoBERTa (XLM-R) 和多语言双向编码器表示模型(mBERT),分析大约4,193条评论数据集。对于从网络图片(memes)中进行的多模态性别歧视识别,该系统利用了mBERT + EfficientNet和mBERT + ResNET,并且基于一个大约有4,218个网络图片的数据集进行了训练。此外,它还通过可解释性技术提供了特征重要性得分,包括Shapley Additive Values (SHAP) 和 Local Interpretable Model Agnostic Explanations (LIME)。 该应用程序旨在成为研究人员和内容审核员的工具,以推动这一领域的进一步研究、对抗性别数字暴力,并确保一个安全的在线环境。系统通过人类评估者提供的Chatbot Usability Questionnaire(CUQ)和User Experience Questionnaire(UEQ)反馈进行了评估,从而确定整体可用性。
https://arxiv.org/abs/2601.08457
Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
科学复合图表将多个标注面板组合成一个单一图像,但在实际处理管道中,图例常常缺失或仅提供图片级别的总结,这使得理解每个单独的面板变得困难。在本文中,我们提出了FigEx2,这是一个基于视觉条件的框架,它可以定位面板,并直接从复合图形生成针对每个面板的文字描述。为了减轻开放性描述中的多种表述带来的影响,我们引入了一种噪声感知门控融合模块,该模块能够自适应地过滤词级别特征以稳定检测查询空间。 此外,我们采用了一个阶段化优化策略,结合了监督学习与强化学习(RL),利用CLIP基础的对齐和BERTScore基础的语义奖励来强制执行严格的多模态一致性。为了支持高质量的监督,我们策划了BioSci-Fig-Cap,这是一个针对面板级别定位进行了精炼基准测试的数据集,并且还包含跨学科物理学和化学领域的测试套件。 实验结果表明,FigEx2在检测方面取得了0.726 mAP@0.5:0.95的卓越表现,并在METEOR和BERTScore上分别比Qwen3-VL-8B高出0.51和0.24。值得注意的是,FigEx2展示了出色的零样本迁移能力,能够在未经任何微调的情况下应用于分布外的科学领域。
https://arxiv.org/abs/2601.08026
Sanskrit Subhasitas encapsulate centuries of cultural and philosophical wisdom, yet remain underutilized in the digital age due to linguistic and contextual barriers. In this work, we present Pragya, a retrieval-augmented generation (RAG) framework for semantic recommendation of Subhasitas. We curate a dataset of 200 verses annotated with thematic tags such as motivation, friendship, and compassion. Using sentence embeddings (IndicBERT), the system retrieves top-k verses relevant to user queries. The retrieved results are then passed to a generative model (Mistral LLM) to produce transliterations, translations, and contextual explanations. Experimental evaluation demonstrates that semantic retrieval significantly outperforms keyword matching in precision and relevance, while user studies highlight improved accessibility through generated summaries. To our knowledge, this is the first attempt at integrating retrieval and generation for Sanskrit Subhasitas, bridging cultural heritage with modern applied AI.
《梵语Subhasitas的智慧:面向数字时代的语义推荐》 梵语Subhasitas(格言)蕴含着数世纪的文化和哲学智慧,然而由于语言和文化背景差异,在数字化时代却鲜少被利用。在此项研究中,我们介绍了Pragya框架——一个用于Subhasitas语义推荐的检索增强生成(RAG)框架。我们收集并整理了一个包含200个格言的数据集,并对每个格言标注了主题标签,如激励、友谊和同情等。通过使用句子嵌入技术(例如IndicBERT),系统可以检索出与用户查询相关的前k个格言。然后,将这些检索到的结果传递给生成模型(例如Mistral LLM),以产生转写文本、翻译及背景解释。 实验评估表明,在精度和相关性方面,语义检索显著优于关键词匹配方法,而用户体验研究也显示,通过生成的摘要,用户对Subhasitas的理解得到了明显改善。据我们所知,这是首次尝试将检索与生成技术应用于梵语Subhasitas领域,这为文化遗产与现代人工智能应用之间的桥梁建设提供了有力的支持。
https://arxiv.org/abs/2601.06607
This paper describes our system used in the BLP-2025 Task 1: Hate Speech Detection. We participated in Subtask 1A and Subtask 1B, addressing hate speech classification in Bangla text. Our approach employs a unified architecture that integrates BanglaBERT embeddings with multiple parallel processing branches based on GRUs and CNNs, followed by attention and dense layers for final classification. The model is designed to capture both contextual semantics and local linguistic cues, enabling robust performance across subtasks. The proposed system demonstrated high competitiveness, obtaining 0.7345 micro F1-Score (2nd place) in Subtask 1A and 0.7317 micro F1-Score (5th place) in Subtask 1B.
本文描述了我们在BLP-2025任务1:仇恨言论检测中使用的系统。我们参加了子任务1A和子任务1B,旨在对孟加拉语文本中的仇恨言论进行分类。我们的方法采用了一种统一的架构,该架构结合了BanglaBERT嵌入与基于GRU(门控循环单元)和CNN(卷积神经网络)的多个并行处理分支,并通过注意力机制和稠密层来进行最终分类。该模型旨在捕捉上下文语义和局部语言线索,从而在各个子任务中表现出色。 我们提出的系统展示了很高的竞争力,在子任务1A中获得了0.7345的微平均F1分数(第二名),而在子任务1B中则为0.7317的微平均F1分数(第五名)。
https://arxiv.org/abs/2601.06306
We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.
我们发布了Pantagruel模型,这是一个新的自监督编码器模型系列,旨在处理法文文本和语音。与预测特定模态的目标(如文本标记或语音单元)不同,Pantagruel在特征空间中学习上下文化的目标表示,使特定于模态的编码器能够更有效地捕捉语言和声学规律性。 不同的模型在大规模法国语料库上进行了预训练,这些数据集包括维基百科、OSCAR 和 CroissantLLM(用于文本),以及多语言LibriSpeech、LeBenchmark 和 INA-100k(用于语音)。INA-100k 是一个由法国国家视听档案馆(Institut National de l'Audiovisuel,简称INA)的广播和电视节目存档资料中提取的新创建的10万小时法语音频数据集,提供了高度多样化的音频数据。 我们在广泛的下游任务上评估了Pantagruel模型,这些任务涵盖了两种模态,并包括标准法国基准测试中的FLUE 或 LeBenchmark。在这些任务中,Pantagruel 模型表现出了与CamemBERT、FlauBERT 和 LeBenchmark2.0 等强基线相比的竞争力或优越性,同时保持了可以无缝处理语音或文本输入的共享架构。 这些结果证实了特征空间自监督目标在法文表示学习中的有效性,并将Pantagruel 定位为多模态语音-文本理解的强大基础。
https://arxiv.org/abs/2601.05911
Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.
针对作物病害分析的视觉问答需要准确的图像理解和可靠的文本生成能力。这项工作提出了一种轻量级的视觉-语言框架,用于从叶片图片中识别作物和疾病。该方法结合了Swin Transformer视觉编码器与序列到序列的语言解码器。采用两阶段训练策略以改进视觉表示学习及跨模态对齐。在大规模作物病害数据集上使用分类和自然语言生成指标对该模型进行了评估。实验结果显示,在作物识别和疾病诊断方面均达到了高准确率。此外,该框架在BLEU、ROUGE 和 BERTScore等评分标准中也表现出色。我们的模型相比大规模视觉-语言基准模型而言,参数量显著减少且性能更优。 可解释性通过Grad-CAM及token级归因分析进行评估。定性结果表明,在多样化的用户驱动查询下具有稳健的表现能力。这些发现突显了针对作物病害视觉问答任务的特定视觉预训练的有效性。
https://arxiv.org/abs/2601.05143
The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish--a hybrid of Hindi and English--used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.
在印度,品牌监测的有效性越来越受到“HINGLISH”(印地语和英语混合语言)的挑战。这种混合语言广泛用于如Twitter等平台上的用户生成内容中。传统的自然语言处理(NLP)模型基于单语种数据构建,常常无法解读这种代码混用语言在语法和语义层面的复杂性,导致情感分析不准确及市场洞察误导。 为了解决这一问题,我们提出了一种专门针对Hinglish推文的情感分类框架。我们的方法是通过微调mBERT(多语言BERT),利用其多语言能力来更好地理解印度社交媒体的语言多样性。我们的方法的一个关键组成部分是使用子词分词技术,使模型能够有效处理罗马化Hinglish中常见的拼写变化、俚语和词汇外术语。 这项研究提供了一种现成的AI解决方案,用于品牌情感追踪,并在低资源、代码混用环境中建立了强大的多语言NLP基准。
https://arxiv.org/abs/2601.05091
In the process of digital transformation, enterprises are faced with problems such as insufficient semantic understanding of unstructured data and lack of intelligent decision-making basis in driving mechanisms. This study proposes a method that combines a large language model (LLM) and a knowledge graph. First, a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model is used to perform entity recognition and relationship extraction on multi-source heterogeneous texts, and GPT-4 is used to generate semantically enhanced vector representations; secondly, a two-layer graph neural network (GNN) architecture is designed to fuse the semantic vectors output by LLM with business metadata to construct a dynamic and scalable enterprise knowledge graph; then reinforcement learning is introduced to optimize decision path generation, and the reward function is used to drive the mechanism iteration. In the case of the manufacturing industry, this mechanism reduced the response time for equipment failure scenarios from 7.8 hours to 3.7 hours, the F1 value reached 94.3%, and the compensation for decision errors in the annual digital transformation cost decreased by 45.3%. This method significantly enhances the intelligence level and execution efficiency of the digital transformation driving mechanism by integrating large model semantic understanding with structured knowledge.
在数字化转型的过程中,企业面临着非结构化数据语义理解不足和驱动机制中缺乏智能决策依据的问题。本研究提出了一种结合大规模语言模型(LLM)与知识图谱的方法来应对这些问题。首先,使用经过微调的BERT(双向编码器表示从变压器)模型对多源异构文本进行实体识别和关系抽取,并利用GPT-4生成语义增强的向量表达;其次,设计了一种双层图神经网络(GNN)架构,将大规模语言模型输出的语义向量与业务元数据融合,构建一个动态且可扩展的企业知识图谱;然后引入强化学习优化决策路径生成,并利用奖励函数驱动机制迭代。在制造业案例中,该机制使设备故障场景下的响应时间从7.8小时缩短至3.7小时,F1值达到94.3%,并且每年数字化转型成本中的决策错误补偿减少了45.3%。通过将大规模模型的语义理解和结构化知识相结合,这种方法显著提升了数字转型驱动机制的智能化水平和执行效率。
https://arxiv.org/abs/2601.04696
Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.
大型语言模型(LLMs)的对齐工作在不断演进中。机器生成文本(MGT)与人类撰写文本(HWT)之间的区分变得越来越困难,这加剧了诸如假新闻和网络诈骗等滥用问题。经过微调的检测器泛化能力高度依赖于数据集的质量,仅靠增加MGT的数据来源是不够的,还需要进一步增强生成过程。根据HC-Var理论,提高生成文本的对齐度不仅能帮助测试现有检测器的鲁棒性,还有助于改进基于该方法进行微调后的检测器泛化能力。因此,我们提出了通过对齐实现机器增广生成文本的方法(MAGA)。MAGMA的工作流程从提示构建到推理过程实现了全面对齐,在此过程中,由我们系统提出的增强型检测器反馈学习(RLDF)作为关键组成部分。在我们的实验中,基于MAGMA训练集微调的RoBERTa检测器在其泛化检测AUC上平均提高了4.60%。而MAGMA数据集导致所选检测器的AUC平均降低了8.13%,这预期为未来关于检测器泛化能力的研究提供了指示性意义。
https://arxiv.org/abs/2601.04633
As large language models (LLMs) are increasingly integrated into social decision-making, understanding their political positioning and alignment behavior is critical for safety and fairness. This study presents a sociotechnical audit of 26 prominent LLMs, triangulating their positions across three psychometric inventories (Political Compass, SapplyValues, 8 Values) and evaluating their performance on a large-scale news labeling task ($N \approx 27{,}000$). Our results reveal a strong clustering of models in the Libertarian-Left region of the ideological space, encompassing 96.3% of the cohort. Alignment signals appear to be consistent architectural traits rather than stochastic noise ($\eta^2 > 0.90$); however, we identify substantial discrepancies in measurement validity. In particular, the Political Compass exhibits a strong negative correlation with cultural progressivism ($r=-0.64$) when compared against multi-axial instruments, suggesting a conflation of social conservatism with authoritarianism in this context. We further observe a significant divergence between open-weights and closed-source models, with the latter displaying markedly higher cultural progressivism scores ($p<10^{-25}$). In downstream media analysis, models exhibit a systematic "center-shift," frequently categorizing neutral articles as left-leaning, alongside an asymmetric detection capability in which "Far Left" content is identified with greater accuracy (19.2%) than "Far Right" content (2.0%). These findings suggest that single-axis evaluations are insufficient and that multidimensional auditing frameworks are necessary to characterize alignment behavior in deployed LLMs. Our code and data will be made public.
随着大型语言模型(LLMs)越来越多地被整合到社会决策中,了解它们的政治定位和对齐行为对于安全性和公平性至关重要。这项研究通过三个心理测量量表(政治罗盘、SapplyValues、8 Values),对26个著名的LLM进行了技术与社会科学相结合的审计,并评估了它们在大规模新闻标签任务中的表现($N \approx 27,000$)。我们的结果显示,模型在一个意识形态空间的自由主义-左翼区域有强烈的聚类现象,涵盖了96.3%的样本。对齐信号似乎是一种一致的架构特性而非随机噪声($\eta^2 > 0.90$);然而,我们发现了测量有效性方面的显著差异。特别是,当与多轴向工具相比时,政治罗盘量表显示出与文化进步主义之间强烈的负相关性($r=-0.64$),这表明在这种情况下将社会保守主义误认为是权威主义。 此外,我们观察到开源模型和闭源模型之间的显著差异,后者展示了明显更高的文化进步主义得分($p<10^{-25}$)。在下游媒体分析中,模型表现出系统的“中心偏移”,经常将中立的文章归类为左倾,并且具有不对称的检测能力,在识别极左内容时比识别极右内容更准确(分别准确率为19.2%和2.0%)。 这些发现表明,单一维度的评估是不够的,需要采用多维审计框架来表征部署中的LLM对齐行为。我们的代码和数据将公开发布。
https://arxiv.org/abs/2601.06194
The rapid development of large language models has led to an increase in AI-generated text, with students increasingly using LLM-generated content as their own work, which violates academic integrity. This paper presents an evaluation of AI text detection methods, including both traditional machine learning models and transformer-based architectures. We utilize two datasets, HC3 and DAIGT v2, to build a unified benchmark and apply a topic-based data split to prevent information leakage. This approach ensures robust generalization across unseen domains. Our experiments show that TF-IDF logistic regression achieves a reasonable baseline accuracy of 82.87%. However, deep learning models outperform it. The BiLSTM classifier achieves an accuracy of 88.86%, while DistilBERT achieves a similar accuracy of 88.11% with the highest ROC-AUC score of 0.96, demonstrating the strongest overall performance. The results indicate that contextual semantic modeling is significantly superior to lexical features and highlight the importance of mitigating topic memorization through appropriate evaluation protocols. The limitations of this work are primarily related to dataset diversity and computational constraints. In future work, we plan to expand dataset diversity and utilize parameter-efficient fine-tuning methods such as LoRA. We also plan to explore smaller or distilled models and employ more efficient batching strategies and hardware-aware optimization.
大型语言模型的快速发展导致了越来越多的人工智能生成文本,其中学生使用大语言模型(LLM)生成的内容作为自己的作业变得越来越常见,这违反了学术诚信。本文介绍了对AI文本检测方法的评估,包括传统的机器学习模型和基于Transformer架构的方法。我们利用HC3和DAIGT v2两个数据集构建了一个统一的基准,并采用基于主题的数据分割方式以防止信息泄露。这种方法确保在未见过的领域中具有稳健的一般化性能。 我们的实验表明,TF-IDF逻辑回归达到了82.87%的合理基线准确率。然而,深度学习模型的表现优于它。双向LSTM分类器实现了88.86%的准确率,而DistilBERT则获得了相似的88.11%准确率和最高的ROC-AUC得分为0.96分,显示出最强的整体性能。结果表明,上下文语义建模在很大程度上优于词汇特征,并强调了通过适当的评估协议来减轻主题记忆的重要性。 这项工作的主要局限性在于数据集多样性以及计算资源的限制。在未来的工作中,我们计划扩展数据集的多样性并利用参数高效微调方法如LoRA。此外,我们还打算探索更小或精简的模型,并采用更加高效的批处理策略和硬件感知优化技术。
https://arxiv.org/abs/2601.03812
A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.
大型语言模型(LLM)运行的一个主要挑战是如何预测特定的LLM是否能为给定查询生成足够高质量的输出。现有的方法依赖于外部分类器,最常见的基于BERT的模型,这些模型受限于上下文窗口大小、表示能力有限以及额外的计算开销。我们提出了一种名为IntroLM的方法,该方法使因果语言模型能够在预填充阶段通过使用内省(introspective)令牌预测自身的输出质量而不影响生成过程。通过引入仅在内省令牌条件下激活的LoRA(低秩适应),模型能够学习为给定查询预测输出质量,同时保持原始骨干行为并避免外部评估者。 在问答基准测试中,将IntroLM应用于Qwen3 8B,在成功预测方面达到了90%的ROC AUC值,优于DeBERTa分类器14个百分点。当集成到多模型路由系统时,IntroLM实现了卓越的成本性能权衡:延迟最多降低了33%,大型模型使用量最多减少了50%,同时保证了匹配的可靠性水平。
https://arxiv.org/abs/2601.03511
The advancements in the field of AI is increasingly giving rise to various threats. One of the most prominent of them is the synthesis and misuse of Deepfakes. To sustain trust in this digital age, detection and tagging of deepfakes is very necessary. In this paper, a novel architecture for Deepfake detection in images and videos is presented. The architecture uses cross attention between spatial and frequency domain features along with a blood detection module to classify an image as real or fake. This paper aims to develop a unified architecture and provide insights into each step. Though this approach we achieve results better than SOTA, specifically 99.80%, 99.88% AUC on FF++ and Celeb-DF upon using Swin Transformer and BERT and 99.55, 99.38 while using EfficientNet-B4 and BERT. The approach also generalizes very well achieving great cross dataset results as well.
人工智能领域的进展日益带来各种威胁,其中最突出的一个问题是深度伪造(Deepfakes)的合成与滥用。为了在数字时代维持信任,检测和标记深度伪造变得非常必要。本文提出了一种用于图像和视频中深度伪造检测的新架构。该架构利用空间域和频域特征之间的交叉注意力机制,并结合血液检测模块来将图像分类为真实或虚假。本文旨在开发一种统一的架构并提供每一步的见解。 通过这种方法,我们在使用Swin Transformer和BERT时,在FF++和Celeb-DF数据集上分别实现了99.80% 和 99.88% 的AUC(受试者操作特征曲线下的面积)结果,这超过了最先进的水平。而在使用EfficientNet-B4和BERT时,我们在相同的数据集上的表现分别为99.55% 和 99.38%。此外,该方法具有很好的泛化能力,在跨数据集中也取得了出色的结果。 这段文字详细介绍了针对深度伪造检测提出的一种新架构,并展示了其优越的性能指标,同时强调了这种方法在提高数字时代信任度方面的重要作用。
https://arxiv.org/abs/2601.03382
Sentiment analysis focuses on identifying the emotional polarity expressed in textual data, typically categorized as positive, negative, or neutral. Hate speech detection, on the other hand, aims to recognize content that incites violence, discrimination, or hostility toward individuals or groups based on attributes such as race, gender, sexual orientation, or religion. Both tasks play a critical role in online content moderation by enabling the detection and mitigation of harmful or offensive material, thereby contributing to safer digital environments. In this study, we examine the performance of three transformer-based models: BERT-base-multilingual-cased, RoBERTa-base, and XLM-RoBERTa-base with the first eight layers frozen, for multilingual sentiment analysis and hate speech detection. The evaluation is conducted across five languages: English, Korean, Japanese, Chinese, and French. The models are compared using standard performance metrics, including accuracy, precision, recall, and F1-score. To enhance model interpretability and provide deeper insight into prediction behavior, we integrate the Local Interpretable Model-agnostic Explanations (LIME) framework, which highlights the contribution of individual words to the models decisions. By combining state-of-the-art transformer architectures with explainability techniques, this work aims to improve both the effectiveness and transparency of multilingual sentiment analysis and hate speech detection systems.
情感分析专注于识别文本数据中表达的情感极性,通常将其归类为正面、负面或中立。而仇恨言论检测则旨在识别可能引发针对个人或群体的暴力、歧视或敌意的内容,这些内容往往基于种族、性别、性取向或宗教等属性。这两项任务在线上内容监管中发挥着关键作用,通过检测和缓解有害或冒犯性的材料,有助于营造更加安全的数字环境。 在本研究中,我们探讨了三种基于转换器(transformer)模型——BERT-base-multilingual-cased、RoBERTa-base 和 XLM-RoBERTa-base(前八层被冻结)——在多语言情感分析和仇恨言论检测方面的表现。评估涵盖了五种语言:英语、韩语、日语、中文和法语。我们通过标准性能指标,包括准确率、精确度、召回率和F1值来进行模型比较。为了提高模型的可解释性,并深入理解其预测行为,我们将局部可解释模型无关解释(Local Interpretable Model-agnostic Explanations, LIME)框架与这些模型相结合,从而突显个别词语对模型决策的影响。 通过结合最先进的转换器架构和解释技术,本研究旨在提升多语言情感分析和仇恨言论检测系统的有效性和透明度。
https://arxiv.org/abs/2601.02697
Structured extraction with LLMs fails in production not because models lack understanding, but because output formatting is unreliable across models and prompts. A prompt that returns clean JSON on GPT-4 may produce fenced, prose-wrapped, or malformed output on Llama, causing strict parsers to reject otherwise correct extractions. We formalize this as format collapse and introduce a dual-metric evaluation framework: ROS (strict parsing, measuring operational reliability) and CSS (post-canonicalization, measuring semantic capability). On a 37,346-example camera metadata benchmark across six model families, we find severe format collapse (for example, Gemma-2B: ROS 0.116 versus CSS 0.246) and large cross-model portability gaps (0.4 to 0.6 F1). We then present PromptPort, a reliability layer combining deterministic canonicalization with a lightweight verifier (DistilBERT) and a safe-override policy. PromptPort recovers format failures (plus 6 to 8 F1), adds verifier-driven semantic selection (plus 14 to 16 F1 beyond canonicalization), and approaches per-field oracle performance (0.890 versus 0.896 in zero-shot) without modifying base models. The method generalizes to held-out model families and provides explicit abstention when uncertain, enabling reliable structured extraction in production deployments.
使用大型语言模型(LLMs)进行结构化提取在生产环境中失败,并不是因为模型缺乏理解能力,而是因为在不同模型和提示语下输出格式的可靠性不足。一个能在GPT-4上返回干净JSON的提示,在Llama等其他模型上可能会产生被包围、以散文形式包裹或者格式不正确的输出,导致严格的解析器拒绝接受原本正确的提取数据。我们将这种情况定义为“格式坍缩”,并引入了一种双指标评估框架:ROS(严格解析,衡量操作可靠性)和CSS(后规范化的,衡量语义能力)。在包含六大家族模型、37,346个样本的相机元数据基准测试中,我们发现严重的格式坍缩现象(例如,在Gemma-2B上,ROS为0.116而CSS为0.246),并且不同模型间的可移植性差距很大(F1值在0.4到0.6之间)。 随后,我们提出了PromptPort这一可靠性层,它结合了确定性的规范转换和一个轻量级验证器(DistilBERT),以及安全覆盖策略。PromptPort能够恢复格式失败(增加6至8个F1分数),添加基于验证的语义选择(在规范化基础上额外增加了14到16个F1分数),并且接近每个字段的专家性能(零样本情况下为0.890,而目标值为0.896),所有这些都不需要修改基础模型。此方法适用于未参与训练的新模型家族,并且能够明确地在不确定时采取保留措施,从而实现在生产部署中进行可靠的结构化提取。 这种技术的进步有助于解决LLMs应用过程中格式一致性的问题,使得从各种不同的提示和模型输出中可靠、准确地抽取所需信息成为可能。
https://arxiv.org/abs/2601.06151