Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.
最近的自然语言处理(NLP)进展主要集中在标准化语言上,而大多数资源匮乏的语言变体则被忽视,特别是在印度的情况中尤为明显。在印度,这一问题尤为重要:尽管印地语是全球第三大使用最广泛的语言(超过6亿使用者),但其众多方言仍然未得到充分代表。对于奥利亚语来说情况也是如此,它拥有大约4500万的使用者。虽然有一些包含标准印地语和奥利亚语的数据集存在,但是这些语言的地方方言在网上几乎没有任何呈现。 为了解决这个问题,我们引入了INDIC-DIALECT,这是一个由人工编纂的平行语料库,包括13,000对句子跨越11种方言以及两种语言:印地语和奥利亚语。利用这一语料库,我们构建了一个多任务基准测试,包含了三个任务:方言分类、多项选择题(MCQ)回答以及机器翻译(MT)。我们的实验结果显示,像GPT-4o和Gemini 2.5这样的大型语言模型在方言分类任务中表现较差。然而,在印度语系上进行预训练的基于转换器的语言模型经过微调后性能显著提升,例如将F1分数从19.6%提高到了89.8%。 对于方言到标准语言的翻译,我们发现混合AI模型达到了最高的BLEU评分为61.32,而基线评分为23.36。有趣的是,由于生成方言句子复杂性的原因,在标准语言到方言的翻译中,“规则基于后跟AI”的方法达到了最佳的BLEU评分为48.44,而基线评分为27.59。 因此,INDIC-DIALECT成为了评估具有方言意识的印度语NLP的新基准,并计划作为开源资源发布以支持对低资源印度方言进一步的研究工作。
https://arxiv.org/abs/2601.10388
The primary challenge in computer vision is precisely calculating the pose of 6D objects, however many current approaches are still fragile and have trouble generalizing from synthetic data to real-world situations with fluctuating lighting, textureless objects, and significant occlusions. To address these limitations, VLM6D, a novel dual-stream architecture that leverages the distinct strengths of visual and geometric data from RGB-D input for robust and precise pose estimation. Our framework uniquely integrates two specialized encoders: a powerful, self-supervised Vision Transformer (DINOv2) processes the RGB modality, harnessing its rich, pre-trained understanding of visual grammar to achieve remarkable resilience against texture and lighting variations. Concurrently, a PointNet++ encoder processes the 3D point cloud derived from depth data, enabling robust geometric reasoning that excels even with the sparse, fragmented data typical of severe occlusion. These complementary feature streams are effectively fused to inform a multi task prediction head. We demonstrate through comprehensive experiments that VLM6D obtained new SOTA performance on the challenging Occluded-LineMOD, validating its superior robustness and accuracy.
https://arxiv.org/abs/2511.00120
Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual "hallucinations". Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model's own predicted attributes are embedded in a prompt and fed to the text decoder to improve factual consistency. The experiments demonstrate the superiority of this architecture. The multi tasking approach outperforms both the independent price regression, with a 3.6% better R2 Value and attribute classification, with a 6.6% improvement F1 score. Critically, the hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical ablation. The hierarchical approach also reduces the latency of the autoregressive text generation process by a factor of 3.5 when compared to direct vision to language model of similar size. One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.
https://arxiv.org/abs/2510.21835
We present TinyBEV, a unified, camera only Bird's Eye View (BEV) framework that distills the full-stack capabilities of a large planning-oriented teacher (UniAD [19]) into a compact, real-time student model. Unlike prior efficient camera only baselines such as VAD[23] and VADv2[7], TinyBEV supports the complete autonomy stack 3D detection, HD-map segmentation, motion forecasting, occupancy prediction, and goal-directed planning within a streamlined 28M-parameter backbone, achieving a 78% reduction in parameters over UniAD [19]. Our model-agnostic, multi-stage distillation strategy combines feature-level, output-level, and adaptive region-aware supervision to effectively transfer high-capacity multi-modal knowledge to a lightweight BEV representation. On nuScenes[4], Tiny-BEV achieves 39.0 mAP for detection, 1.08 minADE for motion forecasting, and a 0.32 collision rate, while running 5x faster (11 FPS) and requiring only camera input. These results demonstrate that full-stack driving intelligence can be retained in resource-constrained settings, bridging the gap between large-scale, multi-modal perception-planning models and deployment-ready real-time autonomy.
我们介绍了TinyBEV,这是一种统一的、仅基于相机的鸟瞰图(BEV)框架,它将大型规划导向教师模型(如UniAD [19])的所有能力提炼到一个紧凑且实时的学生模型中。与之前的高效仅基于相机的基础模型(例如VAD[23]和VADv2[7])不同,TinyBEV在一个精简的28M参数骨干网络中支持完整的自主驾驶栈,包括三维检测、高精度地图分割、运动预测、占用预测以及目标导向规划等功能,相比UniAD [19]实现了高达78%的参数减少。我们的模型无关、多阶段的知识蒸馏策略结合了特征级、输出级和自适应区域感知监督方法,能够有效地将大型模型中的高层次、多模态知识转移到一个轻量级的BEV表示中。 在nuScenes数据集[4]上,Tiny-BEV达到了39.0 mAP的检测精度,在运动预测任务上的minADE为1.08,并且碰撞率为0.32,同时运行速度提高了5倍(达到每秒11帧),仅需基于相机输入。这些结果表明,在资源受限的情况下也可以保留完整的自动驾驶智能,从而弥合了大规模、多模态感知规划模型与部署就绪的实时自主驾驶系统之间的差距。
https://arxiv.org/abs/2509.18372
Accurate prediction of clinical scores is critical for early detection and prognosis of Alzheimers disease (AD). While existing approaches primarily focus on forecasting the ADAS-Cog global score, they often overlook the predictive value of its sub-scores (13 items), which capture domain-specific cognitive decline. In this study, we propose a multi task learning (MTL) framework that jointly predicts the global ADAS-Cog score and its sub-scores (13 items) at Month 24 using baseline MRI and longitudinal clinical scores from baseline and Month 6. The main goal is to examine how each sub scores particularly those associated with MRI features contribute to the prediction of the global score, an aspect largely neglected in prior MTL studies. We employ Vision Transformer (ViT) and Swin Transformer architectures to extract imaging features, which are fused with longitudinal clinical inputs to model cognitive progression. Our results show that incorporating sub-score learning improves global score prediction. Subscore level analysis reveals that a small subset especially Q1 (Word Recall), Q4 (Delayed Recall), and Q8 (Word Recognition) consistently dominates the predicted global score. However, some of these influential sub-scores exhibit high prediction errors, pointing to model instability. Further analysis suggests that this is caused by clinical feature dominance, where the model prioritizes easily predictable clinical scores over more complex MRI derived features. These findings emphasize the need for improved multimodal fusion and adaptive loss weighting to achieve more balanced learning. Our study demonstrates the value of sub score informed modeling and provides insights into building more interpretable and clinically robust AD prediction frameworks. (Github repo provided)
https://arxiv.org/abs/2508.17619
Accurate segmentation of melanocytic tumors in dermoscopic images is a critical step for automated skin cancer screening and clinical decision support. Unlike natural scene segmentation, lesion delineation must reconcile subtle texture and color variations, frequent artifacts (hairs, rulers, bubbles), and a strong need for precise boundary localization to support downstream diagnosis. In this paper we introduce Our method, a novel ResNet inspired dual resolution architecture specifically designed for melanocytic tumor segmentation. Our method maintains a full resolution stream that preserves fine grained boundary information while a complementary pooled stream aggregates multi scale contextual cues for robust lesion recognition. The streams are tightly coupled by boundary aware residual connections that inject high frequency edge information into deep feature maps, and by a channel attention module that adapts color and texture sensitivity to dermoscopic appearance. To further address common imaging artifacts and the limited size of clinical datasets, we propose a lightweight artifact suppression block and a multi task training objective that combines a Dice Tversky segmentation loss with an explicit boundary loss and a contrastive regularizer for feature stability. The combined design yields pixel accurate masks without requiring heavy post processing or complex pre training protocols. Extensive experiments on public dermoscopic benchmarks demonstrate that Our method significantly improves boundary adherence and clinically relevant segmentation metrics compared to standard encoder decoder baselines, making it a practical building block for automated melanoma assessment systems.
在皮肤癌筛查和临床决策支持中,对表皮黑色素肿瘤进行准确的分割是一个关键步骤。与自然场景分割不同的是,病变区域的边界划定必须解决细微的颜色和纹理变化、频繁出现的伪影(如毛发、尺子、气泡)以及精准定位边界的强烈需求问题,以支持后续诊断。本文介绍了我们提出的一种新的基于ResNet的双分辨率架构,专门用于黑色素肿瘤分割。 我们的方法保持了一条全分辨率流,该流保留了精细的边界信息,而另一条互补的池化流则通过汇集多尺度上下文线索来增强病变区域的稳健识别能力。这两股流通过边界感知残差连接紧密耦合在一起,这些连接将高频边缘信息注入到深层特征图中,并且使用一个通道注意力模块根据皮肤镜图像特性调整颜色和纹理敏感度。 为了进一步应对常见的成像伪影以及临床数据集规模较小的问题,我们提出了一种轻量级的伪影抑制块及一个多任务训练目标。该多任务训练目标结合了Dice-Tversky分割损失、明确的边界损失以及用于特征稳定性的对比正则化器。这种综合设计能够在不依赖重型后处理或复杂的预训练协议的情况下生成像素准确的掩模。 在公共皮肤镜基准测试中进行的广泛实验表明,我们的方法相较于标准的编码解码基线模型,在边界贴合度和临床上相关的分割指标方面有了显著提高,使其成为自动黑色素瘤评估系统中的实用构建模块。
https://arxiv.org/abs/2508.06816
Infrared spectroscopy offers rapid, non destructive measurement of chemical and material properties but suffers from high dimensional, overlapping spectral bands that challenge conventional chemometric approaches. Emerging large language models (LLMs), with their capacity for generalization and reasoning, offer promising potential for automating complex scientific workflows. Despite this promise, their application in IR spectral analysis remains largely unexplored. This study addresses the critical challenge of achieving accurate, automated infrared spectral interpretation under low-data conditions using an LLM-driven framework. We introduce an end-to-end, large language model driven agent framework that integrates a structured literature knowledge base, automated spectral preprocessing, feature extraction, and multi task reasoning in a unified pipeline. By querying a curated corpus of peer reviewed IR publications, the agent selects scientifically validated routines. The selected methods transform each spectrum into low dimensional feature sets, which are fed into few shot prompt templates for classification, regression, and anomaly detection. A closed loop, multi turn protocol iteratively appends mispredicted samples to the prompt, enabling dynamic refinement of predictions. Across diverse materials: stamp pad ink, Chinese medicine, Pu'er tea, Citri Reticulatae Pericarpium and waste water COD datasets, the multi turn LLM consistently outperforms single turn inference, rivaling or exceeding machine learning and deep learning models under low data regimes.
红外光谱技术可以快速、无损地测量化学和材料属性,但其高维、重叠的光谱带给传统化学计量方法带来了挑战。新兴的大语言模型(LLM)凭借其泛化能力和推理能力,在自动化复杂科学工作流程方面展现出巨大潜力。尽管有这些前景,但在红外光谱分析中的应用仍处于初步阶段。 本研究旨在利用一个由大语言模型驱动的框架解决在低数据条件下实现准确、自动化的红外光谱解析的关键挑战。我们提出了一种端到端的大语言模型驱动代理框架,该框架整合了结构化的文献知识库、自动化光谱预处理、特征提取和多任务推理,并将其统一在一个管道中。 通过查询一个经过审核的同行评议的红外出版物集合,该代理选择科学验证的方法。所选方法将每个光谱转换为低维特征集,然后这些特征集被输入到少量样本提示模板中进行分类、回归及异常检测。一种封闭循环、多轮协议迭代地将预测错误的样本添加到提示中,从而能够动态优化预测。 在不同的材料上进行了测试:印泥墨水、中药、普洱茶、橘皮和废水COD数据集,在低数据条件下,经过多轮对话的大语言模型始终优于单次推理,并且其性能可以与机器学习及深度学习模型相匹敌或超越。
https://arxiv.org/abs/2507.21471
Early and accurate interpretation of screening mammograms is essential for effective breast cancer detection, yet it remains a complex challenge due to subtle imaging findings and diagnostic ambiguity. Many existing AI approaches fall short by focusing on single view inputs or single-task outputs, limiting their clinical utility. To address these limitations, we propose a novel multi-view, multitask hybrid deep learning framework that processes all four standard mammography views and jointly predicts diagnostic labels and BI-RADS scores for each breast. Our architecture integrates a hybrid CNN VSSM backbone, combining convolutional encoders for rich local feature extraction with Visual State Space Models (VSSMs) to capture global contextual dependencies. To improve robustness and interpretability, we incorporate a gated attention-based fusion module that dynamically weights information across views, effectively handling cases with missing data. We conduct extensive experiments across diagnostic tasks of varying complexity, benchmarking our proposed hybrid models against baseline CNN architectures and VSSM models in both single task and multi task learning settings. Across all tasks, the hybrid models consistently outperform the baselines. In the binary BI-RADS 1 vs. 5 classification task, the shared hybrid model achieves an AUC of 0.9967 and an F1 score of 0.9830. For the more challenging ternary classification, it attains an F1 score of 0.7790, while in the five-class BI-RADS task, the best F1 score reaches 0.4904. These results highlight the effectiveness of the proposed hybrid framework and underscore both the potential and limitations of multitask learning for improving diagnostic performance and enabling clinically meaningful mammography analysis.
早期且准确地解读乳腺筛查摄影对于有效检测乳腺癌至关重要,但由于影像学发现细微及诊断模糊等原因,这一过程仍然是一项复杂挑战。许多现有的人工智能方法由于专注于单一视图输入或单任务输出而效果不佳,从而限制了其临床应用价值。为解决这些局限性,我们提出了一种新颖的多视图、多任务混合深度学习框架,该框架处理所有四种标准乳腺摄影视图,并且能够共同预测每个乳房的诊断标签和BI-RADS评分。 我们的架构结合了一个混合CNN VSSM骨干网,它融合了卷积编码器以提取丰富的局部特征以及视觉状态空间模型(VSSMs)来捕捉全局上下文依赖关系。为了提高鲁棒性和可解释性,我们引入了一种基于门控注意力的融合模块,该模块可以动态加权各视图之间的信息,并有效地处理缺失数据的情况。 我们在不同复杂度的诊断任务上进行了广泛的实验,在单一任务和多任务学习设置下将我们的混合模型与基准CNN架构及VSSM模型进行对比。在所有任务中,混合模型始终优于基线模型。 具体来说,在二元BI-RADS 1 vs. 5分类任务中,共享混合模型达到了AUC为0.9967和F1得分为0.9830的性能。对于更具挑战性的三类分类问题,它获得了F1得分为0.7790的成绩,在五级BI-RADS任务中,最佳F1得分达到0.4904。 这些结果强调了所提出的混合框架的有效性,并突显了多任务学习在改善诊断性能和实现临床意义重大的乳腺摄影分析方面的潜力与局限。
https://arxiv.org/abs/2507.16955
Current multi-channel speech enhancement systems mainly adopt single-output architecture, which face significant challenges in preserving spatio-temporal signal integrity during multiple-input multiple-output (MIMO) processing. To address this limitation, we propose a novel neural network, termed WTFormer, for MIMO speech enhancement that leverages the multi-resolution characteristics of wavelet transform and multi-dimensional collaborative attention to effectively capture globally distributed spatial features, while using Conformer for time-frequency modeling. A multi task loss strategy accompanying MUSIC algorithm is further proposed for optimization training to protect spatial information to the greatest extent. Experimental results on the LibriSpeech dataset show that WTFormer can achieve comparable denoising performance to advanced systems while preserving more spatial information with only 0.98M parameters.
当前的多通道语音增强系统主要采用单输出架构,在进行多输入多输出(MIMO)处理时面临保持时空信号完整性的重大挑战。为了解决这一限制,我们提出了一种新型神经网络WTFormer,用于MIMO语音增强。该网络利用小波变换的多分辨率特性和多维度协作注意机制来有效捕捉全局分布的空间特征,并使用Conformer进行时间-频率建模。此外,为了保护空间信息,我们还提出了一个结合MUSIC算法的多任务损失策略来进行优化训练。 在LibriSpeech数据集上的实验结果表明,WTFormer可以在仅用0.98M参数的情况下达到与先进系统相当的去噪性能,并且保留更多的空间信息。
https://arxiv.org/abs/2506.22001
To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation.
为了使用大型语言模型(LLM)实现精确且全自动的细胞类型注释,我们开发了一个基于图结构特征标记数据库,用于检索与差异基因相关联的实体以重建细胞。此外,我们还设计了一种多任务工作流程来优化注释过程。相较于通用型LLM,我们的方法在11种组织类型的评测中提高了人类评估分数高达0.21,并将语义相似性提升了6.1%,同时更紧密地贴合了手动注释的认知逻辑。
https://arxiv.org/abs/2505.00017
Leukemia is 10th most frequently diagnosed cancer and one of the leading causes of cancer related deaths worldwide. Realistic analysis of Leukemia requires White Blook Cells (WBC) localization, classification, and morphological assessment. Despite deep learning advances in medical imaging, leukemia analysis lacks a large, diverse multi-task dataset, while existing small datasets lack domain diversity, limiting real world applicability. To overcome dataset challenges, we present a large scale WBC dataset named Large Leukemia Dataset (LLD) and novel methods for detecting WBC with their attributes. Our contribution here is threefold. First, we present a large-scale Leukemia dataset collected through Peripheral Blood Films (PBF) from several patients, through multiple microscopes, multi cameras, and multi magnification. To enhance diagnosis explainability and medical expert acceptance, each leukemia cell is annotated at 100x with 7 morphological attributes, ranging from Cell Size to Nuclear Shape. Secondly, we propose a multi task model that not only detects WBCs but also predicts their attributes, providing an interpretable and clinically meaningful solution. Third, we propose a method for WBC detection with attribute analysis using sparse annotations. This approach reduces the annotation burden on hematologists, requiring them to mark only a small area within the field of view. Our method enables the model to leverage the entire field of view rather than just the annotated regions, enhancing learning efficiency and diagnostic accuracy. From diagnosis explainability to overcoming domain shift challenges, presented datasets could be used for many challenging aspects of microscopic image analysis. The datasets, code, and demo are available at: this https URL
白血病是全球诊断频率最高的第十种癌症,也是导致癌症相关死亡的主要原因之一。对白血病进行现实分析需要定位、分类和评估白细胞(WBC)的形态特征。尽管深度学习在医学影像方面的进展显著,但白血病分析却缺乏大规模且多任务类型的多样化数据集;现有的小规模数据集也因领域多样性不足而限制了其实际应用。 为克服这些数据集挑战,我们推出了一项名为“大型白血病数据集”(Large Leukemia Dataset, LLD)的项目,并提出一种检测WBC及其属性的新方法。我们的贡献主要有三个方面: 1. 我们收集了一个大规模的数据集——LLD,该数据集通过外周血涂片(PBF)从多个患者处获取,在不同显微镜、相机和放大倍数下拍摄。为了增强诊断的解释性和医学专家的认可度,每个白血病细胞都在100x下标注了7个形态特征,涵盖了细胞大小到核形状等多个方面。 2. 我们提出了一种多任务模型,不仅能够检测WBCs,还能预测它们的属性,从而提供了一个可解释且具有临床意义的解决方案。 3. 我们提出了一个使用稀疏注释进行WBC检测及其属性分析的方法。这种方法减少了血液学家在标注过程中的负担,只需标记视野内的一个小区域即可完成任务。我们的方法使模型能够利用整个视野的信息,而不仅仅是已标注的区域,从而提高了学习效率和诊断准确性。 从增强诊断解释性到克服领域转移挑战,“所提出的”数据集可用于显微图像分析中许多具有挑战性的方面。“这些数据集、代码和演示可以在这里获取:[此链接](https://this-url.com/)(请将“this https URL”替换为实际的网址)。
https://arxiv.org/abs/2504.02602
The essence of modern e-Commercial search system lies in matching user's intent and available candidates depending on user's query, providing personalized and precise service. However, user's query may be incorrect due to ambiguous input and typo, leading to inaccurate search. These cases may be released by query rewrite: modify query to other representation or expansion. However, traditional query rewrite replies on static rewrite vocabulary, which is manually established meanwhile lacks interaction with both domain knowledge in e-Commercial system and common knowledge in the real world. In this paper, with the ability to generate text content of Large Language Models (LLMs), we provide an iterative framework to generate query rewrite. The framework incorporates a 3-stage procedure in each iteration: Rewrite Generation with domain knowledge by Retrieval-Augmented Generation (RAG) and query understanding by Chain-of-Thoughts (CoT); Online Signal Collection with automatic positive rewrite update; Post-training of LLM with multi task objective to generate new rewrites. Our work (named as IterQR) provides a comprehensive framework to generate \textbf{Q}uery \textbf{R}ewrite with both domain / real-world knowledge. It automatically update and self-correct the rewrites during \textbf{iter}ations. \method{} has been deployed in Meituan Delivery's search system (China's leading food delivery platform), providing service for users with significant improvement.
现代电子商务搜索系统的核心在于根据用户的查询匹配用户意图与可用候选,从而提供个性化和精准的服务。然而,由于输入模糊或拼写错误等原因,用户的查询可能不准确,导致搜索结果偏差。这种情况可以通过查询重写来缓解:将原始查询修改为其他形式或扩展其含义。但是,传统的查询重写依赖于静态的重写词汇表,这些词汇表是手动创建且缺乏与电子商务系统中的领域知识和现实世界常识的互动。 在本文中,借助大型语言模型(LLM)生成文本内容的能力,我们提出了一种迭代框架来生成查询重写。该框架每轮包含三个步骤:使用检索增强生成(RAG)和链式思维(CoT)进行查询理解与领域知识相结合的重写生成;在线信号收集以自动更新正面重写;以及通过多任务目标对LLM进行后训练,以便产生新的重写。 我们的工作(命名为IterQR)提供了一个综合框架来生成带有领域/现实世界知识的**Q**uery **R**ewrite。在迭代过程中,它能够自动更新和自我纠正重写内容。该方法已被部署在中国领先的外卖平台美团配送的搜索系统中,并为用户提供了显著的服务改进。
https://arxiv.org/abs/2504.05309
Parameter efficient finetuning (PEFT) methods are widely used in LLMs and generative models in computer vision. Especially one can use multiple of these during inference to change the behavior of the base model. In this paper we investigated whether multiple LoRA adapters trained on computer vision tasks can be merged together and used during inference without loss in performance. By achieving this, multitask models can be created just by merging different LoRAs. Merging these will reduce inference time and it will not require any additional retraining. We have trained adapters on six different tasks and evaluated their performance when they are merged together. For comparison we used a model with a frozen backbone and finetuned its head. Our results show that even with simple merging techniques creating a multitask model by merging adapters is achievable by slightly loosing performance in some cases. In our experiments we merged up to three adapters together. Depending on the task and the similarity of the data adapters were trained on, merges can outperform head finetuning. We have observed that LoRAs trained with dissimilar datasets tend to perform better compared to model trained on similar datasets.
参数高效微调(PEFT)方法在大型语言模型和计算机视觉生成模型中广泛使用。特别是,人们可以在推理过程中使用多种此类技术来改变基础模型的行为。本文研究了是否可以将多个针对计算机视觉任务训练的LoRA适配器合并起来,并在推理时使用而不影响性能。通过实现这一点,可以通过合并不同的LoRAs创建多任务模型。合并这些适配器将减少推理时间,并且不需要任何额外的再训练。我们在六个不同的任务上训练了适配器,并评估了当它们被合并在一起时的表现。为了进行比较,我们使用了一个带有冻结主干网络并微调其头部的模型。我们的结果显示,即使采用简单的合并技术,在某些情况下通过合并适配器来创建多任务模型也是可行的,尽管可能会略微损失性能。在我们的实验中,我们将最多三个适配器进行了合并。根据任务和数据集的相似性,这些合并可以优于头部微调的表现。我们观察到,使用不相似的数据集训练的LoRAs通常比用相似数据集训练的模型表现更好。
https://arxiv.org/abs/2411.14064
Evaluating Large Language Models (LLMs) in low-resource and linguistically diverse languages remains a significant challenge in NLP, particularly for languages using non-Latin scripts like those spoken in India. Existing benchmarks predominantly focus on English, leaving substantial gaps in assessing LLM capabilities in these languages. We introduce MILU, a Multi task Indic Language Understanding Benchmark, a comprehensive evaluation benchmark designed to address this gap. MILU spans 8 domains and 42 subjects across 11 Indic languages, reflecting both general and culturally specific knowledge. With an India-centric design, incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics. We evaluate over 42 LLMs, and find that current LLMs struggle with MILU, with GPT-4o achieving the highest average accuracy at 72 percent. Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines. Models also perform better in high resource languages as compared to low resource ones. Domain-wise analysis indicates that models perform poorly in culturally relevant areas like Arts and Humanities, Law and Governance compared to general fields like STEM. To the best of our knowledge, MILU is the first of its kind benchmark focused on Indic languages, serving as a crucial step towards comprehensive cultural evaluation. All code, benchmarks, and artifacts will be made publicly available to foster open research.
评估大型语言模型(LLMs)在资源有限和语言多样化的环境中,特别是在使用非拉丁字母的印度等地区的语言中,仍然是自然语言处理(NLP)领域的一个重大挑战。现有的基准测试主要集中在英语上,这导致了对这些语言中的LLM能力评估的巨大空白。我们引入了MILU(多任务印地语理解基准),这是一个旨在解决这一问题的全面评测基准。MILU涵盖了11种印度语言中的8个领域和42个主题,反映了普遍知识以及文化特定的知识。该设计以印度为中心,纳入了来自地区及州级考试的内容,涵盖地方历史、艺术、节日和法律等主题,同时也包括科学和数学等标准科目。我们评估了超过42个LLM,并发现当前的LLMs在MILU上的表现不尽如人意,其中GPT-4o达到了最高的平均准确率72%。开放多语言模型的表现优于特定语言微调的模型,后者仅略高于随机基线水平。模型在资源丰富的语言中的表现优于资源匮乏的语言。按领域分析表明,与STEM等通用领域相比,模型在艺术和人文、法律和治理等领域这样的文化相关领域的表现较差。据我们所知,MILU是首个专注于印度语言的此类基准测试,为全面的文化评估迈出了重要一步。所有代码、基准和资源都将公开发布,以促进开放研究。
https://arxiv.org/abs/2411.02538
For the diagnosis of diabetes retinopathy (DR) images, this paper proposes a classification method based on artificial intelligence. The core lies in a new data augmentation method, GreenBen, which first extracts the green channel grayscale image from the retinal image and then performs Ben enhancement. Considering that diabetes macular edema (DME) is a complication closely related to DR, this paper constructs a joint classification framework of DR and DME based on multi task learning and attention module, and uses GreenBen to enhance its data to reduce the difference of DR images and improve the accuracy of model classification. We conducted extensive experiments on three publicly available datasets, and our method achieved the best results. For GreenBen, whether based on the ResNet50 network or the Swin Transformer network, whether for individual classification or joint DME classification, compared with other data augmentation methods, GreenBen achieved stable and significant improvements in DR classification results, with an accuracy increase of 10%.
https://arxiv.org/abs/2410.09444
Pretrained language models like BERT and T5 serve as crucial backbone encoders for dense retrieval. However, these models often exhibit limited generalization capabilities and face challenges in improving in domain accuracy. Recent research has explored using large language models (LLMs) as retrievers, achieving SOTA performance across various tasks. Despite these advancements, the specific benefits of LLMs over traditional retrievers and the impact of different LLM configurations, such as parameter sizes, pretraining duration, and alignment processes on retrieval tasks remain unclear. In this work, we conduct a comprehensive empirical study on a wide range of retrieval tasks, including in domain accuracy, data efficiency, zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning. We evaluate over 15 different backbone LLMs and non LLMs. Our findings reveal that larger models and extensive pretraining consistently enhance in domain accuracy and data efficiency. Additionally, larger models demonstrate significant potential in zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning. These results underscore the advantages of LLMs as versatile and effective backbone encoders in dense retrieval, providing valuable insights for future research and development in this field.
预训练语言模型(如BERT和T5)在密集检索中扮演着至关重要的基础编码器角色。然而,这些模型通常表现出有限的泛化能力,并且在改进领域精度方面面临挑战。最近的研究探索了使用大型语言模型(LLMs)作为检索器,在各种任务上实现SOTA性能。尽管有这些进步,但LLMs相对于传统检索器的具体优势以及不同LLM配置(如参数大小、预训练持续时间和对齐过程)对检索任务的影响仍然不明确。在这篇论文中,我们对一系列检索任务进行了全面的实证研究,包括领域精度、数据效率、零散获取、长时间检索、基于指令的检索和多任务学习。我们评估了15种不同的基础LLM和非LLM。我们的发现表明,更大的模型和广泛的预训练确实增强了领域精度和数据效率。此外,更大的模型在零散获取、长时间检索、基于指令的检索和多任务学习方面表现出显著的增长潜力。这些结果突出了LLMs作为密集检索的有用且有效的编码器的作用,为未来研究和开发这个领域的未来研究提供了宝贵的见解。
https://arxiv.org/abs/2408.12194
This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is a single model per track. We fine-tune Qwen2-72B-Instruct on our own training dataset. As the competition released only 96 example questions, we developed our own training dataset by processing multiple public datasets or using Large Language Models for data augmentation and synthetic data generation. We apply wise-ft to account for distribution shifts and ensemble multiple LoRA adapters in one model. We employed Logits Processors to constrain the model output on relevant tokens for the tasks. AWQ 4-bit Quantization and vLLM are used during inference to predict the test dataset in the time constraints of 20 to 140 minutes depending on the track. Our solution achieved the first place in each individual track and is the first place overall of Amazons KDD Cup 2024.
本文描述了亚马逊KDD Cup 2024多任务在线购物挑战中所有5个任务的获胜解决方案。挑战是在在线购物领域的回答问题,比赛包含了57个不同的任务,涵盖了5种不同的任务类型(例如多选题)和4个不同的赛道(例如多语言)。我们的解决方案是一个模型 per track。我们对自己的训练数据进行了微调。由于比赛只发布了96个示例问题,因此我们通过处理多个公共数据集或使用大型语言模型进行数据增强和合成数据生成来创建自己的训练数据。我们应用了 wise-ft 来处理分布漂移和模型输出对相关词的限制。在推理过程中,我们使用了 AWQ 4 位量化器和 vLLM 来预测基于任务的测试数据在20到140分钟内的时间约束。我们的解决方案在每条赛道上均获得了第一,在整体亚马逊KDD Cup 2024中也获得了第一。
https://arxiv.org/abs/2408.04658
Parkinson's disease is easy to diagnose when it is advanced, but it is very difficult to diagnose in its early stages. Early diagnosis is essential to be able to treat the symptoms. It impacts on daily activities and reduces the quality of life of both the patients and their families and it is also the second most prevalent neurodegenerative disorder after Alzheimer in people over the age of 60. Most current studies on the prediction of Parkinson's severity are carried out in advanced stages of the disease. In this work, the study analyzes a set of variables that can be easily extracted from voice analysis, making it a very non-intrusive technique. In this paper, a method based on different deep learning techniques is proposed with two purposes. On the one hand, to find out if a person has severe or non-severe Parkinson's disease, and on the other hand, to determine by means of regression techniques the degree of evolution of the disease in a given patient. The UPDRS (Unified Parkinson's Disease Rating Scale) has been used by taking into account both the motor and total labels, and the best results have been obtained using a mixed multi-layer perceptron (MLP) that classifies and regresses at the same time and the most important features of the data obtained are taken as input, using an autoencoder. A success rate of 99.15% has been achieved in the problem of predicting whether a person suffers from severe Parkinson's disease or non-severe Parkinson's disease. In the degree of disease involvement prediction problem case, a MSE (Mean Squared Error) of 0.15 has been obtained. Using a full deep learning pipeline for data preprocessing and classification has proven to be very promising in the field Parkinson's outperforming the state-of-the-art proposals.
帕金森病在病情较晚的时候容易诊断,但在早期阶段很难诊断。早期的诊断对治疗症状非常重要。这种疾病会影响患者的日常生活,降低他们的生活质量,也是60岁以上人群中最常见的神经退行性疾病。目前,关于预测帕金森病严重程度的研究大部分都是在疾病进展到较晚阶段时进行的。在这项工作中,研究分析了一系列可以轻松从语音分析中提取的变量,使得这是一种非常非侵入性的技术。在这篇论文中,提出了一种基于不同深度学习技术的两种目的的方法。一方面,通过确定一个人是否患有严重或非严重的帕金森病,另一方面,通过回归分析方法确定患者疾病在一定程度上的发展程度。在考虑了运动和总标签的情况下,使用了统一帕金森病评分表(UPDRS),并且最佳结果是通过一种混合多层感知器(MLP)进行分类和回归得到的,输入数据的最重要特征使用自动编码器。在预测一个人是否患有严重帕金森病或非严重帕金森病的问题上,取得了99.15%的成功率。在疾病程度预测问题中,获得了0.15的均方误差(MSE)。使用完整的深度学习数据预处理和分类方法在帕金森病领域表现出了与现有最佳建议相媲美的效果。
https://arxiv.org/abs/2402.05491
Source-free test-time adaptation for medical image segmentation aims to enhance the adaptability of segmentation models to diverse and previously unseen test sets of the target domain, which contributes to the generalizability and robustness of medical image segmentation models without access to the source domain. Ensuring consistency between target edges and paired inputs is crucial for test-time adaptation. To improve the performance of test-time domain adaptation, we propose a multi task consistency guided source-free test-time domain adaptation medical image segmentation method which ensures the consistency of the local boundary predictions and the global prototype representation. Specifically, we introduce a local boundary consistency constraint method that explores the relationship between tissue region segmentation and tissue boundary localization tasks. Additionally, we propose a global feature consistency constraint toto enhance the intra-class compactness. We conduct extensive experiments on the segmentation of benchmark fundus images. Compared to prediction directly by the source domain model, the segmentation Dice score is improved by 6.27\% and 0.96\% in RIM-ONE-r3 and Drishti GS datasets, respectively. Additionally, the results of experiments demonstrate that our proposed method outperforms existing competitive domain adaptation segmentation algorithms.
源自由测试时间适应医疗图像分割旨在增强分割模型的适应性,使其能适应目标域中的多样化和之前未见过的测试集。这有助于实现没有访问源域的医疗图像分割模型的泛化能力和稳健性。确保目标边缘与成对输入之间的一致性对测试时间适应至关重要。为了提高测试时间域适应的性能,我们提出了一个多任务一致性引导的源自由测试时间域适应医疗图像分割方法,确保局部边界预测和全局原型表示的一致性。具体来说,我们引入了组织区域分割与组织边界定位任务之间的关系。此外,我们还提出了全局特征一致性约束以增强类内压缩性。我们对基准基金图像进行了广泛的实验。与直接由源域模型预测的分割结果相比,RIM-ONE-r3和Drishti GS数据集中的分割Dice分数分别提高了6.27%和0.96%。此外,实验结果表明,与现有的竞争域适应分割算法相比,我们提出的方法具有优异的性能。
https://arxiv.org/abs/2310.11766
Logical reasoning is fundamental for humans yet presents a substantial challenge in the domain of Artificial Intelligence. Initially, researchers used Knowledge Representation and Reasoning (KR) systems that did not scale and required non trivial manual effort. Recently, the emergence of large language models (LLMs) has demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. Consequently, there is a growing interest in using LLMs for logical reasoning via natural language. This work strives to understand the proficiency of LLMs in logical reasoning by offering a brief review of the latest progress in this area; with a focus on the logical reasoning datasets, tasks, and the methods adopted to utilize LLMs for reasoning. To offer a thorough analysis, we have compiled a benchmark titled LogiGLUE. This includes 24 varied datasets encompassing deductive, abductive, and inductive reasoning. We have standardized these datasets into Seq2Seq tasks to facilitate straightforward training and evaluation for future research. Utilizing LogiGLUE as a foundation, we have trained an instruction fine tuned language model, resulting in LogiT5. We study single task training, multi task training, and a chain of thought knowledge distillation fine tuning technique to assess the performance of model across the different logical reasoning categories. By this comprehensive process, we aim to shed light on the capabilities and potential pathways for enhancing logical reasoning proficiency in LLMs, paving the way for more advanced and nuanced developments in this critical field.
逻辑推理是人类基本的思维活动,但在人工智能领域中却面临巨大的挑战。起初,研究人员使用无法扩展且需要大量手动努力的知识表示和推理(KR)系统。最近,大型语言模型(LLM)的出现已经证明了能够克服正式知识表示(KR)系统的各种限制的能力。因此,越来越多的人开始使用LLM来进行自然语言逻辑推理。这项工作旨在通过简要回顾该领域的最新进展,理解LLM在逻辑推理方面的熟练程度。我们焦点关注逻辑推理数据集、任务和利用LLM进行推理的方法。为了进行全面分析,我们汇编了一个基准名为LogiGLUE。该基准包括24个不同的数据集,涵盖了从演绎、归纳和推断推理的各种类型。我们将这些数据集标准化为Seq2Seq任务,以便于未来的研究和 straightforward的训练和评估。利用LogiGLUE作为基础,我们训练了一个优化的语言模型,结果为LogiT5。我们研究单一任务训练、多任务训练和思维知识蒸馏优化技术,以评估模型在不同逻辑推理类别中的表现。通过这种方式,我们旨在阐明LLM在逻辑推理方面的能力和潜在路径,为这个关键领域的更高级、精细的发展铺平道路。
https://arxiv.org/abs/2310.00836